1. 06 May, 2022 4 commits
  2. 21 Apr, 2022 4 commits
  3. 11 Apr, 2022 4 commits
  4. 25 Mar, 2022 28 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid · 5e206459
      Linus Torvalds authored
      Pull HID updates from Jiri Kosina:
      
       - rework of generic input handling which ultimately makes the
         processing of tablet events more generic and reliable (Benjamin
         Tissoires)
      
       - fixes for handling unnumbered reports fully correctly in i2c-hid
         (Angela Czubak, Dmitry Torokhov)
      
       - untangling of intermingled code for sending and handling output
         reports in i2c-hid (Dmitry Torokhov)
      
       - Apple magic keyboard support improvements for newer models (José
         Expósito)
      
       - Apple T2 Macs support improvements (Aun-Ali Zaidi, Paul Pawlowski)
      
       - driver for Razer Blackwidow keyboards (Jelle van der Waa)
      
       - driver for SiGma Micro keyboards (Desmond Lim)
      
       - integration of first part of DIGImend patches in order to ultimately
         vastly improve Linux support of tablets (Nikolai Kondrashov, José
         Expósito)
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (55 commits)
        HID: intel-ish-hid: Use dma_alloc_coherent for firmware update
        Input: docs: add more details on the use of BTN_TOOL
        HID: input: accommodate priorities for slotted devices
        HID: input: remove the need for HID_QUIRK_INVERT
        HID: input: enforce Invert usage to be processed before InRange
        HID: core: for input reports, process the usages by priority list
        HID: compute an ordered list of input fields to process
        HID: input: move up out-of-range processing of input values
        HID: input: rework spaghetti code with switch statements
        HID: input: tag touchscreens as such if the physical is not there
        HID: core: split data fetching from processing in hid_input_field()
        HID: core: de-duplicate some code in hid_input_field()
        HID: core: statically allocate read buffers
        HID: uclogic: Support multiple frame input devices
        HID: uclogic: Define report IDs before their descriptors
        HID: uclogic: Put version first in rdesc namespace
        HID: uclogic: Use "frame" instead of "buttonpad"
        HID: uclogic: Use different constants for frame report IDs
        HID: uclogic: Specify total report size to buttonpad macro
        HID: uclogic: Switch to matching subreport bytes
        ...
      5e206459
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v5.18-1' of... · 14646776
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
      
      Pull x86 platform driver updates from Hans de Goede:
        "New drivers:
          - AMD Host System Management Port (HSMP)
          - Intel Software Defined Silicon
      
        Removed drivers (functionality folded into other drivers):
          - intel_cht_int33fe_microb
          - surface3_button
      
        amd-pmc:
          - s2idle bug-fixes
          - Support for AMD Spill to DRAM STB feature
      
        hp-wmi:
          - Fix SW_TABLET_MODE detection method (and other fixes)
          - Support omen thermal profile policy v1
      
        serial-multi-instantiate:
          - Add SPI device support
          - Add support for CS35L41 amplifiers used in new laptops
      
        think-lmi:
          - syfs-class-firmware-attributes Certificate authentication support
      
        thinkpad_acpi:
          - Fixes + quirks
          - Add platform_profile support on AMD based ThinkPads
      
        x86-android-tablets:
          - Improve Asus ME176C / TF103C support
          - Support Nextbook Ares 8, Lenovo Tab 2 830 and 1050 tablets
      
        Lots of various other small fixes and hardware-id additions"
      
      * tag 'platform-drivers-x86-v5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (60 commits)
        platform/x86: think-lmi: Certificate authentication support
        Documentation: syfs-class-firmware-attributes: Lenovo Certificate support
        platform/x86: amd-pmc: Only report STB errors when STB enabled
        platform/x86: amd-pmc: Drop CPU QoS workaround
        platform/x86: amd-pmc: Output error codes in messages
        platform/x86: amd-pmc: Move to later in the suspend process
        ACPI / x86: Add support for LPS0 callback handler
        platform/x86: thinkpad_acpi: consistently check fan_get_status return.
        platform/x86: hp-wmi: support omen thermal profile policy v1
        platform/x86: hp-wmi: Changing bios_args.data to be dynamically allocated
        platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls
        platform/x86: hp-wmi: Fix SW_TABLET_MODE detection method
        platform/x86: hp-wmi: Fix hp_wmi_read_int() reporting error (0x05)
        platform/x86: amd-pmc: Validate entry into the deepest state on resume
        platform/x86: thinkpad_acpi: Don't use test_bit on an integer
        platform/x86: thinkpad_acpi: Fix compiler warning about uninitialized err variable
        platform/x86: thinkpad_acpi: clean up dytc profile convert
        platform/x86: x86-android-tablets: Depend on EFI and SPI
        platform/x86: amd-pmc: uninitialized variable in amd_pmc_s2d_init()
        platform/x86: intel-uncore-freq: fix uncore_freq_common_init() error codes
        ...
      14646776
    • Linus Torvalds's avatar
      Merge tag 'kbuild-gnu11-v5.18' of... · 50560ce6
      Linus Torvalds authored
      Merge tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild update for C11 language base from Masahiro Yamada:
       "Kbuild -std=gnu11 updates for v5.18
      
        Linus pointed out the benefits of C99 some years ago, especially
        variable declarations in loops [1]. At that time, we were not ready
        for the migration due to old compilers.
      
        Recently, Jakob Koschel reported a bug in list_for_each_entry(), which
        leaks the invalid pointer out of the loop [2]. In the discussion, we
        agreed that the time had come. Now that GCC 5.1 is the minimum
        compiler version, there is nothing to prevent us from going to
        -std=gnu99, or even straight to -std=gnu11.
      
        Discussions for a better list iterator implementation are ongoing, but
        this patch set must land first"
      
      [1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/
      [2] https://lore.kernel.org/lkml/86C4CE7D-6D93-456B-AA82-F8ADEACA40B7@gmail.com/
      
      * tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS
        Kbuild: move to -std=gnu11
        Kbuild: use -Wdeclaration-after-statement
        Kbuild: add -Wno-shift-negative-value where -Wextra is used
      50560ce6
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 29c8c183
      Linus Torvalds authored
      Merge yet more updates from Andrew Morton:
       "This is the material which was staged after willystuff in linux-next.
      
        Subsystems affected by this patch series: mm (debug, selftests,
        pagecache, thp, rmap, migration, kasan, hugetlb, pagemap, madvise),
        and selftests"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (113 commits)
        selftests: kselftest framework: provide "finished" helper
        mm: madvise: MADV_DONTNEED_LOCKED
        mm: fix race between MADV_FREE reclaim and blkdev direct IO read
        mm: generalize ARCH_HAS_FILTER_PGPROT
        mm: unmap_mapping_range_tree() with i_mmap_rwsem shared
        mm: warn on deleting redirtied only if accounted
        mm/huge_memory: remove stale locking logic from __split_huge_pmd()
        mm/huge_memory: remove stale page_trans_huge_mapcount()
        mm/swapfile: remove stale reuse_swap_page()
        mm/khugepaged: remove reuse_swap_page() usage
        mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
        mm: streamline COW logic in do_swap_page()
        mm: slightly clarify KSM logic in do_swap_page()
        mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
        mm: optimize do_wp_page() for exclusive pages in the swapcache
        mm/huge_memory: make is_transparent_hugepage() static
        userfaultfd/selftests: enable hugetlb remap and remove event testing
        selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
        mm: enable MADV_DONTNEED for hugetlb mappings
        kasan: disable LOCKDEP when printing reports
        ...
      29c8c183
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.18-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · aa5b537b
      Linus Torvalds authored
      Pull RISC-V updates from Palmer Dabbelt:
      
       - Support for Sv57-based virtual memory.
      
       - Various improvements for the MicroChip PolarFire SOC and the
         associated Icicle dev board, which should allow upstream kernels to
         boot without any additional modifications.
      
       - An improved memmove() implementation.
      
       - Support for the new Ssconfpmf and SBI PMU extensions, which allows
         for a much more useful perf implementation on RISC-V systems.
      
       - Support for restartable sequences.
      
      * tag 'riscv-for-linus-5.18-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (36 commits)
        rseq/selftests: Add support for RISC-V
        RISC-V: Add support for restartable sequence
        MAINTAINERS: Add entry for RISC-V PMU drivers
        Documentation: riscv: Remove the old documentation
        RISC-V: Add sscofpmf extension support
        RISC-V: Add perf platform driver based on SBI PMU extension
        RISC-V: Add RISC-V SBI PMU extension definitions
        RISC-V: Add a simple platform driver for RISC-V legacy perf
        RISC-V: Add a perf core library for pmu drivers
        RISC-V: Add CSR encodings for all HPMCOUNTERS
        RISC-V: Remove the current perf implementation
        RISC-V: Improve /proc/cpuinfo output for ISA extensions
        RISC-V: Do no continue isa string parsing without correct XLEN
        RISC-V: Implement multi-letter ISA extension probing framework
        RISC-V: Extract multi-letter extension names from "riscv, isa"
        RISC-V: Minimal parser for "riscv, isa" strings
        RISC-V: Correctly print supported extensions
        riscv: Fixed misaligned memory access. Fixed pointer comparison.
        MAINTAINERS: update riscv/microchip entry
        riscv: dts: microchip: add new peripherals to icicle kit device tree
        ...
      aa5b537b
    • Linus Torvalds's avatar
      Merge tag 's390-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · d710d370
      Linus Torvalds authored
      Pull s390 updates from Vasily Gorbik:
      
       - Raise minimum supported machine generation to z10, which comes with
         various cleanups and code simplifications (usercopy/spectre
         mitigation/etc).
      
       - Rework extables and get rid of anonymous out-of-line fixups.
      
       - Page table helpers cleanup. Add set_pXd()/set_pte() helper functions.
         Covert pte_val()/pXd_val() macros to functions.
      
       - Optimize kretprobe handling by avoiding extra kprobe on
         __kretprobe_trampoline.
      
       - Add support for CEX8 crypto cards.
      
       - Allow to trigger AP bus rescan via writing to /sys/bus/ap/scans.
      
       - Add CONFIG_EXPOLINE_EXTERN option to build the kernel without COMDAT
         group sections which simplifies kpatch support.
      
       - Always use the packed stack layout and extend kernel unwinder tests.
      
       - Add sanity checks for ftrace code patching.
      
       - Add s390dbf debug log for the vfio_ap device driver.
      
       - Various virtual vs physical address confusion fixes.
      
       - Various small fixes and improvements all over the code.
      
      * tag 's390-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (69 commits)
        s390/test_unwind: add kretprobe tests
        s390/kprobes: Avoid additional kprobe in kretprobe handling
        s390: convert ".insn" encoding to instruction names
        s390: assume stckf is always present
        s390/nospec: move to single register thunks
        s390: raise minimum supported machine generation to z10
        s390/uaccess: Add copy_from/to_user_key functions
        s390/nospec: align and size extern thunks
        s390/nospec: add an option to use thunk-extern
        s390/nospec: generate single register thunks if possible
        s390/pci: make zpci_set_irq()/zpci_clear_irq() static
        s390: remove unused expoline to BC instructions
        s390/irq: use assignment instead of cast
        s390/traps: get rid of magic cast for per code
        s390/traps: get rid of magic cast for program interruption code
        s390/signal: fix typo in comments
        s390/asm-offsets: remove unused defines
        s390/test_unwind: avoid build warning with W=1
        s390: remove .fixup section
        s390/bpf: encode register within extable entry
        ...
      d710d370
    • Linus Torvalds's avatar
      Merge tag 'xtensa-20220325' of https://github.com/jcmvbkbc/linux-xtensa · 744465da
      Linus Torvalds authored
      Pull Xtensa updates from Max Filippov:
      
       - remove dependency on the compiler's libgcc
      
       - allow selection of internal kernel ABI via Kconfig
      
       - enable compiler plugins support for gcc-12 or newer
      
       - various minor cleanups and fixes
      
      * tag 'xtensa-20220325' of https://github.com/jcmvbkbc/linux-xtensa:
        xtensa: define update_mmu_tlb function
        xtensa: fix xtensa_wsr always writing 0
        xtensa: enable plugin support
        xtensa: clean up kernel exit assembly code
        xtensa: rearrange NMI exit path
        xtensa: merge stack alignment definitions
        xtensa: fix DTC warning unit_address_format
        xtensa: fix stop_machine_cpuslocked call in patch_text
        xtensa: make secondary reset vector support conditional
        xtensa: add kernel ABI selection to Kconfig
        xtensa: don't link with libgcc
        xtensa: add helpers for division, remainder and shifts
        xtensa: add missing XCHAL_HAVE_WINDOWED check
        xtensa: use XCHAL_NUM_AREGS as pt_regs::areg size
        xtensa: rename PT_SIZE to PT_KERNEL_SIZE
        xtensa: Remove unused early_read_config_byte() et al declarations
        xtensa: use strscpy to copy strings
        net: xtensa: use strscpy to copy strings
      744465da
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 1f1c153e
      Linus Torvalds authored
      Pull powerpc updates from Michael Ellerman:
       "Livepatch support for 32-bit is probably the standout new feature,
        otherwise mostly just lots of bits and pieces all over the board.
      
        There's a series of commits cleaning up function descriptor handling,
        which touches a few other arches as well as LKDTM. It has acks from
        Arnd, Kees and Helge.
      
        Summary:
      
         - Enforce kernel RO, and implement STRICT_MODULE_RWX for 603.
      
         - Add support for livepatch to 32-bit.
      
         - Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS.
      
         - Merge vdso64 and vdso32 into a single directory.
      
         - Fix build errors with newer binutils.
      
         - Add support for UADDR64 relocations, which are emitted by some
           toolchains. This allows powerpc to build with the latest lld.
      
         - Fix (another) potential userspace r13 corruption in transactional
           memory handling.
      
         - Cleanups of function descriptor handling & related fixes to LKDTM.
      
        Thanks to Abdul Haleem, Alexey Kardashevskiy, Anders Roxell, Aneesh
        Kumar K.V, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Bhaskar
        Chowdhury, Cédric Le Goater, Chen Jingwen, Christophe JAILLET,
        Christophe Leroy, Corentin Labbe, Daniel Axtens, Daniel Henrique
        Barboza, David Dai, Fabiano Rosas, Ganesh Goudar, Guo Zhengkui, Hangyu
        Hua, Haren Myneni, Hari Bathini, Igor Zhbanov, Jakob Koschel, Jason
        Wang, Jeremy Kerr, Joachim Wiberg, Jordan Niethe, Julia Lawall, Kajol
        Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mamatha Inamdar,
        Maxime Bizon, Maxim Kiselev, Maxim Kochetkov, Michal Suchanek,
        Nageswara R Sastry, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
        Nour-eddine Taleb, Paul Menzel, Ping Fang, Pratik R. Sampat, Randy
        Dunlap, Ritesh Harjani, Rohan McLure, Russell Currey, Sachin Sant,
        Segher Boessenkool, Shivaprasad G Bhat, Sourabh Jain, Thierry Reding,
        Tobias Waldekranz, Tyrel Datwyler, Vaibhav Jain, Vladimir Oltean,
        Wedson Almeida Filho, and YueHaibing"
      
      * tag 'powerpc-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
        powerpc/pseries: Fix use after free in remove_phb_dynamic()
        powerpc/time: improve decrementer clockevent processing
        powerpc/time: Fix KVM host re-arming a timer beyond decrementer range
        powerpc/tm: Fix more userspace r13 corruption
        powerpc/xive: fix return value of __setup handler
        powerpc/64: Add UADDR64 relocation support
        powerpc: 8xx: fix a return value error in mpc8xx_pic_init
        powerpc/ps3: remove unneeded semicolons
        powerpc/64: Force inlining of prevent_user_access() and set_kuap()
        powerpc/bitops: Force inlining of fls()
        powerpc: declare unmodified attribute_group usages const
        powerpc/spufs: Fix build warning when CONFIG_PROC_FS=n
        powerpc/secvar: fix refcount leak in format_show()
        powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E
        powerpc: Move C prototypes out of asm-prototypes.h
        powerpc/kexec: Declare kexec_paca static
        powerpc/smp: Declare current_set static
        powerpc: Cleanup asm-prototypes.c
        powerpc/ftrace: Use STK_GOT in ftrace_mprofile.S
        powerpc/ftrace: Regroup PPC64 specific operations in ftrace_mprofile.S
        ...
      1f1c153e
    • Linus Torvalds's avatar
      Merge tag 'mips_5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 9a8b3d5f
      Linus Torvalds authored
      Pull MIPS updates from Thomas Bogendoerfer:
      
       - added support for QCN550x (ath79)
      
       - enabled KCSAN
      
       - removed TX39XX support
      
       - various cleanups and fixes
      
      * tag 'mips_5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (31 commits)
        MIPS: Fix build error for loongson64 and sgi-ip27
        MIPS: ingenic: correct unit node address
        MIPS: Fix wrong comments in asm/prom.h
        MIPS: Remove redundant definitions of device_tree_init()
        MIPS: Remove redundant check in device_tree_init()
        MIPS: pgalloc: fix memory leak caused by pgd_free()
        MIPS: RB532: fix return value of __setup handler
        MIPS: Only use current_stack_pointer on GCC
        MIPS: boot/compressed: Use array reference for image bounds
        mips: cdmm: Fix refcount leak in mips_cdmm_phys_base
        mips: remove reference to "newer Loongson-3"
        mips: Always permit to build u-boot images
        MIPS: Sanitise Cavium switch cases in TLB handler synthesizers
        DEC: Limit PMAX memory probing to R3k systems
        mips: DEC: honor CONFIG_MIPS_FP_SUPPORT=n
        MIPS: fix fortify panic when copying asm exception handlers
        mips: ralink: fix a refcount leak in ill_acc_of_setup()
        mips: Implement "current_stack_pointer"
        MIPS: Remove TX39XX support
        MIPS: Modernize READ_IMPLIES_EXEC
        ...
      9a8b3d5f
    • Linus Torvalds's avatar
      Merge tag 'iommu-updates-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 34af78c4
      Linus Torvalds authored
      Pull iommu updates from Joerg Roedel:
      
       - IOMMU Core changes:
            - Removal of aux domain related code as it is basically dead and
              will be replaced by iommu-fd framework
            - Split of iommu_ops to carry domain-specific call-backs separatly
            - Cleanup to remove useless ops->capable implementations
            - Improve 32-bit free space estimate in iova allocator
      
       - Intel VT-d updates:
            - Various cleanups of the driver
            - Support for ATS of SoC-integrated devices listed in ACPI/SATC
              table
      
       - ARM SMMU updates:
            - Fix SMMUv3 soft lockup during continuous stream of events
            - Fix error path for Qualcomm SMMU probe()
            - Rework SMMU IRQ setup to prepare the ground for PMU support
            - Minor cleanups and refactoring
      
       - AMD IOMMU driver:
            - Some minor cleanups and error-handling fixes
      
       - Rockchip IOMMU driver:
            - Use standard driver registration
      
       - MSM IOMMU driver:
            - Minor cleanup and change to standard driver registration
      
       - Mediatek IOMMU driver:
            - Fixes for IOTLB flushing logic
      
      * tag 'iommu-updates-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (47 commits)
        iommu/amd: Improve amd_iommu_v2_exit()
        iommu/amd: Remove unused struct fault.devid
        iommu/amd: Clean up function declarations
        iommu/amd: Call memunmap in error path
        iommu/arm-smmu: Account for PMU interrupts
        iommu/vt-d: Enable ATS for the devices in SATC table
        iommu/vt-d: Remove unused function intel_svm_capable()
        iommu/vt-d: Add missing "__init" for rmrr_sanity_check()
        iommu/vt-d: Move intel_iommu_ops to header file
        iommu/vt-d: Fix indentation of goto labels
        iommu/vt-d: Remove unnecessary prototypes
        iommu/vt-d: Remove unnecessary includes
        iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO
        iommu/vt-d: Remove domain and devinfo mempool
        iommu/vt-d: Remove iova_cache_get/put()
        iommu/vt-d: Remove finding domain in dmar_insert_one_dev_info()
        iommu/vt-d: Remove intel_iommu::domains
        iommu/mediatek: Always tlb_flush_all when each PM resume
        iommu/mediatek: Add tlb_lock in tlb_flush_all
        iommu/mediatek: Remove the power status checking in tlb flush all
        ...
      34af78c4
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 6f2689a7
      Linus Torvalds authored
      Pull SCSI updates from James Bottomley:
       "This series consists of the usual driver updates (qla2xxx, pm8001,
        libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates
        and bug fixes.
      
        The high blast radius core update is the removal of write same, which
        affects block and several non-SCSI devices. The other big change,
        which is more local, is the removal of the SCSI pointer"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits)
        scsi: scsi_ioctl: Drop needless assignment in sg_io()
        scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn()
        scsi: lpfc: Copyright updates for 14.2.0.0 patches
        scsi: lpfc: Update lpfc version to 14.2.0.0
        scsi: lpfc: SLI path split: Refactor BSG paths
        scsi: lpfc: SLI path split: Refactor Abort paths
        scsi: lpfc: SLI path split: Refactor SCSI paths
        scsi: lpfc: SLI path split: Refactor CT paths
        scsi: lpfc: SLI path split: Refactor misc ELS paths
        scsi: lpfc: SLI path split: Refactor VMID paths
        scsi: lpfc: SLI path split: Refactor FDISC paths
        scsi: lpfc: SLI path split: Refactor LS_RJT paths
        scsi: lpfc: SLI path split: Refactor LS_ACC paths
        scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths
        scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths
        scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path
        scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe
        scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4
        scsi: lpfc: SLI path split: Refactor lpfc_iocbq
        scsi: lpfc: Use kcalloc()
        ...
      6f2689a7
    • Linus Torvalds's avatar
      Merge tag 'for-5.18/dm-changes' of... · b1f8ccda
      Linus Torvalds authored
      Merge tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - Significant refactoring and fixing of how DM core does bio-based IO
         accounting with focus on fixing wildly inaccurate IO stats for
         dm-crypt (and other DM targets that defer bio submission in their own
         workqueues). End result is proper IO accounting, made possible by
         targets being updated to use the new dm_submit_bio_remap() interface.
      
       - Add hipri bio polling support (REQ_POLLED) to bio-based DM.
      
       - Reduce dm_io and dm_target_io structs so that a single dm_io (which
         contains dm_target_io and first clone bio) weighs in at 256 bytes.
         For reference the bio struct is 128 bytes.
      
       - Various other small cleanups, fixes or improvements in DM core and
         targets.
      
       - Update MAINTAINERS with my kernel.org email address to allow
         distinction between my "upstream" and "Red" Hats.
      
      * tag 'for-5.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (46 commits)
        dm: consolidate spinlocks in dm_io struct
        dm: reduce size of dm_io and dm_target_io structs
        dm: switch dm_target_io booleans over to proper flags
        dm: switch dm_io booleans over to proper flags
        dm: update email address in MAINTAINERS
        dm: return void from __send_empty_flush
        dm: factor out dm_io_complete
        dm cache: use dm_submit_bio_remap
        dm: simplify dm_sumbit_bio_remap interface
        dm thin: use dm_submit_bio_remap
        dm: add WARN_ON_ONCE to dm_submit_bio_remap
        dm: support bio polling
        block: add ->poll_bio to block_device_operations
        dm mpath: use DMINFO instead of printk with KERN_INFO
        dm: stop using bdevname
        dm-zoned: remove the ->name field in struct dmz_dev
        dm: remove unnecessary local variables in __bind
        dm: requeue IO if mapping table not yet available
        dm io: remove stale comment block for dm_io()
        dm thin metadata: remove unused dm_thin_remove_block and __remove
        ...
      b1f8ccda
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 2dacc1e5
      Linus Torvalds authored
      Pull rdma updates from Jason Gunthorpe:
      
       - Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns
      
       - Minor cleanups: coding style, useless includes and documentation
      
       - Reorganize how multicast processing works in rxe
      
       - Replace a red/black tree with xarray in rxe which improves performance
      
       - DSCP support and HW address handle re-use in irdma
      
       - Simplify the mailbox command handling in hns
      
       - Simplify iser now that FMR is eliminated
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (93 commits)
        RDMA/nldev: Prevent underflow in nldev_stat_set_counter_dynamic_doit()
        IB/iser: Fix error flow in case of registration failure
        IB/iser: Generalize map/unmap dma tasks
        IB/iser: Use iser_fr_desc as registration context
        IB/iser: Remove iser_reg_data_sg helper function
        RDMA/rxe: Use standard names for ref counting
        RDMA/rxe: Replace red-black trees by xarrays
        RDMA/rxe: Shorten pool names in rxe_pool.c
        RDMA/rxe: Move max_elem into rxe_type_info
        RDMA/rxe: Replace obj by elem in declaration
        RDMA/rxe: Delete _locked() APIs for pool objects
        RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC
        RDMA/rxe: Replace mr by rkey in responder resources
        RDMA/rxe: Fix ref error in rxe_av.c
        RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT
        RDMA/irdma: Add support for address handle re-use
        RDMA/qib: Fix typos in comments
        RDMA/mlx5: Fix memory leak in error flow for subscribe event routine
        Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error"
        RDMA/rxe: Remove useless argument for update_state()
        ...
      2dacc1e5
    • Kees Cook's avatar
      selftests: kselftest framework: provide "finished" helper · 25fd2d41
      Kees Cook authored
      Instead of having each time that wants to use ksft_exit() have to figure
      out the internals of kselftest.h, add the helper ksft_finished() that
      makes sure the passes, xfails, and skips are equal to the test plan count.
      
      Link: https://lkml.kernel.org/r/20220201013717.2464392-1-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25fd2d41
    • Johannes Weiner's avatar
      mm: madvise: MADV_DONTNEED_LOCKED · 9457056a
      Johannes Weiner authored
      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
      and MCL_ONFAULT allowing to mlock without populating, there are valid use
      cases for depopulating locked ranges as well.
      
      Users mlock memory to protect secrets.  There are allocators for secure
      buffers that want in-use memory generally mlocked, but cleared and
      invalidated memory to give up the physical pages.  This could be done with
      explicit munlock -> mlock calls on free -> alloc of course, but that adds
      two unnecessary syscalls, heavy mmap_sem write locks, vma splits and
      re-merges - only to get rid of the backing pages.
      
      Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are
      okay with on-demand initial population.  It seems valid to selectively
      free some memory during the lifetime of such a process, without having to
      mess with its overall policy.
      
      Why add a separate flag? Isn't this a pretty niche usecase?
      
      - MADV_DONTNEED has been bailing on locked vmas forever. It's at least
        conceivable that someone, somewhere is relying on mlock to protect
        data from perhaps broader invalidation calls. Changing this behavior
        now could lead to quiet data corruption.
      
      - It also clarifies expectations around MADV_FREE and maybe
        MADV_REMOVE. It avoids the situation where one quietly behaves
        different than the others. MADV_FREE_LOCKED can be added later.
      
      - The combination of mlock() and madvise() in the first place is
        probably niche. But where it happens, I'd say that dropping pages
        from a locked region once they don't contain secrets or won't page
        anymore is much saner than relying on mlock to protect memory from
        speculative or errant invalidation calls. It's just that we can't
        change the default behavior because of the two previous points.
      
      Given that, an explicit new flag seems to make the most sense.
      
      [hannes@cmpxchg.org: fix mips build]
      
      Link: https://lkml.kernel.org/r/20220304171912.305060-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9457056a
    • Mauricio Faria de Oliveira's avatar
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · 6c8e2a25
      Mauricio Faria de Oliveira authored
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c8e2a25
    • Anshuman Khandual's avatar
      mm: generalize ARCH_HAS_FILTER_PGPROT · 24e988c7
      Anshuman Khandual authored
      ARCH_HAS_FILTER_PGPROT config has duplicate definitions on platforms that
      subscribe it.  Instead make it a generic config option which can be
      selected on applicable platforms when required.
      
      Link: https://lkml.kernel.org/r/1643004823-16441-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24e988c7
    • Hugh Dickins's avatar
      mm: unmap_mapping_range_tree() with i_mmap_rwsem shared · 2c865995
      Hugh Dickins authored
      Revert 48ec833b ("Revert "mm/memory.c: share the i_mmap_rwsem"") to
      reinstate c8475d14 ("mm/memory.c: share the i_mmap_rwsem"): the
      unmap_mapping_range family of functions do the unmapping of user pages
      (ultimately via zap_page_range_single) without modifying the interval tree
      itself, and unmapping races are necessarily guarded by page table lock,
      thus the i_mmap_rwsem should be shared in unmap_mapping_pages() and
      unmap_mapping_folio().
      
      Commit 48ec833b was intended as a short-term measure, allowing the
      other shared lock changes into 3.19 final, before investigating three
      trinity crashes, one of which had been bisected to commit c8475d14:
      
      [1] https://lkml.org/lkml/2014/11/14/342
      https://lore.kernel.org/lkml/5466142C.60100@oracle.com/
      [2] https://lkml.org/lkml/2014/12/22/213
      https://lore.kernel.org/lkml/549832E2.8060609@oracle.com/
      [3] https://lkml.org/lkml/2014/12/9/741
      https://lore.kernel.org/lkml/5487ACC5.1010002@oracle.com/
      
      Two of those were Bad page states: free_pages_prepare() found PG_mlocked
      still set - almost certain to have been fixed by 4.4 commit b87537d9
      ("mm: rmap use pte lock not mmap_sem to set PageMlocked").  The NULL deref
      on rwsem in [2]: unclear, only happened once, not bisected to c8475d14.
      
      No change to the i_mmap_lock_write() around __unmap_hugepage_range_final()
      in unmap_single_vma(): IIRC that's a special usage, helping to serialize
      hugetlbfs page table sharing, not to be dabbled with lightly.  No change
      to other uses of i_mmap_lock_write() by hugetlbfs.
      
      I am not aware of any significant gains from the concurrency allowed by
      this commit: it is submitted more to resolve an ancient misunderstanding.
      
      Link: https://lkml.kernel.org/r/e4a5e356-6c87-47b2-3ce8-c2a95ae84e20@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c865995
    • Hugh Dickins's avatar
      mm: warn on deleting redirtied only if accounted · 566d3362
      Hugh Dickins authored
      filemap_unaccount_folio() has a WARN_ON_ONCE(folio_test_dirty(folio)).  It
      is good to warn of late dirtying on a persistent filesystem, but late
      dirtying on tmpfs can only lose data which is expected to be thrown away;
      and it's a pity if that warning comes ONCE on tmpfs, then hides others
      which really matter.  Make it conditional on mapping_cap_writeback().
      
      Cleanup: then folio_account_cleaned() no longer needs to check that for
      itself, and so no longer needs to know the mapping.
      
      Link: https://lkml.kernel.org/r/b5a1106c-7226-a5c6-ad41-ad4832cae1f@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jan Kara <jack@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      566d3362
    • David Hildenbrand's avatar
      mm/huge_memory: remove stale locking logic from __split_huge_pmd() · 7f760917
      David Hildenbrand authored
      Let's remove the stale logic that was required for reuse_swap_page().
      
      [akpm@linux-foundation.org: simplification, per Yang Shi]
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f760917
    • David Hildenbrand's avatar
      mm/huge_memory: remove stale page_trans_huge_mapcount() · 55c62fa7
      David Hildenbrand authored
      All users are gone, let's remove it.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55c62fa7
    • David Hildenbrand's avatar
      mm/swapfile: remove stale reuse_swap_page() · 03104c2c
      David Hildenbrand authored
      All users are gone, let's remove it.  We'll let SWP_STABLE_WRITES stick
      around for now, as it might come in handy in the near future.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03104c2c
    • David Hildenbrand's avatar
      mm/khugepaged: remove reuse_swap_page() usage · 363106c4
      David Hildenbrand authored
      reuse_swap_page() currently indicates if we can write to an anon page
      without COW.  A COW is required if the page is shared by multiple
      processes (either already mapped or via swap entries) or if there is
      concurrent writeback that cannot tolerate concurrent page modifications.
      
      However, in the context of khugepaged we're not actually going to write to
      a read-only mapped page, we'll copy the page content to our newly
      allocated THP and map that THP writable.  All we have to make sure is that
      the read-only mapped page we're about to copy won't get reused by another
      process sharing the page, otherwise, page content would get modified.  But
      that is already guaranteed via multiple mechanisms (e.g., holding a
      reference, holding the page lock, removing the rmap after copying the
      page).
      
      The swapcache handling was introduced in commit 10359213 ("mm:
      incorporate read-only pages into transparent huge pages") and it sounds
      like it merely wanted to mimic what do_swap_page() would do when trying to
      map a page obtained via the swapcache writable.
      
      As that logic is unnecessary, let's just remove it, removing the last user
      of reuse_swap_page().
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      363106c4
    • David Hildenbrand's avatar
      mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() · 3bff7e3f
      David Hildenbrand authored
      We currently have a different COW logic for anon THP than we have for
      ordinary anon pages in do_wp_page(): the effect is that the issue reported
      in CVE-2020-29374 is currently still possible for anon THP: an unintended
      information leak from the parent to the child.
      
      Let's apply the same logic (page_count() == 1), with similar optimizations
      to remove additional references first as we really want to avoid
      PTE-mapping the THP and copying individual pages best we can.
      
      If we end up with a page that has page_count() != 1, we'll have to PTE-map
      the THP and fallback to do_wp_page(), which will always copy the page.
      
      Note that KSM does not apply to THP.
      
      I. Interaction with the swapcache and writeback
      
      While a THP is in the swapcache, the swapcache holds one reference on each
      subpage of the THP.  So with PageSwapCache() set, we expect as many
      additional references as we have subpages.  If we manage to remove the THP
      from the swapcache, all these references will be gone.
      
      Usually, a THP is not split when entered into the swapcache and stays a
      compound page.  However, try_to_unmap() will PTE-map the THP and use PTE
      swap entries.  There are no PMD swap entries for that purpose,
      consequently, we always only swapin subpages into PTEs.
      
      Removing a page from the swapcache can fail either when there are
      remaining swap entries (in which case COW is the right thing to do) or if
      the page is currently under writeback.
      
      Having a locked, R/O PMD-mapped THP that is in the swapcache seems to be
      possible only in corner cases, for example, if try_to_unmap() failed after
      adding the page to the swapcache.  However, it's comparatively easy to
      handle.
      
      As we have to fully unmap a THP before starting writeback, and swapin is
      always done on the PTE level, we shouldn't find a R/O PMD-mapped THP in
      the swapcache that is under writeback.  This should at least leave
      writeback out of the picture.
      
      II. Interaction with GUP references
      
      Having a R/O PMD-mapped THP with GUP references (i.e., R/O references)
      will result in PTE-mapping the THP on a write fault.  Similar to ordinary
      anon pages, do_wp_page() will have to copy sub-pages and result in a
      disconnect between the GUP references and the pages actually mapped into
      the page tables.  To improve the situation in the future, we'll need
      additional handling to mark anonymous pages as definitely exclusive to a
      single process, only allow GUP pins on exclusive anon pages, and disallow
      sharing of exclusive anon pages with GUP pins e.g., during fork().
      
      III. Interaction with references from LRU pagevecs
      
      There is no need to try draining the (local) LRU pagevecs in case we would
      stumble over a !PageLRU() page: folio_add_lru() and friends will always
      flush the affected pagevec after adding a compound page to it immediately
      -- pagevec_add_and_need_flush() always returns "true" for them.  Note that
      the LRU pagevecs will hold a reference on the compound page for a very
      short time, between adding the page to the pagevec and draining it
      immediately afterwards.
      
      IV. Interaction with speculative/temporary references
      
      Similar to ordinary anon pages, other speculative/temporary references on
      the THP, for example, from the pagecache or page migration code, will
      disallow exclusive reuse of the page.  We'll have to PTE-map the THP.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3bff7e3f
    • David Hildenbrand's avatar
      mm: streamline COW logic in do_swap_page() · c145e0b4
      David Hildenbrand authored
      Currently we have a different COW logic when:
      * triggering a read-fault to swapin first and then trigger a write-fault
        -> do_swap_page() + do_wp_page()
      * triggering a write-fault to swapin
        -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()
      
      The COW logic in do_swap_page() is different than our reuse logic in
      do_wp_page().  The COW logic in do_wp_page() -- page_count() == 1 -- makes
      currently sure that we certainly don't have a remaining reference, e.g.,
      via GUP, on the target page we want to reuse: if there is any unexpected
      reference, we have to copy to avoid information leaks.
      
      As do_swap_page() behaves differently, in environments with swap enabled
      we can currently have an unintended information leak from the parent to
      the child, similar as known from CVE-2020-29374:
      
      	1. Parent writes to anonymous page
      	-> Page is mapped writable and modified
      	2. Page is swapped out
      	-> Page is unmapped and replaced by swap entry
      	3. fork()
      	-> Swap entries are copied to child
      	4. Child pins page R/O
      	-> Page is mapped R/O into child
      	5. Child unmaps page
      	-> Child still holds GUP reference
      	6. Parent writes to page
      	-> Page is reused in do_swap_page()
      	-> Child can observe changes
      
      Exchanging 2. and 3. should have the same effect.
      
      Let's apply the same COW logic as in do_wp_page(), conditionally trying to
      remove the page from the swapcache after freeing the swap entry, however,
      before actually mapping our page.  We can change the order now that we use
      try_to_free_swap(), which doesn't care about the mapcount, instead of
      reuse_swap_page().
      
      To handle references from the LRU pagevecs, conditionally drain the local
      LRU pagevecs when required, however, don't consider the page_count() when
      deciding whether to drain to keep it simple for now.
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c145e0b4
    • David Hildenbrand's avatar
      mm: slightly clarify KSM logic in do_swap_page() · 84d60fdd
      David Hildenbrand authored
      Let's make it clearer that KSM might only have to copy a page in case we
      have a page in the swapcache, not if we allocated a fresh page and
      bypassed the swapcache.  While at it, add a comment why this is usually
      necessary and merge the two swapcache conditions.
      
      [akpm@linux-foundation.org: fix comment, per David]
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84d60fdd
    • David Hildenbrand's avatar
      mm: optimize do_wp_page() for fresh pages in local LRU pagevecs · d4c47097
      David Hildenbrand authored
      For example, if a page just got swapped in via a read fault, the LRU
      pagevecs might still hold a reference to the page.  If we trigger a write
      fault on such a page, the additional reference from the LRU pagevecs will
      prohibit reusing the page.
      
      Let's conditionally drain the local LRU pagevecs when we stumble over a
      !PageLRU() page.  We cannot easily drain remote LRU pagevecs and it might
      not be desirable performance-wise.  Consequently, this will only avoid
      copying in some cases.
      
      Add a simple "page_count(page) > 3" check first but keep the
      "page_count(page) > 1 + PageSwapCache(page)" check in place, as we want to
      minimize cases where we remove a page from the swapcache but won't be able
      to reuse it, for example, because another process has it mapped R/O, to
      not affect reclaim.
      
      We cannot easily handle the following cases and we will always have to
      copy:
      
      (1) The page is referenced in the LRU pagevecs of other CPUs. We really
          would have to drain the LRU pagevecs of all CPUs -- most probably
          copying is much cheaper.
      
      (2) The page is already PageLRU() but is getting moved between LRU
          lists, for example, for activation (e.g., mark_page_accessed()),
          deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to
          drain mostly unconditionally, which might be bad performance-wise.
          Most probably this won't happen too often in practice.
      
      Note that there are other reasons why an anon page might temporarily not
      be PageLRU(): for example, compaction and migration have to isolate LRU
      pages from the LRU lists first (isolate_lru_page()), moving them to
      temporary local lists and clearing PageLRU() and holding an additional
      reference on the page.  In that case, we'll always copy.
      
      This change seems to be fairly effective with the reproducer [1] shared by
      Nadav, as long as writeback is done synchronously, for example, using
      zram.  However, with asynchronous writeback, we'll usually fail to free
      the swapcache because the page is still under writeback: something we
      cannot easily optimize for, and maybe it's not really relevant in
      practice.
      
      [1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4c47097
    • David Hildenbrand's avatar
      mm: optimize do_wp_page() for exclusive pages in the swapcache · 53a05ad9
      David Hildenbrand authored
      Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3.
      
      This series attempts to optimize and streamline the COW logic for ordinary
      anon pages and THP anon pages, fixing two remaining instances of
      CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information
      can leak from a parent process to a child process via anonymous pages
      shared during fork().
      
      This issue, including other related COW issues, has been summarized in [2]:
      
       "1. Observing Memory Modifications of Private Pages From A Child Process
      
        Long story short: process-private memory might not be as private as you
        think once you fork(): successive modifications of private memory
        regions in the parent process can still be observed by the child
        process, for example, by smart use of vmsplice()+munmap().
      
        The core problem is that pinning pages readable in a child process, such
        as done via the vmsplice system call, can result in a child process
        observing memory modifications done in the parent process the child is
        not supposed to observe. [1] contains an excellent summary and [2]
        contains further details. This issue was assigned CVE-2020-29374 [9].
      
        For this to trigger, it's required to use a fork() without subsequent
        exec(), for example, as used under Android zygote. Without further
        details about an application that forks less-privileged child processes,
        one cannot really say what's actually affected and what's not -- see the
        details section the end of this mail for a short sshd/openssh analysis.
      
        While commit 17839856 ("gup: document and work around "COW can break
        either way" issue") fixed this issue and resulted in other problems
        (e.g., ptrace on pmem), commit 09854ba9 ("mm: do_wp_page()
        simplification") re-introduced part of the problem unfortunately.
      
        The original reproducer can be modified quite easily to use THP [3] and
        make the issue appear again on upstream kernels. I modified it to use
        hugetlb [4] and it triggers as well. The problem is certainly less
        severe with hugetlb than with THP; it merely highlights that we still
        have plenty of open holes we should be closing/fixing.
      
        Regarding vmsplice(), the only known workaround is to disallow the
        vmsplice() system call ... or disable THP and hugetlb. But who knows
        what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
        the end, it's a more generic issue"
      
      This security issue was first reported by Jann Horn on 27 May 2020 and it
      currently affects anonymous pages during swapin, anonymous THP and hugetlb.
      This series tackles anonymous pages during swapin and anonymous THP:
      
       - do_swap_page() for handling COW on PTEs during swapin directly
      
       - do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
         faults
      
      With this series, we'll apply the same COW logic we have in do_wp_page()
      to all swappable anon pages: don't reuse (map writable) the page in
      case there are additional references (page_count() != 1). All users of
      reuse_swap_page() are remove, and consequently reuse_swap_page() is
      removed.
      
      In general, we're struggling with the following COW-related issues:
      
      (1) "missed COW": we miss to copy on write and reuse the page (map it
          writable) although we must copy because there are pending references
          from another process to this page. The result is a security issue.
      
      (2) "wrong COW": we copy on write although we wouldn't have to and
          shouldn't: if there are valid GUP references, they will become out
          of sync with the pages mapped into the page table. We fail to detect
          that such a page can be reused safely, especially if never more than
          a single process mapped the page. The result is an intra process
          memory corruption.
      
      (3) "unnecessary COW": we copy on write although we wouldn't have to:
          performance degradation and temporary increases swap+memory
          consumption can be the result.
      
      While this series fixes (1) for swappable anon pages, it tries to reduce
      reported cases of (3) first as good and easy as possible to limit the
      impact when streamlining.  The individual patches try to describe in
      which cases we will run into (3).
      
      This series certainly makes (2) worse for THP, because a THP will now
      get PTE-mapped on write faults if there are additional references, even
      if there was only ever a single process involved: once PTE-mapped, we'll
      copy each and every subpage and won't reuse any subpage as long as the
      underlying compound page wasn't split.
      
      I'm working on an approach to fix (2) and improve (3): PageAnonExclusive
      to mark anon pages that are exclusive to a single process, allow GUP
      pins only on such exclusive pages, and allow turning exclusive pages
      shared (clearing PageAnonExclusive) only if there are no GUP pins.  Anon
      pages with PageAnonExclusive set never have to be copied during write
      faults, but eventually during fork() if they cannot be turned shared.
      The improved reuse logic in this series will essentially also be the
      logic to reset PageAnonExclusive.  This work will certainly take a
      while, but I'm planning on sharing details before having code fully
      ready.
      
      #1-#5 can be applied independently of the rest. #6-#9 are mostly only
      cleanups related to reuse_swap_page().
      
      Notes:
      * For now, I'll leave hugetlb code untouched: "unnecessary COW" might
        easily break existing setups because hugetlb pages are a scarce resource
        and we could just end up having to crash the application when we run out
        of hugetlb pages. We have to be very careful and the security aspect with
        hugetlb is most certainly less relevant than for unprivileged anon pages.
      * Instead of lru_add_drain() we might actually just drain the lru_add list
        or even just remove the single page of interest from the lru_add list.
        This would require a new helper function, and could be added if the
        conditional lru_add_drain() turn out to be a problem.
      * I extended the test case already included in [1] to also test for the
        newly found do_swap_page() case. I'll send that out separately once/if
        this part was merged.
      
      [1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      This patch (of 9):
      
      Liang Zhang reported [1] that the current COW logic in do_wp_page() is
      sub-optimal when it comes to swap+read fault+write fault of anonymous
      pages that have a single user, visible via a performance degradation in
      the redis benchmark.  Something similar was previously reported [2] by
      Nadav with a simple reproducer.
      
      After we put an anon page into the swapcache and unmapped it from a single
      process, that process might read that page again and refault it read-only.
      If that process then writes to that page, the process is actually the
      exclusive user of the page, however, the COW logic in do_co_page() won't
      be able to reuse it due to the additional reference from the swapcache.
      
      Let's optimize for pages that have been added to the swapcache but only
      have an exclusive user.  Try removing the swapcache reference if there is
      hope that we're the exclusive user.
      
      We will fail removing the swapcache reference in two scenarios:
      (1) There are additional swap entries referencing the page: copying
          instead of reusing is the right thing to do.
      (2) The page is under writeback: theoretically we might be able to reuse
          in some cases, however, we cannot remove the additional reference
          and will have to copy.
      
      Note that we'll only try removing the page from the swapcache when it's
      highly likely that we'll be the exclusive owner after removing the page
      from the swapache.  As we're about to map that page writable and redirty
      it, that should not affect reclaim but is rather the right thing to do.
      
      Further, we might have additional references from the LRU pagevecs, which
      will force us to copy instead of being able to reuse.  We'll try handling
      such references for some scenarios next.  Concurrent writeback cannot be
      handled easily and we'll always have to copy.
      
      While at it, remove the superfluous page_mapcount() check: it's
      implicitly covered by the page_count() for ordinary anon pages.
      
      [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
      [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarLiang Zhang <zhangliang5@huawei.com>
      Reported-by: default avatarNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53a05ad9