1. 03 Dec, 2022 1 commit
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 0e15c3c7
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - build fix for the NR_CPUS Kconfig SBI version dependency
      
       - fixes to early memory initialization, to fix page permissions in EFI
         and post-initmem-free
      
       - build fix for the VDSO, to avoid trying to profile the VDSO functions
      
       - fixes for kexec crash handling, to fix multi-core and interrupt
         related initialization inside the crash kernel
      
       - fix for a race condition when handling multiple concurrect kernel
         stack overflows
      
      * tag 'riscv-for-linus-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: kexec: Fixup crash_smp_send_stop without multi cores
        riscv: kexec: Fixup irq controller broken in kexec crash path
        riscv: mm: Proper page permissions after initmem free
        riscv: vdso: fix section overlapping under some conditions
        riscv: fix race when vmap stack overflow
        riscv: Sync efi page table's kernel mappings before switching
        riscv: Fix NR_CPUS range conditions
      0e15c3c7
  2. 02 Dec, 2022 11 commits
    • Linus Torvalds's avatar
      Merge tag 'mmc-v6.1-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 2df2adc3
      Linus Torvalds authored
      Pull MMC fixes from Ulf Hansson:
       "MMC core:
         - Fix ambiguous TRIM and DISCARD args
         - Fix removal of debugfs file for mmc_test
      
        MMC host:
         - mtk-sd: Add missing clk_disable_unprepare() in an error path
         - sdhci: Fix I/O voltage switch delay for UHS-I SD cards
         - sdhci-esdhc-imx: Fix CQHCI exit halt state check
         - sdhci-sprd: Fix voltage switch"
      
      * tag 'mmc-v6.1-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: sdhci-sprd: Fix no reset data and command after voltage switch
        mmc: sdhci: Fix voltage switch delay
        mmc: mtk-sd: Fix missing clk_disable_unprepare in msdc_of_clock_parse()
        mmc: mmc_test: Fix removal of debugfs file
        mmc: sdhci-esdhc-imx: correct CQHCI exit halt state check
        mmc: core: Fix ambiguous TRIM and DISCARD arg
      2df2adc3
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · f66f62f8
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
       "Intel VT-d fixes:
      
         - IO/TLB flush fix
      
         - Various pci_dev refcount fixes"
      
      * tag 'iommu-fixes-v6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/vt-d: Fix PCI device refcount leak in dmar_dev_scope_init()
        iommu/vt-d: Fix PCI device refcount leak in has_external_pci()
        iommu/vt-d: Fix PCI device refcount leak in prq_event_thread()
        iommu/vt-d: Add a fix for devices need extra dtlb flush
      f66f62f8
    • Pawan Gupta's avatar
      x86/bugs: Make sure MSR_SPEC_CTRL is updated properly upon resume from S3 · 66065157
      Pawan Gupta authored
      The "force" argument to write_spec_ctrl_current() is currently ambiguous
      as it does not guarantee the MSR write. This is due to the optimization
      that writes to the MSR happen only when the new value differs from the
      cached value.
      
      This is fine in most cases, but breaks for S3 resume when the cached MSR
      value gets out of sync with the hardware MSR value due to S3 resetting
      it.
      
      When x86_spec_ctrl_current is same as x86_spec_ctrl_base, the MSR write
      is skipped. Which results in SPEC_CTRL mitigations not getting restored.
      
      Move the MSR write from write_spec_ctrl_current() to a new function that
      unconditionally writes to the MSR. Update the callers accordingly and
      rename functions.
      
        [ bp: Rework a bit. ]
      
      Fixes: caa0ff24 ("x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value")
      Suggested-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>
      Link: https://lore.kernel.org/r/806d39b0bfec2fe8f50dc5446dff20f5bb24a959.1669821572.git.pawan.kumar.gupta@linux.intel.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66065157
    • Linus Torvalds's avatar
      Merge tag 'sound-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · a1e9185d
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Likely the last piece for 6.1; the only significant fixes are ASoC
        core ops fixes, while others are device-specific (rather minor) fixes
        in ASoC and FireWire drivers.
      
        All appear safe enough to take as a late stage material"
      
      * tag 'sound-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: dice: fix regression for Lexicon I-ONIX FW810S
        ASoC: cs42l51: Correct PGA Volume minimum value
        ASoC: ops: Correct bounds check for second channel on SX controls
        ASoC: tlv320adc3xxx: Fix build error for implicit function declaration
        ASoC: ops: Check bounds for second channel in snd_soc_put_volsw_sx()
        ASoC: ops: Fix bounds check for _sx controls
        ASoC: fsl_micfil: explicitly clear CHnF flags
        ASoC: fsl_micfil: explicitly clear software reset bit
      a1e9185d
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-12-02' of git://anongit.freedesktop.org/drm/drm · c290db01
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Things do seem to have finally settled down, just four i915 and one
        amdgpu this week. Probably won't have much for next week if you do
        push rc8 out.
      
        i915:
         - Fix dram info readout
         - Remove non-existent pipes from bigjoiner pipe mask
         - Fix negative value passed as remaining time
         - Never return 0 if not all requests retired
      
        amdgpu:
         - VCN fix for vangogh"
      
      * tag 'drm-fixes-2022-12-02' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu: enable Vangogh VCN indirect sram mode
        drm/i915: Never return 0 if not all requests retired
        drm/i915: Fix negative value passed as remaining time
        drm/i915: Remove non-existent pipes from bigjoiner pipe mask
        drm/i915/mtl: Fix dram info readout
      c290db01
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm · bdaa78c6
      Linus Torvalds authored
      Pull misc hotfixes from Andrew Morton:
       "15 hotfixes,  11 marked cc:stable.
      
        Only three or four of the latter address post-6.0 issues, which is
        hopefully a sign that things are converging"
      
      * tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible"
        Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
        drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
        mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
        mm/khugepaged: fix GUP-fast interaction by sending IPI
        mm/khugepaged: take the right locks for page table retraction
        mm: migrate: fix THP's mapcount on isolation
        mm: introduce arch_has_hw_nonleaf_pmd_young()
        mm: add dummy pmd_young() for architectures not having it
        mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()
        tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"
        nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
        hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
        madvise: use zap_page_range_single for madvise dontneed
        mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
      bdaa78c6
    • Linus Torvalds's avatar
      v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() fails · 6647e76a
      Linus Torvalds authored
      The V4L2_MEMORY_USERPTR interface is long deprecated and shouldn't be
      used (and is discouraged for any modern v4l drivers).  And Seth Jenkins
      points out that the fallback to VM_PFNMAP/VM_IO is fundamentally racy
      and dangerous.
      
      Note that it's not even a case that should trigger, since any normal
      user pointer logic ends up just using the pin_user_pages_fast() call
      that does the proper page reference counting.  That's not the problem
      case, only if you try to use special device mappings do you have any
      issues.
      
      Normally I'd just remove this during the merge window, but since Seth
      pointed out the problem cases, we really want to know as soon as
      possible if there are actually any users of this odd special case of a
      legacy interface.  Neither Hans nor Mauro seem to think that such
      mis-uses of the old legacy interface should exist.  As Mauro says:
      
       "See, V4L2 has actually 4 streaming APIs:
              - Kernel-allocated mmap (usually referred simply as just mmap);
              - USERPTR mmap;
              - read();
              - dmabuf;
      
        The USERPTR is one of the oldest way to use it, coming from V4L
        version 1 times, and by far the least used one"
      
      And Hans chimed in on the USERPTR interface:
      
       "To be honest, I wouldn't mind if it goes away completely, but that's a
        bit of a pipe dream right now"
      
      but while removing this legacy interface entirely may be a pipe dream we
      can at least try to remove the unlikely (and actively broken) case of
      using special device mappings for USERPTR accesses.
      
      This replaces it with a WARN_ONCE() that we can remove once we've
      hopefully confirmed that no actual users exist.
      
      NOTE! Longer term, this means that a 'struct frame_vector' only ever
      contains proper page pointers, and all the games we have with converting
      them to pages can go away (grep for 'frame_vector_to_pages()' and the
      uses of 'vec->is_pfns').  But this is just the first step, to verify
      that this code really is all dead, and do so as quickly as possible.
      Reported-by: default avatarSeth Jenkins <sethjenkins@google.com>
      Acked-by: default avatarHans Verkuil <hverkuil@xs4all.nl>
      Acked-by: default avatarMauro Carvalho Chehab <mchehab@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6647e76a
    • Xiongfeng Wang's avatar
      iommu/vt-d: Fix PCI device refcount leak in dmar_dev_scope_init() · 4bedbbd7
      Xiongfeng Wang authored
      for_each_pci_dev() is implemented by pci_get_device(). The comment of
      pci_get_device() says that it will increase the reference count for the
      returned pci_dev and also decrease the reference count for the input
      pci_dev @from if it is not NULL.
      
      If we break for_each_pci_dev() loop with pdev not NULL, we need to call
      pci_dev_put() to decrease the reference count. Add the missing
      pci_dev_put() for the error path to avoid reference count leak.
      
      Fixes: 2e455289 ("iommu/vt-d: Unify the way to process DMAR device scope array")
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Link: https://lore.kernel.org/r/20221121113649.190393-3-wangxiongfeng2@huawei.comSigned-off-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      4bedbbd7
    • Xiongfeng Wang's avatar
      iommu/vt-d: Fix PCI device refcount leak in has_external_pci() · afca9e19
      Xiongfeng Wang authored
      for_each_pci_dev() is implemented by pci_get_device(). The comment of
      pci_get_device() says that it will increase the reference count for the
      returned pci_dev and also decrease the reference count for the input
      pci_dev @from if it is not NULL.
      
      If we break for_each_pci_dev() loop with pdev not NULL, we need to call
      pci_dev_put() to decrease the reference count. Add the missing
      pci_dev_put() before 'return true' to avoid reference count leak.
      
      Fixes: 89a6079d ("iommu/vt-d: Force IOMMU on for platform opt in hint")
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Link: https://lore.kernel.org/r/20221121113649.190393-2-wangxiongfeng2@huawei.comSigned-off-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      afca9e19
    • Yang Yingliang's avatar
      iommu/vt-d: Fix PCI device refcount leak in prq_event_thread() · 6927d352
      Yang Yingliang authored
      As comment of pci_get_domain_bus_and_slot() says, it returns a pci device
      with refcount increment, when finish using it, the caller must decrease
      the reference count by calling pci_dev_put(). So call pci_dev_put() after
      using the 'pdev' to avoid refcount leak.
      
      Besides, if the 'pdev' is null or intel_svm_prq_report() returns error,
      there is no need to trace this fault.
      
      Fixes: 06f4b8d0 ("iommu/vt-d: Remove unnecessary SVA data accesses in page fault path")
      Suggested-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Link: https://lore.kernel.org/r/20221119144028.2452731-1-yangyingliang@huawei.comSigned-off-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      6927d352
    • Jacob Pan's avatar
      iommu/vt-d: Add a fix for devices need extra dtlb flush · e65a6897
      Jacob Pan authored
      QAT devices on Intel Sapphire Rapids and Emerald Rapids have a defect in
      address translation service (ATS). These devices may inadvertently issue
      ATS invalidation completion before posted writes initiated with
      translated address that utilized translations matching the invalidation
      address range, violating the invalidation completion ordering.
      
      This patch adds an extra device TLB invalidation for the affected devices,
      it is needed to ensure no more posted writes with translated address
      following the invalidation completion. Therefore, the ordering is
      preserved and data-corruption is prevented.
      
      Device TLBs are invalidated under the following six conditions:
      1. Device driver does DMA API unmap IOVA
      2. Device driver unbind a PASID from a process, sva_unbind_device()
      3. PASID is torn down, after PASID cache is flushed. e.g. process
      exit_mmap() due to crash
      4. Under SVA usage, called by mmu_notifier.invalidate_range() where
      VM has to free pages that were unmapped
      5. userspace driver unmaps a DMA buffer
      6. Cache invalidation in vSVA usage (upcoming)
      
      For #1 and #2, device drivers are responsible for stopping DMA traffic
      before unmap/unbind. For #3, iommu driver gets mmu_notifier to
      invalidate TLB the same way as normal user unmap which will do an extra
      invalidation. The dTLB invalidation after PASID cache flush does not
      need an extra invalidation.
      
      Therefore, we only need to deal with #4 and #5 in this patch. #1 is also
      covered by this patch due to common code path with #5.
      Tested-by: default avatarYuzhang Luo <yuzhang.luo@intel.com>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Signed-off-by: default avatarJacob Pan <jacob.jun.pan@linux.intel.com>
      Link: https://lore.kernel.org/r/20221130062449.1360063-1-jacob.jun.pan@linux.intel.comSigned-off-by: default avatarLu Baolu <baolu.lu@linux.intel.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      e65a6897
  3. 01 Dec, 2022 12 commits
  4. 30 Nov, 2022 16 commits
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 063c0e77
      Linus Torvalds authored
      Pull clk fixes from Stephen Boyd:
       "A set of clk driver fixes that resolve issues for various SoCs.
      
        Most of these are incorrect clk data, like bad parent descriptions.
        When the clk tree is improperly described things don't work, like USB
        and UFS controllers, because clk frequencies are wonky. Here are the
        extra details:
      
         - Fix the parent of UFS reference clks on Qualcomm SC8280XP so that
           UFS works properly
      
         - Fix the clk ID for USB on AT91 RM9200 so the USB driver continues
           to probe
      
         - Stop using of_device_get_match_data() on the wrong device for a
           Samsung Exynos driver so it gets the proper clk data
      
         - Fix ExynosAutov9 binding
      
         - Fix the parent of the div4 clk on Exynos7885
      
         - Stop calling runtime PM APIs from the Qualcomm GDSC driver directly
           as it leads to a lockdep splat and is just plain wrong because it
           violates runtime PM semantics by calling runtime PM APIs when the
           device has been runtime PM disabled"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: qcom: gcc-sc8280xp: add cxo as parent for three ufs ref clks
        ARM: at91: rm9200: fix usb device clock id
        clk: samsung: Revert "clk: samsung: exynos-clkout: Use of_device_get_match_data()"
        dt-bindings: clock: exynosautov9: fix reference to CMU_FSYS1
        clk: qcom: gdsc: Remove direct runtime PM calls
        clk: samsung: exynos7885: Correct "div4" clock parents
      063c0e77
    • Andrew Morton's avatar
      revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible" · 1d351f18
      Andrew Morton authored
      It causes build failures with unusual CC/HOSTCC combinations.
      
      Quoting
      https://lkml.kernel.org/r/A222B1E6-69B8-4085-AD1B-27BDB72CA971@goldelico.com:
      
        HOSTCC  scripts/mod/modpost.o - due to target missing
      In file included from include/linux/string.h:5,
                       from scripts/mod/../../include/linux/license.h:5,
                       from scripts/mod/modpost.c:24:
      include/linux/compiler.h:246:10: fatal error: asm/rwonce.h: No such file or directory
        246 | #include <asm/rwonce.h>
            |          ^~~~~~~~~~~~~~
      compilation terminated.
      
      ...
      
      The problem is that HOSTCC is not necessarily the same compiler or even
      architecture as CC and pulling in <linux/compiler.h> or <asm/rwonce.h>
      files indirectly isn't a good idea then.
      
      My toolchain is providing HOSTCC = gcc (MacPorts) and CC = arm-linux-gnueabihf
      (built from gcc source) and all running on Darwin.
      
      If I change the include to <string.h> I can then "HOSTCC scripts/mod/modpost.c"
      but then it fails for "CC kernel/module/main.c" not finding <string.h>:
      
        CC      kernel/module/main.o - due to target missing
      In file included from kernel/module/main.c:43:0:
      ./include/linux/license.h:5:20: fatal error: string.h: No such file or directory
       #include <string.h>
                          ^
      compilation terminated.
      Reported-by: default avatar"H. Nikolaus Schaller" <hns@goldelico.com>
      Cc: Sam James <sam@gentoo.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d351f18
    • Lee Jones's avatar
      Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled · 152fe65f
      Lee Jones authored
      When enabled, KASAN enlarges function's stack-frames.  Pushing quite a few
      over the current threshold.  This can mainly be seen on 32-bit
      architectures where the present limit (when !GCC) is a lowly 1024-Bytes.
      
      Link: https://lkml.kernel.org/r/20221125120750.3537134-3-lee@kernel.orgSigned-off-by: default avatarLee Jones <lee@kernel.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@gmail.com>
      Cc: Harry Wentland <harry.wentland@amd.com>
      Cc: Leo Li <sunpeng.li@amd.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
      Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      152fe65f
    • Lee Jones's avatar
      drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame · 6f6cb171
      Lee Jones authored
      Patch series "Fix a bunch of allmodconfig errors", v2.
      
      Since b339ec9c ("kbuild: Only default to -Werror if COMPILE_TEST")
      WERROR now defaults to COMPILE_TEST meaning that it's enabled for
      allmodconfig builds.  This leads to some interesting build failures when
      using Clang, each resolved in this set.
      
      With this set applied, I am able to obtain a successful allmodconfig Arm
      build.
      
      
      This patch (of 2):
      
      calculate_bandwidth() is presently broken on all !(X86_64 || SPARC64 ||
      ARM64) architectures built with Clang (all released versions), whereby the
      stack frame gets blown up to well over 5k.  This would cause an immediate
      kernel panic on most architectures.  We'll revert this when the following
      bug report has been resolved:
      https://github.com/llvm/llvm-project/issues/41896.
      
      Link: https://lkml.kernel.org/r/20221125120750.3537134-1-lee@kernel.org
      Link: https://lkml.kernel.org/r/20221125120750.3537134-2-lee@kernel.orgSigned-off-by: default avatarLee Jones <lee@kernel.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@gmail.com>
      Cc: Harry Wentland <harry.wentland@amd.com>
      Cc: Lee Jones <lee@kernel.org>
      Cc: Leo Li <sunpeng.li@amd.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
      Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f6cb171
    • Jann Horn's avatar
      mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths · f268f6cf
      Jann Horn authored
      Any codepath that zaps page table entries must invoke MMU notifiers to
      ensure that secondary MMUs (like KVM) don't keep accessing pages which
      aren't mapped anymore.  Secondary MMUs don't hold their own references to
      pages that are mirrored over, so failing to notify them can lead to page
      use-after-free.
      
      I'm marking this as addressing an issue introduced in commit f3f0e1d2
      ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
      the security impact of this only came in commit 27e1f827 ("khugepaged:
      enable collapse pmd for pte-mapped THP"), which actually omitted flushes
      for the removal of present PTEs, not just for the removal of empty page
      tables.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f268f6cf
    • Jann Horn's avatar
      mm/khugepaged: fix GUP-fast interaction by sending IPI · 2ba99c5e
      Jann Horn authored
      Since commit 70cbc3cc ("mm: gup: fix the fast GUP race against THP
      collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to
      ensure that the page table was not removed by khugepaged in between.
      
      However, lockless_pages_from_mm() still requires that the page table is
      not concurrently freed.  Fix it by sending IPIs (if the architecture uses
      semi-RCU-style page table freeing) before freeing/reusing page tables.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ba99c5e
    • Jann Horn's avatar
      mm/khugepaged: take the right locks for page table retraction · 8d3c106e
      Jann Horn authored
      pagetable walks on address ranges mapped by VMAs can be done under the
      mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
      VMA's address_space.  Only one of these needs to be held, and it does not
      need to be held in exclusive mode.
      
      Under those circumstances, the rules for concurrent access to page table
      entries are:
      
       - Terminal page table entries (entries that don't point to another page
         table) can be arbitrarily changed under the page table lock, with the
         exception that they always need to be consistent for
         hardware page table walks and lockless_pages_from_mm().
         This includes that they can be changed into non-terminal entries.
       - Non-terminal page table entries (which point to another page table)
         can not be modified; readers are allowed to READ_ONCE() an entry, verify
         that it is non-terminal, and then assume that its value will stay as-is.
      
      Retracting a page table involves modifying a non-terminal entry, so
      page-table-level locks are insufficient to protect against concurrent page
      table traversal; it requires taking all the higher-level locks under which
      it is possible to start a page walk in the relevant range in exclusive
      mode.
      
      The collapse_huge_page() path for anonymous THP already follows this rule,
      but the shmem/file THP path was getting it wrong, making it possible for
      concurrent rmap-based operations to cause corruption.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
      Fixes: 27e1f827 ("khugepaged: enable collapse pmd for pte-mapped THP")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d3c106e
    • Gavin Shan's avatar
      mm: migrate: fix THP's mapcount on isolation · 829ae0f8
      Gavin Shan authored
      The issue is reported when removing memory through virtio_mem device.  The
      transparent huge page, experienced copy-on-write fault, is wrongly
      regarded as pinned.  The transparent huge page is escaped from being
      isolated in isolate_migratepages_block().  The transparent huge page can't
      be migrated and the corresponding memory block can't be put into offline
      state.
      
      Fix it by replacing page_mapcount() with total_mapcount().  With this, the
      transparent huge page can be isolated and migrated, and the memory block
      can be put into offline state.  Besides, The page's refcount is increased
      a bit earlier to avoid the page is released when the check is executed.
      
      Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
      Fixes: 1da2f328 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Reported-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Tested-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>	[5.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      829ae0f8
    • Juergen Gross's avatar
      mm: introduce arch_has_hw_nonleaf_pmd_young() · 4aaf269c
      Juergen Gross authored
      When running as a Xen PV guests commit eed9a328 ("mm: x86: add
      CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in
      pmdp_test_and_clear_young():
      
       BUG: unable to handle page fault for address: ffff8880083374d0
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0003) - permissions violation
       PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065
       Oops: 0003 [#1] PREEMPT SMP NOPTI
       CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1
       RIP: e030:pmdp_test_and_clear_young+0x25/0x40
      
      This happens because the Xen hypervisor can't emulate direct writes to
      page table entries other than PTEs.
      
      This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young()
      similar to arch_has_hw_pte_young() and test that instead of
      CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG.
      
      Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com
      Fixes: eed9a328 ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reported-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Tested-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Acked-by: David Hildenbrand <david@redhat.com>	[core changes]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4aaf269c
    • Juergen Gross's avatar
      mm: add dummy pmd_young() for architectures not having it · 6617da8f
      Juergen Gross authored
      In order to avoid #ifdeffery add a dummy pmd_young() implementation as a
      fallback.  This is required for the later patch "mm: introduce
      arch_has_hw_nonleaf_pmd_young()".
      
      Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.comSigned-off-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6617da8f
    • SeongJae Park's avatar
      mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes() · 95bc35f9
      SeongJae Park authored
      Commit da878780 ("mm/damon/sysfs: support online inputs update") made
      'damon_sysfs_set_schemes()' to be called for running DAMON context, which
      could have schemes.  In the case, DAMON sysfs interface is supposed to
      update, remove, or add schemes to reflect the sysfs files.  However, the
      code is assuming the DAMON context wouldn't have schemes at all, and
      therefore creates and adds new schemes.  As a result, the code doesn't
      work as intended for online schemes tuning and could have more than
      expected memory footprint.  The schemes are all in the DAMON context, so
      it doesn't leak the memory, though.
      
      Remove the wrong asssumption (the DAMON context wouldn't have schemes) in
      'damon_sysfs_set_schemes()' to fix the bug.
      
      Link: https://lkml.kernel.org/r/20221122194831.3472-1-sj@kernel.org
      Fixes: da878780 ("mm/damon/sysfs: support online inputs update")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      95bc35f9
    • Tiezhu Yang's avatar
      tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep" · a435874b
      Tiezhu Yang authored
      The latest version of grep claims the egrep is now obsolete so the build
      now contains warnings that look like:
      
      	egrep: warning: egrep is obsolescent; using grep -E
      
      fix this up by moving the related file to use "grep -E" instead.
      
        sed -i "s/egrep/grep -E/g" `grep egrep -rwl tools/vm`
      
      Here are the steps to install the latest grep:
      
        wget http://ftp.gnu.org/gnu/grep/grep-3.8.tar.gz
        tar xf grep-3.8.tar.gz
        cd grep-3.8 && ./configure && make
        sudo make install
        export PATH=/usr/local/bin:$PATH
      
      Link: https://lkml.kernel.org/r/1668825419-30584-1-git-send-email-yangtiezhu@loongson.cnSigned-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a435874b
    • ZhangPeng's avatar
      nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry() · f0a0ccda
      ZhangPeng authored
      Syzbot reported a null-ptr-deref bug:
      
       NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP
       frequency < 30 seconds
       general protection fault, probably for non-canonical address
       0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN
       KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
       CPU: 1 PID: 3603 Comm: segctord Not tainted
       6.1.0-rc2-syzkaller-00105-gb229b6ca #0
       Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google
       10/11/2022
       RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0
       fs/nilfs2/alloc.c:608
       Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00
       00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02
       00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7
       RSP: 0018:ffffc90003dff830 EFLAGS: 00010212
       RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d
       RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010
       RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f
       R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158
       R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004
       FS:  0000000000000000(0000) GS:ffff8880b9b00000(0000)
       knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0
       Call Trace:
        <TASK>
        nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline]
        nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193
        nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236
        nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940
        nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline]
        nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline]
        nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088
        nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337
        nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568
        nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018
        nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067
        nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline]
        nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline]
        nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045
        nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379
        nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline]
        nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570
        kthread+0x2e4/0x3a0 kernel/kthread.c:376
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
        </TASK>
       ...
      
      If DAT metadata file is corrupted on disk, there is a case where
      req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during
      a b-tree operation that cascadingly updates ancestor nodes of the b-tree,
      because nilfs_dat_commit_alloc() for a lower level block can initialize
      the blocknr on the same DAT entry between nilfs_dat_prepare_end() and
      nilfs_dat_commit_end().
      
      If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free()
      without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and
      causes the NULL pointer dereference above in
      nilfs_palloc_commit_free_entry() function, which leads to a crash.
      
      Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh
      before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free().
      
      This also calls nilfs_error() in that case to notify that there is a fatal
      flaw in the filesystem metadata and prevent further operations.
      
      Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com
      Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com
      Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.comSigned-off-by: default avatarZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0a0ccda
    • Mike Kravetz's avatar
      hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing · 04ada095
      Mike Kravetz authored
      madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
      tables associated with the address range.  For hugetlb vmas,
      zap_page_range will call __unmap_hugepage_range_final.  However,
      __unmap_hugepage_range_final assumes the passed vma is about to be removed
      and deletes the vma_lock to prevent pmd sharing as the vma is on the way
      out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
      missing vma_lock prevents pmd sharing and could potentially lead to issues
      with truncation/fault races.
      
      This issue was originally reported here [1] as a BUG triggered in
      page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
      vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
      prevent pmd sharing.  Subsequent faults on this vma were confused as
      VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
      not set in new pages added to the page table.  This resulted in pages that
      appeared anonymous in a VM_SHARED vma and triggered the BUG.
      
      Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
      call from unmap_vmas().  This is used to indicate the 'final' unmapping of
      a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
      the vm_lock is not deleted.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      04ada095
    • Mike Kravetz's avatar
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz authored
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21b85b09
    • Yang Shi's avatar
      mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE · dec1d352
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
      include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
      6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
      ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
      96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
      f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      It is because khugepaged allocates pages with __GFP_THISNODE, but the
      preferred node is bogus.  The previous patch fixed the khugepaged code to
      avoid allocating page from non-existing node.  But it is still racy
      against memory hotremove.  There is no synchronization with the memory
      hotplug so it is possible that memory gets offline during a longer taking
      scanning.
      
      So this warning still seems not quite helpful because:
        * There is no guarantee the node is online for __GFP_THISNODE context
          for all the callsites.
        * Kernel just fails the allocation regardless the warning, and it looks
          all callsites handle the allocation failure gracefully.
      
      Although while the warning has helped to identify a buggy code, it is not
      safe in general and this warning could panic the system with panic-on-warn
      configuration which tends to be used surprisingly often.  So replace
      VM_WARN_ON to pr_warn().  And the warning will be triggered if
      __GFP_NOWARN is set since the allocator would print out warning for such
      case if __GFP_NOWARN is not set.
      
      [shy828301@gmail.com: rename nid to this_node and gfp to warn_gfp]
        Link: https://lkml.kernel.org/r/20221123193014.153983-1-shy828301@gmail.com
      [akpm@linux-foundation.org: fix whitespace]
      [akpm@linux-foundation.org: print gfp_mask instead of warn_gfp, per Michel]
      Link: https://lkml.kernel.org/r/20221108184357.55614-3-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dec1d352