1. 03 Feb, 2023 18 commits
  2. 01 Feb, 2023 22 commits
    • Andrew Morton's avatar
      Sync mm-stable with mm-hotfixes-stable to pick up dependent patches · 5ab0fc15
      Andrew Morton authored
      Merge branch 'mm-hotfixes-stable' into mm-stable
      5ab0fc15
    • Kefeng Wang's avatar
      mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath() · ac86f547
      Kefeng Wang authored
      As commit 18365225 ("hwpoison, memcg: forcibly uncharge LRU pages"),
      hwpoison will forcibly uncharg a LRU hwpoisoned page, the folio_memcg
      could be NULl, then, mem_cgroup_track_foreign_dirty_slowpath() could
      occurs a NULL pointer dereference, let's do not record the foreign
      writebacks for folio memcg is null in mem_cgroup_track_foreign_dirty() to
      fix it.
      
      Link: https://lkml.kernel.org/r/20230129040945.180629-1-wangkefeng.wang@huawei.com
      Fixes: 97b27821 ("writeback, memcg: Implement foreign dirty flushing")
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reported-by: default avatarMa Wupeng <mawupeng1@huawei.com>
      Tested-by: default avatarMiko Larsson <mikoxyzzz@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Ma Wupeng <mawupeng1@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac86f547
    • ye xingchen's avatar
      Kconfig.debug: fix the help description in SCHED_DEBUG · 1e90e35b
      ye xingchen authored
      The correct file path for SCHED_DEBUG is /sys/kernel/debug/sched.
      
      Link: https://lkml.kernel.org/r/202301291013573466558@zte.com.cnSigned-off-by: default avatarye xingchen <ye.xingchen@zte.com.cn>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: Josh Poimboeuf <jpoimboe@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1e90e35b
    • Longlong Xia's avatar
      mm/swapfile: add cond_resched() in get_swap_pages() · 7717fc1a
      Longlong Xia authored
      The softlockup still occurs in get_swap_pages() under memory pressure.  64
      CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram
      device is 50MB with same priority as si.  Use the stress-ng tool to
      increase memory pressure, causing the system to oom frequently.
      
      The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens
      of thousands of times to find available space (extreme case:
      cond_resched() is not called in scan_swap_map_slots()).  Let's add
      cond_resched() into get_swap_pages() when failed to find available space
      to avoid softlockup.
      
      Link: https://lkml.kernel.org/r/20230128094757.1060525-1-xialonglong1@huawei.comSigned-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7717fc1a
    • Zhaoyang Huang's avatar
      mm: use stack_depot_early_init for kmemleak · 993f57e0
      Zhaoyang Huang authored
      Mirsad report the below error which is caused by stack_depot_init()
      failure in kvcalloc.  Solve this by having stackdepot use
      stack_depot_early_init().
      
      On 1/4/23 17:08, Mirsad Goran Todorovac wrote:
      I hate to bring bad news again, but there seems to be a problem with the output of /sys/kernel/debug/kmemleak:
      
      [root@pc-mtodorov ~]# cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff951c118568b0 (size 16):
      comm "kworker/u12:2", pid 56, jiffies 4294893952 (age 4356.548s)
      hex dump (first 16 bytes):
          6d 65 6d 73 74 69 63 6b 30 00 00 00 00 00 00 00 memstick0.......
          backtrace:
      [root@pc-mtodorov ~]#
      
      Apparently, backtrace of called functions on the stack is no longer
      printed with the list of memory leaks.  This appeared on Lenovo desktop
      10TX000VCR, with AlmaLinux 8.7 and BIOS version M22KT49A (11/10/2022) and
      6.2-rc1 and 6.2-rc2 builds.  This worked on 6.1 with the same
      CONFIG_KMEMLEAK=y and MGLRU enabled on a vanilla mainstream kernel from
      Mr.  Torvalds' tree.  I don't know if this is deliberate feature for some
      reason or a bug.  Please find attached the config, lshw and kmemleak
      output.
      
      [vbabka@suse.cz: remove stack_depot_init() call]
      Link: https://lore.kernel.org/all/5272a819-ef74-65ff-be61-4d2d567337de@alu.unizg.hr/
      Link: https://lkml.kernel.org/r/1674091345-14799-2-git-send-email-zhaoyang.huang@unisoc.com
      Fixes: 56a61617 ("mm: use stack_depot for recording kmemleak's backtrace")
      Reported-by: default avatarMirsad Todorovac <mirsad.todorovac@alu.unizg.hr>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarZhaoyang Huang <zhaoyang.huang@unisoc.com>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Cc: ke.wang <ke.wang@unisoc.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      993f57e0
    • Phillip Lougher's avatar
      Squashfs: fix handling and sanity checking of xattr_ids count · f65c4bbb
      Phillip Lougher authored
      A Sysbot [1] corrupted filesystem exposes two flaws in the handling and
      sanity checking of the xattr_ids count in the filesystem.  Both of these
      flaws cause computation overflow due to incorrect typing.
      
      In the corrupted filesystem the xattr_ids value is 4294967071, which
      stored in a signed variable becomes the negative number -225.
      
      Flaw 1 (64-bit systems only):
      
      The signed integer xattr_ids variable causes sign extension.
      
      This causes variable overflow in the SQUASHFS_XATTR_*(A) macros.  The
      variable is first multiplied by sizeof(struct squashfs_xattr_id) where the
      type of the sizeof operator is "unsigned long".
      
      On a 64-bit system this is 64-bits in size, and causes the negative number
      to be sign extended and widened to 64-bits and then become unsigned.  This
      produces the very large number 18446744073709548016 or 2^64 - 3600.  This
      number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and
      divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0
      (stored in len).
      
      Flaw 2 (32-bit systems only):
      
      On a 32-bit system the integer variable is not widened by the unsigned
      long type of the sizeof operator (32-bits), and the signedness of the
      variable has no effect due it always being treated as unsigned.
      
      The above corrupted xattr_ids value of 4294967071, when multiplied
      overflows and produces the number 4294963696 or 2^32 - 3400.  This number
      when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by
      SQUASHFS_METADATA_SIZE overflows again and produces a length of 0.
      
      The effect of the 0 length computation:
      
      In conjunction with the corrupted xattr_ids field, the filesystem also has
      a corrupted xattr_table_start value, where it matches the end of
      filesystem value of 850.
      
      This causes the following sanity check code to fail because the
      incorrectly computed len of 0 matches the incorrect size of the table
      reported by the superblock (0 bytes).
      
          len = SQUASHFS_XATTR_BLOCK_BYTES(*xattr_ids);
          indexes = SQUASHFS_XATTR_BLOCKS(*xattr_ids);
      
          /*
           * The computed size of the index table (len bytes) should exactly
           * match the table start and end points
          */
          start = table_start + sizeof(*id_table);
          end = msblk->bytes_used;
      
          if (len != (end - start))
                  return ERR_PTR(-EINVAL);
      
      Changing the xattr_ids variable to be "usigned int" fixes the flaw on a
      64-bit system.  This relies on the fact the computation is widened by the
      unsigned long type of the sizeof operator.
      
      Casting the variable to u64 in the above macro fixes this flaw on a 32-bit
      system.
      
      It also means 64-bit systems do not implicitly rely on the type of the
      sizeof operator to widen the computation.
      
      [1] https://lore.kernel.org/lkml/000000000000cd44f005f1a0f17f@google.com/
      
      Link: https://lkml.kernel.org/r/20230127061842.10965-1-phillip@squashfs.org.uk
      Fixes: 506220d2 ("squashfs: add more sanity checks in xattr id lookup")
      Signed-off-by: default avatarPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com>
      Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Fedor Pchelkin <pchelkin@ispras.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f65c4bbb
    • Tom Saeger's avatar
      sh: define RUNTIME_DISCARD_EXIT · c1c551be
      Tom Saeger authored
      sh vmlinux fails to link with GNU ld < 2.40 (likely < 2.36) since
      commit 99cb0d91 ("arch: fix broken BuildID for arm64 and riscv").
      
      This is similar to fixes for powerpc and s390:
      commit 4b9880db ("powerpc/vmlinux.lds: Define RUNTIME_DISCARD_EXIT").
      commit a494398b ("s390: define RUNTIME_DISCARD_EXIT to fix link error
      with GNU ld < 2.36").
      
        $ sh4-linux-gnu-ld --version | head -n1
        GNU ld (GNU Binutils for Debian) 2.35.2
      
        $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu- microdev_defconfig
        $ make ARCH=sh CROSS_COMPILE=sh4-linux-gnu-
      
        `.exit.text' referenced in section `__bug_table' of crypto/algboss.o:
        defined in discarded section `.exit.text' of crypto/algboss.o
        `.exit.text' referenced in section `__bug_table' of
        drivers/char/hw_random/core.o: defined in discarded section
        `.exit.text' of drivers/char/hw_random/core.o
        make[2]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
        make[1]: *** [Makefile:1252: vmlinux] Error 2
      
      arch/sh/kernel/vmlinux.lds.S keeps EXIT_TEXT:
      
      	/*
      	 * .exit.text is discarded at runtime, not link time, to deal with
      	 * references from __bug_table
      	 */
      	.exit.text : AT(ADDR(.exit.text)) { EXIT_TEXT }
      
      However, EXIT_TEXT is thrown away by
      DISCARD(include/asm-generic/vmlinux.lds.h) because
      sh does not define RUNTIME_DISCARD_EXIT.
      
      GNU ld 2.40 does not have this issue and builds fine.
      This corresponds with Masahiro's comments in a494398b:
      "Nathan [Chancellor] also found that binutils
      commit 21401fc7bf67 ("Duplicate output sections in scripts") cured this
      issue, so we cannot reproduce it with binutils 2.36+, but it is better
      to not rely on it."
      
      Link: https://lkml.kernel.org/r/9166a8abdc0f979e50377e61780a4bba1dfa2f52.1674518464.git.tom.saeger@oracle.com
      Fixes: 99cb0d91 ("arch: fix broken BuildID for arm64 and riscv")
      Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/
      Link: https://lore.kernel.org/all/20230123194218.47ssfzhrpnv3xfez@oracle.com/Signed-off-by: default avatarTom Saeger <tom.saeger@oracle.com>
      Tested-by: default avatarJohn Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dennis Gilmore <dennis@ausil.us>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Palmer Dabbelt <palmer@rivosinc.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1c551be
    • Matthew Wilcox (Oracle)'s avatar
      highmem: round down the address passed to kunmap_flush_on_unmap() · 88d7b120
      Matthew Wilcox (Oracle) authored
      We already round down the address in kunmap_local_indexed() which is the
      other implementation of __kunmap_local().  The only implementation of
      kunmap_flush_on_unmap() is PA-RISC which is expecting a page-aligned
      address.  This may be causing PA-RISC to be flushing the wrong addresses
      currently.
      
      Link: https://lkml.kernel.org/r/20230126200727.1680362-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Fixes: 298fa1ad ("highmem: Provide generic variant of kmap_atomic*")
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      88d7b120
    • Mike Kravetz's avatar
      migrate: hugetlb: check for hugetlb shared PMD in node migration · 73bdf65e
      Mike Kravetz authored
      migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
      move pages shared with another process to a different node.  page_mapcount
      > 1 is being used to determine if a hugetlb page is shared.  However, a
      hugetlb page will have a mapcount of 1 if mapped by multiple processes via
      a shared PMD.  As a result, hugetlb pages shared by multiple processes and
      mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
      
      To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
      consider the page shared.
      
      Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
      Fixes: e2d8cf40 ("migrate: add hugepage migration code to migrate_pages()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73bdf65e
    • Mike Kravetz's avatar
      mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps · 3489dbb6
      Mike Kravetz authored
      Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs".
      
      This issue of mapcount in hugetlb pages referenced by shared PMDs was
      discussed in [1].  The following two patches address user visible behavior
      caused by this issue.
      
      [1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/
      
      
      This patch (of 2):
      
      A hugetlb page will have a mapcount of 1 if mapped by multiple processes
      via a shared PMD.  This is because only the first process increases the
      map count, and subsequent processes just add the shared PMD page to their
      page table.
      
      page_mapcount is being used to decide if a hugetlb page is shared or
      private in /proc/PID/smaps.  Pages referenced via a shared PMD were
      incorrectly being counted as private.
      
      To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
      count the hugetlb page as shared.  A new helper to check for a shared PMD
      is added.
      
      [akpm@linux-foundation.org: simplification, per David]
      [akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()]
      Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com
      Fixes: 25ee01a2 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3489dbb6
    • Zach O'Keefe's avatar
      mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups · edb5d0cf
      Zach O'Keefe authored
      In commit 34488399 ("mm/madvise: add file and shmem support to
      MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():
      
      	-       if (!pmd_present(pmde))
      	-               return SCAN_PMD_NULL;
      	+       if (pmd_none(pmde))
      	+               return SCAN_PMD_NONE;
      
      This was for-use by MADV_COLLAPSE file/shmem codepaths, where
      MADV_COLLAPSE might identify a pte-mapped hugepage, only to have
      khugepaged race-in, free the pte table, and clear the pmd.  Such codepaths
      include:
      
      A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
         already in the pagecache.
      B) In retract_page_tables(), if we fail to grab mmap_lock for the target
         mm/address.
      
      In these cases, collapse_pte_mapped_thp() really does expect a none (not
      just !present) pmd, and we want to suitably identify that case separate
      from the case where no pmd is found, or it's a bad-pmd (of course, many
      things could happen once we drop mmap_lock, and the pmd could plausibly
      undergo multiple transitions due to intervening fault, split, etc). 
      Regardless, the code is prepared install a huge-pmd only when the existing
      pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
      
      However, the commit introduces a logical hole; namely, that we've allowed
      !none- && !huge- && !bad-pmds to be classified as genuine
      pte-table-mapping-pmds.  One such example that could leak through are swap
      entries.  The pmd values aren't checked again before use in
      pte_offset_map_lock(), which is expecting nothing less than a genuine
      pte-table-mapping-pmd.
      
      We want to put back the !pmd_present() check (below the pmd_none() check),
      but need to be careful to deal with subtleties in pmd transitions and
      treatments by various arch.
      
      The issue is that __split_huge_pmd_locked() temporarily clears the present
      bit (or otherwise marks the entry as invalid), but pmd_present() and
      pmd_trans_huge() still need to return true while the pmd is in this
      transitory state.  For example, x86's pmd_present() also checks the
      _PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
      checks a PMD_PRESENT_INVALID bit.
      
      Covering all 4 cases for x86 (all checks done on the same pmd value):
      
      1) pmd_present() && pmd_trans_huge()
         All we actually know here is that the PSE bit is set. Either:
         a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
            is set.
            => huge-pmd
         b) We are currently racing with __split_huge_page().  The danger here
            is that we proceed as-if we have a huge-pmd, but really we are
            looking at a pte-mapping-pmd.  So, what is the risk of this
            danger?
      
            The only relevant path is:
      
      	madvise_collapse() -> collapse_pte_mapped_thp()
      
            Where we might just incorrectly report back "success", when really
            the memory isn't pmd-backed.  This is fine, since split could
            happen immediately after (actually) successful madvise_collapse().
            So, it should be safe to just assume huge-pmd here.
      
      2) pmd_present() && !pmd_trans_huge()
         Either:
         a) PSE not set and either PRESENT or PROTNONE is.
            => pte-table-mapping pmd (or PROT_NONE)
         b) devmap.  This routine can be called immediately after
            unlocking/locking mmap_lock -- or called with no locks held (see
            khugepaged_scan_mm_slot()), so previous VMA checks have since been
            invalidated.
      
      3) !pmd_present() && pmd_trans_huge()
        Not possible.
      
      4) !pmd_present() && !pmd_trans_huge()
        Neither PRESENT nor PROTNONE set
        => not present
      
      I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
      powerpc, longarch, x86, mips, s390) and this logic roughly translates
      (though devmap treatment is unique to x86 and powerpc, and (3) doesn't
      necessarily hold in general -- but that doesn't matter since
      !pmd_present() always takes failure path).
      
      Also, add a comment above find_pmd_or_thp_or_none() to help future
      travelers reason about the validity of the code; namely, the possible
      mutations that might happen out from under us, depending on how mmap_lock
      is held (if at all).
      
      Link: https://lkml.kernel.org/r/20230125225358.2576151-1-zokeefe@google.com
      Fixes: 34488399 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      edb5d0cf
    • Isaac J. Manjarres's avatar
      Revert "mm: kmemleak: alloc gray object for reserved region with direct map" · 8ef852f1
      Isaac J. Manjarres authored
      This reverts commit 972fa3a7.
      
      Kmemleak operates by periodically scanning memory regions for pointers to
      allocated memory blocks to determine if they are leaked or not.  However,
      reserved memory regions can be used for DMA transactions between a device
      and a CPU, and thus, wouldn't contain pointers to allocated memory blocks,
      making them inappropriate for kmemleak to scan.  Thus, revert this commit.
      
      Link: https://lkml.kernel.org/r/20230124230254.295589-1-isaacmanjarres@google.com
      Fixes: 972fa3a7 ("mm: kmemleak: alloc gray object for reserved region with direct map")
      Signed-off-by: default avatarIsaac J. Manjarres <isaacmanjarres@google.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Calvin Zhang <calvinzhang.cool@gmail.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Saravana Kannan <saravanak@google.com>
      Cc: <stable@vger.kernel.org>	[5.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ef852f1
    • Randy Dunlap's avatar
      freevxfs: Kconfig: fix spelling · 0d7866ea
      Randy Dunlap authored
      Fix a spello in freevxfs Kconfig.
      (reported by codespell)
      
      Link: https://lkml.kernel.org/r/20230124181638.15604-1-rdunlap@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0d7866ea
    • Wei Yang's avatar
      maple_tree: should get pivots boundary by type · ab6ef70a
      Wei Yang authored
      We should get pivots boundary by type.  Fixes a potential overindexing of
      mt_pivots[].
      
      Link: https://lkml.kernel.org/r/20221112234308.23823-1-richard.weiyang@gmail.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab6ef70a
    • Eugen Hristev's avatar
    • Vlastimil Babka's avatar
      mm, mremap: fix mremap() expanding for vma's with vm_ops->close() · d014cd7c
      Vlastimil Babka authored
      Fabian has reported another regression in 6.1 due to ca3d76b0 ("mm:
      add merging after mremap resize").  The problem is that vma_merge() can
      fail when vma has a vm_ops->close() method, causing is_mergeable_vma()
      test to be negative.  This was happening for vma mapping a file from
      fuse-overlayfs, which does have the method.  But when we are simply
      expanding the vma, we never remove it due to the "merge" with the added
      area, so the test should not prevent the expansion.
      
      As a quick fix, check for such vmas and expand them using vma_adjust()
      directly as was done before commit ca3d76b0.  For a more robust long
      term solution we should try to limit the check for vma_ops->close only to
      cases that actually result in vma removal, so that no merge would be
      prevented unnecessarily.
      
      [akpm@linux-foundation.org: fix indenting whitespace, reflow comment]
      Link: https://lkml.kernel.org/r/20230117101939.9753-1-vbabka@suse.cz
      Fixes: ca3d76b0 ("mm: add merging after mremap resize")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarFabian Vogt <fvogt@suse.com>
        Link: https://bugzilla.suse.com/show_bug.cgi?id=1206359#c35Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Cc: Jakub Matěna <matenajakub@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d014cd7c
    • Fedor Pchelkin's avatar
      squashfs: harden sanity check in squashfs_read_xattr_id_table · 72e544b1
      Fedor Pchelkin authored
      While mounting a corrupted filesystem, a signed integer '*xattr_ids' can
      become less than zero.  This leads to the incorrect computation of 'len'
      and 'indexes' values which can cause null-ptr-deref in copy_bio_to_actor()
      or out-of-bounds accesses in the next sanity checks inside
      squashfs_read_xattr_id_table().
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Link: https://lkml.kernel.org/r/20230117105226.329303-2-pchelkin@ispras.ru
      Fixes: 506220d2 ("squashfs: add more sanity checks in xattr id lookup")
      Reported-by: <syzbot+082fa4af80a5bb1a9843@syzkaller.appspotmail.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarAlexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Phillip Lougher <phillip@squashfs.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72e544b1
    • James Morse's avatar
      ia64: fix build error due to switch case label appearing next to declaration · 6f28a261
      James Morse authored
      Since commit aa06a9bd ("ia64: fix clock_getres(CLOCK_MONOTONIC) to
      report ITC frequency"), gcc 10.1.0 fails to build ia64 with the gnomic:
      | ../arch/ia64/kernel/sys_ia64.c: In function 'ia64_clock_getres':
      | ../arch/ia64/kernel/sys_ia64.c:189:3: error: a label can only be part of a statement and a declaration is not a statement
      |   189 |   s64 tick_ns = DIV_ROUND_UP(NSEC_PER_SEC, local_cpu_data->itc_freq);
      
      This line appears immediately after a case label in a switch.
      
      Move the declarations out of the case, to the top of the function.
      
      Link: https://lkml.kernel.org/r/20230117151632.393836-1-james.morse@arm.com
      Fixes: aa06a9bd ("ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequency")
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Reviewed-by: default avatarSergei Trofimovich <slyich@gmail.com>
      Cc: Émeric Maschino <emeric.maschino@gmail.com>
      Cc: matoro <matoro_mailinglist_kernel@matoro.tk>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f28a261
    • Yu Zhao's avatar
      mm: multi-gen LRU: fix crash during cgroup migration · de08eaa6
      Yu Zhao authored
      lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself.  This
      isn't true for the following scenario:
      
          CPU 1                         CPU 2
      
        clone()
          cgroup_can_fork()
                                      cgroup_procs_write()
          cgroup_post_fork()
                                        task_lock()
                                        lru_gen_migrate_mm()
                                        task_unlock()
          task_lock()
          lru_gen_add_mm()
          task_unlock()
      
      And when the above happens, kernel crashes because of linked list
      corruption (mm_struct->lru_gen.list).
      
      Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/
      Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarmsizanoen <msizanoen@qtmlabs.xyz>
      Tested-by: default avatarmsizanoen <msizanoen@qtmlabs.xyz>
      Cc: <stable@vger.kernel.org>	[6.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de08eaa6
    • Michal Hocko's avatar
      Revert "mm: add nodes= arg to memory.reclaim" · 55ab834a
      Michal Hocko authored
      This reverts commit 12a5d395.
      
      Although it is recognized that a finer grained pro-active reclaim is
      something we need and want the semantic of this implementation is really
      ambiguous.
      
      In a follow up discussion it became clear that there are two essential
      usecases here.  One is to use memory.reclaim to pro-actively reclaim
      memory and expectation is that the requested and reported amount of memory
      is uncharged from the memcg.  Another usecase focuses on pro-active
      demotion when the memory is merely shuffled around to demotion targets
      while the overall charged memory stays unchanged.
      
      The current implementation considers demoted pages as reclaimed and that
      break both usecases.  [1] has tried to address the reporting part but
      there are more issues with that summarized in [2] and follow up emails.
      
      Let's revert the nodemask based extension of the memcg pro-active
      reclaim for now until we settle with a more robust semantic.
      
      [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com
      [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
      
      Link: https://lkml.kernel.org/r/Y5xASNe1x8cusiTx@dhcp22.suse.cz
      Fixes: 12a5d395 ("mm: add nodes= arg to memory.reclaim")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: zefan li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55ab834a
    • Nhat Pham's avatar
      zsmalloc: fix a race with deferred_handles storing · 85b32581
      Nhat Pham authored
      Currently, there is a race between zs_free() and zs_reclaim_page():
      zs_reclaim_page() finds a handle to an allocated object, but before the
      eviction happens, an independent zs_free() call to the same handle could
      come in and overwrite the object value stored at the handle with the last
      deferred handle.  When zs_reclaim_page() finally gets to call the eviction
      handler, it will see an invalid object value (i.e the previous deferred
      handle instead of the original object value).
      
      This race happens quite infrequently.  We only managed to produce it with
      out-of-tree developmental code that triggers zsmalloc writeback with a
      much higher frequency than usual.
      
      This patch fixes this race by storing the deferred handle in the object
      header instead.  We differentiate the deferred handle from the other two
      cases (handle for allocated object, and linkage for free object) with a
      new tag.  If zspage reclamation succeeds, we will free these deferred
      handles by walking through the zspage objects.  On the other hand, if
      zspage reclamation fails, we reconstruct the zspage freelist (with the
      deferred handle tag and allocated tag) before trying again with the
      reclamation.
      
      [arnd@arndb.de: avoid unused-function warning]
        Link: https://lkml.kernel.org/r/20230117170507.2651972-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20230110231701.326724-1-nphamcs@gmail.com
      Fixes: 9997bc01 ("zsmalloc: implement writeback mechanism for zsmalloc")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85b32581
    • Jann Horn's avatar
      mm/khugepaged: fix ->anon_vma race · 023f47a8
      Jann Horn authored
      If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
      it to be locked.
      
      Page table traversal is allowed under any one of the mmap lock, the
      anon_vma lock (if the VMA is associated with an anon_vma), and the
      mapping lock (if the VMA is associated with a mapping); and so to be
      able to remove page tables, we must hold all three of them. 
      retract_page_tables() bails out if an ->anon_vma is attached, but does
      this check before holding the mmap lock (as the comment above the check
      explains).
      
      If we racily merged an existing ->anon_vma (shared with a child
      process) from a neighboring VMA, subsequent rmap traversals on pages
      belonging to the child will be able to see the page tables that we are
      concurrently removing while assuming that nothing else can access them.
      
      Repeat the ->anon_vma check once we hold the mmap lock to ensure that
      there really is no concurrent page table access.
      
      Hitting this bug causes a lockdep warning in collapse_and_free_pmd(),
      in the line "lockdep_assert_held_write(&vma->anon_vma->root->rwsem)". 
      It can also lead to use-after-free access.
      
      Link: https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20230111133351.807024-1-jannh@google.com
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reported-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@intel.linux.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      023f47a8