1. 17 Oct, 2024 28 commits
    • Lorenzo Stoakes's avatar
      maple_tree: add regression test for spanning store bug · e993457d
      Lorenzo Stoakes authored
      Add a regression test to assert that, when performing a spanning store
      which consumes the entirety of the rightmost right leaf node does not
      result in maple tree corruption when doing so.
      
      This achieves this by building a test tree of 3 levels and establishing a
      store which ultimately results in a spanned store of this nature.
      
      Link: https://lkml.kernel.org/r/30cdc101a700d16e03ba2f9aa5d83f2efa894168.1728314403.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e993457d
    • Lorenzo Stoakes's avatar
      maple_tree: correct tree corruption on spanning store · bea07fd6
      Lorenzo Stoakes authored
      Patch series "maple_tree: correct tree corruption on spanning store", v3.
      
      There has been a nasty yet subtle maple tree corruption bug that appears
      to have been in existence since the inception of the algorithm.
      
      This bug seems far more likely to happen since commit f8d112a4
      ("mm/mmap: avoid zeroing vma tree in mmap_region()"), which is the point
      at which reports started to be submitted concerning this bug.
      
      We were made definitely aware of the bug thanks to the kind efforts of
      Bert Karwatzki who helped enormously in my being able to track this down
      and identify the cause of it.
      
      The bug arises when an attempt is made to perform a spanning store across
      two leaf nodes, where the right leaf node is the rightmost child of the
      shared parent, AND the store completely consumes the right-mode node.
      
      This results in mas_wr_spanning_store() mitakenly duplicating the new and
      existing entries at the maximum pivot within the range, and thus maple
      tree corruption.
      
      The fix patch corrects this by detecting this scenario and disallowing the
      mistaken duplicate copy.
      
      The fix patch commit message goes into great detail as to how this occurs.
      
      This series also includes a test which reliably reproduces the issue, and
      asserts that the fix works correctly.
      
      Bert has kindly tested the fix and confirmed it resolved his issues.  Also
      Mikhail Gavrilov kindly reported what appears to be precisely the same
      bug, which this fix should also resolve.
      
      
      This patch (of 2):
      
      There has been a subtle bug present in the maple tree implementation from
      its inception.
      
      This arises from how stores are performed - when a store occurs, it will
      overwrite overlapping ranges and adjust the tree as necessary to
      accommodate this.
      
      A range may always ultimately span two leaf nodes.  In this instance we
      walk the two leaf nodes, determine which elements are not overwritten to
      the left and to the right of the start and end of the ranges respectively
      and then rebalance the tree to contain these entries and the newly
      inserted one.
      
      This kind of store is dubbed a 'spanning store' and is implemented by
      mas_wr_spanning_store().
      
      In order to reach this stage, mas_store_gfp() invokes
      mas_wr_preallocate(), mas_wr_store_type() and mas_wr_walk() in turn to
      walk the tree and update the object (mas) to traverse to the location
      where the write should be performed, determining its store type.
      
      When a spanning store is required, this function returns false stopping at
      the parent node which contains the target range, and mas_wr_store_type()
      marks the mas->store_type as wr_spanning_store to denote this fact.
      
      When we go to perform the store in mas_wr_spanning_store(), we first
      determine the elements AFTER the END of the range we wish to store (that
      is, to the right of the entry to be inserted) - we do this by walking to
      the NEXT pivot in the tree (i.e.  r_mas.last + 1), starting at the node we
      have just determined contains the range over which we intend to write.
      
      We then turn our attention to the entries to the left of the entry we are
      inserting, whose state is represented by l_mas, and copy these into a 'big
      node', which is a special node which contains enough slots to contain two
      leaf node's worth of data.
      
      We then copy the entry we wish to store immediately after this - the copy
      and the insertion of the new entry is performed by mas_store_b_node().
      
      After this we copy the elements to the right of the end of the range which
      we are inserting, if we have not exceeded the length of the node (i.e. 
      r_mas.offset <= r_mas.end).
      
      Herein lies the bug - under very specific circumstances, this logic can
      break and corrupt the maple tree.
      
      Consider the following tree:
      
      Height
        0                             Root Node
                                       /      \
                       pivot = 0xffff /        \ pivot = ULONG_MAX
                                     /          \
        1                       A [-----]       ...
                                   /   \
                   pivot = 0x4fff /     \ pivot = 0xffff
                                 /       \
        2 (LEAVES)          B [-----]  [-----] C
                                            ^--- Last pivot 0xffff.
      
      Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
      that all ranges expressed in maple tree code are inclusive):
      
      1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
         determines that this is a spanning store across nodes B and C. The mas
         state is set such that the current node from which we traverse further
         is node A.
      
      2. In mas_wr_spanning_store() we try to find elements to the right of pivot
         0xffff by searching for an index of 0x10000:
      
          - mas_wr_walk_index() invokes mas_wr_walk_descend() and
            mas_wr_node_walk() in turn.
      
              - mas_wr_node_walk() loops over entries in node A until EITHER it
                finds an entry whose pivot equals or exceeds 0x10000 OR it
                reaches the final entry.
      
              - Since no entry has a pivot equal to or exceeding 0x10000, pivot
                0xffff is selected, leading to node C.
      
          - mas_wr_walk_traverse() resets the mas state to traverse node C. We
            loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
            in turn once again.
      
               - Again, we reach the last entry in node C, which has a pivot of
                 0xffff.
      
      3. We then copy the elements to the left of 0x4000 in node B to the big
         node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
         too.
      
      4. We determine whether we have any entries to copy from the right of the
         end of the range via - and with r_mas set up at the entry at pivot
         0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
         pivot 0xffff.
      
      5. BUG! The maple tree is corrupted with a duplicate entry.
      
      This requires a very specific set of circumstances - we must be spanning
      the last element in a leaf node, which is the last element in the parent
      node.
      
      spanning store across two leaf nodes with a range that ends at that shared
      pivot.
      
      A potential solution to this problem would simply be to reset the walk
      each time we traverse r_mas, however given the rarity of this situation it
      seems that would be rather inefficient.
      
      Instead, this patch detects if the right hand node is populated, i.e.  has
      anything we need to copy.
      
      We do so by only copying elements from the right of the entry being
      inserted when the maximum value present exceeds the last, rather than
      basing this on offset position.
      
      The patch also updates some comments and eliminates the unused bool return
      value in mas_wr_walk_index().
      
      The work performed in commit f8d112a4 ("mm/mmap: avoid zeroing vma
      tree in mmap_region()") seems to have made the probability of this event
      much more likely, which is the point at which reports started to be
      submitted concerning this bug.
      
      The motivation for this change arose from Bert Karwatzki's report of
      encountering mm instability after the release of kernel v6.12-rc1 which,
      after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
      options, was identified as maple tree corruption.
      
      After Bert very generously provided his time and ability to reproduce this
      event consistently, I was able to finally identify that the issue
      discussed in this commit message was occurring for him.
      
      Link: https://lkml.kernel.org/r/cover.1728314402.git.lorenzo.stoakes@oracle.com
      Link: https://lkml.kernel.org/r/48b349a2a0f7c76e18772712d0997a5e12ab0a3b.1728314403.git.lorenzo.stoakes@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reported-by: default avatarBert Karwatzki <spasswolf@web.de>
      Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/Tested-by: default avatarBert Karwatzki <spasswolf@web.de>
      Reported-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bea07fd6
    • Wei Xu's avatar
      mm/mglru: only clear kswapd_failures if reclaimable · b130ba4a
      Wei Xu authored
      lru_gen_shrink_node() unconditionally clears kswapd_failures, which can
      prevent kswapd from sleeping and cause 100% kswapd cpu usage even when
      kswapd repeatedly fails to make progress in reclaim.
      
      Only clear kswap_failures in lru_gen_shrink_node() if reclaim makes some
      progress, similar to shrink_node().
      
      I happened to run into this problem in one of my tests recently.  It
      requires a combination of several conditions: The allocator needs to
      allocate a right amount of pages such that it can wake up kswapd
      without itself being OOM killed; there is no memory for kswapd to
      reclaim (My test disables swap and cleans page cache first); no other
      process frees enough memory at the same time.
      
      Link: https://lkml.kernel.org/r/20241014221211.832591-1-weixugc@google.com
      Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarWei Xu <weixugc@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Jan Alexander Steffens <heftig@archlinux.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b130ba4a
    • Liu Shixin's avatar
      mm/swapfile: skip HugeTLB pages for unuse_vma · 7528c4fb
      Liu Shixin authored
      I got a bad pud error and lost a 1GB HugeTLB when calling swapoff.  The
      problem can be reproduced by the following steps:
      
       1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory.
       2. Swapout the above anonymous memory.
       3. run swapoff and we will get a bad pud error in kernel message:
      
        mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7)
      
      We can tell that pud_clear_bad is called by pud_none_or_clear_bad in
      unuse_pud_range() by ftrace.  And therefore the HugeTLB pages will never
      be freed because we lost it from page table.  We can skip HugeTLB pages
      for unuse_vma to fix it.
      
      Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com
      Fixes: 0fe6e20b ("hugetlb, rmap: add reverse mapping for hugepage")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7528c4fb
    • Nanyong Sun's avatar
      selftests: mm: fix the incorrect usage() info of khugepaged · 3e822bed
      Nanyong Sun authored
      The mount option of tmpfs should be huge=advise, not madvise which is not
      supported and may mislead the users.
      
      Link: https://lkml.kernel.org/r/20241015020257.139235-1-sunnanyong@huawei.com
      Fixes: 1b03d0d5 ("selftests/vm: add thp collapse file and tmpfs testing")
      Signed-off-by: default avatarNanyong Sun <sunnanyong@huawei.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e822bed
    • Jann Horn's avatar
      MAINTAINERS: add Jann as memory mapping/VMA reviewer · cb2bb9c5
      Jann Horn authored
      Add myself as a reviewer for memory mapping / VMA code.  I will probably
      only reply to patches sporadically, but hopefully this will help me keep
      up with changes that look interesting security-wise.
      
      Link: https://lkml.kernel.org/r/20241014-maintainers-mmap-reviewer-v1-1-50dce0514752@google.comSigned-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Acked-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb2bb9c5
    • Jeongjun Park's avatar
      mm: swap: prevent possible data-race in __try_to_reclaim_swap · 818f916e
      Jeongjun Park authored
      A report [1] was uploaded from syzbot.
      
      In the previous commit 862590ac ("mm: swap: allow cache reclaim to
      skip slot cache"), the __try_to_reclaim_swap() function reads offset and
      folio->entry from folio without folio_lock protection.
      
      In the currently reported KCSAN log, it is assumed that the actual
      data-race will not occur because the calltrace that does WRITE already
      obtains the folio_lock and then writes.
      
      However, the existing __try_to_reclaim_swap() function was already
      implemented to perform reads under folio_lock protection [1], and there is
      a risk of a data-race occurring through a function other than the one
      shown in the KCSAN log.
      
      Therefore, I think it is appropriate to change
      read operations for folio to be performed under folio_lock.
      
      [1]
      
      ==================================================================
      BUG: KCSAN: data-race in __delete_from_swap_cache / __try_to_reclaim_swap
      
      write to 0xffffea0004c90328 of 8 bytes by task 5186 on cpu 0:
       __delete_from_swap_cache+0x1f0/0x290 mm/swap_state.c:163
       delete_from_swap_cache+0x72/0xe0 mm/swap_state.c:243
       folio_free_swap+0x1d8/0x1f0 mm/swapfile.c:1850
       free_swap_cache mm/swap_state.c:293 [inline]
       free_pages_and_swap_cache+0x1fc/0x410 mm/swap_state.c:325
       __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
       tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
       tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
       tlb_flush_mmu+0x2cf/0x440 mm/mmu_gather.c:373
       zap_pte_range mm/memory.c:1700 [inline]
       zap_pmd_range mm/memory.c:1739 [inline]
       zap_pud_range mm/memory.c:1768 [inline]
       zap_p4d_range mm/memory.c:1789 [inline]
       unmap_page_range+0x1f3c/0x22d0 mm/memory.c:1810
       unmap_single_vma+0x142/0x1d0 mm/memory.c:1856
       unmap_vmas+0x18d/0x2b0 mm/memory.c:1900
       exit_mmap+0x18a/0x690 mm/mmap.c:1864
       __mmput+0x28/0x1b0 kernel/fork.c:1347
       mmput+0x4c/0x60 kernel/fork.c:1369
       exit_mm+0xe4/0x190 kernel/exit.c:571
       do_exit+0x55e/0x17f0 kernel/exit.c:926
       do_group_exit+0x102/0x150 kernel/exit.c:1088
       get_signal+0xf2a/0x1070 kernel/signal.c:2917
       arch_do_signal_or_restart+0x95/0x4b0 arch/x86/kernel/signal.c:337
       exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
       exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
       __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
       syscall_exit_to_user_mode+0x59/0x130 kernel/entry/common.c:218
       do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      read to 0xffffea0004c90328 of 8 bytes by task 5189 on cpu 1:
       __try_to_reclaim_swap+0x9d/0x510 mm/swapfile.c:198
       free_swap_and_cache_nr+0x45d/0x8a0 mm/swapfile.c:1915
       zap_pte_range mm/memory.c:1656 [inline]
       zap_pmd_range mm/memory.c:1739 [inline]
       zap_pud_range mm/memory.c:1768 [inline]
       zap_p4d_range mm/memory.c:1789 [inline]
       unmap_page_range+0xcf8/0x22d0 mm/memory.c:1810
       unmap_single_vma+0x142/0x1d0 mm/memory.c:1856
       unmap_vmas+0x18d/0x2b0 mm/memory.c:1900
       exit_mmap+0x18a/0x690 mm/mmap.c:1864
       __mmput+0x28/0x1b0 kernel/fork.c:1347
       mmput+0x4c/0x60 kernel/fork.c:1369
       exit_mm+0xe4/0x190 kernel/exit.c:571
       do_exit+0x55e/0x17f0 kernel/exit.c:926
       __do_sys_exit kernel/exit.c:1055 [inline]
       __se_sys_exit kernel/exit.c:1053 [inline]
       __x64_sys_exit+0x1f/0x20 kernel/exit.c:1053
       x64_sys_call+0x2d46/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:61
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      value changed: 0x0000000000000242 -> 0x0000000000000000
      
      Link: https://lkml.kernel.org/r/20241007070623.23340-1-aha310510@gmail.com
      Reported-by: syzbot+fa43f1b63e3aa6f66329@syzkaller.appspotmail.com
      Fixes: 862590ac ("mm: swap: allow cache reclaim to skip slot cache")
      Signed-off-by: default avatarJeongjun Park <aha310510@gmail.com>
      Acked-by: default avatarChris Li <chrisl@kernel.org>
      Reviewed-by: default avatarKairui Song <kasong@tencent.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      818f916e
    • Baolin Wang's avatar
      mm: khugepaged: fix the incorrect statistics when collapsing large file folios · d60fcaf0
      Baolin Wang authored
      Khugepaged already supports collapsing file large folios (including shmem
      mTHP) by commit 7de856ff ("mm: khugepaged: support shmem mTHP
      collapse"), and the control parameters in khugepaged:
      'khugepaged_max_ptes_swap' and 'khugepaged_max_ptes_none', still compare
      based on PTE granularity to determine whether a file collapse is needed. 
      However, the statistics for 'present' and 'swap' in
      hpage_collapse_scan_file() do not take into account the large folios,
      which may lead to incorrect judgments regarding the
      khugepaged_max_ptes_swap/none parameters, resulting in unnecessary file
      collapses.
      
      To fix this issue, take into account the large folios' statistics for
      'present' and 'swap' variables in the hpage_collapse_scan_file().
      
      Link: https://lkml.kernel.org/r/c76305d96d12d030a1a346b50503d148364246d2.1728901391.git.baolin.wang@linux.alibaba.com
      Fixes: 7de856ff ("mm: khugepaged: support shmem mTHP collapse")
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d60fcaf0
    • Andrey Konovalov's avatar
      MAINTAINERS: kasan, kcov: add bugzilla links · 22ff9b0f
      Andrey Konovalov authored
      Add links to the Bugzilla component that's used to track KASAN and KCOV
      issues.
      
      Link: https://lkml.kernel.org/r/20241012225524.117871-1-andrey.konovalov@linux.devSigned-off-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      22ff9b0f
    • David Hildenbrand's avatar
      mm: don't install PMD mappings when THPs are disabled by the hw/process/vma · 2b0f9223
      David Hildenbrand authored
      We (or rather, readahead logic :) ) might be allocating a THP in the
      pagecache and then try mapping it into a process that explicitly disabled
      THP: we might end up installing PMD mappings.
      
      This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
      THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
      starting the VM.
      
      For example, starting a VM backed on a file system with large folios
      supported makes the VM crash when the VM tries accessing such a mapping
      using KVM.
      
      Is it also a problem when the HW disabled THP using
      TRANSPARENT_HUGEPAGE_UNSUPPORTED?  At least on x86 this would be the case
      without X86_FEATURE_PSE.
      
      In the future, we might be able to do better on s390x and only disallow
      PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
      really wants.  For now, fix it by essentially performing the same check as
      would be done in __thp_vma_allowable_orders() or in shmem code, where this
      works as expected, and disallow PMD mappings, making us fallback to PTE
      mappings.
      
      Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com
      Fixes: 793917d9 ("mm/readahead: Add large folio readahead")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarLeo Fu <bfu@redhat.com>
      Tested-by: default avatarThomas Huth <thuth@redhat.com>
      Cc: Thomas Huth <thuth@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b0f9223
    • Kefeng Wang's avatar
      mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw() · 963756aa
      Kefeng Wang authored
      Patch series "mm: don't install PMD mappings when THPs are disabled by the
      hw/process/vma".
      
      During testing, it was found that we can get PMD mappings in processes
      where THP (and more precisely, PMD mappings) are supposed to be disabled. 
      While it works as expected for anon+shmem, the pagecache is the
      problematic bit.
      
      For s390 KVM this currently means that a VM backed by a file located on
      filesystem with large folio support can crash when KVM tries accessing the
      problematic page, because the readahead logic might decide to use a
      PMD-sized THP and faulting it into the page tables will install a PMD
      mapping, something that s390 KVM cannot tolerate.
      
      This might also be a problem with HW that does not support PMD mappings,
      but I did not try reproducing it.
      
      Fix it by respecting the ways to disable THPs when deciding whether we can
      install a PMD mapping.  khugepaged should already be taking care of not
      collapsing if THPs are effectively disabled for the hw/process/vma.
      
      
      This patch (of 2):
      
      Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by
      shmem_allowable_huge_orders() and __thp_vma_allowable_orders().
      
      [david@redhat.com: rename to vma_thp_disabled(), split out thp_disabled_by_hw() ]
      Link: https://lkml.kernel.org/r/20241011102445.934409-2-david@redhat.com
      Fixes: 793917d9 ("mm/readahead: Add large folio readahead")
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarLeo Fu <bfu@redhat.com>
      Tested-by: default avatarThomas Huth <thuth@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Boqiao Fu <bfu@redhat.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      963756aa
    • SeongJae Park's avatar
      Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs · f4050cca
      SeongJae Park authored
      DAMON GitHub repos have moved from awslabs GitHub org to damonitor org[1].
      Following the change, URLs on documents are also updated[2].  However,
      commit 2e9b3d6e ("Docs/damon/maintainer-profile: add links in place"),
      which was added just after the update, was using the deprecated GitHub
      URLs.  Update those to use damonitor GitHub URLs instead.
      
      [1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org
      [2] https://lore.kernel.org/20240826015741.80707-2-sj@kernel.org
      
      Link: https://lkml.kernel.org/r/20241011170154.70651-3-sj@kernel.org
      Fixes: 2e9b3d6e ("Docs/damon/maintainer-profile: add links in place")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4050cca
    • SeongJae Park's avatar
      Docs/damon/maintainer-profile: add missing '_' suffixes for external web links · 46e10f64
      SeongJae Park authored
      Patch series "Docs/damon/maintainer-profile: a couple of minor hotfixes".
      
      DAMON maintainer-profile.rst file patches[1] that were merged into the
      v6.12-rc1 have a couple of minor mistakes.  Fix those.
      
      [1] https://lore.kernel.org/20240826015741.80707-1-sj@kernel.org
      
      
      This patch (of 2):
      
      Links to external web pages on DAMON's maintainer-profile.rst are missing
      '_' suffixes.  As a result, rendered document is having only verbose URLs
      that cannot be clicked.  Fix those.
      
      Also, update the link texts for git trees to contain the names of the
      trees, for better readability and avoiding below Sphinx warning.
      
          maintainer-profile.rst:4: WARNING: Duplicate explicit target name: "tree".
      
      Link: https://lkml.kernel.org/r/20241011170154.70651-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20241011170154.70651-2-sj@kernel.org
      Fixes: 2e9b3d6e ("Docs/damon/maintainer-profile: add links in place")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46e10f64
    • Sidhartha Kumar's avatar
      maple_tree: check for MA_STATE_BULK on setting wr_rebalance · a6e0ceb7
      Sidhartha Kumar authored
      It is possible for a bulk operation (MA_STATE_BULK is set) to enter the
      new_end < mt_min_slots[type] case and set wr_rebalance as a store type. 
      This is incorrect as bulk stores do not rebalance per write, but rather
      after the all of the writes are done through the mas_bulk_rebalance()
      path.  Therefore, add a check to make sure MA_STATE_BULK is not set before
      we return wr_rebalance as the store type.
      
      Also add a test to make sure wr_rebalance is never the store type when
      doing bulk operations via mas_expected_entries()
      
      This is a hotfix for this rc however it has no userspace effects as there
      are no users of the bulk insertion mode.
      
      Link: https://lkml.kernel.org/r/20241011214451.7286-1-sidhartha.kumar@oracle.com
      Fixes: 5d659bbb ("maple_tree: introduce mas_wr_store_type()")
      Suggested-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Signed-off-by: default avatarSidhartha <sidhartha.kumar@oracle.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6e0ceb7
    • Yang Shi's avatar
      mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point · 37f0b47c
      Yang Shi authored
      The "addr" and "is_shmem" arguments have different order in TP_PROTO and
      TP_ARGS.  This resulted in the incorrect trace result:
      
      text-hugepage-644429 [276] 392092.878683: mm_khugepaged_collapse_file:
      mm=0xffff20025d52c440, hpage_pfn=0x200678c00, index=512, addr=1, is_shmem=0,
      filename=text-hugepage, nr=512, result=failed
      
      The value of "addr" is wrong because it was treated as bool value, the
      type of is_shmem.
      
      Fix the order in TP_PROTO to keep "addr" is before "is_shmem" since the
      original patch review suggested this order to achieve best packing.
      
      And use "lx" for "addr" instead of "ld" in TP_printk because address is
      typically shown in hex.
      
      After the fix, the trace result looks correct:
      
      text-hugepage-7291  [004]   128.627251: mm_khugepaged_collapse_file:
      mm=0xffff0001328f9500, hpage_pfn=0x20016ea00, index=512, addr=0x400000,
      is_shmem=0, filename=text-hugepage, nr=512, result=failed
      
      Link: https://lkml.kernel.org/r/20241012011702.1084846-1-yang@os.amperecomputing.com
      Fixes: 4c9473e8 ("mm/khugepaged: add tracepoint to collapse_file()")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Cc: Gautam Menghani <gautammenghani201@gmail.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>    [6.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37f0b47c
    • Jinjie Ruan's avatar
      mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets() · 2d6a1c83
      Jinjie Ruan authored
      The sysfs_target->regions allocated in damon_sysfs_regions_alloc() is not
      freed in damon_sysfs_test_add_targets(), which cause the following memory
      leak, free it to fix it.
      
      	unreferenced object 0xffffff80c2a8db80 (size 96):
      	  comm "kunit_try_catch", pid 187, jiffies 4294894363
      	  hex dump (first 32 bytes):
      	    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      	    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      	  backtrace (crc 0):
      	    [<0000000001e3714d>] kmemleak_alloc+0x34/0x40
      	    [<000000008e6835c1>] __kmalloc_cache_noprof+0x26c/0x2f4
      	    [<000000001286d9f8>] damon_sysfs_test_add_targets+0x1cc/0x738
      	    [<0000000032ef8f77>] kunit_try_run_case+0x13c/0x3ac
      	    [<00000000f3edea23>] kunit_generic_run_threadfn_adapter+0x80/0xec
      	    [<00000000adf936cf>] kthread+0x2e8/0x374
      	    [<0000000041bb1628>] ret_from_fork+0x10/0x20
      
      Link: https://lkml.kernel.org/r/20241010125323.3127187-1-ruanjinjie@huawei.com
      Fixes: b8ee5575 ("mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d6a1c83
    • Andy Shevchenko's avatar
      mm: remove unused stub for can_swapin_thp() · a5e8eb25
      Andy Shevchenko authored
      When can_swapin_thp() is unused, it prevents kernel builds with clang,
      `make W=1` and CONFIG_WERROR=y:
      
      mm/memory.c:4184:20: error: unused function 'can_swapin_thp' [-Werror,-Wunused-function]
      
      Fix this by removing the unused stub.
      
      See also commit 6863f564 ("kbuild: allow Clang to find unused static
      inline functions for W=1 build").
      
      Link: https://lkml.kernel.org/r/20241008191329.2332346-1-andriy.shevchenko@linux.intel.com
      Fixes: 242d12c9 ("mm: support large folios swap-in for sync io devices")
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Cc: Bill Wendling <morbo@google.com>
      Cc: Chuanhua Han <hanchuanhua@oppo.com>
      Cc: Justin Stitt <justinstitt@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a5e8eb25
    • Andy Chiu's avatar
      mailmap: add an entry for Andy Chiu · 3f4e74cb
      Andy Chiu authored
      Map my outdated addresses within mailmap.
      
      Link: https://lkml.kernel.org/r/20241009144934.43027-1-andybnac@gmail.comSigned-off-by: default avatarAndy Chiu <andybnac@gmail.com>
      Cc: Greentime Hu <greentime.hu@sifive.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Leon Chien <leonchien@synology.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f4e74cb
    • Lorenzo Stoakes's avatar
      MAINTAINERS: add memory mapping/VMA co-maintainers · f8dc524e
      Lorenzo Stoakes authored
      Add myself and Liam as co-maintainers of the memory mapping and VMA code
      alongside Andrew as we are heavily involved in its implementation and
      maintenance.
      
      Link: https://lkml.kernel.org/r/20241009201032.6130-1-lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8dc524e
    • Brahmajit Das's avatar
      fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization · 5778ace0
      Brahmajit Das authored
      show show_smap_vma_flags() has been a using misspelled initializer in
      mnemonics[] - it needed to initialize 2 element array of char and it used
      NUL-padded 2 character string literals (i.e.  3-element initializer).
      
      This has been spotted by gcc-15[*]; prior to that gcc quietly dropped the
      3rd eleemnt of initializers.  To fix this we are increasing the size of
      mnemonics[] (from mnemonics[BITS_PER_LONG][2] to
      mnemonics[BITS_PER_LONG][3]) to accomodate the NUL-padded string literals.
      
      This also helps us in simplyfying the logic for printing of the flags as
      instead of printing each character from the mnemonics[], we can just print
      the mnemonics[] using seq_printf.
      
      [*]: fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
        917 |                 [0 ... (BITS_PER_LONG-1)] = "??",
            |                                                 ^~~~
      fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
      fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
      fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
      fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
      fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
      ...
      
      
      Stephen pointed out:
      
      : The C standard explicitly allows for a string initializer to be too long
      : due to the NUL byte at the end ...  so this warning may be overzealous.
      
      but let's make the warning go away anwyay.
      
      Link: https://lkml.kernel.org/r/20241005063700.2241027-1-brahmajit.xyz@gmail.com
      Link: https://lkml.kernel.org/r/20241003093040.47c08382@canb.auug.org.auSigned-off-by: default avatarBrahmajit Das <brahmajit.xyz@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5778ace0
    • Florian Westphal's avatar
      lib: alloc_tag_module_unload must wait for pending kfree_rcu calls · dc783ba4
      Florian Westphal authored
      Ben Greear reports following splat:
       ------------[ cut here ]------------
       net/netfilter/nf_nat_core.c:1114 module nf_nat func:nf_nat_register_fn has 256 allocated at module unload
       WARNING: CPU: 1 PID: 10421 at lib/alloc_tag.c:168 alloc_tag_module_unload+0x22b/0x3f0
       Modules linked in: nf_nat(-) btrfs ufs qnx4 hfsplus hfs minix vfat msdos fat
      ...
       Hardware name: Default string Default string/SKYBAY, BIOS 5.12 08/04/2020
       RIP: 0010:alloc_tag_module_unload+0x22b/0x3f0
        codetag_unload_module+0x19b/0x2a0
        ? codetag_load_module+0x80/0x80
      
      nf_nat module exit calls kfree_rcu on those addresses, but the free
      operation is likely still pending by the time alloc_tag checks for leaks.
      
      Wait for outstanding kfree_rcu operations to complete before checking
      resolves this warning.
      
      Reproducer:
      unshare -n iptables-nft -t nat -A PREROUTING -p tcp
      grep nf_nat /proc/allocinfo # will list 4 allocations
      rmmod nft_chain_nat
      rmmod nf_nat                # will WARN.
      
      [akpm@linux-foundation.org: add comment]
      Link: https://lkml.kernel.org/r/20241007205236.11847-1-fw@strlen.de
      Fixes: a4735739 ("lib: code tagging module support")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reported-by: default avatarBen Greear <greearb@candelatech.com>
      Closes: https://lore.kernel.org/netdev/bdaaef9d-4364-4171-b82b-bcfc12e207eb@candelatech.com/
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dc783ba4
    • Jann Horn's avatar
      mm/mremap: fix move_normal_pmd/retract_page_tables race · 6fa1066f
      Jann Horn authored
      In mremap(), move_page_tables() looks at the type of the PMD entry and the
      specified address range to figure out by which method the next chunk of
      page table entries should be moved.
      
      At that point, the mmap_lock is held in write mode, but no rmap locks are
      held yet.  For PMD entries that point to page tables and are fully covered
      by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
      which first takes rmap locks, then does move_normal_pmd(). 
      move_normal_pmd() takes the necessary page table locks at source and
      destination, then moves an entire page table from the source to the
      destination.
      
      The problem is: The rmap locks, which protect against concurrent page
      table removal by retract_page_tables() in the THP code, are only taken
      after the PMD entry has been read and it has been decided how to move it. 
      So we can race as follows (with two processes that have mappings of the
      same tmpfs file that is stored on a tmpfs mount with huge=advise); note
      that process A accesses page tables through the MM while process B does it
      through the file rmap:
      
      process A                      process B
      =========                      =========
      mremap
        mremap_to
          move_vma
            move_page_tables
              get_old_pmd
              alloc_new_pmd
                            *** PREEMPT ***
                                     madvise(MADV_COLLAPSE)
                                       do_madvise
                                         madvise_walk_vmas
                                           madvise_vma_behavior
                                             madvise_collapse
                                               hpage_collapse_scan_file
                                                 collapse_file
                                                   retract_page_tables
                                                     i_mmap_lock_read(mapping)
                                                     pmdp_collapse_flush
                                                     i_mmap_unlock_read(mapping)
              move_pgt_entry(NORMAL_PMD, ...)
                take_rmap_locks
                move_normal_pmd
                drop_rmap_locks
      
      When this happens, move_normal_pmd() can end up creating bogus PMD entries
      in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`.  The effect
      depends on arch-specific and machine-specific details; on x86, you can end
      up with physical page 0 mapped as a page table, which is likely
      exploitable for user->kernel privilege escalation.
      
      Fix the race by letting process B recheck that the PMD still points to a
      page table after the rmap locks have been taken.  Otherwise, we bail and
      let the caller fall back to the PTE-level copying path, which will then
      bail immediately at the pmd_none() check.
      
      Bug reachability: Reaching this bug requires that you can create
      shmem/file THP mappings - anonymous THP uses different code that doesn't
      zap stuff under rmap locks.  File THP is gated on an experimental config
      flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
      shmem THP to hit this bug.  As far as I know, getting shmem THP normally
      requires that you can mount your own tmpfs with the right mount flags,
      which would require creating your own user+mount namespace; though I don't
      know if some distros maybe enable shmem THP by default or something like
      that.
      
      Bug impact: This issue can likely be used for user->kernel privilege
      escalation when it is reachable.
      
      Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
      Fixes: 1d65b771 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Co-developed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Closes: https://project-zero.issues.chromium.org/371047675Acked-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6fa1066f
    • Sebastian Andrzej Siewior's avatar
      mm: percpu: increase PERCPU_DYNAMIC_SIZE_SHIFT on certain builds. · 8f3ce3d9
      Sebastian Andrzej Siewior authored
      Arnd reported a build failure due to the BUILD_BUG_ON() statement in
      alloc_kmem_cache_cpus().  The test
      
        PERCPU_DYNAMIC_EARLY_SIZE < NR_KMALLOC_TYPES * KMALLOC_SHIFT_HIGH * sizeof(struct kmem_cache_cpu)
      
      The factors that increase the right side of the equation:
      - PAGE_SIZE > 4KiB increases KMALLOC_SHIFT_HIGH
      - For the local_lock_t in kmem_cache_cpu:
        - PREEMPT_RT adds an actual lock.
        - LOCKDEP increases the size of the lock.
        - LOCK_STAT adds additional bytes plus padding to the lockdep
          structure.
      
      The net difference with and without PREEMPT_RT is 88 bytes for the
      lock_lock_t, 96 bytes for kmem_cache_cpu due to additional padding.  This
      is enough to exceed the 80KiB limit with 16KiB page size - the 8KiB page
      size is fine.
      
      Increase PERCPU_DYNAMIC_SIZE_SHIFT to 13 on configs with PAGE_SIZE larger
      than 4KiB and LOCKDEP enabled.
      
      Link: https://lkml.kernel.org/r/20241007143049.gyMpEu89@linutronix.de
      Fixes: d8fccd9c ("arm64: Allow to enable PREEMPT_RT.")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202410020326.iaZIteIx-lkp@intel.com/Reported-by: default avatarArnd Bergmann <arnd@kernel.org>
      Closes: https://lore.kernel.org/20241004095702.637528-1-arnd@kernel.orgAcked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f3ce3d9
    • Edward Liaw's avatar
      selftests/mm: fix deadlock for fork after pthread_create on ARM · e142cc87
      Edward Liaw authored
      On Android with arm, there is some synchronization needed to avoid a
      deadlock when forking after pthread_create.
      
      Link: https://lkml.kernel.org/r/20241003211716.371786-3-edliaw@google.com
      Fixes: cff29458 ("selftests/mm: extend and rename uffd pagemap test")
      Signed-off-by: default avatarEdward Liaw <edliaw@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e142cc87
    • Edward Liaw's avatar
      selftests/mm: replace atomic_bool with pthread_barrier_t · e61ef21e
      Edward Liaw authored
      Patch series "selftests/mm: fix deadlock after pthread_create".
      
      On Android arm, pthread_create followed by a fork caused a deadlock in the
      case where the fork required work to be completed by the created thread.
      
      Update the synchronization primitive to use pthread_barrier instead of
      atomic_bool.
      
      Apply the same fix to the wp-fork-with-event test.
      
      
      This patch (of 2):
      
      Swap synchronization primitive with pthread_barrier, so that stdatomic.h
      does not need to be included.
      
      The synchronization is needed on Android ARM64; we see a deadlock with
      pthread_create when the parent thread races forward before the child has a
      chance to start doing work.
      
      Link: https://lkml.kernel.org/r/20241003211716.371786-1-edliaw@google.com
      Link: https://lkml.kernel.org/r/20241003211716.371786-2-edliaw@google.com
      Fixes: cff29458 ("selftests/mm: extend and rename uffd pagemap test")
      Signed-off-by: default avatarEdward Liaw <edliaw@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e61ef21e
    • OGAWA Hirofumi's avatar
      fat: fix uninitialized variable · 963a7f4d
      OGAWA Hirofumi authored
      syszbot produced this with a corrupted fs image.  In theory, however an IO
      error would trigger this also.
      
      This affects just an error report, so should not be a serious error.
      
      Link: https://lkml.kernel.org/r/87r08wjsnh.fsf@mail.parknet.co.jp
      Link: https://lkml.kernel.org/r/66ff2c95.050a0220.49194.03e9.GAE@google.comSigned-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Reported-by: syzbot+ef0d7bc412553291aa86@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      963a7f4d
    • Ryusuke Konishi's avatar
      nilfs2: propagate directory read errors from nilfs_find_entry() · 08cfa12a
      Ryusuke Konishi authored
      Syzbot reported that a task hang occurs in vcs_open() during a fuzzing
      test for nilfs2.
      
      The root cause of this problem is that in nilfs_find_entry(), which
      searches for directory entries, ignores errors when loading a directory
      page/folio via nilfs_get_folio() fails.
      
      If the filesystem images is corrupted, and the i_size of the directory
      inode is large, and the directory page/folio is successfully read but
      fails the sanity check, for example when it is zero-filled,
      nilfs_check_folio() may continue to spit out error messages in bursts.
      
      Fix this issue by propagating the error to the callers when loading a
      page/folio fails in nilfs_find_entry().
      
      The current interface of nilfs_find_entry() and its callers is outdated
      and cannot propagate error codes such as -EIO and -ENOMEM returned via
      nilfs_find_entry(), so fix it together.
      
      Link: https://lkml.kernel.org/r/20241004033640.6841-1-konishi.ryusuke@gmail.com
      Fixes: 2ba466d7 ("nilfs2: directory entry operations")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatarLizhi Xu <lizhi.xu@windriver.com>
      Closes: https://lkml.kernel.org/r/20240927013806.3577931-1-lizhi.xu@windriver.com
      Reported-by: syzbot+8a192e8d090fa9a31135@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=8a192e8d090fa9a31135
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      08cfa12a
    • Lorenzo Stoakes's avatar
      mm/mmap: correct error handling in mmap_region() · 74874c57
      Lorenzo Stoakes authored
      Commit f8d112a4 ("mm/mmap: avoid zeroing vma tree in mmap_region()")
      changed how error handling is performed in mmap_region().
      
      The error value defaults to -ENOMEM, but then gets reassigned immediately
      to the result of vms_gather_munmap_vmas() if we are performing a MAP_FIXED
      mapping over existing VMAs (and thus unmapping them).
      
      This overwrites the error value, potentially clearing it.
      
      After this, we invoke may_expand_vm() and possibly vm_area_alloc(), and
      check to see if they failed. If they do so, then we perform error-handling
      logic, but importantly, we do NOT update the error code.
      
      This means that, if vms_gather_munmap_vmas() succeeds, but one of these
      calls does not, the function will return indicating no error, but rather an
      address value of zero, which is entirely incorrect.
      
      Correct this and avoid future confusion by strictly setting error on each
      and every occasion we jump to the error handling logic, and set the error
      code immediately prior to doing so.
      
      This way we can see at a glance that the error code is always correct.
      
      Many thanks to Vegard Nossum who spotted this issue in discussion around
      this problem.
      
      Link: https://lkml.kernel.org/r/20241002073932.13482-1-lorenzo.stoakes@oracle.com
      Fixes: f8d112a4 ("mm/mmap: avoid zeroing vma tree in mmap_region()")
      Signed-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Suggested-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74874c57
  2. 13 Oct, 2024 5 commits
  3. 12 Oct, 2024 2 commits
  4. 11 Oct, 2024 5 commits
    • Linus Torvalds's avatar
      Merge tag 'linux_kselftest-fixes-6.12-rc3' of... · 09f6b0c8
      Linus Torvalds authored
      Merge tag 'linux_kselftest-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest fixes from Shuah Khan:
       "Fixes for build, run-time errors, and reporting errors:
      
         - ftrace: regression test for a kernel crash when running function
           graph tracing and then enabling function profiler.
      
         - rseq: fix for mm_cid test failure.
      
         - vDSO:
            - fixes to reporting skip and other error conditions
            - changes unconditionally build chacha and getrandom tests on all
              architectures to make it easier for them to run in CIs
            - build error when sched.h to bring in CLONE_NEWTIME define"
      
      * tag 'linux_kselftest-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        ftrace/selftest: Test combination of function_graph tracer and function profiler
        selftests/rseq: Fix mm_cid test failure
        selftests: vDSO: Explicitly include sched.h
        selftests: vDSO: improve getrandom and chacha error messages
        selftests: vDSO: unconditionally build getrandom test
        selftests: vDSO: unconditionally build chacha test
      09f6b0c8
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux · 974099e4
      Linus Torvalds authored
      Pull devicetree fixes from Rob Herring:
      
       - Disable kunit tests for arm64+ACPI
      
       - Fix refcount issue in kunit tests
      
       - Drop constraints on non-conformant 'interrupt-map' in fsl,ls-extirq
      
       - Drop type ref on 'msi-parent in fsl,qoriq-mc binding
      
       - Move elgin,jg10309-01 to its own binding from trivial-devices
      
      * tag 'devicetree-fixes-for-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        of: Skip kunit tests when arm64+ACPI doesn't populate root node
        of: Fix unbalanced of node refcount and memory leaks
        dt-bindings: interrupt-controller: fsl,ls-extirq: workaround wrong interrupt-map number
        dt-bindings: misc: fsl,qoriq-mc: remove ref for msi-parent
        dt-bindings: display: elgin,jg10309-01: Add own binding
      974099e4
    • Linus Torvalds's avatar
      Merge tag 'fbdev-for-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev · 9066258d
      Linus Torvalds authored
      Pull fbdev platform driver fix from Helge Deller:
       "Switch fbdev drivers back to struct platform_driver::remove()
      
        Now that 'remove()' has been converted to the sane new API, there's
        no reason for the 'remove_new()' use, so this converts back to the
        traditional and simpler name.
      
        See commits
      
           5c5a7680 ("platform: Provide a remove callback that returns no value")
           0edb555a ("platform: Make platform_driver::remove() return void")
      
        for background to this all"
      
      * tag 'fbdev-for-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
        fbdev: Switch back to struct platform_driver::remove()
      9066258d
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 547fc322
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
      
       - fix clock handle leak in probe() error path in gpio-aspeed
      
       - add a dummy register read to ensure the write actually completed
      
      * tag 'gpio-fixes-for-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpio: aspeed: Use devm_clk api to manage clock source
        gpio: aspeed: Add the flush write to ensure the write complete.
      547fc322
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs · 6254d537
      Linus Torvalds authored
      Pull NFS client fixes from Anna Schumaker:
       "Localio Bugfixes:
         - remove duplicated include in localio.c
         - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
         - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
         - fix nfsd_file tracepoints to handle NULL rqstp pointers
      
        Other Bugfixes:
         - fix program selection loop in svc_process_common
         - fix integer overflow in decode_rc_list()
         - prevent NULL-pointer dereference in nfs42_complete_copies()
         - fix CB_RECALL performance issues when using a large number of
           delegations"
      
      * tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        NFS: remove revoked delegation from server's delegation list
        nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp
        nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
        nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
        NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies()
        SUNRPC: Fix integer overflow in decode_rc_list()
        sunrpc: fix prog selection loop in svc_process_common
        nfs: Remove duplicated include in localio.c
      6254d537