1. 02 Sep, 2024 40 commits
    • yangge's avatar
      mm/swap: take folio refcount after testing the LRU flag · 67b9a353
      yangge authored
      Whoever passes a folio to __folio_batch_add_and_move() must hold a
      reference, otherwise something else would already be messed up.  If the
      folio is referenced, it will not be freed elsewhere, so we can safely
      clear the folio's lru flag.  As discussed with David in [1], we should
      take the reference after testing the LRU flag, not before.
      
      Link: https://lore.kernel.org/lkml/d41865b4-d6fa-49ba-890a-921eefad27dd@redhat.com/ [1]
      Link: https://lkml.kernel.org/r/1723542743-32179-1-git-send-email-yangge1116@126.comSigned-off-by: default avataryangge <yangge1116@126.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      67b9a353
    • Takaya Saeki's avatar
      filemap: add trace events for get_pages, map_pages, and fault · b6273b55
      Takaya Saeki authored
      To allow precise tracking of page caches accessed, add new tracepoints
      that trigger when a process actually accesses them.
      
      The ureadahead program used by ChromeOS traces the disk access of programs
      as they start up at boot up.  It uses mincore(2) or the
      'mm_filemap_add_to_page_cache' trace event to accomplish this.  It stores
      this information in a "pack" file and on subsequent boots, it will read
      the pack file and call readahead(2) on the information so that disk
      storage can be loaded into RAM before the applications actually need it.
      
      A problem we see is that due to the kernel's readahead algorithm that can
      aggressively pull in more data than needed (to try and accomplish the same
      goal) and this data is also recorded.  The end result is that the pack
      file contains a lot of pages on disk that are never actually used. 
      Calling readahead(2) on these unused pages can slow down the system boot
      up times.
      
      To solve this, add 3 new trace events, get_pages, map_pages, and fault. 
      These will be used to trace the pages are not only pulled in from disk,
      but are actually used by the application.  Only those pages will be stored
      in the pack file, and this helps out the performance of boot up.
      
      With the combination of these 3 new trace events and
      mm_filemap_add_to_page_cache, we observed a reduction in the pack file by
      7.3% - 20% on ChromeOS varying by device.
      
      Link: https://lkml.kernel.org/r/20240813100312.3930505-1-takayas@chromium.orgSigned-off-by: default avatarTakaya Saeki <takayas@chromium.org>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Junichi Uekawa <uekawa@chromium.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6273b55
    • Peter Xu's avatar
      mm/mprotect: fix dax pud handlings · cb0f01be
      Peter Xu authored
      This is only relevant to the two archs that support PUD dax, aka, x86_64
      and ppc64.  PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
      count in this case.
      
      DAX have had PUD mappings for years, but change protection path never
      worked.  When the path is triggered in any form (a simple test program
      would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
      "bad pud".  This patch should fix that.
      
      The new change_huge_pud() tries to keep everything simple.  For example,
      it doesn't optimize write bit as that will need even more PUD helpers. 
      It's not too bad anyway to have one more write fault in the worst case
      once for 1G range; may be a bigger thing for each PAGE_SIZE, though. 
      Neither does it support userfault-wp bits, as there isn't such PUD
      mappings that is supported; file mappings always need a split there.
      
      The same to TLB shootdown: the pmd path (which was for x86 only) has the
      trick of using _ad() version of pmdp_invalidate*() which can avoid one
      redundant TLB, but let's also leave that for later.  Again, the larger the
      mapping, the smaller of such effect.
      
      There's some difference on handling "retry" for change_huge_pud() (where
      it can return 0): it isn't like change_huge_pmd(), as the pmd version is
      safe with all conditions handled in change_pte_range() later, thanks to
      Hugh's new pte_offset_map_lock().  In short, change_pte_range() is simply
      smarter.  For that, change_pud_range() will need proper retry if it races
      with something else when a huge PUD changed from under us.
      
      The last thing to mention is currently the PUD path ignores the huge pte
      numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not
      applicable to NUMA, but also that it's ambiguous on its own to decide how
      to account pud in this case.  In one earlier version of this patchset I
      proposed to remove the counter as it doesn't even look right to do the
      accounting as of now [1], but then a further discussion suggests we can
      leave that for later, as that doesn't block this series if we choose to
      ignore that counter.  That's what this patch does, by ignoring it.
      
      When at it, touch up the comment in pgtable_split_needed() to make it
      generic to either pmd or pud file THPs.
      
      [1] https://lore.kernel.org/all/20240715192142.3241557-3-peterx@redhat.com/
      [2] https://lore.kernel.org/r/added2d0-b8be-4108-82ca-1367a388d0b1@redhat.com
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-8-peterx@redhat.com
      Fixes: a00cc7d9 ("mm, x86: add support for PUD-sized transparent hugepages")
      Fixes: 27af67f3 ("powerpc/book3s64/mm: enable transparent pud hugepage")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb0f01be
    • Peter Xu's avatar
      mm/x86: add missing pud helpers · 473f2490
      Peter Xu authored
      Some new helpers will be needed for pud entry updates soon.  Introduce
      these helpers by referencing the pmd ones.  Namely:
      
        - pudp_invalidate(): this helper invalidates a huge pud before a
          split happens, so that the invalidated pud entry will make sure no
          race will happen (either with software, like a concurrent zap, or
          hardware, like a/d bit lost).
      
        - pud_modify(): this helper applies a new pgprot to an existing huge
          pud mapping.
      
      For more information on why we need these two helpers, please refer to the
      corresponding pmd helpers in the mprotect() code path.
      
      When at it, simplify the pud_modify()/pmd_modify() comments on shadow
      stack pgtable entries to reference pte_modify() to avoid duplicating the
      whole paragraph three times.
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      473f2490
    • Peter Xu's avatar
      mm/x86: implement arch_check_zapped_pud() · 1c399e74
      Peter Xu authored
      Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD
      zaps.  It has the same logic as the PMD helper.
      
      One thing to mention is, it might be a good idea to use page_table_check
      in the future for trapping wrong setups of shadow stack pgtable entries
      [1].  That is left for the future as a separate effort.
      
      [1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1c399e74
    • Peter Xu's avatar
      mm/x86: make pud_leaf() only care about PSE bit · 144bb0ae
      Peter Xu authored
      When working on mprotect() on 1G dax entries, I hit an zap bad pud error
      when zapping a huge pud that is with PROT_NONE permission.
      
      Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits
      set to report a pud entry as a leaf, but that doesn't look right, as it's
      not following the pXd_leaf() definition that we stick with so far, where
      PROT_NONE entries should be reported as leaves.
      
      To fix it, change x86's pud_leaf() implementation to only check against
      PSE bit to report a leaf, irrelevant of whether PRESENT bit is set.
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      144bb0ae
    • Peter Xu's avatar
      mm/powerpc: add missing pud helpers · 4dd7724f
      Peter Xu authored
      Some new helpers will be needed for pud entry updates soon.  Introduce
      these helpers by referencing the pmd ones.  Namely:
      
        - pudp_invalidate(): this helper invalidates a huge pud before a split
        happens, so that the invalidated pud entry will make sure no race will
        happen (either with software, like a concurrent zap, or hardware, like
        a/d bit lost).
      
        - pud_modify(): this helper applies a new pgprot to an existing huge pud
        mapping.
      
      For more information on why we need these two helpers, please refer to the
      corresponding pmd helpers in the mprotect() code path.
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4dd7724f
    • Peter Xu's avatar
      mm/mprotect: push mmu notifier to PUDs · 7f06e3aa
      Peter Xu authored
      mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
      commit a5338093 ("mm: move mmu notifier call from change_protection to
      change_pmd_range").
      
      At that time, the issue was that NUMA balancing can be applied on a huge
      range of VM memory, even if nothing was populated.  The notification can
      be avoided in this case if no valid pmd detected, which includes either
      THP or a PTE pgtable page.
      
      Now to pave way for PUD handling, this isn't enough.  We need to generate
      mmu notifications even on PUD entries properly.  mprotect() is currently
      broken on PUD (e.g., one can easily trigger kernel error with dax 1G
      mappings already), this is the start to fix it.
      
      To fix that, this patch proposes to push such notifications to the PUD
      layers.
      
      There is risk on regressing the problem Rik wanted to resolve before, but I
      think it shouldn't really happen, and I still chose this solution because
      of a few reasons:
      
        1) Consider a large VM that should definitely contain more than GBs of
        memory, it's highly likely that PUDs are also none.  In this case there
        will have no regression.
      
        2) KVM has evolved a lot over the years to get rid of rmap walks, which
        might be the major cause of the previous soft-lockup.  At least TDP MMU
        already got rid of rmap as long as not nested (which should be the major
        use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
        pgtable (e.g. EPT on x86), the invalidation of a full empty region in
        most cases could be pretty fast now, comparing to 2014.
      
        3) KVM has explicit code paths now to even give way for mmu notifiers
        just like this one, e.g. in commit d02c357e ("KVM: x86/mmu: Retry
        fault before acquiring mmu_lock if mapping is changing").  It'll also
        avoid contentions that may also contribute to a soft-lockup.
      
        4) Stick with PMD layer simply don't work when PUD is there...  We need
        one way or another to fix PUD mappings on mprotect().
      
      Pushing it to PUD should be the safest approach as of now, e.g. there's yet
      no sign of huge P4D coming on any known archs.
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7f06e3aa
    • Peter Xu's avatar
      mm/dax: dump start address in fault handler · 5b198b47
      Peter Xu authored
      Patch series "mm/mprotect: Fix dax puds", v5.
      
      Dax supports pud pages for a while, but mprotect on puds was missing since
      the start.  This series tries to fix that by providing pud handling in
      mprotect().  The goal is to add more types of pud mappings like hugetlb or
      pfnmaps.  This series paves way for it by fixing known pud entries.
      
      Considering nobody reported this until when I looked at those other types
      of pud mappings, I am thinking maybe it doesn't need to be a fix for
      stable and this may not need to be backported.  I would guess whoever
      cares about mprotect() won't care 1G dax puds yet, vice versa.  I hope
      fixing that in new kernels would be fine, but I'm open to suggestions.
      
      There're a few small things changed to teach mprotect work on PUDs.  E.g. 
      it will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
      making sense when there can be more than one type of huge pte.  OTOH,
      we'll also need to push the mmu notifiers from pmd to pud layers, which
      might need some attention but so far I think it's safe.  For such details,
      please refer to each patch's commit message.
      
      The mprotect() pud process should be straightforward, as I kept it as
      simple as possible.  There's no NUMA handled as dax simply doesn't support
      that.  There's also no userfault involvements as file memory (even if work
      with userfault-wp async mode) will need to split a pud, so pud entry
      doesn't need to yet know userfault's existance (but hugetlb entries will;
      that's also for later).
      
      
      This patch (of 7):
      
      Currently the dax fault handler dumps the vma range when dynamic debugging
      enabled.  That's mostly not useful.  Dump the (aligned) address instead
      with the order info.
      
      Link: https://lkml.kernel.org/r/20240812181225.1360970-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240812181225.1360970-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b198b47
    • Yuanchu Xie's avatar
      mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true · bceeeaed
      Yuanchu Xie authored
      When non-leaf pmd accessed bits are available, MGLRU page table walks can
      clear the non-leaf pmd accessed bit and ignore the accessed bit on the pte
      if it's on a different node, skipping a generation update as well.  If
      another scan occurs on the same node as said skipped pte.
      
      The non-leaf pmd accessed bit might remain cleared and the pte accessed
      bits won't be checked.  While this is sufficient for reclaim-driven aging,
      where the goal is to select a reasonably cold page, the access can be
      missed when aging proactively for workingset estimation of a node/memcg.
      
      In more detail, get_pfn_folio returns NULL if the folio's nid != node
      under scanning, so the page table walk skips processing of said pte.  Now
      the pmd_young flag on this pmd is cleared, and if none of the pte's are
      accessed before another scan occurs on the folio's node, the pmd_young
      check fails and the pte accessed bit is skipped.
      
      Since force_scan disables various other optimizations, we check force_scan
      to ignore the non-leaf pmd accessed bit.
      
      Link: https://lkml.kernel.org/r/20240813163759.742675-1-yuanchu@google.comSigned-off-by: default avatarYuanchu Xie <yuanchu@google.com>
      Acked-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Lance Yang <ioworker0@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bceeeaed
    • Miao Wang's avatar
      mm: vmalloc: add optimization hint on page existence check · 6963f008
      Miao Wang authored
      In commit 21e516b9 ("mm: vmalloc: dump page owner info if page is
      already mapped"), a BUG_ON macro was changed into an if statement, where
      the compiler optimization hint introduced in the BUG_ON macro was removed
      along with this change.  This patch adds back the hint.
      
      Link: https://lkml.kernel.org/r/20240814-fix_vmap_unlikely-v1-1-cd7954775f12@gmail.com
      Fixes: 21e516b9 ("mm: vmalloc: dump page owner info if page is already mapped")
      Signed-off-by: default avatarMiao Wang <shankerwangmiao@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hariom Panthi <hariom1.p@samsung.com>
      Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6963f008
    • Kirill A. Shutemov's avatar
      mm: accept to promo watermark · 59149bf8
      Kirill A. Shutemov authored
      Commit c574bbe9 ("NUMA balancing: optimize page placement for memory
      tiering system") introduced a new watermark above "high" -- "promo".
      
      Accept memory memory to the highest watermark which is WMARK_PROMO now.
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-9-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      59149bf8
    • Kirill A. Shutemov's avatar
      mm: page_isolation: handle unaccepted memory isolation · e44dd9b1
      Kirill A. Shutemov authored
      Page isolation machinery doesn't know anything about unaccepted memory and
      considers it non-free.  It leads to alloc_contig_pages() failure.
      
      Treat unaccepted memory as free and accept memory on pageblock isolation. 
      Once memory is accepted it becomes PageBuddy() and page isolation knows
      how to deal with them.
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-8-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e44dd9b1
    • Kirill A. Shutemov's avatar
      mm: add a helper to accept page · 55ad43e8
      Kirill A. Shutemov authored
      Accept a given struct page and add it free list.
      
      The help is useful for physical memory scanners that want to use free
      unaccepted memory.
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-7-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55ad43e8
    • Kirill A. Shutemov's avatar
      mm: rework accept memory helpers · 5adfeaec
      Kirill A. Shutemov authored
      Make accept_memory() and range_contains_unaccepted_memory() take 'start'
      and 'size' arguments instead of 'start' and 'end'.
      
      Remove accept_page(), replacing it with direct calls to accept_memory(). 
      The accept_page() name is going to be used for a different function.
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-6-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5adfeaec
    • Kirill A. Shutemov's avatar
      mm: introduce PageUnaccepted() page type · 310183de
      Kirill A. Shutemov authored
      The new page type allows physical memory scanners to detect unaccepted
      memory and handle it accordingly.
      
      The page type is serialized with zone lock.
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-5-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      310183de
    • Kirill A. Shutemov's avatar
      mm: accept memory in __alloc_pages_bulk() · 4be9064b
      Kirill A. Shutemov authored
      Currently, the kernel only accepts memory in get_page_from_freelist(), but
      there is another path that directly takes pages from free lists -
      __alloc_page_bulk().  This function can consume all accepted memory and
      will resort to __alloc_pages_noprof() if necessary.
      
      Conditionally accepted in __alloc_pages_bulk().
      
      The same issue may arise due to deferred page initialization.  Kick the
      deferred initialization machinery before abandoning the zone, as the
      kernel does in get_page_from_freelist().
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-4-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4be9064b
    • Kirill A. Shutemov's avatar
      mm: reduce deferred struct page init ifdeffery · 3a80b822
      Kirill A. Shutemov authored
      Patch series "mm: Fix several issues with unaccepted memory", v2.
      
      The patchset addresses several issues related to unaccepted memory.
      
      Pacth 1/7 preparatory cleanup.
      
      Patch 2/7 ensures that __alloc_pages_bulk() will not exhaust all
      accepted memory without accepting more.
      
      Patches 3/7-5/7 are preparations for patch 6/7, which fixes
      alloc_config_page() on machines with unaccepted memory.  This allows, for
      example, the allocation of gigantic pages at runtime.
      
      Patch 7/7 enables the kernel to accept memory up to the promo watermark.
      
      
      This patch (of 7):
      
      Add dummy _deferred_grow_zone() for !DEFERRED_STRUCT_PAGE_INIT and remove
      #ifdefs in two places.
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20240809114854.3745464-1-kirill.shutemov@linux.intel.com
      Link: https://lkml.kernel.org/r/20240809114854.3745464-3-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a80b822
    • Zi Yan's avatar
      mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) · 727d50a7
      Zi Yan authored
      do_numa_page() and do_huge_pmd_numa_page() share a lot of common code.  To
      reduce redundancy, move common code to numa_migrate_prep() and rename the
      function to numa_migrate_check() to reflect its functionality.
      
      Now do_huge_pmd_numa_page() also checks shared folios to set TNF_SHARED
      flag.
      
      Link: https://lkml.kernel.org/r/20240809145906.1513458-4-ziy@nvidia.comSigned-off-by: default avatarZi Yan <ziy@nvidia.com>
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      727d50a7
    • Shakeel Butt's avatar
      memcg: replace memcg ID idr with xarray · 07222371
      Shakeel Butt authored
      At the moment memcg IDs are managed through IDR which requires external
      synchronization mechanisms and makes the allocation code a bit awkward. 
      Let's switch to xarray and make the code simpler.
      
      [shakeel.butt@linux.dev: fix error path in mem_cgroup_alloc(), per Dan]
        Link: https://lkml.kernel.org/r/20240815155402.3630804-1-shakeel.butt@linux.dev
      Link: https://lkml.kernel.org/r/20240809172618.2946790-1-shakeel.butt@linux.devSigned-off-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMuchun Song <muchun.song@linux.dev>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07222371
    • Jeff Xu's avatar
      selftest mm/mseal: fix test_seal_mremap_move_dontunmap_anyaddr · 072cd213
      Jeff Xu authored
      the syscall remap accepts following:
      
      mremap(src, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, dst)
      
      when the src is sealed, the call will fail with error code:
      EPERM
      
      Previously, the test uses hard-coded 0xdeaddead as dst, and it
      will fail on the system with newer glibc installed.
      
      This patch removes test's dependency on glibc for mremap(), also
      fix the test and remove the hardcoded address.
      
      Link: https://lkml.kernel.org/r/20240807212320.2831848-1-jeffxu@chromium.org
      Fixes: 4926c7a5 ("selftest mm/mseal memory sealing")
      Signed-off-by: default avatarJeff Xu <jeffxu@chromium.org>
      Reported-by: default avatarPedro Falcato <pedro.falcato@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      072cd213
    • Matthew Wilcox (Oracle)'s avatar
      mm: return the folio from swapin_readahead · 94dc8bff
      Matthew Wilcox (Oracle) authored
      The unuse_pte_range() caller only wants the folio while do_swap_page()
      wants both the page and the folio.  Since do_swap_page() already has logic
      for handling both the folio and the page, move the folio-to-page logic
      there.  This also lets us allocate larger folios in the SWP_SYNCHRONOUS_IO
      path in future.
      
      Link: https://lkml.kernel.org/r/20240807193734.1865400-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94dc8bff
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove PG_error · 09022bc1
      Matthew Wilcox (Oracle) authored
      The PG_error bit is now unused; delete it and free up a bit in
      page->flags.
      
      Link: https://lkml.kernel.org/r/20240807193528.1865100-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      09022bc1
    • Matthew Wilcox (Oracle)'s avatar
      fs: remove calls to set and clear the folio error flag · 420e05d0
      Matthew Wilcox (Oracle) authored
      Nobody checks the folio error flag any more, so we can stop setting and
      clearing it.  Also remove the documentation suggesting to not bother
      setting the error bit.
      
      Link: https://lkml.kernel.org/r/20240807193528.1865100-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      420e05d0
    • qiwu.chen's avatar
      mm: kfence: print the elapsed time for allocated/freed track · 62e73fd8
      qiwu.chen authored
      Print the elapsed time for the allocated or freed track, which can be
      useful in some debugging scenarios.
      
      Link: https://lkml.kernel.org/r/20240807025627.37419-1-qiwu.chen@transsion.comSigned-off-by: default avatarqiwu.chen <qiwu.chen@transsion.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: chenqiwu <qiwu.chen@transsion.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      62e73fd8
    • Jianhui Zhou's avatar
      percpu: remove pcpu_alloc_size() · 47baed6a
      Jianhui Zhou authored
      pcpu_alloc_size() was added in 7ac5c53e "mm/percpu.c: introduce
      pcpu_alloc_size()", which is used to get the allocated memory size in bpf.
      However, pcpu_alloc_size() is no longer used in "bpf: Use c->unit_size to
      select target cache during free" because its actuall allocated memory size
      may change at runtime due to its slab merging mechanism.  Therefore,
      pcpu_alloc_size() can be removed.
      
      Link: https://lkml.kernel.org/r/tencent_AD5C50E8D78C07A3CE539BD5F6BF39706507@qq.comSigned-off-by: default avatarJianhui Zhou <912460177@qq.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: JonasZhou <JonasZhou@zhaoxin.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47baed6a
    • David Hildenbrand's avatar
      mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mapping · 43c9074e
      David Hildenbrand authored
      It is not immediately obvious, but we can move the folio->_nr_pages_mapped
      update out of the loop and reduce the number of atomic ops without
      affecting the stats.
      
      The important point to realize is that only removing the last PMD mapping
      will result in _nr_pages_mapped going below ENTIRELY_MAPPED, not the
      individual atomic_inc_return_relaxed() calls.  Concurrent races with
      removal of PMD mappings should be handled as expected, just like when we
      would have such races right now on a single mapcount update.
      
      In a simple munmap() microbenchmark [1] on 1 GiB of memory backed by the
      same PTE-mapped folio size (only mapped by a single process such that they
      will get completely unmapped), this change results in a speedup (positive
      is good) per folio size on a x86-64 Intel machine of roughly (a bit of
      noise expected):
      
      * 16 KiB: +10%
      * 32 KiB: +15%
      * 64 KiB: +17%
      * 128 KiB: +21%
      * 256 KiB: +22%
      * 512 KiB: +22%
      * 1024 KiB: +23%
      * 2048 KiB: +27%
      
      [1] https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c
      
      Link: https://lkml.kernel.org/r/20240807115515.1640951-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43c9074e
    • Pedro Falcato's avatar
      selftests/mm: add mseal test for no-discard madvise · 67203f3f
      Pedro Falcato authored
      Add an mseal test for madvise() operations that aren't considered
      "discard" (e.g purely advisory ops such as MADV_RANDOM).
      
      [pedro.falcato@gmail.com: adjust the mseal test's plan]
        Link: https://lkml.kernel.org/r/20240807203724.2686144-1-pedro.falcato@gmail.com
      Link: https://lkml.kernel.org/r/20240807173336.2523757-3-pedro.falcato@gmail.comSigned-off-by: default avatarPedro Falcato <pedro.falcato@gmail.com>
      Tested-by: default avatarJeff Xu <jeffxu@chromium.org>
      Reviewed-by: default avatarJeff Xu <jeffxu@chromium.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      67203f3f
    • Marco Elver's avatar
      kfence: introduce burst mode · cc0a0f98
      Marco Elver authored
      Introduce burst mode, which can be configured with kfence.burst=$count,
      where the burst count denotes the additional successive slab allocations
      to be allocated through KFENCE for each sample interval.
      
      The idea is that this can give developers an additional knob to make
      KFENCE more aggressive when debugging specific issues of systems where
      either rebooting or recompiling the kernel with KASAN is not possible.
      
      Experiment: To assess the effectiveness of the new option, we randomly
      picked a recent out-of-bounds [1] and use-after-free bug [2], each with a
      reproducer provided by syzbot, that initially detected these bugs with
      KASAN.  We then tried to reproduce the bugs with KFENCE below.
      
      [1] Fixed by: 7c55b788 ("jfs: xattr: fix buffer overflow for invalid xattr")
          https://syzkaller.appspot.com/bug?id=9d1b59d4718239da6f6069d3891863c25f9f24a2
      [2] Fixed by: f8ad00f3 ("l2tp: fix possible UAF when cleaning up tunnels")
          https://syzkaller.appspot.com/bug?id=4f34adc84f4a3b080187c390eeef60611fd450e1
      
      The following KFENCE configs were compared. A pool size of 1023 objects
      was used for all configurations.
      
      	Baseline
      		kfence.sample_interval=100
      		kfence.skip_covered_thresh=75
      		kfence.burst=0
      
      	Aggressive
      		kfence.sample_interval=1
      		kfence.skip_covered_thresh=10
      		kfence.burst=0
      
      	AggressiveBurst
      		kfence.sample_interval=1
      		kfence.skip_covered_thresh=10
      		kfence.burst=1000
      
      Each reproducer was run 10 times (after a fresh reboot), with the
      following detection counts for each KFENCE config:
      
                          | Detection Count out of 10 |
                          |    OOB [1]  |    UAF [2]  |
        ------------------+-------------+-------------+
        Default           |     0/10    |     0/10    |
        Aggressive        |     0/10    |     0/10    |
        AggressiveBurst   |     8/10    |     8/10    |
      
      With the Default and even the Aggressive configs the results are
      unsurprising, given KFENCE has not been designed for deterministic bug
      detection of small test cases.
      
      However, when enabling burst mode with relatively large burst count,
      KFENCE can start to detect heap memory-safety bugs even in simpler test
      cases with high probability (in the above cases with ~80% probability).
      
      Link: https://lkml.kernel.org/r/20240805124203.2692278-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc0a0f98
    • Jann Horn's avatar
      mm: fix (harmless) type confusion in lock_vma_under_rcu() · 17fe833b
      Jann Horn authored
      There is a (harmless) type confusion in lock_vma_under_rcu(): After
      vma_start_read(), we have taken the VMA lock but don't know yet whether
      the VMA has already been detached and scheduled for RCU freeing.  At this
      point, ->vm_start and ->vm_end are accessed.
      
      vm_area_struct contains a union such that ->vm_rcu uses the same memory as
      ->vm_start and ->vm_end; so accessing ->vm_start and ->vm_end of a
      detached VMA is illegal and leads to type confusion between union members.
      
      Fix it by reordering the vma->detached check above the address checks, and
      document the rules for RCU readers accessing VMAs.
      
      This will probably change the number of observed VMA_LOCK_MISS events
      (since previously, trying to access a detached VMA whose ->vm_rcu has been
      scheduled would bail out when checking the fault address against the
      rcu_head members reinterpreted as VMA bounds).
      
      Link: https://lkml.kernel.org/r/20240805-fix-vma-lock-type-confusion-v1-1-9f25443a9a71@google.com
      Fixes: 50ee3253 ("mm: introduce lock_vma_under_rcu to be used from arch-specific code")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17fe833b
    • Nhat Pham's avatar
      zswap: track swapins from disk more accurately · 0e400844
      Nhat Pham authored
      Currently, there are a couple of issues with our disk swapin tracking for
      dynamic zswap shrinker heuristics:
      
      1. We only increment the swapin counter on pivot pages. This means we
         are not taking into account pages that also need to be swapped in,
         but are already taken care of as part of the readahead window.
      
      2. We are also incrementing when the pages are read from the zswap pool,
         which is inaccurate.
      
      This patch rectifies these issues by incrementing the counter whenever we
      need to perform a non-zswap read.  Note that we are slightly overcounting,
      as a page might be read into memory by the readahead algorithm even though
      it will not be neeeded by users - however, this is an acceptable
      inaccuracy, as the readahead logic itself will adapt to these kind of
      scenarios.
      
      To test this change, I built the kernel under a cgroup with its memory.max
      set to 2 GB:
      
      real: 236.66s
      user: 4286.06s
      sys: 652.86s
      swapins: 81552
      
      For comparison, with just the new second chance algorithm, the build time
      is as follows:
      
      real: 244.85s
      user: 4327.22s
      sys: 664.39s
      swapins: 94663
      
      Without neither:
      
      real: 263.89s
      user: 4318.11s
      sys: 673.29s
      swapins: 227300.5
      
      (average over 5 runs)
      
      With this change, the kernel CPU time reduces by a further 1.7%, and the
      real time is reduced by another 3.3%, compared to just the second chance
      algorithm by itself.  The swapins count also reduces by another 13.85%.
      
      Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
      time by 3%, and number of swapins by 64.12%.
      
      To gauge the new scheme's ability to offload cold data, I ran another
      benchmark, in which the kernel was built under a cgroup with memory.max
      set to 3 GB, but with 0.5 GB worth of cold data allocated before each
      build (in a shmem file).
      
      Under the old scheme:
      
      real: 197.18s
      user: 4365.08s
      sys: 289.02s
      zswpwb: 72115.2
      
      Under the new scheme:
      
      real: 195.8s
      user: 4362.25s
      sys: 290.14s
      zswpwb: 87277.8
      
      (average over 5 runs)
      
      Notice that we actually observe a 21% increase in the number of written
      back pages - so the new scheme is just as good, if not better at
      offloading pages from the zswap pool when they are cold.  Build time
      reduces by around 0.7% as a result.
      
      [nphamcs@gmail.com: squeeze a comment into a single line]
        Link: https://lkml.kernel.org/r/20240806004518.3183562-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240805232243.2896283-3-nphamcs@gmail.com
      Fixes: b5ba474f ("zswap: shrink zswap pool based on memory pressure")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Takero Funaki <flintglass@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0e400844
    • Nhat Pham's avatar
      zswap: implement a second chance algorithm for dynamic zswap shrinker · e31c38e0
      Nhat Pham authored
      Patch series "improving dynamic zswap shrinker protection scheme", v3.
      
      When experimenting with the memory-pressure based (i.e "dynamic") zswap
      shrinker in production, we observed a sharp increase in the number of
      swapins, which led to performance regression.  We were able to trace this
      regression to the following problems with the shrinker's warm pages
      protection scheme: 
      
      1. The protection decays way too rapidly, and the decaying is coupled with
         zswap stores, leading to anomalous patterns, in which a small batch of
         zswap stores effectively erase all the protection in place for the
         warmer pages in the zswap LRU.
      
         This observation has also been corroborated upstream by Takero Funaki
         (in [1]).
      
      2. We inaccurately track the number of swapped in pages, missing the
         non-pivot pages that are part of the readahead window, while counting
         the pages that are found in the zswap pool.
      
      
      To alleviate these two issues, this patch series improve the dynamic zswap
      shrinker in the following manner:
      
      1. Replace the protection size tracking scheme with a second chance
         algorithm. This new scheme removes the need for haphazard stats
         decaying, and automatically adjusts the pace of pages aging with memory
         pressure, and writeback rate with pool activities: slowing down when
         the pool is dominated with zswpouts, and speeding up when the pool is
         dominated with stale entries.
      
      2. Fix the tracking of the number of swapins to take into account
         non-pivot pages in the readahead window.
      
      With these two changes in place, in a kernel-building benchmark without
      any cold data added, the number of swapins is reduced by 64.12%.  This
      translate to a 10.32% reduction in build time.  We also observe a 3%
      reduction in kernel CPU time.
      
      In another benchmark, with cold data added (to gauge the new algorithm's
      ability to offload cold data), the new second chance scheme outperforms
      the old protection scheme by around 0.7%, and actually written back around
      21% more pages to backing swap device.  So the new scheme is just as good,
      if not even better than the old scheme on this front as well.
      
      [1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/
      
      
      This patch (of 2):
      
      Current zswap shrinker's heuristics to prevent overshrinking is brittle
      and inaccurate, specifically in the way we decay the protection size (i.e
      making pages in the zswap LRU eligible for reclaim).
      
      We currently decay protection aggressively in zswap_lru_add() calls.  This
      leads to the following unfortunate effect: when a new batch of pages enter
      zswap, the protection size rapidly decays to below 25% of the zswap LRU
      size, which is way too low.
      
      We have observed this effect in production, when experimenting with the
      zswap shrinker: the rate of shrinking shoots up massively right after a
      new batch of zswap stores.  This is somewhat the opposite of what we want
      originally - when new pages enter zswap, we want to protect both these new
      pages AND the pages that are already protected in the zswap LRU.
      
      Replace existing heuristics with a second chance algorithm
      
      1. When a new zswap entry is stored in the zswap pool, its referenced
         bit is set.
      2. When the zswap shrinker encounters a zswap entry with the referenced
         bit set, give it a second chance - only flips the referenced bit and
         rotate it in the LRU.
      3. If the shrinker encounters the entry again, this time with its
         referenced bit unset, then it can reclaim the entry.
      
      In this manner, the aging of the pages in the zswap LRUs are decoupled
      from zswap stores, and picks up the pace with increasing memory pressure
      (which is what we want).
      
      The second chance scheme allows us to modulate the writeback rate based on
      recent pool activities.  Entries that recently entered the pool will be
      protected, so if the pool is dominated by such entries the writeback rate
      will reduce proportionally, protecting the workload's workingset.On the
      other hand, stale entries will be written back quickly, which increases
      the effective writeback rate.
      
      The referenced bit is added at the hole after the `length` field of struct
      zswap_entry, so there is no extra space overhead for this algorithm.
      
      We will still maintain the count of swapins, which is consumed and
      subtracted from the lru size in zswap_shrinker_count(), to further
      penalize past overshrinking that led to disk swapins.  The idea is that
      had we considered this many more pages in the LRU active/protected, they
      would not have been written back and we would not have had to swapped them
      in.
      
      To test this new heuristics, I built the kernel under a cgroup with
      memory.max set to 2G, on a host with 36 cores:
      
      With the old shrinker:
      
      real: 263.89s
      user: 4318.11s
      sys: 673.29s
      swapins: 227300.5
      
      With the second chance algorithm:
      
      real: 244.85s
      user: 4327.22s
      sys: 664.39s
      swapins: 94663
      
      (average over 5 runs)
      
      We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
      reduction in real time. Note that the number of swapped in pages
      dropped by 58%.
      
      [nphamcs@gmail.com: fix a small mistake in the referenced bit documentation]
        Link: https://lkml.kernel.org/r/20240806003403.3142387-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240805232243.2896283-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240805232243.2896283-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Takero Funaki <flintglass@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e31c38e0
    • David Gow's avatar
      mm: only enforce minimum stack gap size if it's sensible · 69b50d43
      David Gow authored
      The generic mmap_base code tries to leave a gap between the top of the
      stack and the mmap base address, but enforces a minimum gap size (MIN_GAP)
      of 128MB, which is too large on some setups.  In particular, on arm tasks
      without ADDR_LIMIT_32BIT, the STACK_TOP value is less than 128MB, so it's
      impossible to fit such a gap in.
      
      Only enforce this minimum if MIN_GAP < MAX_GAP, as we'd prefer to honour
      MAX_GAP, which is defined proportionally, so scales better and always
      leaves us with both _some_ stack space and some room for mmap.
      
      This fixes the usercopy KUnit test suite on 32-bit arm, as it doesn't set
      any personality flags so gets the default (in this case 26-bit) task size.
      This test can be run with: ./tools/testing/kunit/kunit.py run --arch arm
      usercopy --make_options LLVM=1
      
      Link: https://lkml.kernel.org/r/20240803074642.1849623-2-davidgow@google.com
      Fixes: dba79c3d ("arm: use generic mmap top-down layout and brk randomization")
      Signed-off-by: default avatarDavid Gow <davidgow@google.com>
      Reviewed-by: default avatarKees Cook <kees@kernel.org>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69b50d43
    • Yang Li's avatar
      mm: remove duplicated include in vma_internal.h · a06e79d3
      Yang Li authored
      The header files linux/bug.h is included twice in vma_internal.h, so one
      inclusion of each can be removed.
      
      Link: https://lkml.kernel.org/r/20240802060216.24591-1-yang.lee@linux.alibaba.comSigned-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9636Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a06e79d3
    • David Hildenbrand's avatar
      mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk · e317a8d8
      David Hildenbrand authored
      Let's simplify by reusing folio_walk.  Keep the existing behavior by
      handling migration entries and zeropages.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-12-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e317a8d8
    • David Hildenbrand's avatar
      mm: remove follow_page() · 7290840d
      David Hildenbrand authored
      All users are gone, let's remove it and any leftovers in comments.  We'll
      leave any FOLL/follow_page_() naming cleanups as future work.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7290840d
    • David Hildenbrand's avatar
      s390/mm/fault: convert do_secure_storage_access() from follow_page() to folio_walk · 0b31a3ce
      David Hildenbrand authored
      Let's get rid of another follow_page() user and perform the conversion
      under PTL: Note that this is also what follow_page_pte() ends up doing.
      
      Unfortunately we cannot currently optimize out the additional reference,
      because arch_make_folio_accessible() must be called with a raised refcount
      to protect against concurrent conversion to secure.  We can just move the
      arch_make_folio_accessible() under the PTL, like follow_page_pte() would.
      
      We'll effectively drop the "writable" check implied by FOLL_WRITE:
      follow_page_pte() would also not check that when calling
      arch_make_folio_accessible(), so there is no good reason for doing that
      here.
      
      We'll lose the secretmem check from follow_page() as well, about which we
      shouldn't really care.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b31a3ce
    • David Hildenbrand's avatar
      s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk · 85a7e543
      David Hildenbrand authored
      Let's get rid of another follow_page() user and perform the UV calls under
      PTL -- which likely should be fine.
      
      No need for an additional reference while holding the PTL:
      uv_destroy_folio() and uv_convert_from_secure_folio() raise the refcount,
      so any concurrent make_folio_secure() would see an unexpted reference and
      cannot set PG_arch_1 concurrently.
      
      Do we really need a writable PTE?  Likely yes, because the "destroy" part
      is, in comparison to the export, a destructive operation.  So we'll keep
      the writability check for now.
      
      We'll lose the secretmem check from follow_page().  Likely we don't care
      about that here.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85a7e543
    • David Hildenbrand's avatar
      mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk · 8710f6ed
      David Hildenbrand authored
      Let's remove yet another follow_page() user.  Note that we have to do the
      split without holding the PTL, after folio_walk_end().  We don't care
      about losing the secretmem check in follow_page().
      
      [david@redhat.com: teach can_split_folio() that we are not holding an additional reference]
        Link: https://lkml.kernel.org/r/c75d1c6c-8ea6-424f-853c-1ccda6c77ba2@redhat.com
      Link: https://lkml.kernel.org/r/20240802155524.517137-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8710f6ed
    • David Hildenbrand's avatar
      mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walk · b1d3e9bb
      David Hildenbrand authored
      Let's use folio_walk instead, for example avoiding taking temporary folio
      references if the folio does obviously not even apply and getting rid of
      one more follow_page() user.  We cannot move all handling under the PTL,
      so leave the rmap handling (which implies an allocation) out.
      
      Note that zeropages obviously don't apply: old code could just have
      specified FOLL_DUMP.  Further, we don't care about losing the secretmem
      check in follow_page(): these are never anon pages and
      vma_ksm_compatible() would never consider secretmem vmas (VM_SHARED |
      VM_MAYSHARE must be set for secretmem, see secretmem_mmap()).
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1d3e9bb