1. 27 Sep, 2022 40 commits
    • Liam R. Howlett's avatar
      kernel/fork: use maple tree for dup_mmap() during forking · c9dbe82c
      Liam R. Howlett authored
      The maple tree was already tracking VMAs in this function by an earlier
      commit, but the rbtree iterator was being used to iterate the list.
      Change the iterator to use a maple tree native iterator and switch to the
      maple tree advanced API to avoid multiple walks of the tree during insert
      operations.  Unexport the now-unused vma_store() function.
      
      For performance reasons we bulk allocate the maple tree nodes.  The node
      calculations are done internally to the tree and use the VMA count and
      assume the worst-case node requirements.  The VM_DONT_COPY flag does not
      allow for the most efficient copy method of the tree and so a bulk loading
      algorithm is used.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-15-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c9dbe82c
    • Liam R. Howlett's avatar
      mm/mmap: use maple tree for unmapped_area{_topdown} · 3499a131
      Liam R. Howlett authored
      The maple tree code was added to find the unmapped area in a previous
      commit and was checked against what the rbtree returned, but the actual
      result was never used.  Start using the maple tree implementation and
      remove the rbtree code.
      
      Add kernel documentation comment for these functions.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-14-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3499a131
    • Liam R. Howlett's avatar
      mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree · 7fdbd37d
      Liam R. Howlett authored
      Use the maple tree's advanced API and a maple state to walk the tree for
      the entry at the address of the next vma, then use the maple state to walk
      back one entry to find the previous entry.
      
      Add kernel documentation comments for this API.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-13-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fdbd37d
    • Liam R. Howlett's avatar
      mm/mmap: use the maple tree in find_vma() instead of the rbtree. · be8432e7
      Liam R. Howlett authored
      Using the maple tree interface mt_find() will handle the RCU locking and
      will start searching at the address up to the limit, ULONG_MAX in this
      case.
      
      Add kernel documentation to this API.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-12-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be8432e7
    • Matthew Wilcox (Oracle)'s avatar
      mmap: use the VMA iterator in count_vma_pages_range() · 2e3af1db
      Matthew Wilcox (Oracle) authored
      This simplifies the implementation and is faster than using the linked
      list.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-11-Liam.Howlett@oracle.comSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e3af1db
    • Matthew Wilcox (Oracle)'s avatar
      mm: add VMA iterator · f39af059
      Matthew Wilcox (Oracle) authored
      This thin layer of abstraction over the maple tree state is for iterating
      over VMAs.  You can go forwards, go backwards or ask where the iterator
      is.  Rename the existing vma_next() to __vma_next() -- it will be removed
      by the end of this series.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-10-Liam.Howlett@oracle.comSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f39af059
    • Liam R. Howlett's avatar
      mm: start tracking VMAs with maple tree · d4af56c5
      Liam R. Howlett authored
      Start tracking the VMAs with the new maple tree structure in parallel with
      the rb_tree.  Add debug and trace events for maple tree operations and
      duplicate the rb_tree that is created on forks into the maple tree.
      
      The maple tree is added to the mm_struct including the mm_init struct,
      added support in required mm/mmap functions, added tracking in kernel/fork
      for process forking, and used to find the unmapped_area and checked
      against what the rbtree finds.
      
      This also moves the mmap_lock() in exit_mmap() since the oom reaper call
      does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
      
      When splitting a vma fails due to allocations of the maple tree nodes,
      the error path in __split_vma() calls new->vm_ops->close(new).  The page
      accounting for hugetlb is actually in the close() operation,  so it
      accounts for the removal of 1/2 of the VMA which was not adjusted.  This
      results in a negative exit value.  To avoid the negative charge, set
      vm_start = vm_end and vm_pgoff = 0.
      
      There is also a potential accounting issue in special mappings from
      insert_vm_struct() failing to allocate, so reverse the charge there in
      the failure scenario.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4af56c5
    • Liam R. Howlett's avatar
      lib/test_maple_tree: add testing for maple tree · e15e06a8
      Liam R. Howlett authored
      This is a test suite that uses the radix test infrastructure.  It has been
      split into its own commit to allow for easier review of the maple tree
      code.
      
      The testing includes:
      - Allocation of nodes
      - gfp flag allocation checks
      - Expansion & contraction of tree
      - preallocation checks
      - tree navigation by next/prev
      - tree navigation by iterators (mas_for_each, etc)
      - Number of nodes for a given number of entries
      - Generic tree construction tests
      - Addition and removal of entries in forward and reverse numerical indexes
      - gap searching both forward and reverse
      - Combining gaps by overwriting entries in different ways
      - splitting right-most node
      - splitting left-most node
      - overwriting multiple slots
      - overwriting across different levels of the tree
      - overwriting the middle of a tree
      - causing a 3-way split up to the root by overwriting the last slot and
        first slot of different nodes and spanning different levels
      - RCU stress testing of the tree with threads
      - Duplication of the tree by entry count
      - Tests which were generated by fuzzers have been added.
      - A large number of tests which come from recording crashing in a VM and
        reconstructing the tree (see check_erase2_set())
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-8-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e15e06a8
    • Liam R. Howlett's avatar
      radix tree test suite: add lockdep_is_held to header · c349fa18
      Liam R. Howlett authored
      maple tree uses lockdep_is_held, so define it as external in the header.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-7-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c349fa18
    • Liam R. Howlett's avatar
      radix tree test suite: add support for slab bulk APIs · cc86e0c2
      Liam R. Howlett authored
      Add support for kmem_cache_free_bulk() and kmem_cache_alloc_bulk() to the
      radix tree test suite.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-6-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc86e0c2
    • Liam R. Howlett's avatar
      radix tree test suite: add allocation counts and size to kmem_cache · 000a4493
      Liam R. Howlett authored
      Add functions to get the number of allocations, and total allocations from
      a kmem_cache.  Also add a function to get the allocated size and a way to
      zero the total allocations.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-5-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      000a4493
    • Liam R. Howlett's avatar
      radix tree test suite: add kmem_cache_set_non_kernel() · e73cb368
      Liam R. Howlett authored
      kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
      kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
      the flags.  This functionality allows for testing different paths though
      the code.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-4-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e73cb368
    • Liam R. Howlett's avatar
      radix tree test suite: add pr_err define · fbeea9d1
      Liam R. Howlett authored
      define pr_err to printk
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-3-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fbeea9d1
    • Liam R. Howlett's avatar
      Maple Tree: add new data structure · 54a611b6
      Liam R. Howlett authored
      Patch series "Introducing the Maple Tree"
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      Davidlor said
      
      : Yes I like the maple tree, and at this stage I don't think we can ask for
      : more from this series wrt the MM - albeit there seems to still be some
      : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
      : complexity out of the MM (not to say that the actual maple tree is not
      : complex) by consolidating the three complimentary data structures very
      : much worth it considering performance does not take a hit.  This was very
      : much a turn off with the range locking approach, which worst case scenario
      : incurred in prohibitive overhead.  Also as Liam and Matthew have
      : mentioned, RCU opens up a lot of nice performance opportunities, and in
      : addition academia[1] has shown outstanding scalability of address spaces
      : with the foundation of replacing the locked rbtree with RCU aware trees.
      
      A similar work has been discovered in the academic press
      
      	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
      
      Sheer coincidence.  We designed our tree with the intention of solving the
      hardest problem first.  Upon settling on a b-tree variant and a rough
      outline, we researched ranged based b-trees and RCU b-trees and did find
      that article.  So it was nice to find reassurances that we were on the
      right path, but our design choice of using ranges made that paper unusable
      for us.
      
      This patch (of 70):
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      There is additional BUG_ON() calls added within the tree, most of which
      are in debug code.  These will be replaced with a WARN_ON() call in the
      future.  There is also additional BUG_ON() calls within the code which
      will also be reduced in number at a later date.  These exist to catch
      things such as out-of-range accesses which would crash anyways.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      54a611b6
    • Aneesh Kumar K.V's avatar
      mm/demotion: expose memory tier details via sysfs · 9832fb87
      Aneesh Kumar K.V authored
      Add /sys/devices/virtual/memory_tiering/ where all memory tier related
      details can be found.  All allocated memory tiers will be listed there as
      /sys/devices/virtual/memory_tiering/memory_tierN/
      
      The nodes which are part of a specific memory tier can be listed via
      /sys/devices/virtual/memory_tiering/memory_tierN/nodes
      
      A directory hierarchy looks like
      :/sys/devices/virtual/memory_tiering$ tree memory_tier4/
      memory_tier4/
      ├── nodes
      ├── subsystem -> ../../../../bus/memory_tiering
      └── uevent
      
      :/sys/devices/virtual/memory_tiering$ cat memory_tier4/nodes
      0,2
      
      [aneesh.kumar@linux.ibm.com: drop toptier_nodes from sysfs]
        Link: https://lkml.kernel.org/r/20220922102201.62168-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220830081736.119281-1-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9832fb87
    • Aneesh Kumar K.V's avatar
      lib/nodemask: optimize node_random for nodemask with single NUMA node · 3e061d92
      Aneesh Kumar K.V authored
      The most common case for certain node_random usage (demotion nodemask) is
      with nodemask weight 1.  We can avoid calling get_random_init() in that
      case and always return the only node set in the nodemask.
      
      A simple test as below
        before = rdtsc_ordered();
        for (i= 0; i < 100; i++) {
            rand = node_random(&nmask);
        }
        after = rdtsc_ordered();
      
      Without fix after - before : 16438
      With fix after - before : 816
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-11-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e061d92
    • Aneesh Kumar K.V's avatar
      mm/demotion: update node_is_toptier to work with memory tiers · 467b171a
      Aneesh Kumar K.V authored
      With memory tier support we can have memory only NUMA nodes in the top
      tier from which we want to avoid promotion tracking NUMA faults.  Update
      node_is_toptier to work with memory tiers.  All NUMA nodes are by default
      top tier nodes.  With lower(slower) memory tiers added we consider all
      memory tiers above a memory tier having CPU NUMA nodes as a top memory
      tier
      
      [sj@kernel.org: include missed header file, memory-tiers.h]
        Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
      [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
      [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
        Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      467b171a
    • Jagdish Gediya's avatar
      mm/demotion: demote pages according to allocation fallback order · 32008027
      Jagdish Gediya authored
      Currently, a higher tier node can only be demoted to selected nodes on the
      next lower tier as defined by the demotion path.  This strict demotion
      order does not work in all use cases (e.g.  some use cases may want to
      allow cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space).  This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that currently.
      
      This patch adds support to get all the allowed demotion targets for a
      memory tier.  demote_page_list() function is now modified to utilize this
      allowed node mask as the fallback allocation mask.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-9-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarJagdish Gediya <jvgediya.oss@gmail.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32008027
    • Aneesh Kumar K.V's avatar
      mm/demotion: drop memtier from memtype · b26ac6f3
      Aneesh Kumar K.V authored
      Now that we track node-specific memtier in pg_data_t, we can drop memtier
      from memtype.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-8-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b26ac6f3
    • Aneesh Kumar K.V's avatar
      mm/demotion: add pg_data_t member to track node memory tier details · 7766cf7a
      Aneesh Kumar K.V authored
      Also update different helpes to use NODE_DATA()->memtier.  Since node
      specific memtier can change based on the reassignment of NUMA node to a
      different memory tiers, accessing NODE_DATA()->memtier needs to happen
      under an rcu read lock or memory_tier_lock.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-7-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7766cf7a
    • Aneesh Kumar K.V's avatar
      mm/demotion: build demotion targets based on explicit memory tiers · 6c542ab7
      Aneesh Kumar K.V authored
      This patch switch the demotion target building logic to use memory tiers
      instead of NUMA distance.  All N_MEMORY NUMA nodes will be placed in the
      default memory tier and additional memory tiers will be added by drivers
      like dax kmem.
      
      This patch builds the demotion target for a NUMA node by looking at all
      memory tiers below the tier to which the NUMA node belongs.  The closest
      node in the immediately following memory tier is used as a demotion
      target.
      
      Since we are now only building demotion target for N_MEMORY NUMA nodes the
      CPU hotplug calls are removed in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c542ab7
    • Aneesh Kumar K.V's avatar
      mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE · 7b88bda3
      Aneesh Kumar K.V authored
      By default, all nodes are assigned to the default memory tier which is the
      memory tier designated for nodes with DRAM
      
      Set dax kmem device node's tier to slower memory tier by assigning
      abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE.  Low-level drivers
      like papr_scm or ACPI NFIT can initialize memory device type to a more
      accurate value based on device tree details or HMAT.  If the kernel
      doesn't find the memory type initialized, a default slower memory type is
      assigned by the kmem driver.
      
      [aneesh.kumar@linux.ibm.com: assign correct memory type for multiple dax devices with the same node affinity]
        Link: https://lkml.kernel.org/r/20220826100224.542312-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-5-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b88bda3
    • Aneesh Kumar K.V's avatar
      mm/demotion: add hotplug callbacks to handle new numa node onlined · c6123a19
      Aneesh Kumar K.V authored
      If the new NUMA node onlined doesn't have a abstract distance assigned,
      the kernel adds the NUMA node to default memory tier.
      
      [aneesh.kumar@linux.ibm.com: fix kernel error with memory hotplug]
        Link: https://lkml.kernel.org/r/20220825092019.379069-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-4-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6123a19
    • Aneesh Kumar K.V's avatar
      mm/demotion: move memory demotion related code · 91952440
      Aneesh Kumar K.V authored
      This moves memory demotion related code to mm/memory-tiers.c.  No
      functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91952440
    • Aneesh Kumar K.V's avatar
      mm/demotion: add support for explicit memory tiers · 992bf775
      Aneesh Kumar K.V authored
      Patch series "mm/demotion: Memory tiers and demotion", v15.
      
      The current kernel has the basic memory tiering support: Inactive pages on
      a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
      node to make room for new allocations on the higher tier NUMA node. 
      Frequently accessed pages on a lower tier NUMA node can be migrated
      (promoted) to a higher tier NUMA node to improve the performance.
      
      In the current kernel, memory tiers are defined implicitly via a demotion
      path relationship between NUMA nodes, which is created during the kernel
      initialization and updated when a NUMA node is hot-added or hot-removed. 
      The current implementation puts all nodes with CPU into the highest tier,
      and builds the tier hierarchy tier-by-tier by establishing the per-node
      demotion targets based on the distances between nodes.
      
      This current memory tier kernel implementation needs to be improved for
      several important use cases:
      
      * The current tier initialization code always initializes each
        memory-only NUMA node into a lower tier.  But a memory-only NUMA node
        may have a high performance memory device (e.g.  a DRAM-backed
        memory-only node on a virtual machine) and that should be put into a
        higher tier.
      
      * The current tier hierarchy always puts CPU nodes into the top tier. 
        But on a system with HBM (e.g.  GPU memory) devices, these memory-only
        HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
        better to be placed into the next lower tier.
      
      * Also because the current tier hierarchy always puts CPU nodes into the
        top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
        node from CPU-less into a CPU node (or vice versa), the memory tier
        hierarchy gets changed, even though no memory node is added or removed. 
        This can make the tier hierarchy unstable and make it difficult to
        support tier-based memory accounting.
      
      * A higher tier node can only be demoted to nodes with shortest distance
        on the next lower tier as defined by the demotion path, not any other
        node from any lower tier.  This strict, demotion order does not work in
        all use cases (e.g.  some use cases may want to allow cross-socket
        demotion to another node in the same demotion tier as a fallback when
        the preferred demotion node is out of space), and has resulted in the
        feature request for an interface to override the system-wide, per-node
        demotion order from the userspace.  This demotion order is also
        inconsistent with the page allocation fallback order when all the nodes
        in a higher tier are out of space: The page allocation can fall back to
        any node from any lower tier, whereas the demotion order doesn't allow
        that.
      
      This patch series make the creation of memory tiers explicit under the
      control of device driver.
      
      Memory Tier Initialization
      ==========================
      
      Linux kernel presents memory devices as NUMA nodes and each memory device
      is of a specific type.  The memory type of a device is represented by its
      abstract distance.  A memory tier corresponds to a range of abstract
      distance.  This allows for classifying memory devices with a specific
      performance range into a memory tier.
      
      By default, all memory nodes are assigned to the default tier with
      abstract distance 512.
      
      A device driver can move its memory nodes from the default tier.  For
      example, PMEM can move its memory nodes below the default tier, whereas
      GPU can move its memory nodes above the default tier.
      
      The kernel initialization code makes the decision on which exact tier a
      memory node should be assigned to based on the requests from the device
      drivers as well as the memory device hardware information provided by the
      firmware.
      
      Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
      
      
      This patch (of 10):
      
      In the current kernel, memory tiers are defined implicitly via a demotion
      path relationship between NUMA nodes, which is created during the kernel
      initialization and updated when a NUMA node is hot-added or hot-removed. 
      The current implementation puts all nodes with CPU into the highest tier,
      and builds the tier hierarchy by establishing the per-node demotion
      targets based on the distances between nodes.
      
      This current memory tier kernel implementation needs to be improved for
      several important use cases,
      
      The current tier initialization code always initializes each memory-only
      NUMA node into a lower tier.  But a memory-only NUMA node may have a high
      performance memory device (e.g.  a DRAM-backed memory-only node on a
      virtual machine) that should be put into a higher tier.
      
      The current tier hierarchy always puts CPU nodes into the top tier.  But
      on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
      these devices should be in the top tier, and DRAM nodes with CPUs are
      better to be placed into the next lower tier.
      
      With current kernel higher tier node can only be demoted to nodes with
      shortest distance on the next lower tier as defined by the demotion path,
      not any other node from any lower tier.  This strict, demotion order does
      not work in all use cases (e.g.  some use cases may want to allow
      cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space), This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that.
      
      This patch series address the above by defining memory tiers explicitly.
      
      Linux kernel presents memory devices as NUMA nodes and each memory device
      is of a specific type.  The memory type of a device is represented by its
      abstract distance.  A memory tier corresponds to a range of abstract
      distance.  This allows for classifying memory devices with a specific
      performance range into a memory tier.
      
      This patch configures the range/chunk size to be 128.  The default DRAM
      abstract distance is 512.  We can have 4 memory tiers below the default
      DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
      Faster memory devices can be placed in these faster(higher) memory tiers.
      Slower memory devices like persistent memory will have abstract distance
      higher than the default DRAM level.
      
      [akpm@linux-foundation.org: fix comment, per Aneesh]
      Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      992bf775
    • Yu Zhao's avatar
      mm: multi-gen LRU: design doc · 8be976a0
      Yu Zhao authored
      Add a design doc.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-15-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be976a0
    • Yu Zhao's avatar
      mm: multi-gen LRU: admin guide · 07017acb
      Yu Zhao authored
      Add an admin guide.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07017acb
    • Yu Zhao's avatar
      mm: multi-gen LRU: debugfs interface · d6c3af7d
      Yu Zhao authored
      Add /sys/kernel/debug/lru_gen for working set estimation and proactive
      reclaim.  These techniques are commonly used to optimize job scheduling
      (bin packing) in data centers [1][2].
      
      Compared with the page table-based approach and the PFN-based
      approach, this lruvec-based approach has the following advantages:
      1. It offers better choices because it is aware of memcgs, NUMA nodes,
         shared mappings and unmapped page cache.
      2. It is more scalable because it is O(nr_hot_pages), whereas the
         PFN-based approach is O(nr_total_pages).
      
      Add /sys/kernel/debug/lru_gen_full for debugging.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-13-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d6c3af7d
    • Yu Zhao's avatar
      mm: multi-gen LRU: thrashing prevention · 1332a809
      Yu Zhao authored
      Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
      requested by many desktop users [1].
      
      When set to value N, it prevents the working set of N milliseconds from
      getting evicted.  The OOM killer is triggered if this working set cannot
      be kept in memory.  Based on the average human detectable lag (~100ms),
      N=1000 usually eliminates intolerable lags due to thrashing.  Larger
      values like N=3000 make lags less noticeable at the risk of premature OOM
      kills.
      
      Compared with the size-based approach [2], this time-based approach
      has the following advantages:
      
      1. It is easier to configure because it is agnostic to applications
         and memory sizes.
      2. It is more reliable because it is directly wired to the OOM killer.
      
      [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
      [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-12-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1332a809
    • Yu Zhao's avatar
      mm: multi-gen LRU: kill switch · 354ed597
      Yu Zhao authored
      Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
      can be disabled include:
        0x0001: the multi-gen LRU core
        0x0002: walking page table, when arch_has_hw_pte_young() returns
                true
        0x0004: clearing the accessed bit in non-leaf PMD entries, when
                CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
        [yYnN]: apply to all the components above
      E.g.,
        echo y >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0007
        echo 5 >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0005
      
      NB: the page table walks happen on the scale of seconds under heavy memory
      pressure, in which case the mmap_lock contention is a lesser concern,
      compared with the LRU lock contention and the I/O congestion.  So far the
      only well-known case of the mmap_lock contention happens on Android, due
      to Scudo [1] which allocates several thousand VMAs for merely a few
      hundred MBs.  The SPF and the Maple Tree also have provided their own
      assessments [2][3].  However, if walking page tables does worsen the
      mmap_lock contention, the kill switch can be used to disable it.  In this
      case the multi-gen LRU will suffer a minor performance degradation, as
      shown previously.
      
      Clearing the accessed bit in non-leaf PMD entries can also be disabled,
      since this behavior was not tested on x86 varieties other than Intel and
      AMD.
      
      [1] https://source.android.com/devices/tech/debug/scudo
      [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
      [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      354ed597
    • Yu Zhao's avatar
      mm: multi-gen LRU: optimize multiple memcgs · f76c8337
      Yu Zhao authored
      When multiple memcgs are available, it is possible to use generations as a
      frame of reference to make better choices and improve overall performance
      under global memory pressure.  This patch adds a basic optimization to
      select memcgs that can drop single-use unmapped clean pages first.  Doing
      so reduces the chance of going into the aging path or swapping, which can
      be costly.
      
      A typical example that benefits from this optimization is a server running
      mixed types of workloads, e.g., heavy anon workload in one memcg and heavy
      buffered I/O workload in the other.
      
      Though this optimization can be applied to both kswapd and direct reclaim,
      it is only added to kswapd to keep the patchset manageable.  Later
      improvements may cover the direct reclaim path.
      
      While ensuring certain fairness to all eligible memcgs, proportional scans
      of individual memcgs also require proper backoff to avoid overshooting
      their aggregate reclaim target by too much.  Otherwise it can cause high
      direct reclaim latency.  The conditions for backoff are:
      
      1. At low priorities, for direct reclaim, if aging fairness or direct
         reclaim latency is at risk, i.e., aging one memcg multiple times or
         swapping after the target is met.
      2. At high priorities, for global reclaim, if per-zone free pages are
         above respective watermarks.
      
      Server benchmark results:
        Mixed workloads:
          fio (buffered I/O): +[19, 21]%
                      IOPS         BW
            patch1-8: 1880k        7343MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[119, 123]%
                      Ops/sec      KB/sec
            patch1-8: 862768.65    33514.68
            patch1-9: 1911022.12   74234.54
      
        Mixed workloads:
          fio (buffered I/O): +[75, 77]%
                      IOPS         BW
            5.19-rc1: 1279k        4996MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[13, 15]%
                      Ops/sec      KB/sec
            5.19-rc1: 1673524.04   65008.87
            patch1-9: 1911022.12   74234.54
      
        Configurations:
          (changes since patch 6)
      
          cat mixed.sh
          modprobe brd rd_nr=2 rd_size=56623104
      
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          mkfs.ext4 /dev/ram1
          mount -t ext4 /dev/ram1 /mnt
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=90m --group_reporting &
          pid=$!
      
          sleep 200
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
          kill -INT $pid
          wait
      
      Client benchmark results:
        no change (CONFIG_MEMCG=n)
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-10-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f76c8337
    • Yu Zhao's avatar
      mm: multi-gen LRU: support page table walks · bd74fdae
      Yu Zhao authored
      To further exploit spatial locality, the aging prefers to walk page tables
      to search for young PTEs and promote hot pages.  A kill switch will be
      added in the next patch to disable this behavior.  When disabled, the
      aging relies on the rmap only.
      
      NB: this behavior has nothing similar with the page table scanning in the
      2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
      to swapcache and unmaps them.
      
      To avoid confusion, the term "iteration" specifically means the traversal
      of an entire mm_struct list; the term "walk" will be applied to page
      tables and the rmap, as usual.
      
      An mm_struct list is maintained for each memcg, and an mm_struct follows
      its owner task to the new memcg when this task is migrated.  Given an
      lruvec, the aging iterates lruvec_memcg()->mm_list and calls
      walk_page_range() with each mm_struct on this list to promote hot pages
      before it increments max_seq.
      
      When multiple page table walkers iterate the same list, each of them gets
      a unique mm_struct; therefore they can run concurrently.  Page table
      walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
      pages it left in the previous memcg will not be promoted when its current
      memcg is under reclaim.  Similarly, page table walkers will not promote
      pages from nodes other than the one under reclaim.
      
      This patch uses the following optimizations when walking page tables:
      1. It tracks the usage of mm_struct's between context switches so that
         page table walkers can skip processes that have been sleeping since
         the last iteration.
      2. It uses generational Bloom filters to record populated branches so
         that page table walkers can reduce their search space based on the
         query results, e.g., to skip page tables containing mostly holes or
         misplaced pages.
      3. It takes advantage of the accessed bit in non-leaf PMD entries when
         CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
      4. It does not zigzag between a PGD table and the same PMD table
         spanning multiple VMAs. IOW, it finishes all the VMAs within the
         range of the same PMD table before it returns to a PGD table. This
         improves the cache performance for workloads that have large
         numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[8, 10]%
                      Ops/sec      KB/sec
            patch1-7: 1147696.57   44640.29
            patch1-8: 1245274.91   48435.66
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
          patch1-8
            49.44%  lzo1x_1_do_compress (real work)
             6.19%  page_vma_mapped_walk (overhead)
             5.97%  _raw_spin_unlock_irq
             3.13%  get_pfn_folio
             2.85%  ptep_clear_flush
             2.42%  __zram_bvec_write
             2.08%  do_raw_spin_lock
             1.92%  memmove
             1.44%  alloc_zspage
             1.36%  memset
      
        Configurations:
          no change
      
      Thanks to the following developers for their efforts [3].
        kernel test robot <lkp@intel.com>
      
      [1] https://lwn.net/Articles/23732/
      [2] https://llvm.org/docs/ScudoHardenedAllocator.html
      [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd74fdae
    • Yu Zhao's avatar
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao authored
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      018ee47f
    • Yu Zhao's avatar
      mm: multi-gen LRU: minimal implementation · ac35a490
      Yu Zhao authored
      To avoid confusion, the terms "promotion" and "demotion" will be applied
      to the multi-gen LRU, as a new convention; the terms "activation" and
      "deactivation" will be applied to the active/inactive LRU, as usual.
      
      The aging produces young generations.  Given an lruvec, it increments
      max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
      hot pages to the youngest generation when it finds them accessed through
      page tables; the demotion of cold pages happens consequently when it
      increments max_seq.  Promotion in the aging path does not involve any LRU
      list operations, only the updates of the gen counter and
      lrugen->nr_pages[]; demotion, unless as the result of the increment of
      max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
      aging has the complexity O(nr_hot_pages), since it is only interested in
      hot pages.
      
      The eviction consumes old generations.  Given an lruvec, it increments
      min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
      A feedback loop modeled after the PID controller monitors refaults over
      anon and file types and decides which type to evict when both types are
      available from the same generation.
      
      The protection of pages accessed multiple times through file descriptors
      takes place in the eviction path.  Each generation is divided into
      multiple tiers.  A page accessed N times through file descriptors is in
      tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
      bits in folio->flags.  The aforementioned feedback loop also monitors
      refaults over all tiers and decides when to protect pages in which tiers
      (N>1), using the first tier (N=0,1) as a baseline.  The first tier
      contains single-use unmapped clean pages, which are most likely the best
      choices.  In contrast to promotion in the aging path, the protection of a
      page in the eviction path is achieved by moving this page to the next
      generation, i.e., min_seq+1, if the feedback loop decides so.  This
      approach has the following advantages:
      
      1. It removes the cost of activation in the buffered access path by
         inferring whether pages accessed multiple times through file
         descriptors are statistically hot and thus worth protecting in the
         eviction path.
      2. It takes pages accessed through page tables into account and avoids
         overprotecting pages accessed multiple times through file
         descriptors. (Pages accessed through page tables are in the first
         tier, since N=0.)
      3. More tiers provide better protection for pages accessed more than
         twice through file descriptors, when under heavy buffered I/O
         workloads.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): +[30, 32]%
                      IOPS         BW
            5.19-rc1: 2673k        10.2GiB/s
            patch1-6: 3491k        13.3GiB/s
      
        Single workload:
          memcached (anon): -[4, 6]%
                      Ops/sec      KB/sec
            5.19-rc1: 1161501.04   45177.25
            patch1-6: 1106168.46   43025.04
      
        Configurations:
          CPU: two Xeon 6154
          Mem: total 256G
      
          Node 1 was only used as a ram disk to reduce the variance in the
          results.
      
          patch drivers/block/brd.c <<EOF
          99,100c99,100
          < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
          < 	page = alloc_page(gfp_flags);
          ---
          > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
          > 	page = alloc_pages_node(1, gfp_flags, 0);
          EOF
      
          cat >>/etc/systemd/system.conf <<EOF
          CPUAffinity=numa
          NUMAPolicy=bind
          NUMAMask=0
          EOF
      
          cat >>/etc/memcached.conf <<EOF
          -m 184320
          -s /var/run/memcached/memcached.sock
          -a 0766
          -t 36
          -B binary
          EOF
      
          cat fio.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkfs.ext4 /dev/ram0
          mount -t ext4 /dev/ram0 /mnt
      
          mkdir /sys/fs/cgroup/user.slice/test
          echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
          echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
          fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=5m --group_reporting
      
          cat memcached.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
      Client benchmark results:
        kswapd profiles:
          5.19-rc1
            40.33%  page_vma_mapped_walk (overhead)
            21.80%  lzo1x_1_do_compress (real work)
             7.53%  do_raw_spin_lock
             3.95%  _raw_spin_unlock_irq
             2.52%  vma_interval_tree_iter_next
             2.37%  folio_referenced_one
             2.28%  vma_interval_tree_subtree_search
             1.97%  anon_vma_interval_tree_iter_first
             1.60%  ptep_clear_flush
             1.06%  __zram_bvec_write
      
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
        Configurations:
          CPU: single Snapdragon 7c
          Mem: total 4G
      
          ChromeOS MemoryPressure [1]
      
      [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac35a490
    • Yu Zhao's avatar
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao authored
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2
    • Yu Zhao's avatar
      Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" · aa1b6790
      Yu Zhao authored
      This patch undoes the following refactor: commit 289ccba1
      ("include/linux/mm_inline.h: fold __update_lru_size() into its sole
      caller")
      
      The upcoming changes to include/linux/mm_inline.h will reuse
      __update_lru_size().
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-5-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa1b6790
    • Yu Zhao's avatar
      mm/vmscan.c: refactor shrink_node() · f1e1a7be
      Yu Zhao authored
      This patch refactors shrink_node() to improve readability for the upcoming
      changes to mm/vmscan.c.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-4-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1e1a7be
    • Yu Zhao's avatar
      mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG · eed9a328
      Yu Zhao authored
      Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
      x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
      linear address translation [1].  Page table walkers that clear the
      accessed bit may use this capability to reduce their search space.
      
      Note that:
      1. Although an inline function is preferable, this capability is added
         as a configuration option for consistency with the existing macros.
      2. Due to the little interest in other varieties, this capability was
         only tested on Intel and AMD CPUs.
      
      Thanks to the following developers for their efforts [2][3].
        Randy Dunlap <rdunlap@infradead.org>
        Stephen Rothwell <sfr@canb.auug.org.au>
      
      [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
           Volume 3 (June 2021), section 4.8
      [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
      [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eed9a328
    • Yu Zhao's avatar
      mm: x86, arm64: add arch_has_hw_pte_young() · e1fd09e3
      Yu Zhao authored
      Patch series "Multi-Gen LRU Framework", v14.
      
      What's new
      ==========
      1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
         Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
      2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
         machines. The old direct reclaim backoff, which tries to enforce a
         minimum fairness among all eligible memcgs, over-swapped by about
         (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
         pulls the plug on swapping once the target is met, trades some
         fairness for curtailed latency:
         https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
      3. Fixed minior build warnings and conflicts. More comments and nits.
      
      TLDR
      ====
      The current page reclaim is too expensive in terms of CPU usage and it
      often makes poor choices about what to evict. This patchset offers an
      alternative solution that is performant, versatile and
      straightforward.
      
      Patchset overview
      =================
      The design and implementation overview is in patch 14:
      https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
      
      01. mm: x86, arm64: add arch_has_hw_pte_young()
      02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
      Take advantage of hardware features when trying to clear the accessed
      bit in many PTEs.
      
      03. mm/vmscan.c: refactor shrink_node()
      04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
          its sole caller"
      Minor refactors to improve readability for the following patches.
      
      05. mm: multi-gen LRU: groundwork
      Adds the basic data structure and the functions that insert pages to
      and remove pages from the multi-gen LRU (MGLRU) lists.
      
      06. mm: multi-gen LRU: minimal implementation
      A minimal implementation without optimizations.
      
      07. mm: multi-gen LRU: exploit locality in rmap
      Exploits spatial locality to improve efficiency when using the rmap.
      
      08. mm: multi-gen LRU: support page table walks
      Further exploits spatial locality by optionally scanning page tables.
      
      09. mm: multi-gen LRU: optimize multiple memcgs
      Optimizes the overall performance for multiple memcgs running mixed
      types of workloads.
      
      10. mm: multi-gen LRU: kill switch
      Adds a kill switch to enable or disable MGLRU at runtime.
      
      11. mm: multi-gen LRU: thrashing prevention
      12. mm: multi-gen LRU: debugfs interface
      Provide userspace with features like thrashing prevention, working set
      estimation and proactive reclaim.
      
      13. mm: multi-gen LRU: admin guide
      14. mm: multi-gen LRU: design doc
      Add an admin guide and a design doc.
      
      Benchmark results
      =================
      Independent lab results
      -----------------------
      Based on the popularity of searches [01] and the memory usage in
      Google's public cloud, the most popular open-source memory-hungry
      applications, in alphabetical order, are:
            Apache Cassandra      Memcached
            Apache Hadoop         MongoDB
            Apache Spark          PostgreSQL
            MariaDB (MySQL)       Redis
      
      An independent lab evaluated MGLRU with the most widely used benchmark
      suites for the above applications. They posted 960 data points along
      with kernel metrics and perf profiles collected over more than 500
      hours of total benchmark time. Their final reports show that, with 95%
      confidence intervals (CIs), the above applications all performed
      significantly better for at least part of their benchmark matrices.
      
      On 5.14:
      1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
         less wall time to sort three billion random integers, respectively,
         under the medium- and the high-concurrency conditions, when
         overcommitting memory. There were no statistically significant
         changes in wall time for the rest of the benchmark matrix.
      2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
         more transactions per minute (TPM), respectively, under the medium-
         and the high-concurrency conditions, when overcommitting memory.
         There were no statistically significant changes in TPM for the rest
         of the benchmark matrix.
      3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
         and [21.59, 30.02]% more operations per second (OPS), respectively,
         for sequential access, random access and Gaussian (distribution)
         access, when THP=always; 95% CIs [13.85, 15.97]% and
         [23.94, 29.92]% more OPS, respectively, for random access and
         Gaussian access, when THP=never. There were no statistically
         significant changes in OPS for the rest of the benchmark matrix.
      4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
         [2.16, 3.55]% more operations per second (OPS), respectively, for
         exponential (distribution) access, random access and Zipfian
         (distribution) access, when underutilizing memory; 95% CIs
         [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
         respectively, for exponential access, random access and Zipfian
         access, when overcommitting memory.
      
      On 5.15:
      5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
         and [4.11, 7.50]% more operations per second (OPS), respectively,
         for exponential (distribution) access, random access and Zipfian
         (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
         [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
         exponential access, random access and Zipfian access, when swap was
         on.
      6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
         less average wall time to finish twelve parallel TeraSort jobs,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in average wall time for the rest of the
         benchmark matrix.
      7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
         minute (TPM) under the high-concurrency condition, when swap was
         off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in TPM for the rest of the benchmark matrix.
      8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
         [11.47, 19.36]% more total operations per second (OPS),
         respectively, for sequential access, random access and Gaussian
         (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
         [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
         for sequential access, random access and Gaussian access, when
         THP=never.
      
      Our lab results
      ---------------
      To supplement the above results, we ran the following benchmark suites
      on 5.16-rc7 and found no regressions [10].
            fs_fio_bench_hdd_mq      pft
            fs_lmbench               pgsql-hammerdb
            fs_parallelio            redis
            fs_postmark              stream
            hackbench                sysbenchthread
            kernbench                tpcc_spark
            memcached                unixbench
            multichase               vm-scalability
            mutilate                 will-it-scale
            nginx
      
      [01] https://trends.google.com
      [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
      [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
      [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
      [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
      [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
      [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
      [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
      [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
      [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
      
      Read-world applications
      =======================
      Third-party testimonials
      ------------------------
      Konstantin reported [11]:
         I have Archlinux with 8G RAM + zswap + swap. While developing, I
         have lots of apps opened such as multiple LSP-servers for different
         langs, chats, two browsers, etc... Usually, my system gets quickly
         to a point of SWAP-storms, where I have to kill LSP-servers,
         restart browsers to free memory, etc, otherwise the system lags
         heavily and is barely usable.
         
         1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
         patchset, and I started up by opening lots of apps to create memory
         pressure, and worked for a day like this. Till now I had not a
         single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
         getting to the point of 3G in SWAP before without a single
         SWAP-storm.
      
      Vaibhav from IBM reported [12]:
         In a synthetic MongoDB Benchmark, seeing an average of ~19%
         throughput improvement on POWER10(Radix MMU + 64K Page Size) with
         MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
         three different request distributions, namely, Exponential, Uniform
         and Zipfan.
      
      Shuang from U of Rochester reported [13]:
         With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
         and [9.26, 10.36]% higher throughput, respectively, for random
         access, Zipfian (distribution) access and Gaussian (distribution)
         access, when the average number of jobs per CPU is 1; 95% CIs
         [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
         throughput, respectively, for random access, Zipfian access and
         Gaussian access, when the average number of jobs per CPU is 2.
      
      Daniel from Michigan Tech reported [14]:
         With Memcached allocating ~100GB of byte-addressable Optante,
         performance improvement in terms of throughput (measured as queries
         per second) was about 10% for a series of workloads.
      
      Large-scale deployments
      -----------------------
      We've rolled out MGLRU to tens of millions of ChromeOS users and
      about a million Android users. Google's fleetwide profiling [15] shows
      an overall 40% decrease in kswapd CPU usage, in addition to
      improvements in other UX metrics, e.g., an 85% decrease in the number
      of low-memory kills at the 75th percentile and an 18% decrease in
      app launch time at the 50th percentile.
      
      The downstream kernels that have been using MGLRU include:
      1. Android [16]
      2. Arch Linux Zen [17]
      3. Armbian [18]
      4. ChromeOS [19]
      5. Liquorix [20]
      6. OpenWrt [21]
      7. post-factum [22]
      8. XanMod [23]
      
      [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
      [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
      [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
      [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
      [15] https://dl.acm.org/doi/10.1145/2749469.2750392
      [16] https://android.com
      [17] https://archlinux.org
      [18] https://armbian.com
      [19] https://chromium.org
      [20] https://liquorix.net
      [21] https://openwrt.org
      [22] https://codeberg.org/pf-kernel
      [23] https://xanmod.org
      
      Summary
      =======
      The facts are:
      1. The independent lab results and the real-world applications
         indicate substantial improvements; there are no known regressions.
      2. Thrashing prevention, working set estimation and proactive reclaim
         work out of the box; there are no equivalent solutions.
      3. There is a lot of new code; no smaller changes have been
         demonstrated similar effects.
      
      Our options, accordingly, are:
      1. Given the amount of evidence, the reported improvements will likely
         materialize for a wide range of workloads.
      2. Gauging the interest from the past discussions, the new features
         will likely be put to use for both personal computers and data
         centers.
      3. Based on Google's track record, the new code will likely be well
         maintained in the long term. It'd be more difficult if not
         impossible to achieve similar effects with other approaches.
      
      
      This patch (of 14):
      
      Some architectures automatically set the accessed bit in PTEs, e.g., x86
      and arm64 v8.2.  On architectures that do not have this capability,
      clearing the accessed bit in a PTE usually triggers a page fault following
      the TLB miss of this PTE (to emulate the accessed bit).
      
      Being aware of this capability can help make better decisions, e.g.,
      whether to spread the work out over a period of time to reduce bursty page
      faults when trying to clear the accessed bit in many PTEs.
      
      Note that theoretically this capability can be unreliable, e.g.,
      hotplugged CPUs might be different from builtin ones.  Therefore it should
      not be used in architecture-independent code that involves correctness,
      e.g., to determine whether TLB flushes are required (in combination with
      the accessed bit).
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e1fd09e3
    • Yang Yang's avatar
      mm/page_io: count submission time as thrashing delay for delayacct · 3a9bb7b1
      Yang Yang authored
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      Likes PSI, we count submission time as thrashing delay because when the
      device is congested, or the submitting cgroup IO-throttled, submission can
      be a significant part of overall IO time.
      
      Without this patch, swap thrashing through frontswap or some block
      device supporting rw_page operation isn't measured correctly.
      
      This patch is based on "delayacct: support re-entrance detection of
      thrashing accounting".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815072835.74876-1-yang.yang29@zte.com.cnSigned-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarwangyong <wang.yong12@zte.com.cn>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a9bb7b1