1. 27 Sep, 2022 40 commits
    • Liam R. Howlett's avatar
      xen: use vma_lookup() in privcmd_ioctl_mmap() · 7ccf089b
      Liam R. Howlett authored
      vma_lookup() walks the VMA tree for a specific value, find_vma() will
      search the tree after walking to a specific value.  It is more efficient
      to only walk to the requested value since privcmd_ioctl_mmap() will exit
      the loop if vm_start != msg->va.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-20-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ccf089b
    • Liam R. Howlett's avatar
      mmap: change zeroing of maple tree in __vma_adjust() · 3b0e81a1
      Liam R. Howlett authored
      Only write to the maple tree if we are not inserting or the insert isn't
      going to overwrite the area to clear.  This avoids spanning writes and
      node coealescing when unnecessary.
      
      The change requires a custom search for the linked list addition to find
      the correct VMA for the prev link.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-19-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3b0e81a1
    • Liam R. Howlett's avatar
      mm: remove rb tree. · 524e00b3
      Liam R. Howlett authored
      Remove the RB tree and start using the maple tree for vm_area_struct
      tracking.
      
      Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
      lock is not held.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      524e00b3
    • Matthew Wilcox (Oracle)'s avatar
      proc: remove VMA rbtree use from nommu · 0c563f14
      Matthew Wilcox (Oracle) authored
      These users of the rbtree should probably have been walks of the linked
      list, but convert them to use walks of the maple tree.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-17-Liam.Howlett@oracle.comSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c563f14
    • Liam R. Howlett's avatar
      damon: convert __damon_va_three_regions to use the VMA iterator · d0cf3dd4
      Liam R. Howlett authored
      This rather specialised walk can use the VMA iterator.  If this proves to
      be too slow, we can write a custom routine to find the two largest gaps,
      but it will be somewhat complicated, so let's see if we need it first.
      
      Update the kunit test case to use the maple tree.  This also fixes an
      issue with the kunit testcase not adding the last VMA to the list.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-16-Liam.Howlett@oracle.com
      Fixes: 17ccae8b (mm/damon: add kunit tests)
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0cf3dd4
    • Liam R. Howlett's avatar
      kernel/fork: use maple tree for dup_mmap() during forking · c9dbe82c
      Liam R. Howlett authored
      The maple tree was already tracking VMAs in this function by an earlier
      commit, but the rbtree iterator was being used to iterate the list.
      Change the iterator to use a maple tree native iterator and switch to the
      maple tree advanced API to avoid multiple walks of the tree during insert
      operations.  Unexport the now-unused vma_store() function.
      
      For performance reasons we bulk allocate the maple tree nodes.  The node
      calculations are done internally to the tree and use the VMA count and
      assume the worst-case node requirements.  The VM_DONT_COPY flag does not
      allow for the most efficient copy method of the tree and so a bulk loading
      algorithm is used.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-15-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c9dbe82c
    • Liam R. Howlett's avatar
      mm/mmap: use maple tree for unmapped_area{_topdown} · 3499a131
      Liam R. Howlett authored
      The maple tree code was added to find the unmapped area in a previous
      commit and was checked against what the rbtree returned, but the actual
      result was never used.  Start using the maple tree implementation and
      remove the rbtree code.
      
      Add kernel documentation comment for these functions.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-14-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3499a131
    • Liam R. Howlett's avatar
      mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree · 7fdbd37d
      Liam R. Howlett authored
      Use the maple tree's advanced API and a maple state to walk the tree for
      the entry at the address of the next vma, then use the maple state to walk
      back one entry to find the previous entry.
      
      Add kernel documentation comments for this API.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-13-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fdbd37d
    • Liam R. Howlett's avatar
      mm/mmap: use the maple tree in find_vma() instead of the rbtree. · be8432e7
      Liam R. Howlett authored
      Using the maple tree interface mt_find() will handle the RCU locking and
      will start searching at the address up to the limit, ULONG_MAX in this
      case.
      
      Add kernel documentation to this API.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-12-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be8432e7
    • Matthew Wilcox (Oracle)'s avatar
      mmap: use the VMA iterator in count_vma_pages_range() · 2e3af1db
      Matthew Wilcox (Oracle) authored
      This simplifies the implementation and is faster than using the linked
      list.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-11-Liam.Howlett@oracle.comSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e3af1db
    • Matthew Wilcox (Oracle)'s avatar
      mm: add VMA iterator · f39af059
      Matthew Wilcox (Oracle) authored
      This thin layer of abstraction over the maple tree state is for iterating
      over VMAs.  You can go forwards, go backwards or ask where the iterator
      is.  Rename the existing vma_next() to __vma_next() -- it will be removed
      by the end of this series.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-10-Liam.Howlett@oracle.comSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f39af059
    • Liam R. Howlett's avatar
      mm: start tracking VMAs with maple tree · d4af56c5
      Liam R. Howlett authored
      Start tracking the VMAs with the new maple tree structure in parallel with
      the rb_tree.  Add debug and trace events for maple tree operations and
      duplicate the rb_tree that is created on forks into the maple tree.
      
      The maple tree is added to the mm_struct including the mm_init struct,
      added support in required mm/mmap functions, added tracking in kernel/fork
      for process forking, and used to find the unmapped_area and checked
      against what the rbtree finds.
      
      This also moves the mmap_lock() in exit_mmap() since the oom reaper call
      does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
      
      When splitting a vma fails due to allocations of the maple tree nodes,
      the error path in __split_vma() calls new->vm_ops->close(new).  The page
      accounting for hugetlb is actually in the close() operation,  so it
      accounts for the removal of 1/2 of the VMA which was not adjusted.  This
      results in a negative exit value.  To avoid the negative charge, set
      vm_start = vm_end and vm_pgoff = 0.
      
      There is also a potential accounting issue in special mappings from
      insert_vm_struct() failing to allocate, so reverse the charge there in
      the failure scenario.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d4af56c5
    • Liam R. Howlett's avatar
      lib/test_maple_tree: add testing for maple tree · e15e06a8
      Liam R. Howlett authored
      This is a test suite that uses the radix test infrastructure.  It has been
      split into its own commit to allow for easier review of the maple tree
      code.
      
      The testing includes:
      - Allocation of nodes
      - gfp flag allocation checks
      - Expansion & contraction of tree
      - preallocation checks
      - tree navigation by next/prev
      - tree navigation by iterators (mas_for_each, etc)
      - Number of nodes for a given number of entries
      - Generic tree construction tests
      - Addition and removal of entries in forward and reverse numerical indexes
      - gap searching both forward and reverse
      - Combining gaps by overwriting entries in different ways
      - splitting right-most node
      - splitting left-most node
      - overwriting multiple slots
      - overwriting across different levels of the tree
      - overwriting the middle of a tree
      - causing a 3-way split up to the root by overwriting the last slot and
        first slot of different nodes and spanning different levels
      - RCU stress testing of the tree with threads
      - Duplication of the tree by entry count
      - Tests which were generated by fuzzers have been added.
      - A large number of tests which come from recording crashing in a VM and
        reconstructing the tree (see check_erase2_set())
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-8-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e15e06a8
    • Liam R. Howlett's avatar
      radix tree test suite: add lockdep_is_held to header · c349fa18
      Liam R. Howlett authored
      maple tree uses lockdep_is_held, so define it as external in the header.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-7-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c349fa18
    • Liam R. Howlett's avatar
      radix tree test suite: add support for slab bulk APIs · cc86e0c2
      Liam R. Howlett authored
      Add support for kmem_cache_free_bulk() and kmem_cache_alloc_bulk() to the
      radix tree test suite.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-6-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc86e0c2
    • Liam R. Howlett's avatar
      radix tree test suite: add allocation counts and size to kmem_cache · 000a4493
      Liam R. Howlett authored
      Add functions to get the number of allocations, and total allocations from
      a kmem_cache.  Also add a function to get the allocated size and a way to
      zero the total allocations.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-5-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      000a4493
    • Liam R. Howlett's avatar
      radix tree test suite: add kmem_cache_set_non_kernel() · e73cb368
      Liam R. Howlett authored
      kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
      kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
      the flags.  This functionality allows for testing different paths though
      the code.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-4-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e73cb368
    • Liam R. Howlett's avatar
      radix tree test suite: add pr_err define · fbeea9d1
      Liam R. Howlett authored
      define pr_err to printk
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-3-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fbeea9d1
    • Liam R. Howlett's avatar
      Maple Tree: add new data structure · 54a611b6
      Liam R. Howlett authored
      Patch series "Introducing the Maple Tree"
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      Davidlor said
      
      : Yes I like the maple tree, and at this stage I don't think we can ask for
      : more from this series wrt the MM - albeit there seems to still be some
      : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
      : complexity out of the MM (not to say that the actual maple tree is not
      : complex) by consolidating the three complimentary data structures very
      : much worth it considering performance does not take a hit.  This was very
      : much a turn off with the range locking approach, which worst case scenario
      : incurred in prohibitive overhead.  Also as Liam and Matthew have
      : mentioned, RCU opens up a lot of nice performance opportunities, and in
      : addition academia[1] has shown outstanding scalability of address spaces
      : with the foundation of replacing the locked rbtree with RCU aware trees.
      
      A similar work has been discovered in the academic press
      
      	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
      
      Sheer coincidence.  We designed our tree with the intention of solving the
      hardest problem first.  Upon settling on a b-tree variant and a rough
      outline, we researched ranged based b-trees and RCU b-trees and did find
      that article.  So it was nice to find reassurances that we were on the
      right path, but our design choice of using ranges made that paper unusable
      for us.
      
      This patch (of 70):
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      There is additional BUG_ON() calls added within the tree, most of which
      are in debug code.  These will be replaced with a WARN_ON() call in the
      future.  There is also additional BUG_ON() calls within the code which
      will also be reduced in number at a later date.  These exist to catch
      things such as out-of-range accesses which would crash anyways.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      54a611b6
    • Aneesh Kumar K.V's avatar
      mm/demotion: expose memory tier details via sysfs · 9832fb87
      Aneesh Kumar K.V authored
      Add /sys/devices/virtual/memory_tiering/ where all memory tier related
      details can be found.  All allocated memory tiers will be listed there as
      /sys/devices/virtual/memory_tiering/memory_tierN/
      
      The nodes which are part of a specific memory tier can be listed via
      /sys/devices/virtual/memory_tiering/memory_tierN/nodes
      
      A directory hierarchy looks like
      :/sys/devices/virtual/memory_tiering$ tree memory_tier4/
      memory_tier4/
      ├── nodes
      ├── subsystem -> ../../../../bus/memory_tiering
      └── uevent
      
      :/sys/devices/virtual/memory_tiering$ cat memory_tier4/nodes
      0,2
      
      [aneesh.kumar@linux.ibm.com: drop toptier_nodes from sysfs]
        Link: https://lkml.kernel.org/r/20220922102201.62168-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220830081736.119281-1-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9832fb87
    • Aneesh Kumar K.V's avatar
      lib/nodemask: optimize node_random for nodemask with single NUMA node · 3e061d92
      Aneesh Kumar K.V authored
      The most common case for certain node_random usage (demotion nodemask) is
      with nodemask weight 1.  We can avoid calling get_random_init() in that
      case and always return the only node set in the nodemask.
      
      A simple test as below
        before = rdtsc_ordered();
        for (i= 0; i < 100; i++) {
            rand = node_random(&nmask);
        }
        after = rdtsc_ordered();
      
      Without fix after - before : 16438
      With fix after - before : 816
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-11-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e061d92
    • Aneesh Kumar K.V's avatar
      mm/demotion: update node_is_toptier to work with memory tiers · 467b171a
      Aneesh Kumar K.V authored
      With memory tier support we can have memory only NUMA nodes in the top
      tier from which we want to avoid promotion tracking NUMA faults.  Update
      node_is_toptier to work with memory tiers.  All NUMA nodes are by default
      top tier nodes.  With lower(slower) memory tiers added we consider all
      memory tiers above a memory tier having CPU NUMA nodes as a top memory
      tier
      
      [sj@kernel.org: include missed header file, memory-tiers.h]
        Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
      [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
      [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
        Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      467b171a
    • Jagdish Gediya's avatar
      mm/demotion: demote pages according to allocation fallback order · 32008027
      Jagdish Gediya authored
      Currently, a higher tier node can only be demoted to selected nodes on the
      next lower tier as defined by the demotion path.  This strict demotion
      order does not work in all use cases (e.g.  some use cases may want to
      allow cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space).  This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that currently.
      
      This patch adds support to get all the allowed demotion targets for a
      memory tier.  demote_page_list() function is now modified to utilize this
      allowed node mask as the fallback allocation mask.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-9-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarJagdish Gediya <jvgediya.oss@gmail.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32008027
    • Aneesh Kumar K.V's avatar
      mm/demotion: drop memtier from memtype · b26ac6f3
      Aneesh Kumar K.V authored
      Now that we track node-specific memtier in pg_data_t, we can drop memtier
      from memtype.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-8-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b26ac6f3
    • Aneesh Kumar K.V's avatar
      mm/demotion: add pg_data_t member to track node memory tier details · 7766cf7a
      Aneesh Kumar K.V authored
      Also update different helpes to use NODE_DATA()->memtier.  Since node
      specific memtier can change based on the reassignment of NUMA node to a
      different memory tiers, accessing NODE_DATA()->memtier needs to happen
      under an rcu read lock or memory_tier_lock.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-7-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7766cf7a
    • Aneesh Kumar K.V's avatar
      mm/demotion: build demotion targets based on explicit memory tiers · 6c542ab7
      Aneesh Kumar K.V authored
      This patch switch the demotion target building logic to use memory tiers
      instead of NUMA distance.  All N_MEMORY NUMA nodes will be placed in the
      default memory tier and additional memory tiers will be added by drivers
      like dax kmem.
      
      This patch builds the demotion target for a NUMA node by looking at all
      memory tiers below the tier to which the NUMA node belongs.  The closest
      node in the immediately following memory tier is used as a demotion
      target.
      
      Since we are now only building demotion target for N_MEMORY NUMA nodes the
      CPU hotplug calls are removed in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6c542ab7
    • Aneesh Kumar K.V's avatar
      mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE · 7b88bda3
      Aneesh Kumar K.V authored
      By default, all nodes are assigned to the default memory tier which is the
      memory tier designated for nodes with DRAM
      
      Set dax kmem device node's tier to slower memory tier by assigning
      abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE.  Low-level drivers
      like papr_scm or ACPI NFIT can initialize memory device type to a more
      accurate value based on device tree details or HMAT.  If the kernel
      doesn't find the memory type initialized, a default slower memory type is
      assigned by the kmem driver.
      
      [aneesh.kumar@linux.ibm.com: assign correct memory type for multiple dax devices with the same node affinity]
        Link: https://lkml.kernel.org/r/20220826100224.542312-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-5-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b88bda3
    • Aneesh Kumar K.V's avatar
      mm/demotion: add hotplug callbacks to handle new numa node onlined · c6123a19
      Aneesh Kumar K.V authored
      If the new NUMA node onlined doesn't have a abstract distance assigned,
      the kernel adds the NUMA node to default memory tier.
      
      [aneesh.kumar@linux.ibm.com: fix kernel error with memory hotplug]
        Link: https://lkml.kernel.org/r/20220825092019.379069-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-4-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6123a19
    • Aneesh Kumar K.V's avatar
      mm/demotion: move memory demotion related code · 91952440
      Aneesh Kumar K.V authored
      This moves memory demotion related code to mm/memory-tiers.c.  No
      functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91952440
    • Aneesh Kumar K.V's avatar
      mm/demotion: add support for explicit memory tiers · 992bf775
      Aneesh Kumar K.V authored
      Patch series "mm/demotion: Memory tiers and demotion", v15.
      
      The current kernel has the basic memory tiering support: Inactive pages on
      a higher tier NUMA node can be migrated (demoted) to a lower tier NUMA
      node to make room for new allocations on the higher tier NUMA node. 
      Frequently accessed pages on a lower tier NUMA node can be migrated
      (promoted) to a higher tier NUMA node to improve the performance.
      
      In the current kernel, memory tiers are defined implicitly via a demotion
      path relationship between NUMA nodes, which is created during the kernel
      initialization and updated when a NUMA node is hot-added or hot-removed. 
      The current implementation puts all nodes with CPU into the highest tier,
      and builds the tier hierarchy tier-by-tier by establishing the per-node
      demotion targets based on the distances between nodes.
      
      This current memory tier kernel implementation needs to be improved for
      several important use cases:
      
      * The current tier initialization code always initializes each
        memory-only NUMA node into a lower tier.  But a memory-only NUMA node
        may have a high performance memory device (e.g.  a DRAM-backed
        memory-only node on a virtual machine) and that should be put into a
        higher tier.
      
      * The current tier hierarchy always puts CPU nodes into the top tier. 
        But on a system with HBM (e.g.  GPU memory) devices, these memory-only
        HBM NUMA nodes should be in the top tier, and DRAM nodes with CPUs are
        better to be placed into the next lower tier.
      
      * Also because the current tier hierarchy always puts CPU nodes into the
        top tier, when a CPU is hot-added (or hot-removed) and triggers a memory
        node from CPU-less into a CPU node (or vice versa), the memory tier
        hierarchy gets changed, even though no memory node is added or removed. 
        This can make the tier hierarchy unstable and make it difficult to
        support tier-based memory accounting.
      
      * A higher tier node can only be demoted to nodes with shortest distance
        on the next lower tier as defined by the demotion path, not any other
        node from any lower tier.  This strict, demotion order does not work in
        all use cases (e.g.  some use cases may want to allow cross-socket
        demotion to another node in the same demotion tier as a fallback when
        the preferred demotion node is out of space), and has resulted in the
        feature request for an interface to override the system-wide, per-node
        demotion order from the userspace.  This demotion order is also
        inconsistent with the page allocation fallback order when all the nodes
        in a higher tier are out of space: The page allocation can fall back to
        any node from any lower tier, whereas the demotion order doesn't allow
        that.
      
      This patch series make the creation of memory tiers explicit under the
      control of device driver.
      
      Memory Tier Initialization
      ==========================
      
      Linux kernel presents memory devices as NUMA nodes and each memory device
      is of a specific type.  The memory type of a device is represented by its
      abstract distance.  A memory tier corresponds to a range of abstract
      distance.  This allows for classifying memory devices with a specific
      performance range into a memory tier.
      
      By default, all memory nodes are assigned to the default tier with
      abstract distance 512.
      
      A device driver can move its memory nodes from the default tier.  For
      example, PMEM can move its memory nodes below the default tier, whereas
      GPU can move its memory nodes above the default tier.
      
      The kernel initialization code makes the decision on which exact tier a
      memory node should be assigned to based on the requests from the device
      drivers as well as the memory device hardware information provided by the
      firmware.
      
      Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
      
      
      This patch (of 10):
      
      In the current kernel, memory tiers are defined implicitly via a demotion
      path relationship between NUMA nodes, which is created during the kernel
      initialization and updated when a NUMA node is hot-added or hot-removed. 
      The current implementation puts all nodes with CPU into the highest tier,
      and builds the tier hierarchy by establishing the per-node demotion
      targets based on the distances between nodes.
      
      This current memory tier kernel implementation needs to be improved for
      several important use cases,
      
      The current tier initialization code always initializes each memory-only
      NUMA node into a lower tier.  But a memory-only NUMA node may have a high
      performance memory device (e.g.  a DRAM-backed memory-only node on a
      virtual machine) that should be put into a higher tier.
      
      The current tier hierarchy always puts CPU nodes into the top tier.  But
      on a system with HBM or GPU devices, the memory-only NUMA nodes mapping
      these devices should be in the top tier, and DRAM nodes with CPUs are
      better to be placed into the next lower tier.
      
      With current kernel higher tier node can only be demoted to nodes with
      shortest distance on the next lower tier as defined by the demotion path,
      not any other node from any lower tier.  This strict, demotion order does
      not work in all use cases (e.g.  some use cases may want to allow
      cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space), This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that.
      
      This patch series address the above by defining memory tiers explicitly.
      
      Linux kernel presents memory devices as NUMA nodes and each memory device
      is of a specific type.  The memory type of a device is represented by its
      abstract distance.  A memory tier corresponds to a range of abstract
      distance.  This allows for classifying memory devices with a specific
      performance range into a memory tier.
      
      This patch configures the range/chunk size to be 128.  The default DRAM
      abstract distance is 512.  We can have 4 memory tiers below the default
      DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511.
      Faster memory devices can be placed in these faster(higher) memory tiers.
      Slower memory devices like persistent memory will have abstract distance
      higher than the default DRAM level.
      
      [akpm@linux-foundation.org: fix comment, per Aneesh]
      Link: https://lkml.kernel.org/r/20220818131042.113280-1-aneesh.kumar@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220818131042.113280-2-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      992bf775
    • Yu Zhao's avatar
      mm: multi-gen LRU: design doc · 8be976a0
      Yu Zhao authored
      Add a design doc.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-15-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be976a0
    • Yu Zhao's avatar
      mm: multi-gen LRU: admin guide · 07017acb
      Yu Zhao authored
      Add an admin guide.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07017acb
    • Yu Zhao's avatar
      mm: multi-gen LRU: debugfs interface · d6c3af7d
      Yu Zhao authored
      Add /sys/kernel/debug/lru_gen for working set estimation and proactive
      reclaim.  These techniques are commonly used to optimize job scheduling
      (bin packing) in data centers [1][2].
      
      Compared with the page table-based approach and the PFN-based
      approach, this lruvec-based approach has the following advantages:
      1. It offers better choices because it is aware of memcgs, NUMA nodes,
         shared mappings and unmapped page cache.
      2. It is more scalable because it is O(nr_hot_pages), whereas the
         PFN-based approach is O(nr_total_pages).
      
      Add /sys/kernel/debug/lru_gen_full for debugging.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-13-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d6c3af7d
    • Yu Zhao's avatar
      mm: multi-gen LRU: thrashing prevention · 1332a809
      Yu Zhao authored
      Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
      requested by many desktop users [1].
      
      When set to value N, it prevents the working set of N milliseconds from
      getting evicted.  The OOM killer is triggered if this working set cannot
      be kept in memory.  Based on the average human detectable lag (~100ms),
      N=1000 usually eliminates intolerable lags due to thrashing.  Larger
      values like N=3000 make lags less noticeable at the risk of premature OOM
      kills.
      
      Compared with the size-based approach [2], this time-based approach
      has the following advantages:
      
      1. It is easier to configure because it is agnostic to applications
         and memory sizes.
      2. It is more reliable because it is directly wired to the OOM killer.
      
      [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
      [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-12-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1332a809
    • Yu Zhao's avatar
      mm: multi-gen LRU: kill switch · 354ed597
      Yu Zhao authored
      Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
      can be disabled include:
        0x0001: the multi-gen LRU core
        0x0002: walking page table, when arch_has_hw_pte_young() returns
                true
        0x0004: clearing the accessed bit in non-leaf PMD entries, when
                CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
        [yYnN]: apply to all the components above
      E.g.,
        echo y >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0007
        echo 5 >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0005
      
      NB: the page table walks happen on the scale of seconds under heavy memory
      pressure, in which case the mmap_lock contention is a lesser concern,
      compared with the LRU lock contention and the I/O congestion.  So far the
      only well-known case of the mmap_lock contention happens on Android, due
      to Scudo [1] which allocates several thousand VMAs for merely a few
      hundred MBs.  The SPF and the Maple Tree also have provided their own
      assessments [2][3].  However, if walking page tables does worsen the
      mmap_lock contention, the kill switch can be used to disable it.  In this
      case the multi-gen LRU will suffer a minor performance degradation, as
      shown previously.
      
      Clearing the accessed bit in non-leaf PMD entries can also be disabled,
      since this behavior was not tested on x86 varieties other than Intel and
      AMD.
      
      [1] https://source.android.com/devices/tech/debug/scudo
      [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
      [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      354ed597
    • Yu Zhao's avatar
      mm: multi-gen LRU: optimize multiple memcgs · f76c8337
      Yu Zhao authored
      When multiple memcgs are available, it is possible to use generations as a
      frame of reference to make better choices and improve overall performance
      under global memory pressure.  This patch adds a basic optimization to
      select memcgs that can drop single-use unmapped clean pages first.  Doing
      so reduces the chance of going into the aging path or swapping, which can
      be costly.
      
      A typical example that benefits from this optimization is a server running
      mixed types of workloads, e.g., heavy anon workload in one memcg and heavy
      buffered I/O workload in the other.
      
      Though this optimization can be applied to both kswapd and direct reclaim,
      it is only added to kswapd to keep the patchset manageable.  Later
      improvements may cover the direct reclaim path.
      
      While ensuring certain fairness to all eligible memcgs, proportional scans
      of individual memcgs also require proper backoff to avoid overshooting
      their aggregate reclaim target by too much.  Otherwise it can cause high
      direct reclaim latency.  The conditions for backoff are:
      
      1. At low priorities, for direct reclaim, if aging fairness or direct
         reclaim latency is at risk, i.e., aging one memcg multiple times or
         swapping after the target is met.
      2. At high priorities, for global reclaim, if per-zone free pages are
         above respective watermarks.
      
      Server benchmark results:
        Mixed workloads:
          fio (buffered I/O): +[19, 21]%
                      IOPS         BW
            patch1-8: 1880k        7343MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[119, 123]%
                      Ops/sec      KB/sec
            patch1-8: 862768.65    33514.68
            patch1-9: 1911022.12   74234.54
      
        Mixed workloads:
          fio (buffered I/O): +[75, 77]%
                      IOPS         BW
            5.19-rc1: 1279k        4996MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[13, 15]%
                      Ops/sec      KB/sec
            5.19-rc1: 1673524.04   65008.87
            patch1-9: 1911022.12   74234.54
      
        Configurations:
          (changes since patch 6)
      
          cat mixed.sh
          modprobe brd rd_nr=2 rd_size=56623104
      
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          mkfs.ext4 /dev/ram1
          mount -t ext4 /dev/ram1 /mnt
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=90m --group_reporting &
          pid=$!
      
          sleep 200
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
          kill -INT $pid
          wait
      
      Client benchmark results:
        no change (CONFIG_MEMCG=n)
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-10-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f76c8337
    • Yu Zhao's avatar
      mm: multi-gen LRU: support page table walks · bd74fdae
      Yu Zhao authored
      To further exploit spatial locality, the aging prefers to walk page tables
      to search for young PTEs and promote hot pages.  A kill switch will be
      added in the next patch to disable this behavior.  When disabled, the
      aging relies on the rmap only.
      
      NB: this behavior has nothing similar with the page table scanning in the
      2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
      to swapcache and unmaps them.
      
      To avoid confusion, the term "iteration" specifically means the traversal
      of an entire mm_struct list; the term "walk" will be applied to page
      tables and the rmap, as usual.
      
      An mm_struct list is maintained for each memcg, and an mm_struct follows
      its owner task to the new memcg when this task is migrated.  Given an
      lruvec, the aging iterates lruvec_memcg()->mm_list and calls
      walk_page_range() with each mm_struct on this list to promote hot pages
      before it increments max_seq.
      
      When multiple page table walkers iterate the same list, each of them gets
      a unique mm_struct; therefore they can run concurrently.  Page table
      walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
      pages it left in the previous memcg will not be promoted when its current
      memcg is under reclaim.  Similarly, page table walkers will not promote
      pages from nodes other than the one under reclaim.
      
      This patch uses the following optimizations when walking page tables:
      1. It tracks the usage of mm_struct's between context switches so that
         page table walkers can skip processes that have been sleeping since
         the last iteration.
      2. It uses generational Bloom filters to record populated branches so
         that page table walkers can reduce their search space based on the
         query results, e.g., to skip page tables containing mostly holes or
         misplaced pages.
      3. It takes advantage of the accessed bit in non-leaf PMD entries when
         CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
      4. It does not zigzag between a PGD table and the same PMD table
         spanning multiple VMAs. IOW, it finishes all the VMAs within the
         range of the same PMD table before it returns to a PGD table. This
         improves the cache performance for workloads that have large
         numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[8, 10]%
                      Ops/sec      KB/sec
            patch1-7: 1147696.57   44640.29
            patch1-8: 1245274.91   48435.66
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
          patch1-8
            49.44%  lzo1x_1_do_compress (real work)
             6.19%  page_vma_mapped_walk (overhead)
             5.97%  _raw_spin_unlock_irq
             3.13%  get_pfn_folio
             2.85%  ptep_clear_flush
             2.42%  __zram_bvec_write
             2.08%  do_raw_spin_lock
             1.92%  memmove
             1.44%  alloc_zspage
             1.36%  memset
      
        Configurations:
          no change
      
      Thanks to the following developers for their efforts [3].
        kernel test robot <lkp@intel.com>
      
      [1] https://lwn.net/Articles/23732/
      [2] https://llvm.org/docs/ScudoHardenedAllocator.html
      [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bd74fdae
    • Yu Zhao's avatar
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao authored
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      018ee47f
    • Yu Zhao's avatar
      mm: multi-gen LRU: minimal implementation · ac35a490
      Yu Zhao authored
      To avoid confusion, the terms "promotion" and "demotion" will be applied
      to the multi-gen LRU, as a new convention; the terms "activation" and
      "deactivation" will be applied to the active/inactive LRU, as usual.
      
      The aging produces young generations.  Given an lruvec, it increments
      max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
      hot pages to the youngest generation when it finds them accessed through
      page tables; the demotion of cold pages happens consequently when it
      increments max_seq.  Promotion in the aging path does not involve any LRU
      list operations, only the updates of the gen counter and
      lrugen->nr_pages[]; demotion, unless as the result of the increment of
      max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
      aging has the complexity O(nr_hot_pages), since it is only interested in
      hot pages.
      
      The eviction consumes old generations.  Given an lruvec, it increments
      min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
      A feedback loop modeled after the PID controller monitors refaults over
      anon and file types and decides which type to evict when both types are
      available from the same generation.
      
      The protection of pages accessed multiple times through file descriptors
      takes place in the eviction path.  Each generation is divided into
      multiple tiers.  A page accessed N times through file descriptors is in
      tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
      bits in folio->flags.  The aforementioned feedback loop also monitors
      refaults over all tiers and decides when to protect pages in which tiers
      (N>1), using the first tier (N=0,1) as a baseline.  The first tier
      contains single-use unmapped clean pages, which are most likely the best
      choices.  In contrast to promotion in the aging path, the protection of a
      page in the eviction path is achieved by moving this page to the next
      generation, i.e., min_seq+1, if the feedback loop decides so.  This
      approach has the following advantages:
      
      1. It removes the cost of activation in the buffered access path by
         inferring whether pages accessed multiple times through file
         descriptors are statistically hot and thus worth protecting in the
         eviction path.
      2. It takes pages accessed through page tables into account and avoids
         overprotecting pages accessed multiple times through file
         descriptors. (Pages accessed through page tables are in the first
         tier, since N=0.)
      3. More tiers provide better protection for pages accessed more than
         twice through file descriptors, when under heavy buffered I/O
         workloads.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): +[30, 32]%
                      IOPS         BW
            5.19-rc1: 2673k        10.2GiB/s
            patch1-6: 3491k        13.3GiB/s
      
        Single workload:
          memcached (anon): -[4, 6]%
                      Ops/sec      KB/sec
            5.19-rc1: 1161501.04   45177.25
            patch1-6: 1106168.46   43025.04
      
        Configurations:
          CPU: two Xeon 6154
          Mem: total 256G
      
          Node 1 was only used as a ram disk to reduce the variance in the
          results.
      
          patch drivers/block/brd.c <<EOF
          99,100c99,100
          < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
          < 	page = alloc_page(gfp_flags);
          ---
          > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
          > 	page = alloc_pages_node(1, gfp_flags, 0);
          EOF
      
          cat >>/etc/systemd/system.conf <<EOF
          CPUAffinity=numa
          NUMAPolicy=bind
          NUMAMask=0
          EOF
      
          cat >>/etc/memcached.conf <<EOF
          -m 184320
          -s /var/run/memcached/memcached.sock
          -a 0766
          -t 36
          -B binary
          EOF
      
          cat fio.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkfs.ext4 /dev/ram0
          mount -t ext4 /dev/ram0 /mnt
      
          mkdir /sys/fs/cgroup/user.slice/test
          echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
          echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
          fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=5m --group_reporting
      
          cat memcached.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
      Client benchmark results:
        kswapd profiles:
          5.19-rc1
            40.33%  page_vma_mapped_walk (overhead)
            21.80%  lzo1x_1_do_compress (real work)
             7.53%  do_raw_spin_lock
             3.95%  _raw_spin_unlock_irq
             2.52%  vma_interval_tree_iter_next
             2.37%  folio_referenced_one
             2.28%  vma_interval_tree_subtree_search
             1.97%  anon_vma_interval_tree_iter_first
             1.60%  ptep_clear_flush
             1.06%  __zram_bvec_write
      
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
        Configurations:
          CPU: single Snapdragon 7c
          Mem: total 4G
      
          ChromeOS MemoryPressure [1]
      
      [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac35a490
    • Yu Zhao's avatar
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao authored
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarBrian Geffon <bgeffon@google.com>
      Acked-by: default avatarJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: default avatarSteven Barrett <steven@liquorix.net>
      Acked-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Tested-by: default avatarDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: default avatarDonald Carr <d@chaos-reins.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: default avatarKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: default avatarShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: default avatarSofia Trinh <sofia.trinh@edi.works>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2