1. 24 Feb, 2024 23 commits
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vread_iter · 53becf32
      Uladzislau Rezki (Sony) authored
      Extend the vread_iter() to be able to perform a sequential reading of VAs
      which are spread among multiple nodes.  So a data read over the /dev/kmem
      correctly reflects a vmalloc memory layout.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-9-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53becf32
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add a scan area of VA only once · 96aa8437
      Uladzislau Rezki (Sony) authored
      Invoke a kmemleak_scan_area() function only for newly allocated objects to
      add a scan area within that object.  There is no reason to add a same scan
      area(pointer to beginning or inside the object) several times.  If a VA is
      obtained from the cache its scan area has already been associated.
      
      Link: https://lkml.kernel.org/r/20240202190628.47806-1-urezki@gmail.com
      Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock")
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96aa8437
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: offload free_vmap_area_lock lock · 72210662
      Uladzislau Rezki (Sony) authored
      Concurrent access to a global vmap space is a bottle-neck.  We can
      simulate a high contention by running a vmalloc test suite.
      
      To address it, introduce an effective vmap node logic.  Each node behaves
      as independent entity.  When a node is accessed it serves a request
      directly(if possible) from its pool.
      
      This model has a size based pool for requests, i.e.  pools are serialized
      and populated based on object size and real demand.  A maximum object size
      that pool can handle is set to 256 pages.
      
      This technique reduces a pressure on the global vmap lock.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-8-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72210662
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove global purge_vmap_area_root rb-tree · 282631cb
      Uladzislau Rezki (Sony) authored
      Similar to busy VA, lazily-freed area is stored to a node it belongs to. 
      Such approach does not require any global locking primitive, instead an
      access becomes scalable what mitigates a contention.
      
      This patch removes a global purge-lock, global purge-tree and global purge
      list.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-7-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      282631cb
    • Baoquan He's avatar
      mm/vmalloc: remove vmap_area_list · 55c49fee
      Baoquan He authored
      Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get
      the base address of vmalloc area.  Now, vmap_area_list is empty, so export
      VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
      
      [urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()]
        Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c49fee
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove global vmap_area_root rb-tree · d0936029
      Uladzislau Rezki (Sony) authored
      Store allocated objects in a separate nodes.  A va->va_start address is
      converted into a correct node where it should be placed and resided.  An
      addr_to_node() function is used to do a proper address conversion to
      determine a node that contains a VA.
      
      Such approach balances VAs across nodes as a result an access becomes
      scalable.  Number of nodes in a system depends on number of CPUs.
      
      Please note:
      
      1. As of now allocated VAs are bound to a node-0. It means the
         patch does not give any difference comparing with a current
         behavior;
      
      2. The global vmap_area_lock, vmap_area_root are removed as there
         is no need in it anymore. The vmap_area_list is still kept and
         is _empty_. It is exported for a kexec only;
      
      3. The vmallocinfo and vread() have to be reworked to be able to
         handle multiple nodes.
      
      [urezki@gmail.com: mark vmap_init_free_space() with __init tag]
        Link: https://lkml.kernel.org/r/20240111132628.299644-1-urezki@gmail.com
      [urezki@gmail.com: fix a wrong value passed to __find_vmap_area()]
        Link: https://lkml.kernel.org/r/20240111121104.180993-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-5-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0936029
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: move vmap_init_free_space() down in vmalloc.c · 7fa8cee0
      Uladzislau Rezki (Sony) authored
      A vmap_init_free_space() is a function that setups a vmap space and is
      considered as part of initialization phase.  Since a main entry which is
      vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow
      the pattern.
      
      There is no a functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-4-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fa8cee0
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: rename adjust_va_to_fit_type() function · 5b75b8e1
      Uladzislau Rezki (Sony) authored
      This patch renames the adjust_va_to_fit_type() function to va_clip() which
      is shorter and more expressive.
      
      There is no a functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-3-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b75b8e1
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add va_alloc() helper · 38f6b9af
      Uladzislau Rezki (Sony) authored
      Patch series "Mitigate a vmap lock contention", v3.
      
      1. Motivation
      
      - Offload global vmap locks making it scaled to number of CPUS;
      
      - If possible and there is an agreement, we can remove the "Per cpu kva
        allocator" to make the vmap code to be more simple;
      
      - There were complaints from XFS folk that a vmalloc might be contented
        on their workloads.
      
      2. Design(high level overview)
      
      We introduce an effective vmap node logic.  A node behaves as independent
      entity to serve an allocation request directly(if possible) from its pool.
      That way it bypasses a global vmap space that is protected by its own
      lock.
      
      An access to pools are serialized by CPUs.  Number of nodes are equal to
      number of CPUs in a system.  Please note the high threshold is bound to
      128 nodes.
      
      Pools are size segregated and populated based on system demand.  The
      maximum alloc request that can be stored into a segregated storage is 256
      pages.  The lazily drain path decays a pool by 25% as a first step and as
      second populates it by fresh freed VAs for reuse instead of returning them
      into a global space.
      
      When a VA is obtained(alloc path), it is stored in separate nodes.  A
      va->va_start address is converted into a correct node where it should be
      placed and resided.  Doing so we balance VAs across the nodes as a result
      an access becomes scalable.  The addr_to_node() function does a proper
      address conversion to a correct node.
      
      A vmap space is divided on segments with fixed size, it is 16 pages.  That
      way any address can be associated with a segment number.  Number of
      segments are equal to num_possible_cpus() but not grater then 128.  The
      numeration starts from 0.  See below how it is converted:
      
      static inline unsigned int
      addr_to_node_id(unsigned long addr)
      {
      	return (addr / zone_size) % nr_nodes;
      }
      
      On a free path, a VA can be easily found by converting its "va_start"
      address to a certain node it resides.  It is moved from "busy" data to
      "lazy" data structure.  Later on, as noted earlier, the lazy kworker
      decays each node pool and populates it by fresh incoming VAs.  Please
      note, a VA is returned to a node that did an alloc request.
      
      3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
      
      sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      
      <default perf>
       94.41%     0.89%  [kernel]        [k] _raw_spin_lock
       93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
       76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
       72.96%     0.81%  [kernel]        [k] alloc_vmap_area
       56.94%     0.00%  [kernel]        [k] __get_vm_area_node
       41.95%     0.00%  [kernel]        [k] vmalloc
       37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
       35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
       35.17%     0.00%  [kernel]        [k] ret_from_fork
       35.17%     0.00%  [kernel]        [k] kthread
       35.08%     0.00%  [test_vmalloc]  [k] test_func
       34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
       28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
       23.53%     0.25%  [kernel]        [k] vfree.part.0
       21.72%     0.00%  [kernel]        [k] remove_vm_area
       20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
        2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
      <default perf>
         vs
      <patch-series perf>
       82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
       63.36%     0.02%  [kernel]        [k] vmalloc
       63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
       30.42%     4.46%  [kernel]        [k] vfree.part.0
       28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
       27.28%     0.19%  [kernel]        [k] __get_vm_area_node
       26.13%     1.50%  [kernel]        [k] alloc_vmap_area
       21.72%    21.67%  [kernel]        [k] clear_page_rep
       19.51%     2.43%  [kernel]        [k] _raw_spin_lock
       16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
       13.40%     2.07%  [kernel]        [k] free_unref_page
       10.62%     0.01%  [kernel]        [k] remove_vm_area
        9.02%     8.73%  [kernel]        [k] insert_vmap_area
        8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
        8.94%     0.00%  [kernel]        [k] ret_from_fork
        8.94%     0.00%  [kernel]        [k] kthread
        8.29%     0.00%  [test_vmalloc]  [k] test_func
        7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
        5.30%     4.73%  [kernel]        [k] purge_vmap_node
        4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
      <patch-series perf>
      
      confirms that a native_queued_spin_lock_slowpath goes down to
      16.51% percent from 93.07%.
      
      The throughput is ~12x higher:
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    10m51.271s
      user    0m0.013s
      sys     0m0.187s
      urezki@pc638:~$
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    0m51.301s
      user    0m0.015s
      sys     0m0.040s
      urezki@pc638:~$
      
      
      This patch (of 11):
      
      Currently __alloc_vmap_area() function contains an open codded logic that
      finds and adjusts a VA based on allocation request.
      
      Introduce a va_alloc() helper that adjusts found VA only.  There is no a
      functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38f6b9af
    • Oscar Salvador's avatar
      mm,page_owner: update Documentation regarding page_owner_stacks · ba6fe537
      Oscar Salvador authored
      Update page_owner documentation including the new page_owner_stacks
      feature to show how it can be used.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-8-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba6fe537
    • Oscar Salvador's avatar
      mm,page_owner: filter out stacks by a threshold · 05bb6f4e
      Oscar Salvador authored
      We want to be able to filter out the stacks based on a threshold we can
      can tune.  By writing to 'count_threshold' file, we can adjust the
      threshold value.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-7-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05bb6f4e
    • Oscar Salvador's avatar
      mm,page_owner: display all stacks and their count · 765973a0
      Oscar Salvador authored
      This patch adds a new directory called 'page_owner_stacks' under
      /sys/kernel/debug/, with a file called 'show_stacks' in it.  Reading from
      that file will show all stacks that were added by page_owner followed by
      their counting, giving us a clear overview of stack <-> count
      relationship.
      
      E.g:
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_write+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 4578
      
      The seq stack_{start,next} functions will iterate through the list
      stack_list in order to print all stacks.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-6-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      765973a0
    • Oscar Salvador's avatar
      mm,page_owner: implement the tracking of the stacks count · 217b2119
      Oscar Salvador authored
      Implement {inc,dec}_stack_record_count() which increments or decrements on
      respective allocation and free operations, via __reset_page_owner() (free
      operation) and __set_page_owner() (alloc operation).
      
      Newly allocated stack_record structs will be added to the list stack_list
      via add_stack_record_to_list().  Modifications on the list are protected
      via a spinlock with irqs disabled, since this code can also be reached
      from IRQ context.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-5-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      217b2119
    • Oscar Salvador's avatar
      mm,page_owner: maintain own list of stack_records structs · 4bedfb31
      Oscar Salvador authored
      page_owner needs to increment a stack_record refcount when a new
      allocation occurs, and decrement it on a free operation.  In order to do
      that, we need to have a way to get a stack_record from a handle. 
      Implement __stack_depot_get_stack_record() which just does that, and make
      it public so page_owner can use it.
      
      Also, traversing all stackdepot buckets comes with its own complexity,
      plus we would have to implement a way to mark only those stack_records
      that were originated from page_owner, as those are the ones we are
      interested in.  For that reason, page_owner maintains its own list of
      stack_records, because traversing that list is faster than traversing all
      buckets while keeping at the same time a low complexity.
      
      For now, add to stack_list only the stack_records of dummy_handle and
      failure_handle, and set their refcount of 1.
      
      Further patches will add code to increment or decrement stack_records
      count on allocation and free operation.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-4-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bedfb31
    • Oscar Salvador's avatar
      lib/stackdepot: move stack_record struct definition into the header · 8151c7a3
      Oscar Salvador authored
      In order to move the heavy lifting into page_owner code, this one needs to
      have access to the stack_record structure, which right now sits in
      lib/stackdepot.c.  Move it to the stackdepot.h header so page_owner can
      access stack_record's struct fields.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-3-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8151c7a3
    • Oscar Salvador's avatar
      lib/stackdepot: fix first entry having a 0-handle · 3ee34eab
      Oscar Salvador authored
      Patch series "page_owner: print stacks and their outstanding allocations",
      v10.
      
      page_owner is a great debug functionality tool that lets us know about all
      pages that have been allocated/freed and their specific stacktrace.  This
      comes very handy when debugging memory leaks, since with some scripting we
      can see the outstanding allocations, which might point to a memory leak.
      
      In my experience, that is one of the most useful cases, but it can get
      really tedious to screen through all pages and try to reconstruct the
      stack <-> allocated/freed relationship, becoming most of the time a
      daunting and slow process when we have tons of allocation/free operations.
       
      
      This patchset aims to ease that by adding a new functionality into
      page_owner.  This functionality creates a new directory called
      'page_owner_stacks' under 'sys/kernel//debug' with a read-only file called
      'show_stacks', which prints out all the stacks followed by their
      outstanding number of allocations (being that the times the stacktrace has
      allocated but not freed yet).  This gives us a clear and a quick overview
      of stacks <-> allocated/free.
      
      We take advantage of the new refcount_f field that stack_record struct
      gained, and increment/decrement the stack refcount on every
      __set_page_owner() (alloc operation) and __reset_page_owner (free
      operation) call.
      
      Unfortunately, we cannot use the new stackdepot api STACK_DEPOT_FLAG_GET
      because it does not fulfill page_owner needs, meaning we would have to
      special case things, at which point makes more sense for page_owner to do
      its own {dec,inc}rementing of the stacks.  E.g: Using
      STACK_DEPOT_FLAG_PUT, once the refcount reaches 0, such stack gets
      evicted, so page_owner would lose information.
      
      This patchset also creates a new file called 'set_threshold' within
      'page_owner_stacks' directory, and by writing a value to it, the stacks
      which refcount is below such value will be filtered out.
      
      A PoC can be found below:
      
       # cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks.txt
       # head -40 page_owner_full_stacks.txt 
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        page_cache_ra_unbounded+0x96/0x180
        filemap_get_pages+0xfd/0x590
        filemap_read+0xcc/0x330
        blkdev_read_iter+0xb8/0x150
        vfs_read+0x285/0x320
        ksys_read+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 521
      
      
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_write+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 4609
      ...
      ...
      
       # echo 5000 > /sys/kernel/debug/page_owner_stacks/set_threshold 
       # cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks_5000.txt
       # head -40 page_owner_full_stacks_5000.txt 
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_pwrite64+0x75/0x90
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 6781
      
      
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        pcpu_populate_chunk+0xec/0x350
        pcpu_balance_workfn+0x2d1/0x4a0
        process_scheduled_works+0x84/0x380
        worker_thread+0x12a/0x2a0
        kthread+0xe3/0x110
        ret_from_fork+0x30/0x50
        ret_from_fork_asm+0x1b/0x30
       stack_count: 8641
      
      
      This patch (of 7):
      
      The very first entry of stack_record gets a handle of 0, but this is wrong
      because stackdepot treats a 0-handle as a non-valid one.  E.g: See the
      check in stack_depot_fetch()
      
      Fix this by adding and offset of 1.
      
      This bug has been lurking since the very beginning of stackdepot, but no
      one really cared as it seems.  Because of that I am not adding a Fixes
      tag.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20240215215907.20121-2-osalvador@suse.deCo-developed-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ee34eab
    • Andrew Morton's avatar
    • Aneesh Kumar K.V (IBM)'s avatar
      mm/debug_vm_pgtable: fix BUG_ON with pud advanced test · 720da1e5
      Aneesh Kumar K.V (IBM) authored
      Architectures like powerpc add debug checks to ensure we find only devmap
      PUD pte entries.  These debug checks are only done with CONFIG_DEBUG_VM. 
      This patch marks the ptes used for PUD advanced test devmap pte entries so
      that we don't hit on debug checks on architecture like ppc64 as below.
      
      WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138
      ....
      NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138
      LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
      Call Trace:
      [c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable)
      [c000000004a2f980] [000d34c100000000] 0xd34c100000000
      [c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334
      [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
      [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Also
      
       kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
       ....
      
       NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174
       LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       Call Trace:
       [c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable)
       [c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
       [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Link: https://lkml.kernel.org/r/20240129060022.68044-1-aneesh.kumar@kernel.org
      Fixes: 27af67f3 ("powerpc/book3s64/mm: enable transparent pud hugepage")
      Signed-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      720da1e5
    • Nhat Pham's avatar
      mm: cachestat: fix folio read-after-free in cache walk · 3a75cb05
      Nhat Pham authored
      In cachestat, we access the folio from the page cache's xarray to compute
      its page offset, and check for its dirty and writeback flags.  However, we
      do not hold a reference to the folio before performing these actions,
      which means the folio can concurrently be released and reused as another
      folio/page/slab.
      
      Get around this altogether by just using xarray's existing machinery for
      the folio page offsets and dirty/writeback states.
      
      This changes behavior for tmpfs files to now always report zeroes in their
      dirty and writeback counters.  This is okay as tmpfs doesn't follow
      conventional writeback cache behavior: its pages get "cleaned" during
      swapout, after which they're no longer resident etc.
      
      Link: https://lkml.kernel.org/r/20240220153409.GA216065@cmpxchg.org
      Fixes: cf264e13 ("cachestat: implement cachestat syscall")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Cc: <stable@vger.kernel.org>	[6.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a75cb05
    • Lorenzo Stoakes's avatar
      MAINTAINERS: add memory mapping entry with reviewers · 00130266
      Lorenzo Stoakes authored
      Recently there have been a number of patches which have affected various
      aspects of the memory mapping logic as implemented in mm/mmap.c where it
      would have been useful for regular contributors to have been notified.
      
      Add an entry for this part of mm in particular with regular contributors
      tagged as reviewers.
      
      Link: https://lkml.kernel.org/r/20240220064410.4639-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      00130266
    • Byungchul Park's avatar
      mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index · 2774f256
      Byungchul Park authored
      With numa balancing on, when a numa system is running where a numa node
      doesn't have its local memory so it has no managed zones, the following
      oops has been observed.  It's because wakeup_kswapd() is called with a
      wrong zone index, -1.  Fixed it by checking the index before calling
      wakeup_kswapd().
      
      > BUG: unable to handle page fault for address: 00000000000033f3
      > #PF: supervisor read access in kernel mode
      > #PF: error_code(0x0000) - not-present page
      > PGD 0 P4D 0
      > Oops: 0000 [#1] PREEMPT SMP NOPTI
      > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
      > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
      > Code: (omitted)
      > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
      > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
      > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
      > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
      > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
      > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
      > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      > PKRU: 55555554
      > Call Trace:
      >  <TASK>
      > ? __die
      > ? page_fault_oops
      > ? __pte_offset_map_lock
      > ? exc_page_fault
      > ? asm_exc_page_fault
      > ? wakeup_kswapd
      > migrate_misplaced_page
      > __handle_mm_fault
      > handle_mm_fault
      > do_user_addr_fault
      > exc_page_fault
      > asm_exc_page_fault
      > RIP: 0033:0x55b897ba0808
      > Code: (omitted)
      > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
      > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
      > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
      > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
      > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
      >  </TASK>
      
      Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.comSigned-off-by: default avatarByungchul Park <byungchul@sk.com>
      Reported-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Fixes: c574bbe9 ("NUMA balancing: optimize page placement for memory tiering system")
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2774f256
    • Marco Elver's avatar
      kasan: revert eviction of stack traces in generic mode · 711d3491
      Marco Elver authored
      This partially reverts commits cc478e0b, 63b85ac5, 08d7c94d,
      a414d428, and 773688a6 to make use of variable-sized stack depot
      records, since eviction of stack entries from stack depot forces fixed-
      sized stack records.  Care was taken to retain the code cleanups by the
      above commits.
      
      Eviction was added to generic KASAN as a response to alleviating the
      additional memory usage from fixed-sized stack records, but this still
      uses more memory than previously.
      
      With the re-introduction of variable-sized records for stack depot, we can
      just switch back to non-evictable stack records again, and return back to
      the previous performance and memory usage baseline.
      
      Before (observed after a KASAN kernel boot):
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      After:
      
        pools: 319
        refcounted_allocations: 0
        refcounted_frees: 0
        refcounted_in_use: 0
        freelist_size: 0
        persistent_count: 29397
        persistent_bytes: 5183536
      
      As can be seen from the counters, with a generic KASAN config, refcounted
      allocations and evictions are no longer used.  Due to using variable-sized
      records, I observe a reduction of 278 stack depot pools (saving 4448 KiB)
      with my test setup.
      
      Link: https://lkml.kernel.org/r/20240129100708.39460-2-elver@google.com
      Fixes: cc478e0b ("kasan: avoid resetting aux_lock")
      Fixes: 63b85ac5 ("kasan: stop leaking stack trace handles")
      Fixes: 08d7c94d ("kasan: memset free track in qlink_free")
      Fixes: a414d428 ("kasan: handle concurrent kasan_record_aux_stack calls")
      Fixes: 773688a6 ("kasan: use stack_depot_put for Generic mode")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      711d3491
    • Marco Elver's avatar
      stackdepot: use variable size records for non-evictable entries · 31639fd6
      Marco Elver authored
      With the introduction of stack depot evictions, each stack record is now
      fixed size, so that future reuse after an eviction can safely store
      differently sized stack traces.  In all cases that do not make use of
      evictions, this wastes lots of space.
      
      Fix it by re-introducing variable size stack records (up to the max
      allowed size) for entries that will never be evicted.  We know if an entry
      will never be evicted if the flag STACK_DEPOT_FLAG_GET is not provided,
      since a later stack_depot_put() attempt is undefined behavior.
      
      With my current kernel config that enables KASAN and also SLUB owner
      tracking, I observe (after a kernel boot) a whopping reduction of 296
      stack depot pools, which translates into 4736 KiB saved.  The savings here
      are from SLUB owner tracking only, because KASAN generic mode still uses
      refcounting.
      
      Before:
      
        pools: 893
        allocations: 29841
        frees: 6524
        in_use: 23317
        freelist_size: 3454
      
      After:
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      [elver@google.com: fix -Wstringop-overflow warning]
        Link: https://lore.kernel.org/all/20240201135747.18eca98e@canb.auug.org.au/
        Link: https://lkml.kernel.org/r/20240201090434.1762340-1-elver@google.com
        Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20240129100708.39460-1-elver@google.com
      Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Fixes: 108be8de ("lib/stackdepot: allow users to evict stack traces")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      31639fd6
  2. 22 Feb, 2024 17 commits