1. 26 Apr, 2024 40 commits
    • Kent Overstreet's avatar
      rust: add a rust helper for krealloc() · 53ed0af4
      Kent Overstreet authored
      Memory allocation profiling is turning krealloc() into a nontrivial macro
      - so for now, we need a helper for it.
      
      Until we have proper support on the rust side for memory allocation
      profiling this does mean that all Rust allocations will be accounted to
      the helper.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-25-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarAlice Ryhl <aliceryhl@google.com>
      Acked-by: default avatarMiguel Ojeda <ojeda@kernel.org>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53ed0af4
    • Suren Baghdasaryan's avatar
      mm/slab: add allocation accounting into slab allocation and free paths · 4b873696
      Suren Baghdasaryan authored
      Account slab allocations using codetag reference embedded into slabobj_ext.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-24-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Co-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b873696
    • Suren Baghdasaryan's avatar
      lib: add codetag reference into slabobj_ext · c789b5fe
      Suren Baghdasaryan authored
      To store code tag for every slab object, a codetag reference is embedded
      into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-23-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Co-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c789b5fe
    • Suren Baghdasaryan's avatar
      mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y · 26865a1b
      Suren Baghdasaryan authored
      For all page allocations to be tagged, page_ext has to be initialized
      before the first page allocation.  Early tasks allocate their stacks using
      page allocator before alloc_node_page_ext() initializes page_ext area,
      unless early_page_ext is enabled.  Therefore these allocations will
      generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled. 
      Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
      ensure page_ext initialization prior to any page allocation.  This will
      have all the negative effects associated with early_page_ext, such as
      possible longer boot time, therefore we enable it only when debugging with
      CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
      CONFIG_MEM_ALLOC_PROFILING.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-22-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      26865a1b
    • Suren Baghdasaryan's avatar
      mm: fix non-compound multi-order memory accounting in __free_pages · cc92eba1
      Suren Baghdasaryan authored
      When a non-compound multi-order page is freed, it is possible that a
      speculative reference keeps the page pinned.  In this case we free all
      pages except for the first page, which will be freed later by the last
      put_page().  However the page passed to put_page() is indistinguishable
      from an order-0 page, so it cannot do the accounting, just as it cannot
      free the subsequent pages.  Do the accounting here, where we free the
      pages.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-21-surenb@google.comReported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc92eba1
    • Suren Baghdasaryan's avatar
      mm: create new codetag references during page splitting · be25d1d4
      Suren Baghdasaryan authored
      When a high-order page is split into smaller ones, each newly split page
      should get its codetag.  After the split each split page will be
      referencing the original codetag.  The codetag's "bytes" counter remains
      the same because the amount of allocated memory has not changed, however
      the "calls" counter gets increased to keep the counter correct when these
      individual pages get freed.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-20-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be25d1d4
    • Suren Baghdasaryan's avatar
      mm: enable page allocation tagging · b951aaff
      Suren Baghdasaryan authored
      Redefine page allocators to record allocation tags upon their invocation. 
      Instrument post_alloc_hook and free_pages_prepare to modify current
      allocation tag.
      
      [surenb@google.com: undo _noprof additions in the documentation]
        Link: https://lkml.kernel.org/r/20240326231453.1206227-3-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-19-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Co-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b951aaff
    • Suren Baghdasaryan's avatar
      change alloc_pages name in dma_map_ops to avoid name conflicts · 8a2f1187
      Suren Baghdasaryan authored
      After redefining alloc_pages, all uses of that name are being replaced. 
      Change the conflicting names to prevent preprocessor from replacing them
      when it's not intended.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8a2f1187
    • Suren Baghdasaryan's avatar
      mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation tags · ccdabb1d
      Suren Baghdasaryan authored
      As each allocation tag generates a per-cpu variable, more space is
      required to store them.  Increase PERCPU_MODULE_RESERVE to provide enough
      area.  A better long-term solution would be to allocate this memory
      dynamically.
      
      [surenb@google.com: increase PERCPU_MODULE_RESERVE to accommodate allocation tags]
        Link: https://lkml.kernel.org/r/20240406214044.1114406-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-17-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ccdabb1d
    • Suren Baghdasaryan's avatar
      lib: introduce early boot parameter to avoid page_ext memory overhead · 8d469d0b
      Suren Baghdasaryan authored
      The highest memory overhead from memory allocation profiling comes from
      page_ext objects.  This overhead exists even if the feature is disabled
      but compiled-in.  To avoid it, introduce an early boot parameter that
      prevents page_ext object creation.  The new boot parameter is a tri-state
      with possible values of 0|1|never.  When it is set to "never" the memory
      allocation profiling support is disabled, and overhead is minimized
      (currently no page_ext objects are allocated, in the future more overhead
      might be eliminated).  As a result we also lose ability to enable memory
      allocation profiling at runtime (because there is no space to store
      alloctag references).  Runtime sysctrl becomes read-only if the early boot
      parameter was set to "never".  Note that the default value of this boot
      parameter depends on the CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
      configuration.  When CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n the
      boot parameter is set to "never", therefore eliminating any overhead. 
      CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y results in boot parameter
      being set to 1 (enabled).  This allows distributions to avoid any overhead
      by setting CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n config and with
      no changes to the kernel command line.
      
      We reuse sysctl.vm.mem_profiling boot parameter name in order to avoid
      introducing yet another control.  This change turns it into a tri-state
      early boot parameter.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-16-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d469d0b
    • Suren Baghdasaryan's avatar
      lib: introduce support for page allocation tagging · dcfe378c
      Suren Baghdasaryan authored
      Introduce helper functions to easily instrument page allocators by storing
      a pointer to the allocation tag associated with the code that allocated
      the page in a page_ext field.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-15-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Co-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dcfe378c
    • Suren Baghdasaryan's avatar
      lib: add allocation tagging support for memory allocation profiling · 22d407b1
      Suren Baghdasaryan authored
      Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
      instrument memory allocators.  It registers an "alloc_tags" codetag type
      with /proc/allocinfo interface to output allocation tag information when
      the feature is enabled.
      
      CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
      allocation profiling instrumentation.
      
      Memory allocation profiling can be enabled or disabled at runtime using
      /proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
      CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
      profiling by default.
      
      [surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title]
        Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com
      [surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU]
        Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com
      [klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h]
        Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com
      [surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()]
        Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Co-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKlara Modin <klarasmodin@gmail.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      22d407b1
    • Suren Baghdasaryan's avatar
      lib: prevent module unloading if memory is not freed · 47a92dfb
      Suren Baghdasaryan authored
      Skip freeing module's data section if there are non-zero allocation tags
      because otherwise, once these allocations are freed, the access to their
      code tag would cause UAF.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-13-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47a92dfb
    • Suren Baghdasaryan's avatar
      lib: code tagging module support · a4735739
      Suren Baghdasaryan authored
      Add support for code tagging from dynamically loaded modules.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-12-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Co-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a4735739
    • Suren Baghdasaryan's avatar
      lib: code tagging framework · 916cc516
      Suren Baghdasaryan authored
      Add basic infrastructure to support code tagging which stores tag common
      information consisting of the module name, function, file name and line
      number.  Provide functions to register a new code tag type and navigate
      between code tags.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-11-surenb@google.comCo-developed-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      916cc516
    • Suren Baghdasaryan's avatar
      slab: objext: introduce objext_flags as extension to page_memcg_data_flags · 53ce7203
      Suren Baghdasaryan authored
      Introduce objext_flags to store additional objext flags unrelated to memcg.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-10-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53ce7203
    • Suren Baghdasaryan's avatar
      mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation · 45012241
      Suren Baghdasaryan authored
      Slab extension objects can't be allocated before slab infrastructure is
      initialized.  Some caches, like kmem_cache and kmem_cache_node, are
      created before slab infrastructure is initialized.  Objects from these
      caches can't have extension objects.  Introduce SLAB_NO_OBJ_EXT slab flag
      to mark these caches and avoid creating extensions for objects allocated
      from these slabs.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-9-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45012241
    • Suren Baghdasaryan's avatar
      mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation · 768c33be
      Suren Baghdasaryan authored
      Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
      when allocating slabobj_ext on a slab.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-8-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      768c33be
    • Suren Baghdasaryan's avatar
      mm: introduce slabobj_ext to support slab object extensions · 21c690a3
      Suren Baghdasaryan authored
      Currently slab pages can store only vectors of obj_cgroup pointers in
      page->memcg_data.  Introduce slabobj_ext structure to allow more data to
      be stored for each slab object.  Wrap obj_cgroup into slabobj_ext to
      support current functionality while allowing to extend slabobj_ext in the
      future.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-7-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      21c690a3
    • Kent Overstreet's avatar
      fs: convert alloc_inode_sb() to a macro · a5674119
      Kent Overstreet authored
      We're introducing alloc tagging, which tracks memory allocations by
      callsite.  Converting alloc_inode_sb() to a macro means allocations will
      be tracked by its caller, which is a bit more useful.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-6-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a5674119
    • Kent Overstreet's avatar
      scripts/kallysms: always include __start and __stop symbols · a7f13d0f
      Kent Overstreet authored
      These symbols are used to denote section boundaries: by always including
      them we can unify loading sections from modules with loading built-in
      sections, which leads to some significant cleanup.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-5-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f13d0f
    • Kent Overstreet's avatar
      mm/slub: mark slab_free_freelist_hook() __always_inline · 9ea9cd8e
      Kent Overstreet authored
      It seems we need to be more forceful with the compiler on this one.  This
      is done for performance reasons only.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-4-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9ea9cd8e
    • Kent Overstreet's avatar
      asm-generic/io.h: kill vmalloc.h dependency · 690da22d
      Kent Overstreet authored
      Needed to avoid a new circular dependency with the memory allocation
      profiling series.
      
      Naturally, a whole bunch of files needed to include vmalloc.h that were
      previously getting it implicitly.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-3-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      690da22d
    • Kent Overstreet's avatar
      fix missing vmalloc.h includes · 0069455b
      Kent Overstreet authored
      Patch series "Memory allocation profiling", v6.
      
      Overview:
      Low overhead [1] per-callsite memory allocation profiling. Not just for
      debug kernels, overhead low enough to be deployed in production.
      
      Example output:
        root@moria-kvm:~# sort -rn /proc/allocinfo
         127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
          56373248     4737 mm/slub.c:2259 func:alloc_slab_page
          14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
          14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
          13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
          11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
           9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
           4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
           4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
           3940352      962 mm/memory.c:4214 func:alloc_anon_folio
           2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
           ...
      
      Usage:
      kconfig options:
       - CONFIG_MEM_ALLOC_PROFILING
       - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
       - CONFIG_MEM_ALLOC_PROFILING_DEBUG
         adds warnings for allocations that weren't accounted because of a
         missing annotation
      
      sysctl:
        /proc/sys/vm/mem_profiling
      
      Runtime info:
        /proc/allocinfo
      
      Notes:
      
      [1]: Overhead
      To measure the overhead we are comparing the following configurations:
      (1) Baseline with CONFIG_MEMCG_KMEM=n
      (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
      (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
      (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
      (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
      (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
      (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y
      
      Performance overhead:
      To evaluate performance we implemented an in-kernel test executing
      multiple get_free_page/free_page and kmalloc/kfree calls with allocation
      sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
      affinity set to a specific CPU to minimize the noise. Below are results
      from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
      56 core Intel Xeon:
      
                              kmalloc                 pgalloc
      (1 baseline)            6.764s                  16.902s
      (2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
      (3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
      (4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
      (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
      (6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
      (7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)
      
      Memory overhead:
      Kernel size:
      
         text           data        bss         dec         diff
      (1) 26515311	      18890222    17018880    62424413
      (2) 26524728	      19423818    16740352    62688898    264485
      (3) 26524724	      19423818    16740352    62688894    264481
      (4) 26524728	      19423818    16740352    62688898    264485
      (5) 26541782	      18964374    16957440    62463596    39183
      
      Memory consumption on a 56 core Intel CPU with 125GB of memory:
      Code tags:           192 kB
      PageExts:         262144 kB (256MB)
      SlabExts:           9876 kB (9.6MB)
      PcpuExts:            512 kB (0.5MB)
      
      Total overhead is 0.2% of total memory.
      
      Benchmarks:
      
      Hackbench tests run 100 times:
      hackbench -s 512 -l 200 -g 15 -f 25 -P
            baseline       disabled profiling           enabled profiling
      avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
      stdev 0.0137         0.0188                       0.0077
      
      
      hackbench -l 10000
            baseline       disabled profiling           enabled profiling
      avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
      stdev 0.0933         0.0286                       0.0489
      
      stress-ng tests:
      stress-ng --class memory --seq 4 -t 60
      stress-ng --class cpu --seq 4 -t 60
      Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/
      
      [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/
      
      
      This patch (of 37):
      
      The next patch drops vmalloc.h from a system header in order to fix a
      circular dependency; this adds it to all the files that were pulling it in
      implicitly.
      
      [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
        Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
      [surenb@google.com: fix arch/x86/mm/numa_32.c]
        Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
      [kent.overstreet@linux.dev: a few places were depending on sizes.h]
        Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
      [arnd@arndb.de: fix mm/kasan/hw_tags.c]
        Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
      [surenb@google.com: fix arc build]
        Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0069455b
    • Randy Dunlap's avatar
      scripts/kernel-doc: drop "_noprof" on function prototypes · 51a7bf02
      Randy Dunlap authored
      Memory profiling introduces macros as hooks for function-level allocation
      profiling[1].  Memory allocation functions that are profiled are named
      like xyz_alloc() for API access to the function.  xyz_alloc() then calls
      xyz_alloc_noprof() to do the allocation work.
      
      The kernel-doc comments for the memory allocation functions are introduced
      with the xyz_alloc() function names but the function implementations are
      the xyz_alloc_noprof() names.  This causes kernel-doc warnings for
      mismatched documentation and function prototype names.  By dropping the
      "_noprof" part of the function name, the kernel-doc function name matches
      the function prototype name, so the warnings are resolved.
      
      [1] https://lore.kernel.org/all/20240321163705.3067592-1-surenb@google.com/
      
      Link: https://lkml.kernel.org/r/20240326054149.2121-1-rdunlap@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Closes: https://lore.kernel.org/all/20240325123603.1bdd6588@canb.auug.org.au/Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51a7bf02
    • Yosry Ahmed's avatar
      percpu: clean up all mappings when pcpu_map_pages() fails · 2ccd48ce
      Yosry Ahmed authored
      In pcpu_map_pages(), if __pcpu_map_pages() fails on a CPU, we call
      __pcpu_unmap_pages() to clean up mappings on all CPUs where mappings were
      created, but not on the CPU where __pcpu_map_pages() fails.
      
      __pcpu_map_pages() and __pcpu_unmap_pages() are wrappers around
      vmap_pages_range_noflush() and vunmap_range_noflush().  All other callers
      of vmap_pages_range_noflush() call vunmap_range_noflush() when mapping
      fails, except pcpu_map_pages().  The reason could be that partial mappings
      may be left behind from a failed mapping attempt.
      
      Call __pcpu_unmap_pages() for the failed CPU as well in pcpu_map_pages().
      
      This was found by code inspection, no failures or bugs were observed.
      
      Link: https://lkml.kernel.org/r/20240311194346.2291333-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Christoph Lameter (Ampere) <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ccd48ce
    • Donet Tom's avatar
      mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy · 133d04b1
      Donet Tom authored
      commit bda420b9 ("numa balancing: migrate on fault among multiple
      bound nodes") added support for migrate on protnone reference with
      MPOL_BIND memory policy.  This allowed numa fault migration when the
      executing node is part of the policy mask for MPOL_BIND.  This patch
      extends migration support to MPOL_PREFERRED_MANY policy.
      
      Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
      MPOL_F_NUMA_BALANCING.  This causes issues when we want to use
      NUMA_BALANCING_MEMORY_TIERING.  To effectively use the slow memory tier,
      the kernel should not allocate pages from the slower memory tier via
      allocation control zonelist fallback.  Instead, we should move cold pages
      from the faster memory node via memory demotion.  For a page allocation,
      kswapd is only woken up after we try to allocate pages from all nodes in
      the allocation zone list.  This implies that, without using memory
      policies, we will end up allocating hot pages in the slower memory tier.
      
      MPOL_PREFERRED_MANY was added by commit b27abacc ("mm/mempolicy: add
      MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
      allocation control when we have memory tiers in the system.  With
      MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
      of faster memory nodes.  When we fail to allocate pages from the faster
      memory node, kswapd would be woken up, allowing demotion of cold pages to
      slower memory nodes.
      
      With the current kernel, such usage of memory policies implies we can't do
      page promotion from a slower memory tier to a faster memory tier using
      numa fault.  This patch fixes this issue.
      
      For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask,
      we allow numa migration to the executing nodes.  If the executing node is
      not in the policy node mask, we do not allow numa migration.
      
      Example:
      On a 2-sockets system, NUMA node N0, N1 and N2 are in socket 0,
      N3 in socket 1. N0, N1 and N3 have fast memory and CPU, while
      N2 has slow memory and no CPU.  For a workload, we may use
      MPOL_PREFERRED_MANY with nodemask N0 and N1 set because the workload
      runs on CPUs of socket 0 at most times. Then, even if the workload
      runs on CPUs of N3 occasionally, we will not try to migrate the workload
      pages from N2 to N3 because users may want to avoid cross-socket access
      as much as possible in the long term.
      
      In below table, Process is the Process executing node and
      Curr Loc Pgs is the numa node where page present(folio node)
      ===========================================================
      Process  Policy  Curr Loc Pgs     Observation
      -----------------------------------------------------------
      N0       N0 N1      N1         Pages Migrated from N1 to N0
      N0       N0 N1      N2         Pages Migrated from N2 to N0
      N0       N0 N1      N3	       Pages Migrated from N3 to N0
      
      N3       N0 N1      N0         Pages NOT Migrated  to N3
      N3       N0 N1      N1         Pages NOT Migrated  to N3
      N3       N0 N1      N2	       Pages NOT Migrated  to N3
      ------------------------------------------------------------
      
      Link: https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      133d04b1
    • Donet Tom's avatar
      mm/mempolicy: use numa_node_id() instead of cpu_to_node() · f8fd525b
      Donet Tom authored
      Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
      policy:, v4.
      
      This patchset is to optimize the cross-socket memory access with
      MPOL_PREFERRED_MANY policy.
      
      To test this patch we ran the following test on a 3 node system.
       Node 0 - 2GB   - Tier 1
       Node 1 - 11GB  - Tier 1
       Node 6 - 10GB  - Tier 2
      
      Below changes are made to memcached to set the memory policy,
      It select Node0 and Node1 as preferred nodes.
      
         #include <numaif.h>
         #include <numa.h>
      
          unsigned long nodemask;
          int ret;
      
          nodemask = 0x03;
          ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
                                                     &nodemask, 10);
          /* If MPOL_F_NUMA_BALANCING isn't supported,
           * fall back to MPOL_PREFERRED_MANY */
          if (ret < 0 && errno == EINVAL){
             printf("set mem policy normal\n");
              ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
          }
          if (ret < 0) {
             perror("Failed to call set_mempolicy");
             exit(-1);
          }
      
      Test Procedure:
      ===============
      1. Make sure memory tiring and demotion are enabled.
      2. Start memcached.
      
         # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
             -d -s "/tmp/memcached.sock"
      
      3. Run memtier_benchmark to store 3200000 keys.
      
        #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
          --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
          --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024
      
      4. Start a memory eater on node 0 and 1. This will demote all memcached
         pages to node 6.
      5. Make sure all the memcached pages got demoted to lower tier by reading
         /proc/<memcaced PID>/numa_maps.
      
          # cat /proc/2771/numa_maps
           ---
          default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
          default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
           ---
      
      6. Kill memory eater.
      7. Read the pgpromote_success counter.
      8. Start reading the keys by running memtier_benchmark.
      
        #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
         --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
         --key-minimum=1 --key-maximum=3200000 -n allkeys
         --threads=64 -c 1 -R -x 6
      
      9. Read the pgpromote_success counter.
      
      Test Results:
      =============
      Without Patch
      ------------------
      1. pgpromote_success  before test
      Node 0:  pgpromote_success 11
      Node 1:  pgpromote_success 140974
      
      pgpromote_success  after test
      Node 0:  pgpromote_success 11
      Node 1:  pgpromote_success 140974
      
      2. Memtier-benchmark result.
      AGGREGATED AVERAGE RESULTS (6 runs)
      ==================================================================
      Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
      ------------------------------------------------------------------
      Sets     0.00       ---         ---        ---          ---
      Gets    305792.03  305791.93   0.10       0.18949       0.16700
      Waits    0.00       ---         ---        ---          ---
      Totals  305792.03  305791.93   0.10       0.18949       0.16700
      
      ======================================
      p99 Latency  p99.9 Latency  KB/sec
      -------------------------------------
      ---          ---            0.00
      0.44700     1.71100        11542.69
      ---           ---            ---
      0.44700     1.71100        11542.69
      
      With Patch
      ---------------
      1. pgpromote_success  before test
      Node 0:  pgpromote_success 5
      Node 1:  pgpromote_success 89386
      
      pgpromote_success  after test
      Node 0:  pgpromote_success 57895
      Node 1:  pgpromote_success 141463
      
      2. Memtier-benchmark result.
      AGGREGATED AVERAGE RESULTS (6 runs)
      ====================================================================
      Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
      --------------------------------------------------------------------
      Sets     0.00        ---       ---        ---           ---
      Gets    521942.24  521942.07  0.17       0.11459        0.10300
      Waits    0.00        ---       ---         ---          ---
      Totals  521942.24  521942.07  0.17       0.11459        0.10300
      
      =======================================
      p99 Latency  p99.9 Latency  KB/sec
      ---------------------------------------
       ---          ---            0.00
      0.23100      0.31900        19701.68
      ---          ---             ---
      0.23100      0.31900        19701.68
      
      
      Test Result Analysis:
      =====================
      1. With patch we could observe pages are getting promoted.
      2. Memtier-benchmark results shows that, with the patch,
         performance has increased more than 50%.
      
       Ops/sec without fix -  305792.03
       Ops/sec with fix    -  521942.24
      
      
      This patch (of 2):
      
      Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
      quicker.  smp_processor_id is guaranteed to be stable in the
      'mpol_misplaced()' function because it is called with ptl held. 
      lockdep_assert_held was added to ensure that.
      
      No functional change in this patch.
      
      [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment]
        Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8fd525b
    • Yosry Ahmed's avatar
      mm: zswap: remove unnecessary check in zswap_find_zpool() · fea68a75
      Yosry Ahmed authored
      zswap_find_zpool() checks if ZSWAP_NR_ZPOOLS > 1, which is always true. 
      This is a remnant from a patch version that had ZSWAP_NR_ZPOOLS as a
      config option and never made it upstream.  Remove the unnecessary check.
      
      Link: https://lkml.kernel.org/r/20240311235210.2937484-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fea68a75
    • Duoming Zhou's avatar
      lib/test_hmm.c: handle src_pfns and dst_pfns allocation failure · c2af060d
      Duoming Zhou authored
      The kcalloc() in dmirror_device_evict_chunk() will return null if the
      physical memory has run out.  As a result, if src_pfns or dst_pfns is
      dereferenced, the null pointer dereference bug will happen.
      
      Moreover, the device is going away.  If the kcalloc() fails, the pages
      mapping a chunk could not be evicted.  So add a __GFP_NOFAIL flag in
      kcalloc().
      
      Finally, as there is no need to have physically contiguous memory, Switch
      kcalloc() to kvcalloc() in order to avoid failing allocations.
      
      Link: https://lkml.kernel.org/r/20240312005905.9939-1-duoming@zju.edu.cn
      Fixes: b2ef9f5a ("mm/hmm/test: add selftest driver for HMM")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2af060d
    • Johannes Weiner's avatar
      mm: zpool: return pool size in pages · 4196b48d
      Johannes Weiner authored
      All zswap backends track their pool sizes in pages.  Currently they
      multiply by PAGE_SIZE for zswap, only for zswap to divide again in order
      to do limit math.  Report pages directly.
      
      Link: https://lkml.kernel.org/r/20240312153901.3441-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4196b48d
    • Johannes Weiner's avatar
      mm: zswap: optimize zswap pool size tracking · 91cdcd8d
      Johannes Weiner authored
      Profiling the munmap() of a zswapped memory region shows 60% of the total
      cycles currently going into updating the zswap_pool_total_size.
      
      There are three consumers of this counter:
      - store, to enforce the globally configured pool limit
      - meminfo & debugfs, to report the size to the user
      - shrink, to determine the batch size for each cycle
      
      Instead of aggregating everytime an entry enters or exits the zswap
      pool, aggregate the value from the zpools on-demand:
      
      - Stores aggregate the counter anyway upon success. Aggregating to
        check the limit instead is the same amount of work.
      
      - Meminfo & debugfs might benefit somewhat from a pre-aggregated
        counter, but aren't exactly hotpaths.
      
      - Shrinking can aggregate once for every cycle instead of doing it for
        every freed entry. As the shrinker might work on tens or hundreds of
        objects per scan cycle, this is a large reduction in aggregations.
      
      The paths that benefit dramatically are swapin, swapoff, and unmaps. 
      There could be millions of pages being processed until somebody asks for
      the pool size again.  This eliminates the pool size updates from those
      paths entirely.
      
      Top profile entries for a 24G range munmap(), before:
      
          38.54%  zswap-unmap  [kernel.kallsyms]  [k] zs_zpool_total_size
          12.51%  zswap-unmap  [kernel.kallsyms]  [k] zpool_get_total_size
           9.10%  zswap-unmap  [kernel.kallsyms]  [k] zswap_update_total_size
           2.95%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
           2.88%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
           2.86%  zswap-unmap  [kernel.kallsyms]  [k] xas_store
      
      and after:
      
           7.70%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
           7.16%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
           6.74%  zswap-unmap  [kernel.kallsyms]  [k] xas_store
      
      It was also briefly considered to move to a single atomic in zswap
      that is updated by the backends, since zswap only cares about the sum
      of all pools anyway. However, zram directly needs per-pool information
      out of zsmalloc. To keep the backend from having to update two atomics
      every time, I opted for the lazy aggregation instead for now.
      
      Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91cdcd8d
    • Peter Xu's avatar
      mm: document pXd_leaf() API · 64078b3d
      Peter Xu authored
      There's one small section already, but since we're going to remove
      pXd_huge(), that comment may start to obsolete.
      
      Rewrite that section with more information, hopefully with that the API is
      crystal clear on what it implies.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-15-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64078b3d
    • Peter Xu's avatar
      mm/arm: remove pmd_thp_or_huge() · 502016e3
      Peter Xu authored
      ARM/ARM64 used to define pmd_thp_or_huge().  Now this macro is completely
      redundant.  Remove it and use pmd_leaf().
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      502016e3
    • Peter Xu's avatar
      mm/treewide: remove pXd_huge() · 9636f055
      Peter Xu authored
      This API is not used anymore, drop it for the whole tree.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9636f055
    • Peter Xu's avatar
      mm/treewide: replace pXd_huge() with pXd_leaf() · 1965e933
      Peter Xu authored
      Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
      reuse it.  Luckily, pXd_huge() isn't widely used.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1965e933
    • Peter Xu's avatar
      mm/gup: merge pXd huge mapping checks · 7db86dc3
      Peter Xu authored
      Huge mapping checks in GUP are slightly redundant and can be simplified.
      
      pXd_huge() now is the same as pXd_leaf().  pmd_trans_huge() and
      pXd_devmap() should both imply pXd_leaf(). Time to merge them into one.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7db86dc3
    • Peter Xu's avatar
      mm/powerpc: redefine pXd_huge() with pXd_leaf() · 460b9adc
      Peter Xu authored
      PowerPC book3s 4K mostly has the same definition on both, except
      pXd_huge() constantly returns 0 for hash MMUs.  As Michael Ellerman
      pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit
      will never be set so it will keep returning false.
      
      As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
      such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
      pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf
      hugetlb mappings.
      
      The goal should be that we will have one API pXd_leaf() to detect all
      kinds of huge mappings (hugepd is still special in this case, though). 
      AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to
      make sure ie.  THPs on hash MMU will also return true.
      
      This helps to simplify a follow up patch to drop pXd_huge() treewide.
      
      NOTE: *_leaf() definition need to be moved before the inclusion of
      asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.
      
      [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      460b9adc
    • Peter Xu's avatar
      mm/arm64: merge pXd_huge() and pXd_leaf() definitions · 961a6ee5
      Peter Xu authored
      Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly
      differently.  Redefine the pXd_huge() with pXd_leaf().
      
      There used to be two traps for old aarch64 definitions over these APIs that
      I found when reading the code around, they're:
      
       (1) 4797ec2d ("arm64: fix pud_huge() for 2-level pagetables")
       (2) 23bc8f69 ("arm64: mm: fix p?d_leaf()")
      
      Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a
      problem (on PROT_NONE checks).  To make sure it also works for (1), we
      move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to
      constantly returning "false" for 2-level pgtables, which looks even safer
      to cover both now.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      961a6ee5
    • Peter Xu's avatar
      mm/arm: redefine pmd_huge() with pmd_leaf() · 6818135d
      Peter Xu authored
      Most of the archs already define these two APIs the same way.  ARM is more
      complicated in two aspects:
      
        - For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for
          pXd_leaf() it's always checking against PXD_TYPE_SECT.
      
        - SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM
          pgtables, which makes the whole thing even harder to follow.
      
      Luckily, the second complexity should be hidden by the pmd_leaf()
      implementation against 2-level v.s. 3-level headers.  Invoke pmd_leaf()
      directly for pmd_huge(), to remove the first part of complexity.  This
      prepares to drop pXd_huge() API globally.
      
      When at it, drop the obsolete comments - it's outdated.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6818135d