1. 26 Apr, 2024 33 commits
    • Kent Overstreet's avatar
      fs: convert alloc_inode_sb() to a macro · a5674119
      Kent Overstreet authored
      We're introducing alloc tagging, which tracks memory allocations by
      callsite.  Converting alloc_inode_sb() to a macro means allocations will
      be tracked by its caller, which is a bit more useful.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-6-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a5674119
    • Kent Overstreet's avatar
      scripts/kallysms: always include __start and __stop symbols · a7f13d0f
      Kent Overstreet authored
      These symbols are used to denote section boundaries: by always including
      them we can unify loading sections from modules with loading built-in
      sections, which leads to some significant cleanup.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-5-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7f13d0f
    • Kent Overstreet's avatar
      mm/slub: mark slab_free_freelist_hook() __always_inline · 9ea9cd8e
      Kent Overstreet authored
      It seems we need to be more forceful with the compiler on this one.  This
      is done for performance reasons only.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-4-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9ea9cd8e
    • Kent Overstreet's avatar
      asm-generic/io.h: kill vmalloc.h dependency · 690da22d
      Kent Overstreet authored
      Needed to avoid a new circular dependency with the memory allocation
      profiling series.
      
      Naturally, a whole bunch of files needed to include vmalloc.h that were
      previously getting it implicitly.
      
      Link: https://lkml.kernel.org/r/20240321163705.3067592-3-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      690da22d
    • Kent Overstreet's avatar
      fix missing vmalloc.h includes · 0069455b
      Kent Overstreet authored
      Patch series "Memory allocation profiling", v6.
      
      Overview:
      Low overhead [1] per-callsite memory allocation profiling. Not just for
      debug kernels, overhead low enough to be deployed in production.
      
      Example output:
        root@moria-kvm:~# sort -rn /proc/allocinfo
         127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
          56373248     4737 mm/slub.c:2259 func:alloc_slab_page
          14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
          14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
          13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
          11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
           9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
           4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
           4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
           3940352      962 mm/memory.c:4214 func:alloc_anon_folio
           2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
           ...
      
      Usage:
      kconfig options:
       - CONFIG_MEM_ALLOC_PROFILING
       - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
       - CONFIG_MEM_ALLOC_PROFILING_DEBUG
         adds warnings for allocations that weren't accounted because of a
         missing annotation
      
      sysctl:
        /proc/sys/vm/mem_profiling
      
      Runtime info:
        /proc/allocinfo
      
      Notes:
      
      [1]: Overhead
      To measure the overhead we are comparing the following configurations:
      (1) Baseline with CONFIG_MEMCG_KMEM=n
      (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
      (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
      (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
      (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
      (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
      (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
          CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y
      
      Performance overhead:
      To evaluate performance we implemented an in-kernel test executing
      multiple get_free_page/free_page and kmalloc/kfree calls with allocation
      sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
      affinity set to a specific CPU to minimize the noise. Below are results
      from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
      56 core Intel Xeon:
      
                              kmalloc                 pgalloc
      (1 baseline)            6.764s                  16.902s
      (2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
      (3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
      (4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
      (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
      (6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
      (7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)
      
      Memory overhead:
      Kernel size:
      
         text           data        bss         dec         diff
      (1) 26515311	      18890222    17018880    62424413
      (2) 26524728	      19423818    16740352    62688898    264485
      (3) 26524724	      19423818    16740352    62688894    264481
      (4) 26524728	      19423818    16740352    62688898    264485
      (5) 26541782	      18964374    16957440    62463596    39183
      
      Memory consumption on a 56 core Intel CPU with 125GB of memory:
      Code tags:           192 kB
      PageExts:         262144 kB (256MB)
      SlabExts:           9876 kB (9.6MB)
      PcpuExts:            512 kB (0.5MB)
      
      Total overhead is 0.2% of total memory.
      
      Benchmarks:
      
      Hackbench tests run 100 times:
      hackbench -s 512 -l 200 -g 15 -f 25 -P
            baseline       disabled profiling           enabled profiling
      avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
      stdev 0.0137         0.0188                       0.0077
      
      
      hackbench -l 10000
            baseline       disabled profiling           enabled profiling
      avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
      stdev 0.0933         0.0286                       0.0489
      
      stress-ng tests:
      stress-ng --class memory --seq 4 -t 60
      stress-ng --class cpu --seq 4 -t 60
      Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/
      
      [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/
      
      
      This patch (of 37):
      
      The next patch drops vmalloc.h from a system header in order to fix a
      circular dependency; this adds it to all the files that were pulling it in
      implicitly.
      
      [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
        Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
      [surenb@google.com: fix arch/x86/mm/numa_32.c]
        Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
      [kent.overstreet@linux.dev: a few places were depending on sizes.h]
        Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
      [arnd@arndb.de: fix mm/kasan/hw_tags.c]
        Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
      [surenb@google.com: fix arc build]
        Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.comSigned-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alex Gaynor <alex.gaynor@gmail.com>
      Cc: Alice Ryhl <aliceryhl@google.com>
      Cc: Andreas Hindborg <a.hindborg@samsung.com>
      Cc: Benno Lossin <benno.lossin@proton.me>
      Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Gary Guo <gary@garyguo.net>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0069455b
    • Randy Dunlap's avatar
      scripts/kernel-doc: drop "_noprof" on function prototypes · 51a7bf02
      Randy Dunlap authored
      Memory profiling introduces macros as hooks for function-level allocation
      profiling[1].  Memory allocation functions that are profiled are named
      like xyz_alloc() for API access to the function.  xyz_alloc() then calls
      xyz_alloc_noprof() to do the allocation work.
      
      The kernel-doc comments for the memory allocation functions are introduced
      with the xyz_alloc() function names but the function implementations are
      the xyz_alloc_noprof() names.  This causes kernel-doc warnings for
      mismatched documentation and function prototype names.  By dropping the
      "_noprof" part of the function name, the kernel-doc function name matches
      the function prototype name, so the warnings are resolved.
      
      [1] https://lore.kernel.org/all/20240321163705.3067592-1-surenb@google.com/
      
      Link: https://lkml.kernel.org/r/20240326054149.2121-1-rdunlap@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Closes: https://lore.kernel.org/all/20240325123603.1bdd6588@canb.auug.org.au/Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51a7bf02
    • Yosry Ahmed's avatar
      percpu: clean up all mappings when pcpu_map_pages() fails · 2ccd48ce
      Yosry Ahmed authored
      In pcpu_map_pages(), if __pcpu_map_pages() fails on a CPU, we call
      __pcpu_unmap_pages() to clean up mappings on all CPUs where mappings were
      created, but not on the CPU where __pcpu_map_pages() fails.
      
      __pcpu_map_pages() and __pcpu_unmap_pages() are wrappers around
      vmap_pages_range_noflush() and vunmap_range_noflush().  All other callers
      of vmap_pages_range_noflush() call vunmap_range_noflush() when mapping
      fails, except pcpu_map_pages().  The reason could be that partial mappings
      may be left behind from a failed mapping attempt.
      
      Call __pcpu_unmap_pages() for the failed CPU as well in pcpu_map_pages().
      
      This was found by code inspection, no failures or bugs were observed.
      
      Link: https://lkml.kernel.org/r/20240311194346.2291333-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Christoph Lameter (Ampere) <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ccd48ce
    • Donet Tom's avatar
      mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy · 133d04b1
      Donet Tom authored
      commit bda420b9 ("numa balancing: migrate on fault among multiple
      bound nodes") added support for migrate on protnone reference with
      MPOL_BIND memory policy.  This allowed numa fault migration when the
      executing node is part of the policy mask for MPOL_BIND.  This patch
      extends migration support to MPOL_PREFERRED_MANY policy.
      
      Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
      MPOL_F_NUMA_BALANCING.  This causes issues when we want to use
      NUMA_BALANCING_MEMORY_TIERING.  To effectively use the slow memory tier,
      the kernel should not allocate pages from the slower memory tier via
      allocation control zonelist fallback.  Instead, we should move cold pages
      from the faster memory node via memory demotion.  For a page allocation,
      kswapd is only woken up after we try to allocate pages from all nodes in
      the allocation zone list.  This implies that, without using memory
      policies, we will end up allocating hot pages in the slower memory tier.
      
      MPOL_PREFERRED_MANY was added by commit b27abacc ("mm/mempolicy: add
      MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
      allocation control when we have memory tiers in the system.  With
      MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
      of faster memory nodes.  When we fail to allocate pages from the faster
      memory node, kswapd would be woken up, allowing demotion of cold pages to
      slower memory nodes.
      
      With the current kernel, such usage of memory policies implies we can't do
      page promotion from a slower memory tier to a faster memory tier using
      numa fault.  This patch fixes this issue.
      
      For MPOL_PREFERRED_MANY, if the executing node is in the policy node mask,
      we allow numa migration to the executing nodes.  If the executing node is
      not in the policy node mask, we do not allow numa migration.
      
      Example:
      On a 2-sockets system, NUMA node N0, N1 and N2 are in socket 0,
      N3 in socket 1. N0, N1 and N3 have fast memory and CPU, while
      N2 has slow memory and no CPU.  For a workload, we may use
      MPOL_PREFERRED_MANY with nodemask N0 and N1 set because the workload
      runs on CPUs of socket 0 at most times. Then, even if the workload
      runs on CPUs of N3 occasionally, we will not try to migrate the workload
      pages from N2 to N3 because users may want to avoid cross-socket access
      as much as possible in the long term.
      
      In below table, Process is the Process executing node and
      Curr Loc Pgs is the numa node where page present(folio node)
      ===========================================================
      Process  Policy  Curr Loc Pgs     Observation
      -----------------------------------------------------------
      N0       N0 N1      N1         Pages Migrated from N1 to N0
      N0       N0 N1      N2         Pages Migrated from N2 to N0
      N0       N0 N1      N3	       Pages Migrated from N3 to N0
      
      N3       N0 N1      N0         Pages NOT Migrated  to N3
      N3       N0 N1      N1         Pages NOT Migrated  to N3
      N3       N0 N1      N2	       Pages NOT Migrated  to N3
      ------------------------------------------------------------
      
      Link: https://lkml.kernel.org/r/158acc57319129aa46d50fd64c9330f3e7c7b4bf.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/369d6a58758396335fd1176d97bbca4e7730d75a.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      133d04b1
    • Donet Tom's avatar
      mm/mempolicy: use numa_node_id() instead of cpu_to_node() · f8fd525b
      Donet Tom authored
      Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
      policy:, v4.
      
      This patchset is to optimize the cross-socket memory access with
      MPOL_PREFERRED_MANY policy.
      
      To test this patch we ran the following test on a 3 node system.
       Node 0 - 2GB   - Tier 1
       Node 1 - 11GB  - Tier 1
       Node 6 - 10GB  - Tier 2
      
      Below changes are made to memcached to set the memory policy,
      It select Node0 and Node1 as preferred nodes.
      
         #include <numaif.h>
         #include <numa.h>
      
          unsigned long nodemask;
          int ret;
      
          nodemask = 0x03;
          ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
                                                     &nodemask, 10);
          /* If MPOL_F_NUMA_BALANCING isn't supported,
           * fall back to MPOL_PREFERRED_MANY */
          if (ret < 0 && errno == EINVAL){
             printf("set mem policy normal\n");
              ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
          }
          if (ret < 0) {
             perror("Failed to call set_mempolicy");
             exit(-1);
          }
      
      Test Procedure:
      ===============
      1. Make sure memory tiring and demotion are enabled.
      2. Start memcached.
      
         # ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
             -d -s "/tmp/memcached.sock"
      
      3. Run memtier_benchmark to store 3200000 keys.
      
        #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
          --threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
          --key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024
      
      4. Start a memory eater on node 0 and 1. This will demote all memcached
         pages to node 6.
      5. Make sure all the memcached pages got demoted to lower tier by reading
         /proc/<memcaced PID>/numa_maps.
      
          # cat /proc/2771/numa_maps
           ---
          default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
          default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
           ---
      
      6. Kill memory eater.
      7. Read the pgpromote_success counter.
      8. Start reading the keys by running memtier_benchmark.
      
        #./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
         --pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
         --key-minimum=1 --key-maximum=3200000 -n allkeys
         --threads=64 -c 1 -R -x 6
      
      9. Read the pgpromote_success counter.
      
      Test Results:
      =============
      Without Patch
      ------------------
      1. pgpromote_success  before test
      Node 0:  pgpromote_success 11
      Node 1:  pgpromote_success 140974
      
      pgpromote_success  after test
      Node 0:  pgpromote_success 11
      Node 1:  pgpromote_success 140974
      
      2. Memtier-benchmark result.
      AGGREGATED AVERAGE RESULTS (6 runs)
      ==================================================================
      Type    Ops/sec   Hits/sec   Misses/sec  Avg. Latency  p50 Latency
      ------------------------------------------------------------------
      Sets     0.00       ---         ---        ---          ---
      Gets    305792.03  305791.93   0.10       0.18949       0.16700
      Waits    0.00       ---         ---        ---          ---
      Totals  305792.03  305791.93   0.10       0.18949       0.16700
      
      ======================================
      p99 Latency  p99.9 Latency  KB/sec
      -------------------------------------
      ---          ---            0.00
      0.44700     1.71100        11542.69
      ---           ---            ---
      0.44700     1.71100        11542.69
      
      With Patch
      ---------------
      1. pgpromote_success  before test
      Node 0:  pgpromote_success 5
      Node 1:  pgpromote_success 89386
      
      pgpromote_success  after test
      Node 0:  pgpromote_success 57895
      Node 1:  pgpromote_success 141463
      
      2. Memtier-benchmark result.
      AGGREGATED AVERAGE RESULTS (6 runs)
      ====================================================================
      Type    Ops/sec    Hits/sec  Misses/sec  Avg. Latency  p50 Latency
      --------------------------------------------------------------------
      Sets     0.00        ---       ---        ---           ---
      Gets    521942.24  521942.07  0.17       0.11459        0.10300
      Waits    0.00        ---       ---         ---          ---
      Totals  521942.24  521942.07  0.17       0.11459        0.10300
      
      =======================================
      p99 Latency  p99.9 Latency  KB/sec
      ---------------------------------------
       ---          ---            0.00
      0.23100      0.31900        19701.68
      ---          ---             ---
      0.23100      0.31900        19701.68
      
      
      Test Result Analysis:
      =====================
      1. With patch we could observe pages are getting promoted.
      2. Memtier-benchmark results shows that, with the patch,
         performance has increased more than 50%.
      
       Ops/sec without fix -  305792.03
       Ops/sec with fix    -  521942.24
      
      
      This patch (of 2):
      
      Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
      quicker.  smp_processor_id is guaranteed to be stable in the
      'mpol_misplaced()' function because it is called with ptl held. 
      lockdep_assert_held was added to ensure that.
      
      No functional change in this patch.
      
      [donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment]
        Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
      Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Signed-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8fd525b
    • Yosry Ahmed's avatar
      mm: zswap: remove unnecessary check in zswap_find_zpool() · fea68a75
      Yosry Ahmed authored
      zswap_find_zpool() checks if ZSWAP_NR_ZPOOLS > 1, which is always true. 
      This is a remnant from a patch version that had ZSWAP_NR_ZPOOLS as a
      config option and never made it upstream.  Remove the unnecessary check.
      
      Link: https://lkml.kernel.org/r/20240311235210.2937484-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fea68a75
    • Duoming Zhou's avatar
      lib/test_hmm.c: handle src_pfns and dst_pfns allocation failure · c2af060d
      Duoming Zhou authored
      The kcalloc() in dmirror_device_evict_chunk() will return null if the
      physical memory has run out.  As a result, if src_pfns or dst_pfns is
      dereferenced, the null pointer dereference bug will happen.
      
      Moreover, the device is going away.  If the kcalloc() fails, the pages
      mapping a chunk could not be evicted.  So add a __GFP_NOFAIL flag in
      kcalloc().
      
      Finally, as there is no need to have physically contiguous memory, Switch
      kcalloc() to kvcalloc() in order to avoid failing allocations.
      
      Link: https://lkml.kernel.org/r/20240312005905.9939-1-duoming@zju.edu.cn
      Fixes: b2ef9f5a ("mm/hmm/test: add selftest driver for HMM")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2af060d
    • Johannes Weiner's avatar
      mm: zpool: return pool size in pages · 4196b48d
      Johannes Weiner authored
      All zswap backends track their pool sizes in pages.  Currently they
      multiply by PAGE_SIZE for zswap, only for zswap to divide again in order
      to do limit math.  Report pages directly.
      
      Link: https://lkml.kernel.org/r/20240312153901.3441-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4196b48d
    • Johannes Weiner's avatar
      mm: zswap: optimize zswap pool size tracking · 91cdcd8d
      Johannes Weiner authored
      Profiling the munmap() of a zswapped memory region shows 60% of the total
      cycles currently going into updating the zswap_pool_total_size.
      
      There are three consumers of this counter:
      - store, to enforce the globally configured pool limit
      - meminfo & debugfs, to report the size to the user
      - shrink, to determine the batch size for each cycle
      
      Instead of aggregating everytime an entry enters or exits the zswap
      pool, aggregate the value from the zpools on-demand:
      
      - Stores aggregate the counter anyway upon success. Aggregating to
        check the limit instead is the same amount of work.
      
      - Meminfo & debugfs might benefit somewhat from a pre-aggregated
        counter, but aren't exactly hotpaths.
      
      - Shrinking can aggregate once for every cycle instead of doing it for
        every freed entry. As the shrinker might work on tens or hundreds of
        objects per scan cycle, this is a large reduction in aggregations.
      
      The paths that benefit dramatically are swapin, swapoff, and unmaps. 
      There could be millions of pages being processed until somebody asks for
      the pool size again.  This eliminates the pool size updates from those
      paths entirely.
      
      Top profile entries for a 24G range munmap(), before:
      
          38.54%  zswap-unmap  [kernel.kallsyms]  [k] zs_zpool_total_size
          12.51%  zswap-unmap  [kernel.kallsyms]  [k] zpool_get_total_size
           9.10%  zswap-unmap  [kernel.kallsyms]  [k] zswap_update_total_size
           2.95%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
           2.88%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
           2.86%  zswap-unmap  [kernel.kallsyms]  [k] xas_store
      
      and after:
      
           7.70%  zswap-unmap  [kernel.kallsyms]  [k] __slab_free
           7.16%  zswap-unmap  [kernel.kallsyms]  [k] obj_cgroup_uncharge_zswap
           6.74%  zswap-unmap  [kernel.kallsyms]  [k] xas_store
      
      It was also briefly considered to move to a single atomic in zswap
      that is updated by the backends, since zswap only cares about the sum
      of all pools anyway. However, zram directly needs per-pool information
      out of zsmalloc. To keep the backend from having to update two atomics
      every time, I opted for the lazy aggregation instead for now.
      
      Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91cdcd8d
    • Peter Xu's avatar
      mm: document pXd_leaf() API · 64078b3d
      Peter Xu authored
      There's one small section already, but since we're going to remove
      pXd_huge(), that comment may start to obsolete.
      
      Rewrite that section with more information, hopefully with that the API is
      crystal clear on what it implies.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-15-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64078b3d
    • Peter Xu's avatar
      mm/arm: remove pmd_thp_or_huge() · 502016e3
      Peter Xu authored
      ARM/ARM64 used to define pmd_thp_or_huge().  Now this macro is completely
      redundant.  Remove it and use pmd_leaf().
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-14-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      502016e3
    • Peter Xu's avatar
      mm/treewide: remove pXd_huge() · 9636f055
      Peter Xu authored
      This API is not used anymore, drop it for the whole tree.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9636f055
    • Peter Xu's avatar
      mm/treewide: replace pXd_huge() with pXd_leaf() · 1965e933
      Peter Xu authored
      Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
      reuse it.  Luckily, pXd_huge() isn't widely used.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1965e933
    • Peter Xu's avatar
      mm/gup: merge pXd huge mapping checks · 7db86dc3
      Peter Xu authored
      Huge mapping checks in GUP are slightly redundant and can be simplified.
      
      pXd_huge() now is the same as pXd_leaf().  pmd_trans_huge() and
      pXd_devmap() should both imply pXd_leaf(). Time to merge them into one.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-11-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7db86dc3
    • Peter Xu's avatar
      mm/powerpc: redefine pXd_huge() with pXd_leaf() · 460b9adc
      Peter Xu authored
      PowerPC book3s 4K mostly has the same definition on both, except
      pXd_huge() constantly returns 0 for hash MMUs.  As Michael Ellerman
      pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit
      will never be set so it will keep returning false.
      
      As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
      such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
      pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf
      hugetlb mappings.
      
      The goal should be that we will have one API pXd_leaf() to detect all
      kinds of huge mappings (hugepd is still special in this case, though). 
      AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to
      make sure ie.  THPs on hash MMU will also return true.
      
      This helps to simplify a follow up patch to drop pXd_huge() treewide.
      
      NOTE: *_leaf() definition need to be moved before the inclusion of
      asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it.
      
      [1] https://lore.kernel.org/r/87v85zo6w7.fsf@mail.lhotse
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      460b9adc
    • Peter Xu's avatar
      mm/arm64: merge pXd_huge() and pXd_leaf() definitions · 961a6ee5
      Peter Xu authored
      Unlike most archs, aarch64 defines pXd_huge() and pXd_leaf() slightly
      differently.  Redefine the pXd_huge() with pXd_leaf().
      
      There used to be two traps for old aarch64 definitions over these APIs that
      I found when reading the code around, they're:
      
       (1) 4797ec2d ("arm64: fix pud_huge() for 2-level pagetables")
       (2) 23bc8f69 ("arm64: mm: fix p?d_leaf()")
      
      Define pXd_huge() with the current pXd_leaf() will make sure (2) isn't a
      problem (on PROT_NONE checks).  To make sure it also works for (1), we
      move over the __PAGETABLE_PMD_FOLDED check to pud_leaf(), allowing it to
      constantly returning "false" for 2-level pgtables, which looks even safer
      to cover both now.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      961a6ee5
    • Peter Xu's avatar
      mm/arm: redefine pmd_huge() with pmd_leaf() · 6818135d
      Peter Xu authored
      Most of the archs already define these two APIs the same way.  ARM is more
      complicated in two aspects:
      
        - For pXd_huge() it's always checking against !PXD_TABLE_BIT, while for
          pXd_leaf() it's always checking against PXD_TYPE_SECT.
      
        - SECT/TABLE bits are defined differently on 2-level v.s. 3-level ARM
          pgtables, which makes the whole thing even harder to follow.
      
      Luckily, the second complexity should be hidden by the pmd_leaf()
      implementation against 2-level v.s. 3-level headers.  Invoke pmd_leaf()
      directly for pmd_huge(), to remove the first part of complexity.  This
      prepares to drop pXd_huge() API globally.
      
      When at it, drop the obsolete comments - it's outdated.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6818135d
    • Peter Xu's avatar
      mm/arm: use macros to define pmd/pud helpers · 7966a2b7
      Peter Xu authored
      It's already confusing that ARM 2-level v.s.  3-level defines SECT bit
      differently on pmd/puds.  Always use a macro which is much clearer.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-7-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7966a2b7
    • Peter Xu's avatar
      mm/sparc: change pXd_huge() behavior to exclude swap entries · ae798490
      Peter Xu authored
      Please refer to the previous patch on the reasoning for x86.  Now sparc is
      the only architecture that will allow swap entries to be reported as
      pXd_huge().  After this patch, all architectures should forbid swap
      entries in pXd_huge().
      
      [akpm@linux-foundation.org: s/;;/;/, per Muchun]
      Link: https://lkml.kernel.org/r/20240318200404.448346-6-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ae798490
    • Peter Xu's avatar
      mm/x86: change pXd_huge() behavior to exclude swap entries · d0973cb9
      Peter Xu authored
      This patch partly reverts below commits:
      
      3a194f3f ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry")
      cbef8478 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage")
      
      Right now, pXd_huge() definition across kernel is unclear. We have two
      groups that think differently on swap entries:
      
        - x86/sparc:     Allow pXd_huge() to accept swap entries
        - all the rest:  Doesn't allow pXd_huge() to accept swap entries
      
      This is so confusing.  Since the sparc helpers seem to be added in 2016,
      which is after x86's (2015), so sparc could have followed a trend.  x86
      proposed such swap handling in 2015 to resolve hugetlb swap entries hit in
      GUP, but now GUP guards swap entries with !pXd_present() in all layers so
      we should be safe.
      
      We should define this API properly, one way or another, rather than keep
      them defined differently across archs.
      
      Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it
      turns out that I am not the only one thinking so, the question was raised
      when the current pmd_huge() for x86 was proposed by Ville Syrjälä:
      
      https://lore.kernel.org/all/Y2WQ7I4LXh8iUIRd@intel.com/
      
        I might also be missing something obvious, but why is it even necessary
        to treat PRESENT==0+PSE==0 as a huge entry?
      
      It is also questioned when Jason Gunthorpe reviewed the other patchset on
      swap entry handlings:
      
      https://lore.kernel.org/all/20240221125753.GQ13330@nvidia.com/
      
      Revert its meaning back to original.  It shouldn't have any functional
      change as we should be ready with guards on !pXd_present() explicitly
      everywhere.
      
      Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there
      probably because it was breaking things when 3a194f3f was proposed,
      according to the report here:
      
      https://lore.kernel.org/all/Y2LYXItKQyaJTv8j@intel.com/
      
      Now we shouldn't need that.
      
      Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf().
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0973cb9
    • Peter Xu's avatar
      mm/gup: check p4d presence before going on · 089f9214
      Peter Xu authored
      Currently there should have no p4d swap entries so it may not matter much,
      however this may help us to rule out swap entries in pXd_huge() API, which
      will include p4d_huge().  The p4d_present() checks make it 100% clear that
      we won't rely on p4d_huge() for swap entries.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      089f9214
    • Peter Xu's avatar
      mm/gup: cache p4d in follow_p4d_mask() · e6fd5564
      Peter Xu authored
      Add a variable to cache p4d in follow_p4d_mask().  It's a good practise to
      make sure all the following checks will have a consistent view of the
      entry.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6fd5564
    • Peter Xu's avatar
      mm/hmm: process pud swap entry without pud_huge() · 9abc71b4
      Peter Xu authored
      Swap pud entries do not always return true for pud_huge() for all archs. 
      x86 and sparc (so far) allow it, but all the rest do not accept a swap
      entry to be reported as pud_huge().  So it's not safe to check swap
      entries within pud_huge().  Check swap entries before pud_huge(), so it
      should be always safe.
      
      This is the only place in the kernel that (IMHO, wrongly) relies on
      pud_huge() to return true on pud swap entries.  The plan is to cleanup
      pXd_huge() to only report non-swap mappings for all archs.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9abc71b4
    • Lucas Stach's avatar
      mm: page_alloc: control latency caused by zone PCP draining · 55f77df7
      Lucas Stach authored
      Patch series "mm/treewide: Remove pXd_huge() API", v2.
      
      In previous work [1], we removed the pXd_large() API, which is arch
      specific.  This patchset further removes the hugetlb pXd_huge() API.
      
      Hugetlb was never special on creating huge mappings when compared with
      other huge mappings.  Having a standalone API just to detect such pgtable
      entries is more or less redundant, especially after the pXd_leaf() API set
      is introduced with/without CONFIG_HUGETLB_PAGE.
      
      When looking at this problem, a few issues are also exposed that we don't
      have a clear definition of the *_huge() variance API.  This patchset
      started by cleaning these issues first, then replace all *_huge() users to
      use *_leaf(), then drop all *_huge() code.
      
      On x86/sparc, swap entries will be reported "true" in pXd_huge(), while
      for all the rest archs they're reported "false" instead.  This part is
      done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix,
      but I'll leave that to hmm experts to decide.
      
      Besides, there are three archs (arm, arm64, powerpc) that have slightly
      different definitions between the *_huge() v.s.  *_leaf() variances.  I
      tackled them separately so that it'll be easier for arch experts to chim
      in when necessary.  This part is done in patch 6-9.
      
      The final patches 10-14 do the rest on the final removal, since *_leaf()
      will be the ultimate API in the future, and we seem to have quite some
      confusions on how *_huge() APIs can be defined, provide a rich comment for
      *_leaf() API set to define them properly to avoid future misuse, and
      hopefully that'll also help new archs to start support huge mappings and
      avoid traps (like either swap entries, or PROT_NONE entry checks).
      
      [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com
      
      
      This patch (of 14):
      
      When the complete PCP is drained a much larger number of pages than the
      usual batch size might be freed at once, causing large IRQ and preemption
      latency spikes, as they are all freed while holding the pcp and zone
      spinlocks.
      
      To avoid those latency spikes, limit the number of pages freed in a single
      bulk operation to common batch limits.
      
      Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.deSigned-off-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andreas Larsson <andreas@gaisler.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bjorn Andersson <andersson@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fabio Estevam <festevam@denx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Shawn Guo <shawnguo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55f77df7
    • Dev Jain's avatar
      selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg · 13e86096
      Dev Jain authored
      mmap() must not succeed in validate_lower_address_hint(), for if it does,
      it is a bug in mmap() itself.  Reflect this behaviour with
      ksft_exit_fail_msg().  While at it, do some formatting changes.
      
      Link: https://lkml.kernel.org/r/20240314122250.68534-1-dev.jain@arm.comSigned-off-by: default avatarDev Jain <dev.jain@arm.com>
      Reviewed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13e86096
    • David Hildenbrand's avatar
      mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) · fa9fcd8b
      David Hildenbrand authored
      We changed faultin_page_range() to no longer consume a VMA, because
      faultin_page_range() might internally release the mm lock to lookup
      the VMA again -- required to cleanly handle VM_FAULT_RETRY. But
      independent of that, __get_user_pages() will always lookup the VMA
      itself.
      
      Now that we let __get_user_pages() just handle VMA checks in a way that
      is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise()
      is just overhead. So let's just call madvise_populate()
      on the full range instead.
      
      There is one change in behavior: madvise_walk_vmas() would skip any VMA
      holes, and if everything succeeded, it would return -ENOMEM after
      processing all VMAs.
      
      However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to
      notice any difference: -ENOMEM might either indicate that there were VMA
      holes or that populating page tables failed because there was not enough
      memory. So it's unlikely that user space will notice the difference, and
      that special handling likely only makes sense for some other madvise()
      actions.
      
      Further, we'd already fail with -ENOMEM early in the past if looking up the
      VMA after dropping the MM lock failed because of concurrent VMA
      modifications. So let's just keep it simple and avoid the madvise VMA
      walk, and consistently fail early if we find a VMA hole.
      
      Link: https://lkml.kernel.org/r/20240314161300.382526-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa9fcd8b
    • Yosry Ahmed's avatar
      mm: memcg: add NULL check to obj_cgroup_put() · 91b71e78
      Yosry Ahmed authored
      9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). 
      Move the NULL check in the function, similar to mem_cgroup_put().  The
      unlikely() NULL check in current_objcg_update() was left alone to avoid
      dropping the unlikey() annotation as this a fast path.
      
      Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91b71e78
    • Christophe Leroy's avatar
      mm: remove guard around pgd_offset_k() macro · 5b0a6700
      Christophe Leroy authored
      The last architecture redefining pgd_offset_k() was IA64 and it was
      removed by commit cf8e8658 ("arch: Remove Itanium (IA-64)
      architecture")
      
      There is no need anymore to guard generic version of pgd_offset_k()
      with #ifndef pgd_offset_k
      
      Link: https://lkml.kernel.org/r/59d3f47d5615d18cca1986f269be2fcb3df34556.1710589838.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b0a6700
    • Andrew Morton's avatar
  2. 25 Apr, 2024 7 commits
    • Miaohe Lin's avatar
      mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() · 52ccdde1
      Miaohe Lin authored
      When I did memory failure tests recently, below warning occurs:
      
      DEBUG_LOCKS_WARN_ON(1)
      WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
      Modules linked in: mce_inject hwpoison_inject
      CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      RIP: 0010:__lock_acquire+0xccb/0x1ca0
      RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
      RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
      RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
      R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
      R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
      FS:  00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       lock_acquire+0xbe/0x2d0
       _raw_spin_lock_irqsave+0x3a/0x60
       hugepage_subpool_put_pages.part.0+0xe/0xc0
       free_huge_folio+0x253/0x3f0
       dissolve_free_huge_page+0x147/0x210
       __page_handle_poison+0x9/0x70
       memory_failure+0x4e6/0x8c0
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xbc/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff9f3114887
      RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
      RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
      RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
       </TASK>
      Kernel panic - not syncing: kernel: panic_on_warn set ...
      CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       panic+0x326/0x350
       check_panic_on_warn+0x4f/0x50
       __warn+0x98/0x190
       report_bug+0x18e/0x1a0
       handle_bug+0x3d/0x70
       exc_invalid_op+0x18/0x70
       asm_exc_invalid_op+0x1a/0x20
      RIP: 0010:__lock_acquire+0xccb/0x1ca0
      RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
      RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
      RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
      R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
      R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
       lock_acquire+0xbe/0x2d0
       _raw_spin_lock_irqsave+0x3a/0x60
       hugepage_subpool_put_pages.part.0+0xe/0xc0
       free_huge_folio+0x253/0x3f0
       dissolve_free_huge_page+0x147/0x210
       __page_handle_poison+0x9/0x70
       memory_failure+0x4e6/0x8c0
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x380/0x540
       ksys_write+0x64/0xe0
       do_syscall_64+0xbc/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff9f3114887
      RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
      RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
      RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
      R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
       </TASK>
      
      After git bisecting and digging into the code, I believe the root cause is
      that _deferred_list field of folio is unioned with _hugetlb_subpool field.
      In __update_and_free_hugetlb_folio(), folio->_deferred_list is
      initialized leading to corrupted folio->_hugetlb_subpool when folio is
      hugetlb.  Later free_huge_folio() will use _hugetlb_subpool and above
      warning happens.
      
      But it is assumed hugetlb flag must have been cleared when calling
      folio_put() in update_and_free_hugetlb_folio().  This assumption is broken
      due to below race:
      
      CPU1					CPU2
      dissolve_free_huge_page			update_and_free_pages_bulk
       update_and_free_hugetlb_folio		 hugetlb_vmemmap_restore_folios
      					  folio_clear_hugetlb_vmemmap_optimized
        clear_flag = folio_test_hugetlb_vmemmap_optimized
        if (clear_flag) <-- False, it's already cleared.
         __folio_clear_hugetlb(folio) <-- Hugetlb is not cleared.
        folio_put
         free_huge_folio <-- free_the_page is expected.
      					 list_for_each_entry()
      					  __folio_clear_hugetlb <-- Too late.
      
      Fix this issue by checking whether folio is hugetlb directly instead of
      checking clear_flag to close the race window.
      
      Link: https://lkml.kernel.org/r/20240419085819.1901645-1-linmiaohe@huawei.com
      Fixes: 32c87719 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52ccdde1
    • Muhammad Usama Anjum's avatar
      selftests: mm: protection_keys: save/restore nr_hugepages value from launch script · ed74abcd
      Muhammad Usama Anjum authored
      The save/restore of nr_hugepages was added to the test itself by using the
      atexit() functionality.  But it is broken as parent exits after creating
      child.  Hence calling the atexit() function early.  That's not it.  The
      child exits after creating its child and so on.
      
      The parent cannot wait to get the termination status for its children as
      it'll keep on holding the resources until the new pkey allocation fails. 
      It is impossible to wait for exits of all the grand and great grand
      children.  Hence the restoring of nr_hugepages value from parent is wrong.
      
      Let's save/restore the nr_hugepages settings in the launch script
      instead of doing it in the test.
      
      Link: https://lkml.kernel.org/r/20240419115027.3848958-1-usama.anjum@collabora.com
      Fixes: c52eb6db ("selftests: mm: restore settings from only parent process")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Reported-by: default avatarJoey Gouly <joey.gouly@arm.com>
      Closes: https://lore.kernel.org/all/20240418125250.GA2941398@e124191.cambridge.arm.com
      Cc: Joey Gouly <joey.gouly@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed74abcd
    • Andrey Ryabinin's avatar
      stackdepot: respect __GFP_NOLOCKDEP allocation flag · 6fe60465
      Andrey Ryabinin authored
      If stack_depot_save_flags() allocates memory it always drops
      __GFP_NOLOCKDEP flag.  So when KASAN tries to track __GFP_NOLOCKDEP
      allocation we may end up with lockdep splat like bellow:
      
      ======================================================
       WARNING: possible circular locking dependency detected
       6.9.0-rc3+ #49 Not tainted
       ------------------------------------------------------
       kswapd0/149 is trying to acquire lock:
       ffff88811346a920
      (&xfs_nondir_ilock_class){++++}-{4:4}, at: xfs_reclaim_inode+0x3ac/0x590
      [xfs]
      
       but task is already holding lock:
       ffffffff8bb33100 (fs_reclaim){+.+.}-{0:0}, at:
      balance_pgdat+0x5d9/0xad0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
       -> #1 (fs_reclaim){+.+.}-{0:0}:
              __lock_acquire+0x7da/0x1030
              lock_acquire+0x15d/0x400
              fs_reclaim_acquire+0xb5/0x100
       prepare_alloc_pages.constprop.0+0xc5/0x230
              __alloc_pages+0x12a/0x3f0
              alloc_pages_mpol+0x175/0x340
              stack_depot_save_flags+0x4c5/0x510
              kasan_save_stack+0x30/0x40
              kasan_save_track+0x10/0x30
              __kasan_slab_alloc+0x83/0x90
              kmem_cache_alloc+0x15e/0x4a0
              __alloc_object+0x35/0x370
              __create_object+0x22/0x90
       __kmalloc_node_track_caller+0x477/0x5b0
              krealloc+0x5f/0x110
              xfs_iext_insert_raw+0x4b2/0x6e0 [xfs]
              xfs_iext_insert+0x2e/0x130 [xfs]
              xfs_iread_bmbt_block+0x1a9/0x4d0 [xfs]
              xfs_btree_visit_block+0xfb/0x290 [xfs]
              xfs_btree_visit_blocks+0x215/0x2c0 [xfs]
              xfs_iread_extents+0x1a2/0x2e0 [xfs]
       xfs_buffered_write_iomap_begin+0x376/0x10a0 [xfs]
              iomap_iter+0x1d1/0x2d0
       iomap_file_buffered_write+0x120/0x1a0
              xfs_file_buffered_write+0x128/0x4b0 [xfs]
              vfs_write+0x675/0x890
              ksys_write+0xc3/0x160
              do_syscall_64+0x94/0x170
       entry_SYSCALL_64_after_hwframe+0x71/0x79
      
      Always preserve __GFP_NOLOCKDEP to fix this.
      
      Link: https://lkml.kernel.org/r/20240418141133.22950-1-ryabinin.a.a@gmail.com
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Reported-by: default avatarXiubo Li <xiubli@redhat.com>
      Closes: https://lore.kernel.org/all/a0caa289-ca02-48eb-9bf2-d86fd47b71f4@redhat.com/Reported-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Closes: https://lore.kernel.org/all/f9ff999a-e170-b66b-7caf-293f2b147ac2@opensource.wdc.com/Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Tested-by: default avatarXiubo Li <xiubli@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6fe60465
    • Vishal Moola (Oracle)'s avatar
      hugetlb: check for anon_vma prior to folio allocation · 37641efa
      Vishal Moola (Oracle) authored
      Commit 9acad7ba ("hugetlb: use vmf_anon_prepare() instead of
      anon_vma_prepare()") may bailout after allocating a folio if we do not
      hold the mmap lock.  When this occurs, vmf_anon_prepare() will release the
      vma lock.  Hugetlb then attempts to call restore_reserve_on_error(), which
      depends on the vma lock being held.
      
      We can move vmf_anon_prepare() prior to the folio allocation in order to
      avoid calling restore_reserve_on_error() without the vma lock.
      
      Link: https://lkml.kernel.org/r/ZiFqSrSRLhIV91og@fedora
      Fixes: 9acad7ba ("hugetlb: use vmf_anon_prepare() instead of anon_vma_prepare()")
      Reported-by: syzbot+ad1b592fc4483655438b@syzkaller.appspotmail.com
      Signed-off-by: default avatarVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37641efa
    • Johannes Weiner's avatar
      mm: zswap: fix shrinker NULL crash with cgroup_disable=memory · 682886ec
      Johannes Weiner authored
      Christian reports a NULL deref in zswap that he bisected down to the zswap
      shrinker.  The issue also cropped up in the bug trackers of libguestfs [1]
      and the Red Hat bugzilla [2].
      
      The problem is that when memcg is disabled with the boot time flag, the
      zswap shrinker might get called with sc->memcg == NULL.  This is okay in
      many places, like the lruvec operations.  But it crashes in
      memcg_page_state() - which is only used due to the non-node accounting of
      cgroup's the zswap memory to begin with.
      
      Nhat spotted that the memcg can be NULL in the memcg-disabled case, and I
      was then able to reproduce the crash locally as well.
      
      [1] https://github.com/libguestfs/libguestfs/issues/139
      [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
      
      Link: https://lkml.kernel.org/r/20240418124043.GC1055428@cmpxchg.org
      Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
      Fixes: b5ba474f ("zswap: shrink zswap pool based on memory pressure")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarChristian Heusel <christian@heusel.eu>
      Debugged-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarNhat Pham <nphamcs@gmail.com>
      Tested-by: default avatarChristian Heusel <christian@heusel.eu>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Richard W.M. Jones <rjones@redhat.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: <stable@vger.kernel.org>	[v6.8]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      682886ec
    • Matthew Wilcox (Oracle)'s avatar
      mm: turn folio_test_hugetlb into a PageType · d99e3140
      Matthew Wilcox (Oracle) authored
      The current folio_test_hugetlb() can be fooled by a concurrent folio split
      into returning true for a folio which has never belonged to hugetlbfs. 
      This can't happen if the caller holds a refcount on it, but we have a few
      places (memory-failure, compaction, procfs) which do not and should not
      take a speculative reference.
      
      Since hugetlb pages do not use individual page mapcounts (they are always
      fully mapped and use the entire_mapcount field to record the number of
      mappings), the PageType field is available now that page_mapcount()
      ignores the value in this field.
      
      In compaction and with CONFIG_DEBUG_VM enabled, the current implementation
      can result in an oops, as reported by Luis. This happens since 9c5ccf2d
      ("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks
      in the PageHuge() testing path.
      
      [willy@infradead.org: update vmcoreinfo]
        Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org
      Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org
      Fixes: 9c5ccf2d ("mm: remove HUGETLB_PAGE_DTOR")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d99e3140
    • Matthew Wilcox (Oracle)'s avatar
      mm: support page_mapcount() on page_has_type() pages · fd1a745c
      Matthew Wilcox (Oracle) authored
      Return 0 for pages which can't be mapped.  This matches how page_mapped()
      works.  It is more convenient for users to not have to filter out these
      pages.
      
      Link: https://lkml.kernel.org/r/20240321142448.1645400-5-willy@infradead.org
      Fixes: 9c5ccf2d ("mm: remove HUGETLB_PAGE_DTOR")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd1a745c