1. 04 Oct, 2023 40 commits
    • Qi Zheng's avatar
      kvm: mmu: dynamically allocate the x86-mmu shrinker · e5985c40
      Qi Zheng authored
      Use new APIs to dynamically allocate the x86-mmu shrinker.
      
      Link: https://lkml.kernel.org/r/20230911094444.68966-3-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5985c40
    • Qi Zheng's avatar
      mm: shrinker: add infrastructure for dynamically allocating shrinker · c42d50ae
      Qi Zheng authored
      Patch series "use refcount+RCU method to implement lockless slab shrink",
      v6.
      
      1. Background
      =============
      
      We used to implement the lockless slab shrink with SRCU [1], but then kernel
      test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test
      case [2], so we reverted it [3].
      
      This patch series aims to re-implement the lockless slab shrink using the
      refcount+RCU method proposed by Dave Chinner [4].
      
      [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
      [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
      [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
      [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
      
      2. Implementation
      =================
      
      Currently, the shrinker instances can be divided into the following three types:
      
      a) global shrinker instance statically defined in the kernel, such as
         workingset_shadow_shrinker.
      
      b) global shrinker instance statically defined in the kernel modules, such as
         mmu_shrinker in x86.
      
      c) shrinker instance embedded in other structures.
      
      For case a, the memory of shrinker instance is never freed. For case b, the
      memory of shrinker instance will be freed after synchronize_rcu() when the
      module is unloaded. For case c, the memory of shrinker instance will be freed
      along with the structure it is embedded in.
      
      In preparation for implementing lockless slab shrink, we need to dynamically
      allocate those shrinker instances in case c, then the memory can be dynamically
      freed alone by calling kfree_rcu().
      
      This patchset adds the following new APIs for dynamically allocating shrinker,
      and add a private_data field to struct shrinker to record and get the original
      embedded structure.
      
      1. shrinker_alloc()
      2. shrinker_register()
      3. shrinker_free()
      
      In order to simplify shrinker-related APIs and make shrinker more independent of
      other kernel mechanisms, this patchset uses the above APIs to convert all
      shrinkers (including case a and b) to dynamically allocated, and then remove all
      existing APIs. This will also have another advantage mentioned by Dave Chinner:
      
      ```
      The other advantage of this is that it will break all the existing out of tree
      code and third party modules using the old API and will no longer work with a
      kernel using lockless slab shrinkers. They need to break (both at the source and
      binary levels) to stop bad things from happening due to using uncoverted
      shrinkers in the new setup.
      ```
      
      Then we free the shrinker by calling call_rcu(), and use rcu_read_{lock,unlock}()
      to ensure that the shrinker instance is valid. And the shrinker::refcount
      mechanism ensures that the shrinker instance will not be run again after
      unregistration. So the structure that records the pointer of shrinker instance
      can be safely freed without waiting for the RCU read-side critical section.
      
      In this way, while we implement the lockless slab shrink, we don't need to be
      blocked in unregister_shrinker() to wait RCU read-side critical section.
      
      PATCH 1: introduce new APIs
      PATCH 2~38: convert all shrinnkers to use new APIs
      PATCH 39: remove old APIs
      PATCH 40~41: some cleanups and preparations
      PATCH 42-43: implement the lockless slab shrink
      PATCH 44~45: convert shrinker_rwsem to mutex
      
      3. Testing
      ==========
      
      3.1 slab shrink stress test
      ---------------------------
      
      We can reproduce the down_read_trylock() hotspot through the following script:
      
      ```
      
      DIR="/root/shrinker/memcg/mnt"
      
      do_create()
      {
          mkdir -p /sys/fs/cgroup/memory/test
          echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          for i in `seq 0 $1`;
          do
              mkdir -p /sys/fs/cgroup/memory/test/$i;
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              mkdir -p $DIR/$i;
          done
      }
      
      do_mount()
      {
          for i in `seq $1 $2`;
          do
              mount -t tmpfs $i $DIR/$i;
          done
      }
      
      do_touch()
      {
          for i in `seq $1 $2`;
          do
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
          done
      }
      
      case "$1" in
        touch)
          do_touch $2 $3
          ;;
        test)
          do_create 4000
          do_mount 0 4000
          do_touch 0 3000
          ;;
        *)
          exit 1
          ;;
      esac
      ```
      
      Save the above script, then run test and touch commands. Then we can use the
      following perf command to view hotspots:
      
      perf top -U -F 999
      
      1) Before applying this patchset:
      
        33.15%  [kernel]          [k] down_read_trylock
        25.38%  [kernel]          [k] shrink_slab
        21.75%  [kernel]          [k] up_read
         4.45%  [kernel]          [k] _find_next_bit
         2.27%  [kernel]          [k] do_shrink_slab
         1.80%  [kernel]          [k] intel_idle_irq
         1.79%  [kernel]          [k] shrink_lruvec
         0.67%  [kernel]          [k] xas_descend
         0.41%  [kernel]          [k] mem_cgroup_iter
         0.40%  [kernel]          [k] shrink_node
         0.38%  [kernel]          [k] list_lru_count_one
      
      2) After applying this patchset:
      
        64.56%  [kernel]          [k] shrink_slab
        12.18%  [kernel]          [k] do_shrink_slab
         3.30%  [kernel]          [k] __rcu_read_unlock
         2.61%  [kernel]          [k] shrink_lruvec
         2.49%  [kernel]          [k] __rcu_read_lock
         1.93%  [kernel]          [k] intel_idle_irq
         0.89%  [kernel]          [k] shrink_node
         0.81%  [kernel]          [k] mem_cgroup_iter
         0.77%  [kernel]          [k] mem_cgroup_calculate_protection
         0.66%  [kernel]          [k] list_lru_count_one
      
      We can see that the first perf hotspot becomes shrink_slab, which is what we
      expect.
      
      3.2 registration and unregistration stress test
      -----------------------------------------------
      
      Run the command below to test:
      
      stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
      
      1) Before applying this patchset:
      
      setting to a 60 second run per stressor
      dispatching hogs: 9 ramfs
      stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                                (secs)    (secs)    (secs)   (real time) (usr+sys time)
      ramfs            473062     60.00      8.00    279.13      7884.12        1647.59
      for a 60.01s run time:
         1440.34s available CPU time
            7.99s user time   (  0.55%)
          279.13s system time ( 19.38%)
          287.12s total time  ( 19.93%)
      load average: 7.12 2.99 1.15
      successful run completed in 60.01s (1 min, 0.01 secs)
      
      2) After applying this patchset:
      
      setting to a 60 second run per stressor
      dispatching hogs: 9 ramfs
      stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                                (secs)    (secs)    (secs)   (real time) (usr+sys time)
      ramfs            477165     60.00      8.13    281.34      7952.55        1648.40
      for a 60.01s run time:
         1440.33s available CPU time
            8.12s user time   (  0.56%)
          281.34s system time ( 19.53%)
          289.46s total time  ( 20.10%)
      load average: 6.98 3.03 1.19
      successful run completed in 60.01s (1 min, 0.01 secs)
      
      We can see that the ops/s has hardly changed.
      
      
      This patch (of 45):
      
      Currently, the shrinker instances can be divided into the following three
      types:
      
      a) global shrinker instance statically defined in the kernel, such as
         workingset_shadow_shrinker.
      
      b) global shrinker instance statically defined in the kernel modules, such
         as mmu_shrinker in x86.
      
      c) shrinker instance embedded in other structures.
      
      For case a, the memory of shrinker instance is never freed. For case b,
      the memory of shrinker instance will be freed after synchronize_rcu() when
      the module is unloaded. For case c, the memory of shrinker instance will
      be freed along with the structure it is embedded in.
      
      In preparation for implementing lockless slab shrink, we need to
      dynamically allocate those shrinker instances in case c, then the memory
      can be dynamically freed alone by calling kfree_rcu().
      
      So this commit adds the following new APIs for dynamically allocating
      shrinker, and add a private_data field to struct shrinker to record and
      get the original embedded structure.
      
      1. shrinker_alloc()
      
      Used to allocate shrinker instance itself and related memory, it will
      return a pointer to the shrinker instance on success and NULL on failure.
      
      2. shrinker_register()
      
      Used to register the shrinker instance, which is same as the current
      register_shrinker_prepared().
      
      3. shrinker_free()
      
      Used to unregister (if needed) and free the shrinker instance.
      
      In order to simplify shrinker-related APIs and make shrinker more
      independent of other kernel mechanisms, subsequent submissions will use
      the above API to convert all shrinkers (including case a and b) to
      dynamically allocated, and then remove all existing APIs.
      
      This will also have another advantage mentioned by Dave Chinner:
      
      ```
      The other advantage of this is that it will break all the existing
      out of tree code and third party modules using the old API and will
      no longer work with a kernel using lockless slab shrinkers. They
      need to break (both at the source and binary levels) to stop bad
      things from happening due to using unconverted shrinkers in the new
      setup.
      ```
      
      [zhengqi.arch@bytedance.com: mm: shrinker: some cleanup]
        Link: https://lkml.kernel.org/r/20230919024607.65463-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20230911094444.68966-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20230911094444.68966-2-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c42d50ae
    • Qi Zheng's avatar
      drm/ttm: introduce pool_shrink_rwsem · 0b2f5ea1
      Qi Zheng authored
      Currently, synchronize_shrinkers() is only used by TTM pool.  It only
      requires that no shrinkers run in parallel.
      
      After we use RCU+refcount method to implement the lockless slab shrink, we
      can not use shrinker_rwsem or synchronize_rcu() to guarantee that all
      shrinker invocations have seen an update before freeing memory.
      
      So we introduce a new pool_shrink_rwsem to implement a private
      ttm_pool_synchronize_shrinkers(), so as to achieve the same purpose.
      
      Link: https://lkml.kernel.org/r/20230911092517.64141-5-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b2f5ea1
    • Qi Zheng's avatar
      mm: shrinker: remove redundant shrinker_rwsem in debugfs operations · 1dd49e58
      Qi Zheng authored
      debugfs_remove_recursive() will wait for debugfs_file_put() to return, so
      the shrinker will not be freed when doing debugfs operations (such as
      shrinker_debugfs_count_show() and shrinker_debugfs_scan_write()), so there
      is no need to hold shrinker_rwsem during debugfs operations.
      
      Link: https://lkml.kernel.org/r/20230911092517.64141-4-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1dd49e58
    • Qi Zheng's avatar
      mm: vmscan: move shrinker-related code into a separate file · 96f7b2b9
      Qi Zheng authored
      The mm/vmscan.c file is too large, so separate the shrinker-related code
      from it into a separate file.  No functional changes.
      
      Link: https://lkml.kernel.org/r/20230911092517.64141-3-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96f7b2b9
    • Qi Zheng's avatar
      mm: move some shrinker-related function declarations to mm/internal.h · 3ee0aa9f
      Qi Zheng authored
      Patch series "cleanups for lockless slab shrink", v4.
      
      This series is some cleanups for lockless slab shrink.
      
      
      This patch (of 4):
      
      The following functions are only used inside the mm subsystem, so it's
      better to move their declarations to the mm/internal.h file.
      
      1. shrinker_debugfs_add()
      2. shrinker_debugfs_detach()
      3. shrinker_debugfs_remove()
      
      Link: https://lkml.kernel.org/r/20230911092517.64141-1-zhengqi.arch@bytedance.com
      Link: https://lkml.kernel.org/r/20230911092517.64141-2-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ee0aa9f
    • Adrian Hunter's avatar
      proc/kcore: do not try to access unaccepted memory · e538a582
      Adrian Hunter authored
      Support for unaccepted memory was added recently, refer commit
      dcdfdd40 ("mm: Add support for unaccepted memory"), whereby a virtual
      machine may need to accept memory before it can be used.
      
      Do not try to access unaccepted memory because it can cause the guest to
      fail.
      
      For /proc/kcore, which is read-only and does not support mmap, this means a
      read of unaccepted memory will return zeros.
      
      Link: https://lkml.kernel.org/r/20230911112114.91323-3-adrian.hunter@intel.comSigned-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e538a582
    • Adrian Hunter's avatar
      efi/unaccepted: do not let /proc/vmcore try to access unaccepted memory · 7cd34dd3
      Adrian Hunter authored
      Patch series "Do not try to access unaccepted memory", v2.
      
      Support for unaccepted memory was added recently, refer commit
      dcdfdd40 ("mm: Add support for unaccepted memory"), whereby
      a virtual machine may need to accept memory before it can be used.
      
      Plug a few gaps where RAM is exposed without checking if it is
      unaccepted memory.
      
      
      This patch (of 2):
      
      Support for unaccepted memory was added recently, refer commit
      dcdfdd40 ("mm: Add support for unaccepted memory"), whereby a virtual
      machine may need to accept memory before it can be used.
      
      Do not let /proc/vmcore try to access unaccepted memory because it can
      cause the guest to fail.
      
      For /proc/vmcore, which is read-only, this means a read or mmap of
      unaccepted memory will return zeros.
      
      Link: https://lkml.kernel.org/r/20230911112114.91323-1-adrian.hunter@intel.com
      Link: https://lkml.kernel.org/r/20230911112114.91323-2-adrian.hunter@intel.comSigned-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7cd34dd3
    • Alexander Potapenko's avatar
      kmsan: introduce test_memcpy_initialized_gap() · 46fa84a2
      Alexander Potapenko authored
      Add a regression test for the special case where memcpy() previously
      failed to correctly set the origins: if upon memcpy() four aligned
      initialized bytes with a zero origin value ended up split between two
      aligned four-byte chunks, one of those chunks could've received the zero
      origin value even despite it contained uninitialized bytes from other
      writes.
      
      Link: https://lkml.kernel.org/r/20230911145702.2663753-4-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46fa84a2
    • Alexander Potapenko's avatar
      kmsan: merge test_memcpy_aligned_to_unaligned{,2}() together · c3ab4873
      Alexander Potapenko authored
      Introduce report_reset() that allows checking for more than one KMSAN
      report per testcase.
      
      Fold test_memcpy_aligned_to_unaligned2() into
      test_memcpy_aligned_to_unaligned(), so that they share the setup phase and
      check the behavior of a single memcpy() call.
      
      Link: https://lkml.kernel.org/r/20230911145702.2663753-3-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c3ab4873
    • Alexander Potapenko's avatar
      kmsan: prevent optimizations in memcpy tests · 0be7b2c2
      Alexander Potapenko authored
      Clang 18 learned to optimize away memcpy() calls of small uninitialized
      scalar values.  To ensure that memcpy tests in kmsan_test.c still perform
      calls to memcpy() (which KMSAN replaces with __msan_memcpy()), declare a
      separate memcpy_noinline() function with volatile parameters, which won't
      be optimized.
      
      Also retire DO_NOT_OPTIMIZE(), as memcpy_noinline() is apparently enough.
      
      Link: https://lkml.kernel.org/r/20230911145702.2663753-2-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0be7b2c2
    • Alexander Potapenko's avatar
      kmsan: simplify kmsan_internal_memmove_metadata() · be1ab60e
      Alexander Potapenko authored
      kmsan_internal_memmove_metadata() is the function that implements copying
      metadata every time memcpy()/memmove() is called.  Because shadow memory
      stores 1 byte per each byte of kernel memory, copying the shadow is
      trivial and can be done by a single memmove() call.
      
      Origins, on the other hand, are stored as 4-byte values corresponding to
      every aligned 4 bytes of kernel memory.  Therefore, if either the source
      or the destination of kmsan_internal_memmove_metadata() is unaligned, the
      number of origin slots corresponding to the source or destination may
      differ:
      
        1) memcpy(0xffff888080a00000, 0xffff888080900000, 4)
           copies 1 origin slot into 1 origin slot:
      
           src (0xffff888080900000): xxxx
           src origins:              o111
           dst (0xffff888080a00000): xxxx
           dst origins:              o111
      
        2) memcpy(0xffff888080a00001, 0xffff888080900000, 4)
           copies 1 origin slot into 2 origin slots:
      
           src (0xffff888080900000): xxxx
           src origins:              o111
           dst (0xffff888080a00000): .xxx x...
           dst origins:              o111 o111
      
        3) memcpy(0xffff888080a00000, 0xffff888080900001, 4)
           copies 2 origin slots into 1 origin slot:
      
           src (0xffff888080900000): .xxx x...
           src origins:              o111 o222
           dst (0xffff888080a00000): xxxx
           dst origins:              o111
                                 (or o222)
      
      Previously, kmsan_internal_memmove_metadata() tried to solve this problem
      by copying min(src_slots, dst_slots) as is and cloning the missing slot on
      one of the ends, if needed.
      
      This was error-prone even in the simple cases where 4 bytes were copied,
      and did not account for situations where the total number of nonzero
      origin slots could have increased by more than one after copying:
      
        memcpy(0xffff888080a00000, 0xffff888080900002, 8)
      
        src (0xffff888080900002): ..xx .... xx..
        src origins:              o111 0000 o222
        dst (0xffff888080a00000): xx.. ..xx
                                  o111 0000
                              (or 0000 o222)
      
      The new implementation simply copies the shadow byte by byte, and updates
      the corresponding origin slot, if the shadow byte is nonzero.  This
      approach can handle complex cases with mixed initialized and uninitialized
      bytes.  Similarly to KMSAN inline instrumentation, latter writes to bytes
      sharing the same origin slots take precedence.
      
      Link: https://lkml.kernel.org/r/20230911145702.2663753-1-glider@google.com
      Fixes: f80be457 ("kmsan: add KMSAN runtime core")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      be1ab60e
    • Aleksa Sarai's avatar
      memfd: drop warning for missing exec-related flags · 1717449b
      Aleksa Sarai authored
      Commit 434ed335 ("memfd: improve userspace warnings for missing
      exec-related flags") attempted to make these warnings more useful (so
      they would work as an incentive to get users to switch to specifying
      these flags -- as intended by the original MFD_NOEXEC_SEAL patchset).
      Unfortunately, it turns out that even INFO-level logging is too extreme
      to enable by default and alternative solutions to the spam issue (such
      as doing more extreme rate-limiting per-task) are either too ugly or
      overkill for something as simple as emitting a log as a developer aid.
      
      Given that the flags are new and there is no harm to not specifying them
      (after all, we maintain backwards compatibility) we can just drop the
      warnings for now until some time in the future when most programs have
      migrated and distributions start using vm.memfd_noexec=1 (where failing
      to pass the flag would result in unexpected errors for programs that use
      executable memfds).
      
      Link: https://lkml.kernel.org/r/20230912-memfd-reduce-spam-v2-1-7d92a4964b6a@cyphar.com
      Fixes: 434ed335 ("memfd: improve userspace warnings for missing exec-related flags")
      Fixes: 2562d67b ("revert "memfd: improve userspace warnings for missing exec-related flags".")
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Reported-by: default avatarDamian Tometzki <dtometzki@fedoraproject.org>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: Jeff Xu <jeffxu@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1717449b
    • Ying Sun's avatar
      mm/shmem: remove dead code can not be satisfied by "(CONFIG_SHMEM)&&(!(CONFIG_SHMEM))" · 84e8e54e
      Ying Sun authored
      The value of “.fs_flags” in line 4608 is a dead code which will never
      be implemented,because its conditions of line 47 "#ifdef CONFIG_SHMEM"
      and line 4607 are mutually exclusive.  It is recommended to delete
      redundant code.
      
      Link: https://lkml.kernel.org/r/20230906045012.14999-1-sunying@nj.iscas.ac.cnSigned-off-by: default avatarYing Sun <sunying@nj.iscas.ac.cn>
      Suggested-by: default avatarYanjie Ren <renyanjie01@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      84e8e54e
    • Angus Chen's avatar
      mm/vmscan: print err before panic · 037dd8f9
      Angus Chen authored
      If panic is enable,the err information will not be printed before bugon,
      So swap it.  Print the return value of PTR_ERR(pgdat->kswapd) also.
      
      Link: https://lkml.kernel.org/r/20230906083700.181-1-angus.chen@jaguarmicro.comSigned-off-by: default avatarAngus Chen <angus.chen@jaguarmicro.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      037dd8f9
    • Yajun Deng's avatar
      mm/mm_init.c: remove redundant pr_info when node is memoryless · 40dca9b3
      Yajun Deng authored
      There is a similar pr_info in free_area_init_node(), so remove the
      redundant pr_info.
      
      before:
      [    0.006314] Initializing node 0 as memoryless
      [    0.006445] Initmem setup node 0 as memoryless
      [    0.006450] Initmem setup node 1 [mem 0x0000000000001000-0x000000003fffffff]
      [    0.006453] Initmem setup node 2 [mem 0x0000000040000000-0x000000007ffd7fff]
      [    0.006454] Initializing node 3 as memoryless
      [    0.006584] Initmem setup node 3 as memoryless
      [    0.006585] Initmem setup node 4 [mem 0x0000000100000000-0x00000001bfffffff]
      [    0.006586] Initmem setup node 5 [mem 0x00000001c0000000-0x00000001ffffffff]
      [    0.006587] Initmem setup node 6 [mem 0x0000000200000000-0x000000023fffffff]
      
      after:
      [    0.004147] Initmem setup node 0 as memoryless
      [    0.004148] Initmem setup node 1 [mem 0x0000000000001000-0x000000003fffffff]
      [    0.004150] Initmem setup node 2 [mem 0x0000000040000000-0x000000007ffd7fff]
      [    0.004154] Initmem setup node 3 as memoryless
      [    0.004155] Initmem setup node 4 [mem 0x0000000100000000-0x00000001bfffffff]
      [    0.004156] Initmem setup node 5 [mem 0x00000001c0000000-0x00000001ffffffff]
      [    0.004157] Initmem setup node 6 [mem 0x0000000200000000-0x000000023fffffff]
      
      Link: https://lkml.kernel.org/r/20230906091113.4029983-1-yajun.deng@linux.devSigned-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40dca9b3
    • Yuan Can's avatar
      mm: hugetlb_vmemmap: allow alloc vmemmap pages fallback to other nodes · 6a898c27
      Yuan Can authored
      In vmemmap_remap_free(), a new head vmemmap page is allocated to avoid
      breaking a contiguous block of struct page memory, however, the allocation
      can always fail when the given node is movable node.  Remove the
      __GFP_THISNODE to help avoid fragmentation.
      
      Link: https://lkml.kernel.org/r/20230906093157.9737-1-yuancan@huawei.comSigned-off-by: default avatarYuan Can <yuancan@huawei.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a898c27
    • Xiu Jianfeng's avatar
      mm: remove duplicated vma->vm_flags check when expanding stack · 7fa38d0e
      Xiu Jianfeng authored
      expand_upwards() and expand_downwards() will return -EFAULT if VM_GROWSUP
      or VM_GROWSDOWN is not correctly set in vma->vm_flags, however in
      !CONFIG_STACK_GROWSUP case, expand_stack_locked() returns -EINVAL first if
      !(vma->vm_flags & VM_GROWSDOWN) before calling expand_downwards(), to keep
      the consistency with CONFIG_STACK_GROWSUP case, remove this check.
      
      The usages of this function are as below:
      
      A:fs/exec.c
      ret = expand_stack_locked(vma, stack_base);
      if (ret)
      	ret = -EFAULT;
      
      or
      
      B:mm/memory.c mm/mmap.c
      if (expand_stack_locked(vma, addr))
      	return NULL;
      
      which means the return value will not propagate to other places, so I
      believe there is no user-visible effects of this change, and it's
      unnecessary to backport to earlier versions.
      
      Link: https://lkml.kernel.org/r/20230906103312.645712-1-xiujianfeng@huaweicloud.com
      Fixes: f440fa1a ("mm: make find_extend_vma() fail if write lock not held")
      Signed-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fa38d0e
    • SeongJae Park's avatar
      mm/damon/core: remove 'struct target *' parameter from damon_aggregated tracepoint · 2d00946b
      SeongJae Park authored
      damon_aggregateed tracepoint is receiving 'struct target *', but doesn't
      use it.  Remove it from the prototype.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-12-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d00946b
    • SeongJae Park's avatar
      mm/damon/core: remove duplicated comment for watermarks-based deactivation · cf0a96bd
      SeongJae Park authored
      The comment for explaining about watermarks-based monitoring part
      deactivation is duplicated in two paragraphs.  Remove one.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-11-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf0a96bd
    • SeongJae Park's avatar
      mm/damon/core: add more comments for nr_accesses · d896073f
      SeongJae Park authored
      The comment on struct damon_region about nr_accesses field looks not
      sufficient.  Many people actually used to ask what nr_accesses mean.
      There is more detailed explanation of the mechanism on the comment for
      struct damon_attrs, but it is also ambiguous, as it doesn't specify the
      name of the counter for aggregating the access check results.  Make
      those more detailed.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d896073f
    • SeongJae Park's avatar
      mm/damon/core: fix a comment about damon_set_attrs() call timings · 27e68c4b
      SeongJae Park authored
      The comment on damon_set_attrs() says it should not be called while the
      kdamond is running, but now some DAMON modules like sysfs interface and
      DAMON_RECLAIM call it from after_aggregation() and/or
      after_wmarks_check() callbacks for online tuning.  Update the comment.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      27e68c4b
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: link design doc for details of kdamond and context · 46158bf2
      SeongJae Park authored
      The explanation of kdamond and context is duplicated in the design and
      the usage documents.  Replace that in the usage with links to those in
      the design document.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46158bf2
    • SeongJae Park's avatar
      Docs/mm/damon/design: add a section for kdamond and DAMON context · 86ae64cd
      SeongJae Park authored
      The design document is not explaining about the concept of kdamond and
      the DAMON context, while usage document does.  Those concept explanation
      should be in the design document, and usage document should link those.
      Add a section for those.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      86ae64cd
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: explain the format of damon_aggregate tracepoint · 4f554ca1
      SeongJae Park authored
      The example of the section for damon_aggregated tracepoint is not
      explaining how the output looks like, and how it can be interpreted.
      Add it.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f554ca1
    • SeongJae Park's avatar
      Docs/mm/damon/design: explicitly introduce ``nr_accesses`` · 24df886f
      SeongJae Park authored
      The design document is explaining about the access tracking mechanism
      and the access rate counter (nr_accesses), but not directly mentions the
      name.  Add a sentence for making it clear.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24df886f
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: move debugfs intro to the bottom of the section · 4f711278
      SeongJae Park authored
      On the DAMON usage introduction section, the introduction of DAMON
      debugfs interface, which is deprecated, is above kernel API, which is
      actively supported.  Move the DAMON debugfs intro to bottom, so that
      readers have less chances to read it.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f711278
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: place debugfs usage at the bottom · 75999724
      SeongJae Park authored
      debugfs interface is deprecated.  Put it at the bottom of the document
      so that readers have less chances to read it.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75999724
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: fixup missed :ref: keyword · b4c07800
      SeongJae Park authored
      Patch series "mm/damon: misc fixups for documents, comments and its
      tracepoint".
      
      This patchset contains miscellaneous simple fixups for documents, comments and
      tracepoint of DAMON.
      
      
      This patch (of 11):
      
      A cross-link reference in DAMON usage document is missing ':ref:' Sphynx
      keyword.  Fix it.
      
      Link: https://lkml.kernel.org/r/20230907022929.91361-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20230907022929.91361-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b4c07800
    • Nhat Pham's avatar
      zswap: change zswap's default allocator to zsmalloc · 64d4d49c
      Nhat Pham authored
      Out of zswap's 3 allocators, zsmalloc is the clear superior in terms of
      memory utilization, both in theory and as observed in practice, with its
      high storage density and low internal fragmentation.  zsmalloc is also
      more actively developed and maintained, since it is the allocator of
      choice for zswap for many users, as well as the only allocator for zram.
      
      A historical objection to the selection of zsmalloc as the default
      allocator for zswap is its lack of writeback capability.  However, this
      has changed, with the zsmalloc writeback patchset, and the subsequent
      zswap LRU refactor.  With this, there is not a lot of good reasons to keep
      zbud, an otherwise inferior allocator, as the default instead of zswap.
      
      This patch changes the default allocator to zsmalloc.  The only exception
      is on settings without MMU, in which case zbud will remain as the default.
      
      Link: https://lkml.kernel.org/r/20230908235115.2943486-1-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      64d4d49c
    • Joel Fernandes's avatar
      selftests: mm: add a test for moving from an offset from start of mapping · 7b709f38
      Joel Fernandes authored
      It is possible that the aligned address falls on no existing mapping,
      however that does not mean that we can just align it down to that.  This
      test verifies that the "vma->vm_start != addr_to_align" check in
      can_align_down() prevents disastrous results if aligning down when source
      and dest are mutually aligned within a PMD but the source/dest addresses
      requested are not at the beginning of the respective mapping containing
      these addresses.
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-8-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b709f38
    • Joel Fernandes (Google)'s avatar
      selftests: mm: add a test for remapping within a range · 85a22845
      Joel Fernandes (Google) authored
      Move a block of memory within a memory range.  Any alignment optimization
      on the source address may cause corruption.  Verify using kselftest that
      it works.  I have also verified with tracing that such optimization does
      not happen due to this check in can_align_down():
      
      if (!for_stack && vma->vm_start != addr_to_align)
      	return false;
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-7-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85a22845
    • Joel Fernandes (Google)'s avatar
      selftests: mm: add a test for remapping to area immediately after existing mapping · a4cb3b24
      Joel Fernandes (Google) authored
      This patch adds support for verifying that we correctly handle the
      situation where something is already mapped before the destination of the remap.
      
      Any realignment of destination address and PMD-copy will destroy that
      existing mapping. In such cases, we need to avoid doing the optimization.
      
      To test this, we map an area called the preamble before the remap
      region. Then we verify after the mremap operation that this region did not get
      corrupted.
      
      Putting some prints in the kernel, I verified that we optimize
      correctly in different situations:
      
      Optimize when there is alignment and no previous mapping (this is tested
      by previous patch).
      <prints>
      can_align_down(old_vma->vm_start=2900000, old_addr=2900000, mask=-2097152): 0
      can_align_down(new_vma->vm_start=2f00000, new_addr=2f00000, mask=-2097152): 0
      === Starting move_page_tables ===
      Doing PUD move for 2800000 -> 2e00000 of extent=200000 <-- Optimization
      Doing PUD move for 2a00000 -> 3000000 of extent=200000
      Doing PUD move for 2c00000 -> 3200000 of extent=200000
      </prints>
      
      Don't optimize when there is alignment but there is previous mapping
      (this is tested by this patch).
      Notice that can_align_down() returns 1 for the destination mapping
      as we detected there is something there.
      <prints>
      can_align_down(old_vma->vm_start=2900000, old_addr=2900000, mask=-2097152): 0
      can_align_down(new_vma->vm_start=5700000, new_addr=5700000, mask=-2097152): 1
      === Starting move_page_tables ===
      Doing move_ptes for 2900000 -> 5700000 of extent=100000 <-- Unoptimized
      Doing PUD move for 2a00000 -> 5800000 of extent=200000
      Doing PUD move for 2c00000 -> 5a00000 of extent=200000
      </prints>
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-6-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a4cb3b24
    • Joel Fernandes (Google)'s avatar
      selftests: mm: add a test for mutually aligned moves > PMD size · 8ed873d8
      Joel Fernandes (Google) authored
      This patch adds a test case to check if a PMD-alignment optimization
      successfully happens.
      
      I add support to make sure there is some room before the source mapping,
      otherwise the optimization to trigger PMD-aligned move will be disabled as
      the kernel will detect that a mapping before the source exists and such
      optimization becomes impossible.
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-5-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8ed873d8
    • Joel Fernandes (Google)'s avatar
      selftests: mm: fix failure case when new remap region was not found · 99eb26d5
      Joel Fernandes (Google) authored
      When a valid remap region could not be found, the source mapping is not
      cleaned up.  Fix the goto statement such that the clean up happens.
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-4-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      99eb26d5
    • Joel Fernandes (Google)'s avatar
      mm/mremap: allow moves within the same VMA for stack moves · b1e5a3de
      Joel Fernandes (Google) authored
      For the stack move happening in shift_arg_pages(), the move is happening
      within the same VMA which spans the old and new ranges.
      
      In case the aligned address happens to fall within that VMA, allow such
      moves and don't abort the mremap alignment optimization.
      
      In the regular non-stack mremap case, we cannot allow any such moves as
      will end up destroying some part of the mapping (either the source of the
      move, or part of the existing mapping).  So just avoid it for stack moves.
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-3-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1e5a3de
    • Joel Fernandes (Google)'s avatar
      mm/mremap: optimize the start addresses in move_page_tables() · af8ca1c1
      Joel Fernandes (Google) authored
      Patch series "Optimize mremap during mutual alignment within PMD", v6.
      
      This patchset optimizes the start addresses in move_page_tables() and
      tests the changes.  It addresses a warning [1] that occurs due to a
      downward, overlapping move on a mutually-aligned offset within a PMD
      during exec.  By initiating the copy process at the PMD level when such
      alignment is present, we can prevent this warning and speed up the copying
      process at the same time.  Linus Torvalds suggested this idea.  Check the
      individual patches for more details.  [1]
      https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
      
      
      This patch (of 7):
      
      Recently, we see reports [1] of a warning that triggers due to
      move_page_tables() doing a downward and overlapping move on a
      mutually-aligned offset within a PMD.  By mutual alignment, I mean the
      source and destination addresses of the mremap are at the same offset
      within a PMD.
      
      This mutual alignment along with the fact that the move is downward is
      sufficient to cause a warning related to having an allocated PMD that does
      not have PTEs in it.
      
      This warning will only trigger when there is mutual alignment in the move
      operation.  A solution, as suggested by Linus Torvalds [2], is to initiate
      the copy process at the PMD level whenever such alignment is present. 
      Implementing this approach will not only prevent the warning from being
      triggered, but it will also optimize the operation as this method should
      enhance the speed of the copy process whenever there's a possibility to
      start copying at the PMD level.
      
      Some more points:
      a. The optimization can be done only when both the source and
      destination of the mremap do not have anything mapped below it up to a
      PMD boundary. I add support to detect that.
      
      b. #1 is not a problem for the call to move_page_tables() from exec.c as
      nothing is expected to be mapped below the source. However, for
      non-overlapping mutually aligned moves as triggered by mremap(2), I
      added support for checking such cases.
      
      c. I currently only optimize for PMD moves, in the future I/we can build
      on this work and do PUD moves as well if there is a need for this. But I
      want to take it one step at a time.
      
      d. We need to be careful about mremap of ranges within the VMA itself.
      For this purpose, I added checks to determine if the address after
      alignment falls within its VMA itself.
      
      [1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
      [2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
      Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.orgSigned-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      af8ca1c1
    • Yuan Can's avatar
      mm: hugetlb_vmemmap: fix hugetlb page number decrease failed on movable nodes · 2eaa6c2a
      Yuan Can authored
      The decreasing of hugetlb pages number failed with the following message
      given:
      
       sh: page allocation failure: order:0, mode:0x204cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_THISNODE)
       CPU: 1 PID: 112 Comm: sh Not tainted 6.5.0-rc7-... #45
       Hardware name: linux,dummy-virt (DT)
       Call trace:
        dump_backtrace.part.6+0x84/0xe4
        show_stack+0x18/0x24
        dump_stack_lvl+0x48/0x60
        dump_stack+0x18/0x24
        warn_alloc+0x100/0x1bc
        __alloc_pages_slowpath.constprop.107+0xa40/0xad8
        __alloc_pages+0x244/0x2d0
        hugetlb_vmemmap_restore+0x104/0x1e4
        __update_and_free_hugetlb_folio+0x44/0x1f4
        update_and_free_hugetlb_folio+0x20/0x68
        update_and_free_pages_bulk+0x4c/0xac
        set_max_huge_pages+0x198/0x334
        nr_hugepages_store_common+0x118/0x178
        nr_hugepages_store+0x18/0x24
        kobj_attr_store+0x18/0x2c
        sysfs_kf_write+0x40/0x54
        kernfs_fop_write_iter+0x164/0x1dc
        vfs_write+0x3a8/0x460
        ksys_write+0x6c/0x100
        __arm64_sys_write+0x1c/0x28
        invoke_syscall+0x44/0x100
        el0_svc_common.constprop.1+0x6c/0xe4
        do_el0_svc+0x38/0x94
        el0_svc+0x28/0x74
        el0t_64_sync_handler+0xa0/0xc4
        el0t_64_sync+0x174/0x178
       Mem-Info:
        ...
      
      The reason is that the hugetlb pages being released are allocated from
      movable nodes, and with hugetlb_optimize_vmemmap enabled, vmemmap pages
      need to be allocated from the same node during the hugetlb pages
      releasing. With GFP_KERNEL and __GFP_THISNODE set, allocating from movable
      node is always failed. Fix this problem by removing __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20230905124503.24899-1-yuancan@huawei.com
      Fixes: ad2fa371 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
      Signed-off-by: default avatarYuan Can <yuancan@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2eaa6c2a
    • Uros Bizjak's avatar
      mm/vmstat: use this_cpu_try_cmpxchg in mod_{zone,node}_state · 77cd8148
      Uros Bizjak authored
      Use this_cpu_try_cmpxchg instead of this_cpu_cmpxchg (*ptr, old, new) ==
      old in mod_zone_state and mod_node_state.  x86 CMPXCHG instruction returns
      success in ZF flag, so this change saves a compare after cmpxchg (and
      related move instruction in front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails.  There is no need to re-read the value in the loop.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230904150917.8318-1-ubizjak@gmail.comSigned-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      77cd8148
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert DAX lock/unlock page to lock/unlock folio · 91e79d22
      Matthew Wilcox (Oracle) authored
      The one caller of DAX lock/unlock page already calls compound_head(), so
      use page_folio() instead, then use a folio throughout the DAX code to
      remove uses of page->mapping and page->index.
      
      [jane.chu@oracle.com: add comment to mf_generic_kill_procss(), simplify mf_generic_kill_procs:folio initialization]
        Link: https://lkml.kernel.org/r/20230908222336.186313-1-jane.chu@oracle.com
      Link: https://lkml.kernel.org/r/20230822231314.349200-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91e79d22