1. 29 Jun, 2021 40 commits
    • David Hildenbrand's avatar
      mm: ignore MAP_EXECUTABLE in ksys_mmap_pgoff() · 3b8db39f
      David Hildenbrand authored
      Let's also remove masking off MAP_EXECUTABLE from ksys_mmap_pgoff(): the
      last in-tree occurrence of MAP_EXECUTABLE is now in LEGACY_MAP_MASK, which
      accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag is
      ignored throughout the kernel now.
      
      Add a comment to LEGACY_MAP_MASK stating that MAP_EXECUTABLE is ignored.
      
      Link: https://lkml.kernel.org/r/20210421093453.6904-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b8db39f
    • David Hildenbrand's avatar
      binfmt: remove in-tree usage of MAP_EXECUTABLE · a4eec6a3
      David Hildenbrand authored
      Ever since commit e9714acf ("mm: kill vma flag VM_EXECUTABLE and
      mm->num_exe_file_vmas"), VM_EXECUTABLE is gone and MAP_EXECUTABLE is
      essentially completely ignored.  Let's remove all usage of MAP_EXECUTABLE.
      
      [akpm@linux-foundation.org: fix blooper in fs/binfmt_aout.c. per David]
      
      Link: https://lkml.kernel.org/r/20210421093453.6904-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4eec6a3
    • David Hildenbrand's avatar
      perf: MAP_EXECUTABLE does not indicate VM_MAYEXEC · 8fa20752
      David Hildenbrand authored
      Patch series "perf/binfmt/mm: remove in-tree usage of MAP_EXECUTABLE".
      
      Stumbling over the history of MAP_EXECUTABLE, I noticed that we still have
      some in-tree users that we can get rid of.
      
      This patch (of 3):
      
      Before commit e9714acf ("mm: kill vma flag VM_EXECUTABLE and
      mm->num_exe_file_vmas"), VM_EXECUTABLE indicated MAP_EXECUTABLE.
      MAP_EXECUTABLE is nowadays essentially ignored by the kernel and does not
      relate to VM_MAYEXEC.
      
      Link: https://lkml.kernel.org/r/20210421093453.6904-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210421093453.6904-2-david@redhat.com
      Fixes: f972eb63 ("perf: Pass protection and flags bits through mmap2 interface")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fa20752
    • Huilong Deng's avatar
      mm: memcontrol: remove trailing semicolon in macros · 6a1803bb
      Huilong Deng authored
      Macros should not use a trailing semicolon.
      
      Link: https://lkml.kernel.org/r/20210614091530.22117-1-denghuilong@cdjrlc.comSigned-off-by: default avatarHuilong Deng <denghuilong@cdjrlc.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a1803bb
    • Dan Schatzberg's avatar
      loop: charge i/o to mem and blk cg · c74d40e8
      Dan Schatzberg authored
      The current code only associates with the existing blkcg when aio is used
      to access the backing file.  This patch covers all types of i/o to the
      backing file and also associates the memcg so if the backing file is on
      tmpfs, memory is charged appropriately.
      
      This patch also exports cgroup_get_e_css and int_active_memcg so it can be
      used by the loop module.
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.comSigned-off-by: default avatarDan Schatzberg <schatzberg.dan@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c74d40e8
    • Dan Schatzberg's avatar
      mm: charge active memcg when no mm is set · 04f94e3f
      Dan Schatzberg authored
      set_active_memcg() worked for kernel allocations but was silently ignored
      for user pages.
      
      This patch establishes a precedence order for who gets charged:
      
      1. If there is a memcg associated with the page already, that memcg is
         charged. This happens during swapin.
      
      2. If an explicit mm is passed, mm->memcg is charged. This happens
         during page faults, which can be triggered in remote VMs (eg gup).
      
      3. Otherwise consult the current process context. If there is an
         active_memcg, use that. Otherwise, current->mm->memcg.
      
      Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
      always charge the root cgroup.  Now it looks up the active_memcg first
      (falling back to charging the root cgroup if not set).
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.comSigned-off-by: default avatarDan Schatzberg <schatzberg.dan@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f94e3f
    • Dan Schatzberg's avatar
      loop: use worker per cgroup instead of kworker · 87579e9b
      Dan Schatzberg authored
      Patch series "Charge loop device i/o to issuing cgroup", v14.
      
      The loop device runs all i/o to the backing file on a separate kworker
      thread which results in all i/o being charged to the root cgroup.  This
      allows a loop device to be used to trivially bypass resource limits and
      other policy.  This patch series fixes this gap in accounting.
      
      A simple script to demonstrate this behavior on cgroupv2 machine:
      
      '''
      #!/bin/bash
      set -e
      
      CGROUP=/sys/fs/cgroup/test.slice
      LOOP_DEV=/dev/loop0
      
      if [[ ! -d $CGROUP ]]
      then
          sudo mkdir $CGROUP
      fi
      
      grep oom_kill $CGROUP/memory.events
      
      # Set a memory limit, write more than that limit to tmpfs -> OOM kill
      sudo unshare -m bash -c "
      echo \$\$ > $CGROUP/cgroup.procs;
      echo 0 > $CGROUP/memory.swap.max;
      echo 64M > $CGROUP/memory.max;
      mount -t tmpfs -o size=512m tmpfs /tmp;
      dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
      
      grep oom_kill $CGROUP/memory.events
      
      # Set a memory limit, write more than that limit through loopback
      # device -> no OOM kill
      sudo unshare -m bash -c "
      echo \$\$ > $CGROUP/cgroup.procs;
      echo 0 > $CGROUP/memory.swap.max;
      echo 64M > $CGROUP/memory.max;
      mount -t tmpfs -o size=512m tmpfs /tmp;
      truncate -s 512m /tmp/backing_file
      losetup $LOOP_DEV /tmp/backing_file
      dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
      losetup -D $LOOP_DEV" || true
      
      grep oom_kill $CGROUP/memory.events
      '''
      
      Naively charging cgroups could result in priority inversions through the
      single kworker thread in the case where multiple cgroups are
      reading/writing to the same loop device.  This patch series does some
      minor modification to the loop driver so that each cgroup can make forward
      progress independently to avoid this inversion.
      
      With this patch series applied, the above script triggers OOM kills when
      writing through the loop device as expected.
      
      This patch (of 3):
      
      Existing uses of loop device may have multiple cgroups reading/writing to
      the same device.  Simply charging resources for I/O to the backing file
      could result in priority inversion where one cgroup gets synchronously
      blocked, holding up all other I/O to the loop device.
      
      In order to avoid this priority inversion, we use a single workqueue where
      each work item is a "struct loop_worker" which contains a queue of struct
      loop_cmds to issue.  The loop device maintains a tree mapping blk css_id
      -> loop_worker.  This allows each cgroup to independently make forward
      progress issuing I/O to the backing file.
      
      There is also a single queue for I/O associated with the rootcg which can
      be used in cases of extreme memory shortage where we cannot allocate a
      loop_worker.
      
      The locking for the tree and queues is fairly heavy handed - we acquire a
      per-loop-device spinlock any time either is accessed.  The existing
      implementation serializes all I/O through a single thread anyways, so I
      don't believe this is any worse.
      
      [colin.king@canonical.com: fixes]
      
      Link: https://lkml.kernel.org/r/20210610173944.1203706-1-schatzberg.dan@gmail.com
      Link: https://lkml.kernel.org/r/20210610173944.1203706-2-schatzberg.dan@gmail.comSigned-off-by: default avatarDan Schatzberg <schatzberg.dan@gmail.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87579e9b
    • wenhuizhang's avatar
      memcontrol: use flexible-array member · b51478a0
      wenhuizhang authored
      Change deprecated zero-length-and-one-element-arrays into flexible array
      member.Zero-length and one-element arrays detected by Lukas's CodeChecker.
      Zero/one element arrays cause undefined behaviours if sizeof() used.
      
      Link: https://lkml.kernel.org/r/20210518200910.29912-1-wenhui@gwmail.gwu.eduSigned-off-by: default avatarwenhuizhang <wenhui@gwmail.gwu.edu>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b51478a0
    • Muchun Song's avatar
      mm: vmscan: remove noinline_for_stack · 9ef56b78
      Muchun Song authored
      The noinline_for_stack is introduced by commit 66635629 ("vmscan: set
      up pagevec as late as possible in shrink_inactive_list()"), its purpose is
      to delay the allocation of pagevec as late as possible to save stack
      memory.  But the commit 2bcf8879 ("mm: take pagevecs off reclaim
      stack") replace pagevecs by lists of pages_to_free.  So we do not need
      noinline_for_stack, just remove it (let the compiler decide whether to
      inline).
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-9-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ef56b78
    • Muchun Song's avatar
      mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock · 271dd6b1
      Muchun Song authored
      The css_set_lock is used to guard the list of inherited objcgs.  So there
      is no need to uncharge kernel memory under css_set_lock.  Just move it out
      of the lock.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-8-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      271dd6b1
    • Muchun Song's avatar
      mm: memcontrol: simplify the logic of objcg pinning memcg · 9838354e
      Muchun Song authored
      The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by the
      css_set_lock.  We do not need to care about objcg->memcg being released in
      the process of obj_cgroup_release().  So there is no need to pin memcg
      before releasing objcg.  Remove those pinning logic to simplfy the code.
      
      There are only two places that modifies the objcg->memcg.  One is the
      initialization to objcg->memcg in the memcg_online_kmem(), another is
      objcgs reparenting in the memcg_reparent_objcgs().  It is also impossible
      for the two to run in parallel.  So xchg() is unnecessary and it is enough
      to use WRITE_ONCE().
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-7-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9838354e
    • Muchun Song's avatar
      mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec · 7467c391
      Muchun Song authored
      lruvec_holds_page_lru_lock() doesn't check anything about locking and is
      used to check whether the page belongs to the lruvec.  So rename it to
      page_matches_lruvec().
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-6-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7467c391
    • Muchun Song's avatar
      mm: memcontrol: simplify lruvec_holds_page_lru_lock · f2e4d28d
      Muchun Song authored
      We already have a helper lruvec_memcg() to get the memcg from lruvec, we
      do not need to do it ourselves in the lruvec_holds_page_lru_lock().  So
      use lruvec_memcg() instead.  And if mem_cgroup_disabled() returns false,
      the page_memcg(page) (the LRU pages) cannot be NULL.  So remove the odd
      logic of "memcg = page_memcg(page) ?  : root_mem_cgroup".  And use
      lruvec_pgdat to simplify the code.  We can have a single definition for
      this function that works for !CONFIG_MEMCG, CONFIG_MEMCG +
      mem_cgroup_disabled() and CONFIG_MEMCG.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-5-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2e4d28d
    • Muchun Song's avatar
      mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec · a984226f
      Muchun Song authored
      All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as
      the 2nd parameter to it (except isolate_migratepages_block()).  But for
      isolate_migratepages_block(), the page_pgdat(page) is also equal to the
      local variable of @pgdat.  So mem_cgroup_page_lruvec() do not need the
      pgdat parameter.  Just remove it to simplify the code.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a984226f
    • Muchun Song's avatar
      mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm · 2884b6b7
      Muchun Song authored
      When mm is NULL, we do not need to hold rcu lock and call css_tryget for
      the root memcg.  And we also do not need to check !mm in every loop of
      while.  So bail out early when !mm.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-3-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2884b6b7
    • Muchun Song's avatar
      mm: memcontrol: fix page charging in page replacement · 8dc87c7d
      Muchun Song authored
      Patch series "memcontrol code cleanup and simplification", v3.
      
      This patch (of 8):
      
      The pages aren't accounted at the root level, so do not charge the page to
      the root memcg in page replacement.  Although we do not display the value
      (mem_cgroup_usage) so there shouldn't be any actual problem, but there is
      a WARN_ON_ONCE in the page_counter_cancel().  Who knows if it will
      trigger?  So it is better to fix it.
      
      Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dc87c7d
    • Muchun Song's avatar
      mm: memcontrol: fix root_mem_cgroup charging · c5c8b16b
      Muchun Song authored
      The below scenario can cause the page counters of the root_mem_cgroup to
      be out of balance.
      
      CPU0:                                   CPU1:
      
      objcg = get_obj_cgroup_from_current()
      obj_cgroup_charge_pages(objcg)
                                              memcg_reparent_objcgs()
                                                  // reparent to root_mem_cgroup
                                                  WRITE_ONCE(iter->memcg, parent)
          // memcg == root_mem_cgroup
          memcg = get_mem_cgroup_from_objcg(objcg)
          // do not charge to the root_mem_cgroup
          try_charge(memcg)
      
      obj_cgroup_uncharge_pages(objcg)
          memcg = get_mem_cgroup_from_objcg(objcg)
          // uncharge from the root_mem_cgroup
          refill_stock(memcg)
              drain_stock(memcg)
                  page_counter_uncharge(&memcg->memory)
      
      get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so
      we never explicitly charge the root_mem_cgroup.  And it's not going to
      change.  It's all about a race when we got an obj_cgroup pointing at some
      non-root memcg, but before we were able to charge it, the cgroup was gone,
      objcg was reparented to the root and so we're skipping the charging.  Then
      we store the objcg pointer and later use to uncharge the root_mem_cgroup.
      
      This can cause the page counter to be less than the actual value.
      Although we do not display the value (mem_cgroup_usage) so there shouldn't
      be any actual problem, but there is a WARN_ON_ONCE in the
      page_counter_cancel().  Who knows if it will trigger?  So it is better to
      fix it.
      
      Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.comSigned-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5c8b16b
    • Waiman Long's avatar
      mm: memcg/slab: disable cache merging for KMALLOC_NORMAL caches · 13e680fb
      Waiman Long authored
      The KMALLOC_NORMAL (kmalloc-<n>) caches are for unaccounted objects only
      when CONFIG_MEMCG_KMEM is enabled.  To make sure that this condition
      remains true, we will have to prevent KMALOC_NORMAL caches to merge with
      other kmem caches.  This is now done by setting its refcount to -1 right
      after its creation.
      
      Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13e680fb
    • Waiman Long's avatar
      mm: memcg/slab: create a new set of kmalloc-cg-<n> caches · 494c1dfe
      Waiman Long authored
      There are currently two problems in the way the objcg pointer array
      (memcg_data) in the page structure is being allocated and freed.
      
      On its allocation, it is possible that the allocated objcg pointer
      array comes from the same slab that requires memory accounting. If this
      happens, the slab will never become empty again as there is at least
      one object left (the obj_cgroup array) in the slab.
      
      When it is freed, the objcg pointer array object may be the last one
      in its slab and hence causes kfree() to be called again. With the
      right workload, the slab cache may be set up in a way that allows the
      recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.
      
      One way to solve this problem is to split the kmalloc-<n> caches
      (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n>
      (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of
      kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All
      the other caches can still allow a mix of accounted and unaccounted
      objects.
      
      With this change, all the objcg pointer array objects will come from
      KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So
      both the recursive kfree() problem and non-freeable slab problem are
      gone.
      
      Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have
      mixed accounted and unaccounted objects, this will slightly reduce the
      number of objcg pointer arrays that need to be allocated and save a bit
      of memory. On the other hand, creating a new set of kmalloc caches does
      have the effect of reducing cache utilization. So it is properly a wash.
      
      The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and
      KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches()
      will include the newly added caches without change.
      
      [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem]
        Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      [akpm@linux-foundation.org: un-fat-finger v5 delta creation]
      [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches]
        Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com
      
      Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      [longman@redhat.com: fix for CONFIG_ZONE_DMA=n]
      Suggested-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      494c1dfe
    • Waiman Long's avatar
      mm: memcg/slab: properly set up gfp flags for objcg pointer array · 41eb5df1
      Waiman Long authored
      Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4.
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure stores a pointer to objcg pointer array for slab pages.  When
      the slab has no used objects, it can be freed in free_slab() which will
      call kfree() to free the objcg pointer array in
      memcg_alloc_page_obj_cgroups().  If it happens that the objcg pointer
      array is the last used object in its slab, that slab may then be freed
      which may caused kfree() to be called again.
      
      With the right workload, the slab cache may be set up in a way that allows
      the recursive kfree() calling loop to nest deep enough to cause a kernel
      stack overflow and panic the system.  In fact, we have a reproducer that
      can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256
      and kmalloc-rcl-128 slabs with the following kfree() loop recursively
      called 74 times:
      
        [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740]
      [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741]
      [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742]
      [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I
      also found an issue on the allocation side.  If the objcg pointer array
      happen to come from the same slab or a circular dependency linkage is
      formed with multiple slabs, those affected slabs can never be freed again.
      
      This patch series addresses these two issues by introducing a new set of
      kmalloc-cg-<n> caches split from kmalloc-<n> caches.  The new set will
      only contain non-reclaimable and non-dma objects that are accounted in
      memory cgroups whereas the old set are now for unaccounted objects only.
      By making this split, all the objcg pointer arrays will come from the
      kmalloc-<n> caches, but those caches will never hold any objcg pointer
      array.  As a result, deeply nested kfree() call and the unfreeable slab
      problems are now gone.
      
      This patch (of 4):
      
      Since the merging of the new slab memory controller in v5.9, the page
      structure may store a pointer to obj_cgroup pointer array for slab pages.
      Currently, only the __GFP_ACCOUNT bit is masked off.  However, the array
      is not readily reclaimable and doesn't need to come from the DMA buffer.
      So those GFP bits should be masked off as well.
      
      Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure
      that it is consistently applied no matter where it is called.
      
      Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com
      Fixes: 286e04b8 ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41eb5df1
    • Waiman Long's avatar
      mm/memcg: optimize user context object stock access · 55927114
      Waiman Long authored
      Most kmem_cache_alloc() calls are from user context.  With instrumentation
      enabled, the measured amount of kmem_cache_alloc() calls from non-task
      context was about 0.01% of the total.
      
      The irq disable/enable sequence used in this case to access content from
      object stock is slow.  To optimize for user context access, there are now
      two sets of object stocks (in the new obj_stock structure) for task
      context and interrupt context access respectively.
      
      The task context object stock can be accessed after disabling preemption
      which is cheap in non-preempt kernel.  The interrupt context object stock
      can only be accessed after disabling interrupt.  User context code can
      access interrupt object stock, but not vice versa.
      
      The downside of this change is that there are more data stored in local
      object stocks and not reflected in the charge counter and the vmstat
      arrays.  However, this is a small price to pay for better performance.
      
      [longman@redhat.com: fix potential uninitialized variable warning]
        Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55927114
    • Waiman Long's avatar
      mm/memcg: improve refill_obj_stock() performance · 5387c904
      Waiman Long authored
      There are two issues with the current refill_obj_stock() code.  First of
      all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
      atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and
      do a obj_cgroup_put().  It is likely that the same obj_cgroup will be used
      again which leads to another call to drain_obj_stock() and
      obj_cgroup_get() as well as atomically retrieve the available byte from
      obj_cgroup.  That is costly.  Instead, we should just uncharge the excess
      pages, reduce the stock bytes and be done with it.  The drain_obj_stock()
      function should only be called when obj_cgroup changes.
      
      Secondly, when charging an object of size not less than a page in
      obj_cgroup_charge(), it is possible that the remaining bytes to be
      refilled to the stock will overflow a page and cause refill_obj_stock() to
      uncharge 1 page.  To avoid the additional uncharge in this case, a new
      allow_uncharge flag is added to refill_obj_stock() which will be set to
      false when called from obj_cgroup_charge() so that an uncharge_pages()
      call won't be issued right after a charge_pages() call unless the objcg
      changes.
      
      A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core
      96-thread x86-64 system with 96 testing threads were run.  Before this
      patch, the total number of kilo kmalloc+kfree operations done for a 4k
      large object by all the testing threads per second were 4,304 kops/s
      (cgroup v1) and 8,478 kops/s (cgroup v2).  After applying this patch, the
      number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively.  This
      represents a performance improvement of 1.10X (cgroup v1) and 49.3X
      (cgroup v2).
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5387c904
    • Waiman Long's avatar
      mm/memcg: cache vmstat data in percpu memcg_stock_pcp · 68ac5b3c
      Waiman Long authored
      Before the new slab memory controller with per object byte charging,
      charging and vmstat data update happen only when new slab pages are
      allocated or freed.  Now they are done with every kmem_cache_alloc() and
      kmem_cache_free().  This causes additional overhead for workloads that
      generate a lot of alloc and free calls.
      
      The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup
      to reduce that overhead.  To further reducing it, this patch makes the
      vmstat data cached in the memcg_stock_pcp structure as well until it
      accumulates a page size worth of update or when other cached data change.
      Caching the vmstat data in the per-cpu stock eliminates two writes to
      non-hot cachelines for memcg specific as well as memcg-lruvecs specific
      vmstat data by a write to a hot local stock cacheline.
      
      On a 2-socket Cascade Lake server with instrumentation enabled and this
      patch applied, it was found that about 20% (634400 out of 3243830) of the
      time when mod_objcg_state() is called leads to an actual call to
      __mod_objcg_state() after initial boot.  When doing parallel kernel build,
      the figure was about 17% (24329265 out of 142512465).  So caching the
      vmstat data reduces the number of calls to __mod_objcg_state() by more
      than 80%.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ac5b3c
    • Waiman Long's avatar
      mm/memcg: move mod_objcg_state() to memcontrol.c · fdbcb2a6
      Waiman Long authored
      Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6.
      
      With the recent introduction of the new slab memory controller, we
      eliminate the need for having separate kmemcaches for each memory cgroup
      and reduce overall kernel memory usage.  However, we also add additional
      memory accounting overhead to each call of kmem_cache_alloc() and
      kmem_cache_free().
      
      For workloads that require a lot of kmemcache allocations and
      de-allocations, they may experience performance regression as illustrated
      in [1] and [2].
      
      A simple kernel module that performs repeated loop of 100,000,000
      kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
      or a big 4k object at module init time with a batch size of 4 (4 kmalloc's
      followed by 4 kfree's) is used for benchmarking.  The benchmarking tool
      was run on a kernel based on linux-next-20210419.  The test was run on a
      CascadeLake server with turbo-boosting disable to reduce run-to-run
      variation.
      
      The small object test exercises mainly the object stock charging and
      vmstat update code paths.  The large object test also exercises the
      refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code
      paths.
      
      With memory accounting disabled, the run time was 3.130s with both small
      object big object tests.
      
      With memory accounting enabled, both cgroup v1 and v2 showed similar
      results in the small object test.  The performance results of the large
      object test, however, differed between cgroup v1 and v2.
      
      The execution times with the application of various patches in the
      patchset were:
      
        Applied patches   Run time   Accounting overhead   %age 1   %age 2
        ---------------   --------   -------------------   ------   ------
      
        Small 32-byte object:
             None          11.634s         8.504s          100.0%   271.7%
              1-2           9.425s         6.295s           74.0%   201.1%
              1-3           9.708s         6.578s           77.4%   210.2%
              1-4           8.062s         4.932s           58.0%   157.6%
      
        Large 4k object (v2):
             None          22.107s        18.977s          100.0%   606.3%
              1-2          20.960s        17.830s           94.0%   569.6%
              1-3          14.238s        11.108s           58.5%   354.9%
              1-4          11.329s         8.199s           43.2%   261.9%
      
        Large 4k object (v1):
             None          36.807s        33.677s          100.0%  1075.9%
              1-2          36.648s        33.518s           99.5%  1070.9%
              1-3          22.345s        19.215s           57.1%   613.9%
              1-4          18.662s        15.532s           46.1%   496.2%
      
        N.B. %age 1 = overhead/unpatched overhead
             %age 2 = overhead/accounting disabled time
      
      Patch 2 (vmstat data stock caching) helps in both the small object test
      and the large v2 object test. It doesn't help much in v1 big object test.
      
      Patch 3 (refill_obj_stock improvement) does help the small object test
      but offer significant performance improvement for the large object test
      (both v1 and v2).
      
      Patch 4 (eliminating irq disable/enable) helps in all test cases.
      
      To test for the extreme case, a multi-threaded kmalloc/kfree
      microbenchmark was run on the 2-socket 48-core 96-thread system with
      96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
      with accounting enabled for 10s. The total number of kmalloc+kfree done
      in kilo operations per second (kops/s) were as follows:
      
        Applied patches   v1 kops/s   v1 change   v2 kops/s   v2 change
        ---------------   ---------   ---------   ---------   ---------
             None           3,520        1.00X      6,242        1.00X
              1-2           4,304        1.22X      8,478        1.36X
              1-3           4,731        1.34X    418,142       66.99X
              1-4           4,587        1.30X    438,838       70.30X
      
      With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
      kop/s. This test shows how significant the memory accouting overhead
      can be in some extreme situations.
      
      For this multithreaded test, the improvement from patch 2 mainly
      comes from the conditional atomic xchg of objcg->nr_charged_bytes in
      mod_objcg_state(). By using an unconditional xchg, the operation rates
      were similar to the unpatched kernel.
      
      Patch 3 elminates the single highly contended cacheline of
      objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
      improvement. Cgroup v1, however, still has another highly contended
      cacheline in the shared page counter &memcg->kmem. So the improvement
      is only modest.
      
      Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
      eliminating the irq_disable/irq_enable overhead seems to aggravate the
      cacheline contention.
      
      [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
      [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
      
      This patch (of 4):
      
      mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that
      further optimization can be done to it in later patches without exposing
      unnecessary details to other mm components.
      
      Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com
      Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdbcb2a6
    • Huang Ying's avatar
      swap: check mapping_empty() for swap cache before being freed · eea4a501
      Huang Ying authored
      To check whether all pages and shadow entries in swap cache has been
      removed before swap cache is freed.
      
      Link: https://lkml.kernel.org/r/20210608005121.511140-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eea4a501
    • Huang Ying's avatar
      mm: free idle swap cache page after COW · f4c4a3f4
      Huang Ying authored
      With commit 09854ba9 ("mm: do_wp_page() simplification"), after COW,
      the idle swap cache page (neither the page nor the corresponding swap
      entry is mapped by any process) will be left in the LRU list, even if it's
      in the active list or the head of the inactive list.  So, the page
      reclaimer may take quite some overhead to reclaim these actually unused
      pages.
      
      To help the page reclaiming, in this patch, after COW, the idle swap cache
      page will be tried to be freed.  To avoid to introduce much overhead to
      the hot COW code path,
      
      a) there's almost zero overhead for non-swap case via checking
         PageSwapCache() firstly.
      
      b) the page lock is acquired via trylock only.
      
      To test the patch, we used pmbench memory accessing benchmark with
      working-set larger than available memory on a 2-socket Intel server with a
      NVMe SSD as swap device.  Test results shows that the pmbench score
      increases up to 23.8% with the decreased size of swap cache and swapin
      throughput.
      
      Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Suggested-by: Johannes Weiner <hannes@cmpxchg.org>	[use free_swap_cache()]
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4c4a3f4
    • Huang Ying's avatar
      mm, swap: remove unnecessary smp_rmb() in swap_type_to_swap_info() · a4b45114
      Huang Ying authored
      Before commit c10d38cc ("mm, swap: bounds check swap_info array
      accesses to avoid NULL derefs"), the typical code to reference the
      swap_info[] is as follows,
      
        type = swp_type(swp_entry);
        if (type >= nr_swapfiles)
                /* handle invalid swp_entry */;
        p = swap_info[type];
        /* access fields of *p.  OOPS! p may be NULL! */
      
      Because the ordering isn't guaranteed, it's possible that swap_info[type]
      is read before "nr_swapfiles".  And that may result in NULL pointer
      dereference.
      
      So after commit c10d38cc, the code becomes,
      
        struct swap_info_struct *swap_type_to_swap_info(int type)
        {
      	  if (type >= READ_ONCE(nr_swapfiles))
      		  return NULL;
      	  smp_rmb();
      	  return READ_ONCE(swap_info[type]);
        }
      
        /* users */
        type = swp_type(swp_entry);
        p = swap_type_to_swap_info(type);
        if (!p)
      	  /* handle invalid swp_entry */;
        /* dereference p */
      
      Where the value of swap_info[type] (that is, "p") is checked to be
      non-zero before being dereferenced.  So, the NULL deferencing becomes
      impossible even if "nr_swapfiles" is read after swap_info[type].
      Therefore, the "smp_rmb()" becomes unnecessary.
      
      And, we don't even need to read "nr_swapfiles" here.  Because the non-zero
      checking for "p" is sufficient.  We just need to make sure we will not
      access out of the boundary of the array.  With the change, nr_swapfiles
      will only be accessed with swap_lock held, except in
      swapcache_free_entries().  Where the absolute correctness of the value
      isn't needed, as described in the comments.
      
      We still need to guarantee swap_info[type] is read before being
      dereferenced.  That can be satisfied via the data dependency ordering
      enforced by READ_ONCE(swap_info[type]).  This needs to be paired with
      proper write barriers.  So smp_store_release() is used in
      alloc_swap_info() to guarantee the fields of *swap_info[type] is
      initialized before swap_info[type] itself being written.  Note that the
      fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly.
      The assignment and deferencing of swap_info[type] is like
      rcu_assign_pointer() and rcu_dereference().
      
      Link: https://lkml.kernel.org/r/20210520073301.1676294-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4b45114
    • Miaohe Lin's avatar
      mm/swap_slots.c: delete meaningless forward declarations · 1cfcc830
      Miaohe Lin authored
      deactivate_swap_slots_cache() and reactivate_swap_slots_cache() are only
      called below their implementations.  So these forward declarations are
      meaningless and should be removed.
      
      Link: https://lkml.kernel.org/r/20210520134022.1370406-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cfcc830
    • Miaohe Lin's avatar
      mm/swap: remove unused local variable nr_shadows · eb7709c5
      Miaohe Lin authored
      Since commit 55c653b71e8c ("mm: stop accounting shadow entries"),
      nr_shadows is not used anymore.
      
      Link: https://lkml.kernel.org/r/20210520134022.1370406-3-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb7709c5
    • Miaohe Lin's avatar
      mm/swapfile: move get_swap_page_of_type() under CONFIG_HIBERNATION · bb243f7d
      Miaohe Lin authored
      Patch series "Cleanups for swap", v2.
      
      This series contains just cleanups to remove some unused variables, delete
      meaningless forward declarations and so on.  More details can be found in
      the respective changelogs.
      
      This patch (of 4):
      
      We should move get_swap_page_of_type() under CONFIG_HIBERNATION since the
      only caller of this function is now suspend routine.
      
      [linmiaohe@huawei.com: move scan_swap_map() under CONFIG_HIBERNATION]
        Link: https://lkml.kernel.org/r/20210521070855.2015094-1-linmiaohe@huawei.com
      [linmiaohe@huawei.com: fold scan_swap_map() into the only caller get_swap_page_of_type()]
        Link: https://lkml.kernel.org/r/20210527120328.3935132-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210520134022.1370406-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210520134022.1370406-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb243f7d
    • Miaohe Lin's avatar
      mm/shmem: fix shmem_swapin() race with swapoff · 2efa33fc
      Miaohe Lin authored
      When I was investigating the swap code, I found the below possible race
      window:
      
      CPU 1                                         CPU 2
      -----                                         -----
      shmem_swapin
        swap_cluster_readahead
          if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
                                                    swapoff
                                                      ..
                                                      si->swap_file = NULL;
                                                      ..
          struct inode *inode = si->swap_file->f_mapping->host;[oops!]
      
      Close this race window by using get/put_swap_device() to guard against
      concurrent swapoff.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-5-linmiaohe@huawei.com
      Fixes: 8fd2e0b5 ("mm: swap: check if swap backing device is congested or not")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2efa33fc
    • Miaohe Lin's avatar
      mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info() · 5c046235
      Miaohe Lin authored
      The non_swap_entry() was used for working with VMA based swap readahead
      via commit ec560175 ("mm, swap: VMA based swap readahead").  At that
      time, the non_swap_entry() checking is necessary because the function is
      called before checking that in do_swap_page().  Then it's moved to
      swap_ra_info() since commit eaf649eb ("mm: swap: clean up swap
      readahead").  After that, the non_swap_entry() checking is unnecessary,
      because swap_ra_info() is called after non_swap_entry() has been checked
      already.  The resulting code is confusing as the non_swap_entry() check
      looks racy now because while we released the pte lock, somebody else might
      have faulted in this pte.  So we should check whether it's swap pte first
      to guard against such race or swap_type will be unexpected.  But the race
      isn't important because it will not cause problem.  We would have enough
      checking when we really operate the PTE entries later.  So we remove the
      non_swap_entry() check here to avoid confusion.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-4-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c046235
    • Miaohe Lin's avatar
      swap: fix do_swap_page() race with swapoff · 2799e775
      Miaohe Lin authored
      When I was investigating the swap code, I found the below possible race
      window:
      
      CPU 1                                   	CPU 2
      -----                                   	-----
      do_swap_page
        if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
        swap_readpage
          if (data_race(sis->flags & SWP_FS_OPS)) {
                                              	swapoff
      					  	  ..
      					  	  p->swap_file = NULL;
      					  	  ..
          struct file *swap_file = sis->swap_file;
          struct address_space *mapping = swap_file->f_mapping;[oops!]
      
      Note that for the pages that are swapped in through swap cache, this isn't
      an issue. Because the page is locked, and the swap entry will be marked
      with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
      unlocked.
      
      Fix this race by using get/put_swap_device() to guard against concurrent
      swapoff.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com
      Fixes: 0bcac06f ("mm,swap: skip swapcache for swapin of synchronous device")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2799e775
    • Miaohe Lin's avatar
      mm/swapfile: use percpu_ref to serialize against concurrent swapoff · 63d8620e
      Miaohe Lin authored
      Patch series "close various race windows for swap", v6.
      
      When I was investigating the swap code, I found some possible race
      windows.  This series aims to fix all these races.  But using current
      get/put_swap_device() to guard against concurrent swapoff for
      swap_readpage() looks terrible because swap_readpage() may take really
      long time.  And to reduce the performance overhead on the hot-path as much
      as possible, it appears we can use the percpu_ref to close this race
      window(as suggested by Huang, Ying).  The patch 1 adds percpu_ref support
      for swap and most of the remaining patches try to use this to close
      various race windows.  More details can be found in the respective
      changelogs.
      
      This patch (of 4):
      
      Using current get/put_swap_device() to guard against concurrent swapoff
      for some swap ops, e.g.  swap_readpage(), looks terrible because they
      might take really long time.  This patch adds the percpu_ref support to
      serialize against concurrent swapoff(as suggested by Huang, Ying).  Also
      we remove the SWP_VALID flag because it's used together with RCU solution.
      
      Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.comSigned-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63d8620e
    • Christophe Leroy's avatar
      mm: pagewalk: fix walk for hugepage tables · e17eae2b
      Christophe Leroy authored
      Pagewalk ignores hugepd entries and walk down the tables as if it was
      traditionnal entries, leading to crazy result.
      
      Add walk_hugepd_range() and use it to walk hugepage tables.
      
      Link: https://lkml.kernel.org/r/38d04410700c8d02f28ba37e020b62c55d6f3d2c.1624597695.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarSteven Price <steven.price@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e17eae2b
    • Andrea Arcangeli's avatar
      mm: gup: pack has_pinned in MMF_HAS_PINNED · a458b76a
      Andrea Arcangeli authored
      has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
      cleanup.
      
      Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
      would reintroduce a loss of SMP scalability to pin-fast, so there's no
      future potential usefulness to keep an atomic in the mm for this.
      
      set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
      (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
      atomic_set after this commit) has to be still issued only once per "mm",
      so the difference between the two will be lost in the noise.
      
      will-it-scale "mmap2" shows no change in performance with enterprise
      config as expected.
      
      will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
      improvement against upstream as expected.
      
      This is a noop as far as overall performance and SMP scalability are
      concerned.
      
      [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
        Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
      [akpm@linux-foundation.org: coding style fixes]
      [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a458b76a
    • Andrea Arcangeli's avatar
      mm: gup: allow FOLL_PIN to scale in SMP · 292648ac
      Andrea Arcangeli authored
      has_pinned cannot be written by each pin-fast or it won't scale in SMP.
      This isn't "false sharing" strictly speaking (it's more like "true
      non-sharing"), but it creates the same SMP scalability bottleneck of
      "false sharing".
      
      To verify the improvement, below test is done on 40 cpus host with
      Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with
      CONFIG_GUP_TEST=y):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
      
      Where we can get (average value for 40 threads):
      
        Old kernel: 477729.97 (+- 3.79%)
        New kernel:  89144.65 (+-11.76%)
      
      On a similar condition with 256 cpus, this commits increases the SMP
      scalability of pin_user_pages_fast() executed by different threads of the
      same process by more than 4000%.
      
      [peterx@redhat.com: rewrite commit message, add parentheses against "(A & B)"]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-3-peterx@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      292648ac
    • Peter Xu's avatar
      mm/gup_benchmark: support threading · f39bd853
      Peter Xu authored
      Patch series "mm/gup: Fix pin page write cache bouncing on has_pinned", v2.
      
      This series contains 3 patches, the 1st one enables threading for
      gup_benchmark in the kselftest.  The latter two patches are collected from
      Andrea's local branch which can fix write cache bouncing issue with
      pinning fast-gup.
      
      To be explicit on the latter two patches:
      
        - the 2nd patch fixes the perf degrade when introducing has_pinned, then
      
        - the last patch tries to remove the has_pinned with a bit in mm->flags
      
      For patch 3: originally I think we had a plan to reuse has_pinned into a
      counter very soon, however that's not happening at least until today, so
      maybe it proves that we can remove it until we really want such a counter
      for whatever reason.  As the commit message stated, it saves 4 bytes for
      each mm without observable regressions.
      
      Regarding testing: we can reference to the commit message of patch 2 for
      some detailed testing with will-is-scale.  Meanwhile I did patch 1 just
      because then we can even easily verify the patchset using the existing
      kselftest facilities or even regress test it in the future with the repo
      if we want.
      
      Below numbers are extra verification tests that I did besides commit
      message of patch 2 using the new gup_benchmark and 256 cpus.  Below test
      is done on 40 cpus host with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz,
      and I can get similar result (of course the write cache bouncing get
      severe with even more cores).
      
      After patch 1 applied (only test patch, so using old kernel):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
        PIN_FAST_BENCHMARK: Time: get:459632 put:5990 us
        PIN_FAST_BENCHMARK: Time: get:461967 put:5840 us
        PIN_FAST_BENCHMARK: Time: get:464521 put:6140 us
        PIN_FAST_BENCHMARK: Time: get:465176 put:7100 us
        PIN_FAST_BENCHMARK: Time: get:465960 put:6733 us
        PIN_FAST_BENCHMARK: Time: get:465324 put:6781 us
        PIN_FAST_BENCHMARK: Time: get:466018 put:7130 us
        PIN_FAST_BENCHMARK: Time: get:466362 put:7118 us
        PIN_FAST_BENCHMARK: Time: get:465118 put:6975 us
        PIN_FAST_BENCHMARK: Time: get:466422 put:6602 us
        PIN_FAST_BENCHMARK: Time: get:465791 put:6818 us
        PIN_FAST_BENCHMARK: Time: get:467091 put:6298 us
        PIN_FAST_BENCHMARK: Time: get:467694 put:5432 us
        PIN_FAST_BENCHMARK: Time: get:469575 put:5581 us
        PIN_FAST_BENCHMARK: Time: get:468124 put:6055 us
        PIN_FAST_BENCHMARK: Time: get:468877 put:6720 us
        PIN_FAST_BENCHMARK: Time: get:467212 put:4961 us
        PIN_FAST_BENCHMARK: Time: get:467834 put:6697 us
        PIN_FAST_BENCHMARK: Time: get:470778 put:6398 us
        PIN_FAST_BENCHMARK: Time: get:469788 put:6310 us
        PIN_FAST_BENCHMARK: Time: get:488277 put:7113 us
        PIN_FAST_BENCHMARK: Time: get:486613 put:7085 us
        PIN_FAST_BENCHMARK: Time: get:486940 put:7202 us
        PIN_FAST_BENCHMARK: Time: get:488728 put:7101 us
        PIN_FAST_BENCHMARK: Time: get:487570 put:7327 us
        PIN_FAST_BENCHMARK: Time: get:489260 put:7027 us
        PIN_FAST_BENCHMARK: Time: get:488846 put:6866 us
        PIN_FAST_BENCHMARK: Time: get:488521 put:6745 us
        PIN_FAST_BENCHMARK: Time: get:489950 put:6459 us
        PIN_FAST_BENCHMARK: Time: get:489777 put:6617 us
        PIN_FAST_BENCHMARK: Time: get:488224 put:6591 us
        PIN_FAST_BENCHMARK: Time: get:488644 put:6477 us
        PIN_FAST_BENCHMARK: Time: get:488754 put:6711 us
        PIN_FAST_BENCHMARK: Time: get:488875 put:6743 us
        PIN_FAST_BENCHMARK: Time: get:489290 put:6657 us
        PIN_FAST_BENCHMARK: Time: get:490264 put:6684 us
        PIN_FAST_BENCHMARK: Time: get:489631 put:6737 us
        PIN_FAST_BENCHMARK: Time: get:488434 put:6655 us
        PIN_FAST_BENCHMARK: Time: get:492213 put:6297 us
        PIN_FAST_BENCHMARK: Time: get:491124 put:6173 us
      
      After the whole series applied (new fixed kernel):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
        PIN_FAST_BENCHMARK: Time: get:82038 put:7041 us
        PIN_FAST_BENCHMARK: Time: get:82144 put:6817 us
        PIN_FAST_BENCHMARK: Time: get:83417 put:6674 us
        PIN_FAST_BENCHMARK: Time: get:82540 put:6594 us
        PIN_FAST_BENCHMARK: Time: get:83214 put:6681 us
        PIN_FAST_BENCHMARK: Time: get:83444 put:6889 us
        PIN_FAST_BENCHMARK: Time: get:83194 put:7499 us
        PIN_FAST_BENCHMARK: Time: get:84876 put:7369 us
        PIN_FAST_BENCHMARK: Time: get:86092 put:10289 us
        PIN_FAST_BENCHMARK: Time: get:86153 put:10415 us
        PIN_FAST_BENCHMARK: Time: get:85026 put:7751 us
        PIN_FAST_BENCHMARK: Time: get:85458 put:7944 us
        PIN_FAST_BENCHMARK: Time: get:85735 put:8154 us
        PIN_FAST_BENCHMARK: Time: get:85851 put:8299 us
        PIN_FAST_BENCHMARK: Time: get:86323 put:9617 us
        PIN_FAST_BENCHMARK: Time: get:86288 put:10496 us
        PIN_FAST_BENCHMARK: Time: get:87697 put:9346 us
        PIN_FAST_BENCHMARK: Time: get:87980 put:8382 us
        PIN_FAST_BENCHMARK: Time: get:88719 put:8400 us
        PIN_FAST_BENCHMARK: Time: get:87616 put:8588 us
        PIN_FAST_BENCHMARK: Time: get:86730 put:9563 us
        PIN_FAST_BENCHMARK: Time: get:88167 put:8673 us
        PIN_FAST_BENCHMARK: Time: get:86844 put:9777 us
        PIN_FAST_BENCHMARK: Time: get:88068 put:11774 us
        PIN_FAST_BENCHMARK: Time: get:86170 put:15676 us
        PIN_FAST_BENCHMARK: Time: get:87967 put:12827 us
        PIN_FAST_BENCHMARK: Time: get:95773 put:7652 us
        PIN_FAST_BENCHMARK: Time: get:87734 put:13650 us
        PIN_FAST_BENCHMARK: Time: get:89833 put:14237 us
        PIN_FAST_BENCHMARK: Time: get:96186 put:8029 us
        PIN_FAST_BENCHMARK: Time: get:95532 put:8886 us
        PIN_FAST_BENCHMARK: Time: get:95351 put:5826 us
        PIN_FAST_BENCHMARK: Time: get:96401 put:8407 us
        PIN_FAST_BENCHMARK: Time: get:96473 put:8287 us
        PIN_FAST_BENCHMARK: Time: get:97177 put:8430 us
        PIN_FAST_BENCHMARK: Time: get:98120 put:5263 us
        PIN_FAST_BENCHMARK: Time: get:96271 put:7757 us
        PIN_FAST_BENCHMARK: Time: get:99628 put:10467 us
        PIN_FAST_BENCHMARK: Time: get:99344 put:10045 us
        PIN_FAST_BENCHMARK: Time: get:94212 put:15485 us
      
      Summary:
      
        Old kernel: 477729.97 (+-3.79%)
        New kernel:  89144.65 (+-11.76%)
      
      This patch (of 3):
      
      Add a new parameter "-j N" to support concurrent gup test.
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210507150553.208763-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f39bd853
    • Matthew Wilcox (Oracle)'s avatar
      mm: move page dirtying prototypes from mm.h · 3a6b2162
      Matthew Wilcox (Oracle) authored
      These functions implement the address_space ->set_page_dirty operation and
      should live in pagemap.h, not mm.h so that the rest of the kernel doesn't
      get funny ideas about calling them directly.
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a6b2162
    • Matthew Wilcox (Oracle)'s avatar
      fs: remove noop_set_page_dirty() · b82a96c9
      Matthew Wilcox (Oracle) authored
      Use __set_page_dirty_no_writeback() instead.  This will set the dirty bit
      on the page, which will be used to avoid calling set_page_dirty() in the
      future.  It will have no effect on actually writing the page back, as the
      pages are not on any LRU lists.
      
      [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]
      
      Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b82a96c9