1. 17 Aug, 2018 40 commits
    • Baoquan He's avatar
      mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmap · 9258631b
      Baoquan He authored
      It's used to pass the size of map data unit into
      alloc_usemap_and_memmap, and is preparation for next patch.
      
      Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9258631b
    • Baoquan He's avatar
      mm/sparsemem.c: defer the ms->section_mem_map clearing · 07a34a8c
      Baoquan He authored
      In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
      will allocate one continuous memory chunk for mem maps on one node and
      populate the relevant page tables to map memory section one by one.  If
      fail to populate for a certain mem section, print warning and its
      ->section_mem_map will be cleared to cancel the marking of being
      present.  Like this, the number of mem sections marked as present could
      become less during sparse_init() execution.
      
      Here just defer the ms->section_mem_map clearing if failed to populate
      its page tables until the last for_each_present_section_nr() loop.  This
      is in preparation for later optimizing the mem map allocation.
      
      [akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
      Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07a34a8c
    • Baoquan He's avatar
      mm/sparse.c: add a static variable nr_present_sections · f2fc10e0
      Baoquan He authored
      Patch series "mm/sparse: Optimize memmap allocation during
      sparse_init()", v6.
      
      In sparse_init(), two temporary pointer arrays, usemap_map and map_map
      are allocated with the size of NR_MEM_SECTIONS.  They are used to store
      each memory section's usemap and mem map if marked as present.  In
      5-level paging mode, this will cost 512M memory though they will be
      released at the end of sparse_init().  System with few memory, like
      kdump kernel which usually only has about 256M, will fail to boot
      because of allocation failure if CONFIG_X86_5LEVEL=y.
      
      In this patchset, optimize the memmap allocation code to only use
      usemap_map and map_map with the size of nr_present_sections.  This makes
      kdump kernel boot up with normal crashkernel='' setting when
      CONFIG_X86_5LEVEL=y.
      
      This patch (of 5):
      
      nr_present_sections is used to record how many memory sections are
      marked as present during system boot up, and will be used in the later
      patch.
      
      Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2fc10e0
    • Kirill Tkhai's avatar
      mm: use special value SHRINKER_REGISTERING instead of list_empty() check · 7e010df5
      Kirill Tkhai authored
      The patch introduces a special value SHRINKER_REGISTERING to use instead
      of list_empty() to differ a registering shrinker from unregistered
      shrinker.  Why we need that at all?
      
      Shrinker registration is split in two parts.  The first one is
      prealloc_shrinker(), which allocates shrinker memory and reserves ID in
      shrinker_idr.  This function can fail.  The second is
      register_shrinker_prepared(), and it finalizes the registration.  This
      function actually makes shrinker available to be used from
      shrink_slab(), and it can't fail.
      
      One shrinker may be based on more then one LRU lists.  So, we never
      clear the bit in memcg shrinker maps, when (one of) corresponding LRU
      list becomes empty, since other LRU lists may be not empty.  See
      superblock shrinker for example: it is based on two LRU lists:
      s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
      when there are no inodes in s_inode_lru, as s_dentry_lru may contain
      dentries.
      
      Instead of that, we use special algorithm to detect shrinkers having no
      elements at all its LRU lists, and this is made in shrink_slab_memcg().
      See the comment in this function for the details.
      
      Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
      meet unregistered shrinker (bit is set, while there is no a shrinker in
      IDR).  Otherwise, we would have done that at the moment of shrinker
      unregistration for all memcgs (and this looks worse, since iteration
      over all memcg may take much time).  Also this would have imposed
      restrictions on shrinker unregistration order for its users: they would
      have had to guarantee, there are no new elements after
      unregister_shrinker() (otherwise, a new added element would have set a
      bit).
      
      So, if we meet a set bit in map and no shrinker in IDR when we're
      iterating over the map in shrink_slab_memcg(), this means the
      corresponding shrinker is unregistered, and we must clear the bit.
      
      Another case is shrinker registration.  We want two things there:
      
      1) do_shrink_slab() can be called only for completely registered
         shrinkers;
      
      2) shrinker internal lists may be populated in any order with
         register_shrinker_prepared() (let's talk on the example with sb).  Both
         of:
      
        a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
          memcg_set_shrinker_bit();                               [cpu0]
          ...
          register_shrinker_prepared();                           [cpu1]
      
        and
      
        b)register_shrinker_prepared();                           [cpu0]
          ...
          list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
          memcg_set_shrinker_bit();                               [cpu1]
      
         are legitimate.  We don't want to impose restriction here and to
         force people to use only (b) variant.  We don't want to force people to
         care, there is no elements in LRU lists before the shrinker is
         completely registered.  Internal users of LRU lists and shrinker code
         are two different subsystems, and they have to be closed in themselves
         each other.
      
      In (a) case we have the bit set before shrinker is completely
      registered.  We don't want do_shrink_slab() is called at this moment, so
      we have to detect such the registering shrinkers.
      
      Before this patch list_empty() (shrinker is not linked to the list)
      check was used for that.  So, in (a) there could be a bit set, but we
      don't call do_shrink_slab() unless shrinker is linked to the list.  It's
      just an indicator, I just overloaded linking to the list.
      
      This was not the best solution, since it's better not to touch the
      shrinker memory from shrink_slab_memcg() before it's completely
      registered (this also will be useful in the future to make shrink_slab()
      completely lockless).
      
      So, this patch introduces better way to detect registering shrinker,
      which allows not to dereference shrinker memory.  It's just a ~0UL
      value, which we insert into the IDR during ID allocation.  After
      shrinker is ready to be used, we insert actual shrinker pointer in the
      IDR, and it becomes available to shrink_slab_memcg().
      
      We can't use NULL instead of this new value for this purpose as:
      shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
      and we don't want the function sees NULL and clears the bit, otherwise
      (a) won't work.
      
      This is the only thing the patch makes: the better way to detect
      registering shrinker.  Nothing else this patch makes.
      
      Also this gives a better assembler, but it's minor side of the patch:
      
      Before:
        callq  <idr_find>
        mov    %rax,%r15
        test   %rax,%rax
        je     <shrink_slab_memcg+0x1d5>
        mov    0x20(%rax),%rax
        lea    0x20(%r15),%rdx
        cmp    %rax,%rdx
        je     <shrink_slab_memcg+0xbd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  <do_shrink_slab>
      
      After:
        callq  <idr_find>
        mov    %rax,%r15
        lea    -0x1(%rax),%rax
        cmp    $0xfffffffffffffffd,%rax
        ja     <shrink_slab_memcg+0x1cd>
        mov    0x8(%rsp),%edx
        mov    %r15,%rsi
        lea    0x10(%rsp),%rdi
        callq  ffffffff810cefd0 <do_shrink_slab>
      
      [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
        Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
      Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e010df5
    • Kirill Tkhai's avatar
      mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab() · ac7fb3ad
      Kirill Tkhai authored
      In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
      numa-aware.  This is not a real problem, since currently all memcg-aware
      shrinkers are numa-aware too (we have two: super_block shrinker and
      workingset shrinker), but something may change in the future.
      
      Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac7fb3ad
    • Kirill Tkhai's avatar
      mm/vmscan.c: clear shrinker bit if there are no objects related to memcg · f90280d6
      Kirill Tkhai authored
      To avoid further unneed calls of do_shrink_slab() for shrinkers, which
      already do not have any charged objects in a memcg, their bits have to
      be cleared.
      
      This patch introduces a lockless mechanism to do that without races
      without parallel list lru add.  After do_shrink_slab() returns
      SHRINK_EMPTY the first time, we clear the bit and call it once again.
      Then we restore the bit, if the new return value is different.
      
      Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
      two situations:
      
      1)list_lru_add()     shrink_slab_memcg
          list_add_tail()    for_each_set_bit() <--- read bit
                               do_shrink_slab() <--- missed list update (no barrier)
          <MB>                 <MB>
          set_bit()            do_shrink_slab() <--- seen list update
      
      This situation, when the first do_shrink_slab() sees set bit, but it
      doesn't see list update (i.e., race with the first element queueing), is
      rare.  So we don't add <MB> before the first call of do_shrink_slab()
      instead of this to do not slow down generic case.  Also, it's need the
      second call as seen in below in (2).
      
      2)list_lru_add()      shrink_slab_memcg()
          list_add_tail()     ...
          set_bit()           ...
        ...                   for_each_set_bit()
        do_shrink_slab()        do_shrink_slab()
          clear_bit()           ...
        ...                     ...
        list_lru_add()          ...
          list_add_tail()       clear_bit()
          <MB>                  <MB>
          set_bit()             do_shrink_slab()
      
      The barriers guarantee that the second do_shrink_slab() in the right
      side task sees list update if really cleared the bit.  This case is
      drawn in the code comment.
      
      [Results/performance of the patchset]
      
      After the whole patchset applied the below test shows signify increase
      of performance:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
            $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
      			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
      			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
      			    touch s/$i/file; done
      
      Then, 5 sequential calls of drop caches:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
      1)Before:
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      2)After
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      Shakeel Butt tested this patchset with fork-bomb on his configuration:
      
       > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
       > file containing few KiBs on corresponding mount. Then in a separate
       > memcg of 200 MiB limit ran a fork-bomb.
       >
       > I ran the "perf record -ag -- sleep 60" and below are the results:
       >
       > Without the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
       > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
       > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
       > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
       > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
       > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       >
       > With the patch series:
       > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
       > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
       > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
       > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
       > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
       > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
       > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
       > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
       > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
       > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
       > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
       > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f90280d6
    • Kirill Tkhai's avatar
      mm: add SHRINK_EMPTY shrinker methods return value · 9b996468
      Kirill Tkhai authored
      We need to distinguish the situations when shrinker has very small
      amount of objects (see vfs_pressure_ratio() called from
      super_cache_count()), and when it has no objects at all.  Currently, in
      the both of these cases, shrinker::count_objects() returns 0.
      
      The patch introduces new SHRINK_EMPTY return value, which will be used
      for "no objects at all" case.  It's is a refactoring mostly, as
      SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
      patch, and all the magic will happen in further.
      
      Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b996468
    • Vladimir Davydov's avatar
      mm/vmscan.c: generalize shrink_slab() calls in shrink_node() · aeed1d32
      Vladimir Davydov authored
      The patch makes shrink_slab() be called for root_mem_cgroup in the same
      way as it's called for the rest of cgroups.  This simplifies the logic
      and improves the readability.
      
      [ktkhai@virtuozzo.com: wrote changelog]
      Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomainSigned-off-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aeed1d32
    • Kirill Tkhai's avatar
      mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab() · b0dedc49
      Kirill Tkhai authored
      Using the preparations made in previous patches, in case of memcg
      shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
      bitmap.  To do that, we separate iterations over memcg-aware and
      !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
      for_each_set_bit() from the bitmap.  In case of big nodes, having many
      isolated environments, this gives significant performance growth.  See
      next patches for the details.
      
      Note that the patch does not respect to empty memcg shrinkers, since we
      never clear the bitmap bits after we set it once.  Their shrinkers will
      be called again, with no shrinked objects as result.  This functionality
      is provided by next patches.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0dedc49
    • Kirill Tkhai's avatar
      mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance · fae91d6d
      Kirill Tkhai authored
      Introduce set_shrinker_bit() function to set shrinker-related bit in
      memcg shrinker bitmap, and set the bit after the first item is added and
      in case of reparenting destroyed memcg's items.
      
      This will allow next patch to make shrinkers be called only, in case of
      they have charged objects at the moment, and to improve shrink_slab()
      performance.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fae91d6d
    • Kirill Tkhai's avatar
      mm/memcontrol.c: export mem_cgroup_is_root() · dfd2f10c
      Kirill Tkhai authored
      This will be used in next patch.
      
      Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfd2f10c
    • Kirill Tkhai's avatar
      mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node() · 3b82c4dc
      Kirill Tkhai authored
      This is just refactoring to allow next patches to have lru pointer in
      memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b82c4dc
    • Kirill Tkhai's avatar
      mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node() · 9bec5c35
      Kirill Tkhai authored
      This is just refactoring to allow the next patches to have dst_memcg
      pointer in memcg_drain_list_lru_node().
      
      Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bec5c35
    • Kirill Tkhai's avatar
      mm/list_lru.c: add memcg argument to list_lru_from_kmem() · 44bd4a47
      Kirill Tkhai authored
      This is just refactoring to allow the next patches to have memcg pointer
      in list_lru_from_kmem().
      
      Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44bd4a47
    • Kirill Tkhai's avatar
      fs: propagate shrinker::id to list_lru · c92e8e10
      Kirill Tkhai authored
      Add list_lru::shrinker_id field and populate it by registered shrinker
      id.
      
      This will be used to set correct bit in memcg shrinkers map by lru code
      in next patches, after there appeared the first related to memcg element
      in list_lru.
      
      Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c92e8e10
    • Kirill Tkhai's avatar
      fs/super.c: refactor alloc_super() · 2b3648a6
      Kirill Tkhai authored
      Do two list_lru_init_memcg() calls after prealloc_super().
      destroy_unused_super() in fail path is OK with this.  Next patch needs
      such the order.
      
      Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b3648a6
    • Kirill Tkhai's avatar
      mm/workingset.c: refactor workingset_init() · 39887653
      Kirill Tkhai authored
      Use prealloc_shrinker()/register_shrinker_prepared() instead of
      register_shrinker().  This will be used in next patch.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39887653
    • Kirill Tkhai's avatar
      mm, memcg: assign memcg-aware shrinkers bitmap to memcg · 0a4465d3
      Kirill Tkhai authored
      Imagine a big node with many cpus, memory cgroups and containers.  Let
      we have 200 containers, every container has 10 mounts, and 10 cgroups.
      All container tasks don't touch foreign containers mounts.  If there is
      intensive pages write, and global reclaim happens, a writing task has to
      iterate over all memcgs to shrink slab, before it's able to go to
      shrink_page_list().
      
      Iteration over all the memcg slabs is very expensive: the task has to
      visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
      2000 memcgs, the total calls are 2000 * 2000 = 4000000.
      
      So, the shrinker makes 4 million do_shrink_slab() calls just to try to
      isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
      shrink_page_list().  I've observed a node spending almost 100% in
      kernel, making useless iteration over already shrinked slab.
      
      This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
      the bitmap depends on bitmap_nr_ids, and during memcg life it's
      maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
      the map is related to corresponding shrinker id.
      
      Next patches will maintain set bit only for really charged memcg.  This
      will allow shrink_slab() to increase its performance in significant way.
      See the last patch for the numbers.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
      [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
        Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
      Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a4465d3
    • Kirill Tkhai's avatar
      mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines · b05706f1
      Kirill Tkhai authored
      Next patch requires these defines are above their current position, so
      here they are moved to declarations.
      
      Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b05706f1
    • Kirill Tkhai's avatar
      mm: assign id to every memcg-aware shrinker · b4c2b231
      Kirill Tkhai authored
      Introduce shrinker::id number, which is used to enumerate memcg-aware
      shrinkers.  The number start from 0, and the code tries to maintain it
      as small as possible.
      
      This will be used to represent a memcg-aware shrinkers in memcg
      shrinkers map.
      
      Since all memcg-aware shrinkers are based on list_lru, which is
      per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
      be under this config option.
      
      [ktkhai@virtuozzo.com: v9]
        Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4c2b231
    • Kirill Tkhai's avatar
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai authored
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • Kirill Tkhai's avatar
      mm/list_lru.c: combine code under the same define · e0295238
      Kirill Tkhai authored
      Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.
      
      This patcheset solves the problem with slow shrink_slab() occuring on
      the machines having many shrinkers and memory cgroups (i.e., with many
      containers).  The problem is complexity of shrink_slab() is O(n^2) and
      it grows too fast with the growth of containers numbers.
      
      Let us have 200 containers, and every container has 10 mounts and 10
      cgroups.  All container tasks are isolated, and they don't touch foreign
      containers mounts.
      
      In case of global reclaim, a task has to iterate all over the memcgs and
      to call all the memcg-aware shrinkers for all of them.  This means, the
      task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
      there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
      2000 = 4000000.
      
      4 million calls are not a number operations, which can takes 1 cpu
      cycle.  E.g., super_cache_count() accesses at least two lists, and makes
      arifmetical calculations.  Even, if there are no charged objects, we do
      these calculations, and replaces cpu caches by read memory.  I observed
      nodes spending almost 100% time in kernel, in case of intensive writing
      and global reclaim.  The writer consumes pages fast, but it's need to
      shrink_slab() before the reclaimer reached shrink pages function (and
      frees SWAP_CLUSTER_MAX pages).  Even if there is no writing, the
      iterations just waste the time, and slows reclaim down.
      
      Let's see the small test below:
      
        $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
        $mkdir /sys/fs/cgroup/memory/ct
        $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
        $for i in `seq 0 4000`;
                do mkdir /sys/fs/cgroup/memory/ct/$i;
                echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
                mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
        done
      
      Then, let's see drop caches time (5 sequential calls):
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
        0.00user 13.78system 0:13.78elapsed 99%CPU
        0.00user 5.59system 0:05.60elapsed 99%CPU
        0.00user 5.48system 0:05.48elapsed 99%CPU
        0.00user 8.35system 0:08.35elapsed 99%CPU
        0.00user 8.34system 0:08.35elapsed 99%CPU
      
      The last four calls don't actually shrink anything.  So, the iterations
      over slab shrinkers take 5.48 seconds.  Not so good for scalability.
      
      The patchset solves the problem by making shrink_slab() of O(n)
      complexity.  There are following functional actions:
      
      1) Assign id to every registered memcg-aware shrinker.
      
      2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
         shrinker-related bit after the first element is added to lru list
         (also, when removed child memcg elements are reparanted).
      
      3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
         shrinker if its bit is set in memcg's shrinker bitmap.  (Also, there is
         a functionality to clear the bit, after last element is shrinked).
      
      This gives significant performance increase.  The result after patchset
      is applied:
      
        $time echo 3 > /proc/sys/vm/drop_caches
      
        0.00user 1.10system 0:01.10elapsed 99%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
        0.00user 0.00system 0:00.01elapsed 64%CPU
        0.00user 0.01system 0:00.01elapsed 82%CPU
      
      The results show the performance increases at least in 548 times.
      
      So, the patchset makes shrink_slab() of less complexity and improves the
      performance in such types of load I pointed.  This will give a profit in
      case of !global reclaim case, since there also will be less
      do_shrink_slab() calls.
      
      This patch (of 17):
      
      These two pairs of blocks of code are under the same #ifdef #else
      #endif.
      
      Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0295238
    • Mike Rapoport's avatar
      mm/memblock.c: replace u64 with phys_addr_t where appropriate · a36aab89
      Mike Rapoport authored
      Most functions in memblock already use phys_addr_t to represent a
      physical address with __memblock_free_late() being an exception.
      
      This patch replaces u64 with phys_addr_t in __memblock_free_late() and
      switches several format strings from %llx to %pa to avoid casting from
      phys_addr_t to u64.
      
      Link: http://lkml.kernel.org/r/1530637506-1256-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a36aab89
    • Oscar Salvador's avatar
      mm/sparse.c: make sparse_init_one_section void and remove check · 4e40987f
      Oscar Salvador authored
      sparse_init_one_section() is being called from two sites: sparse_init()
      and sparse_add_one_section().  The former calls it from a
      for_each_present_section_nr() loop, and the latter marks the section as
      present before calling it.  This means that when
      sparse_init_one_section() gets called, we already know that the section
      is present.  So there is no point to double check that in the function.
      
      This removes the check and makes the function void.
      
      [ross.zwisler@linux.intel.com: fix error path in sparse_add_one_section]
        Link: http://lkml.kernel.org/r/20180706190658.6873-1-ross.zwisler@linux.intel.com
      [ross.zwisler@linux.intel.com: simplification suggested by Oscar]
        Link: http://lkml.kernel.org/r/20180706223358.742-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20180702154325.12196-1-osalvador@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e40987f
    • Michal Hocko's avatar
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko authored
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • Mike Rapoport's avatar
      mm: make DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM · d39f8fb4
      Mike Rapoport authored
      The deferred memory initialization relies on section definitions, e.g
      PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
      most architectures.
      
      Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
      ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
      the commit 2e3ca40f ("mm: relax deferred struct page
      requirements") this requirement was relaxed and now it is possible to
      enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
      DISCONTINGMEM and NO_BOOTMEM which causes build failures.
      
      For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
      causes the following build failure:
      
          CC      mm/page_alloc.o
        mm/page_alloc.c: In function 'update_defer_init':
        mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
        undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
              (pfn & (PAGES_PER_SECTION - 1)) == 0) {
                      ^~~~~~~~~~~~~~~~~
                      USEC_PER_SEC
        mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
        In file included from include/linux/cache.h:5:0,
                         from include/linux/printk.h:9,
                         from include/linux/kernel.h:14,
                         from include/asm-generic/bug.h:18,
                         from arch/arc/include/asm/bug.h:32,
                         from include/linux/bug.h:5,
                         from include/linux/mmdebug.h:5,
                         from include/linux/mm.h:9,
                         from mm/page_alloc.c:18:
        mm/page_alloc.c: In function 'deferred_grow_zone':
        mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
          unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                                            ^
        include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
         #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                                       ^~~~
        include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
         #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                              ^~~~~~~~~~~~~~
        mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
          unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                          ^~~~~
        In file included from include/asm-generic/bug.h:18:0,
                         from arch/arc/include/asm/bug.h:32,
                         from include/linux/bug.h:5,
                         from include/linux/mmdebug.h:5,
                         from include/linux/mm.h:9,
                         from mm/page_alloc.c:18:
        mm/page_alloc.c: In function 'free_area_init_node':
        mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                          ^
        include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
           (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                              ^
        include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
          __builtin_choose_expr(__safe_cmp(x, y), \
                                ^~~~~~~~~~
        include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
         #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                                   ^~~~~~~~~~~~~
        mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                     ^~~~~
        include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
          __builtin_choose_expr(__safe_cmp(x, y), \
          ^
        include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
         #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                                   ^~~~~~~~~~~~~
        mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                     ^~~~~
        scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed
      
      Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
      as the systems that support DISCONTIGMEM do not seem to have that huge
      amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.
      
      Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d39f8fb4
    • Andrey Ryabinin's avatar
      kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN · 0207df4f
      Andrey Ryabinin authored
      KASAN learns about hotadded memory via the memory hotplug notifier.
      devm_memremap_pages() intentionally skips calling memory hotplug
      notifiers.  So KASAN doesn't know anything about new memory added by
      devm_memremap_pages().  This causes a crash when KASAN tries to access
      non-existent shadow memory:
      
       BUG: unable to handle kernel paging request at ffffed0078000000
       RIP: 0010:check_memory_region+0x82/0x1e0
       Call Trace:
        memcpy+0x1f/0x50
        pmem_do_bvec+0x163/0x720
        pmem_make_request+0x305/0xac0
        generic_make_request+0x54f/0xcf0
        submit_bio+0x9c/0x370
        submit_bh_wbc+0x4c7/0x700
        block_read_full_page+0x5ef/0x870
        do_read_cache_page+0x2b8/0xb30
        read_dev_sector+0xbd/0x3f0
        read_lba.isra.0+0x277/0x670
        efi_partition+0x41a/0x18f0
        check_partition+0x30d/0x5e9
        rescan_partitions+0x18c/0x840
        __blkdev_get+0x859/0x1060
        blkdev_get+0x23f/0x810
        __device_add_disk+0x9c8/0xde0
        pmem_attach_disk+0x9a8/0xf50
        nvdimm_bus_probe+0xf3/0x3c0
        driver_probe_device+0x493/0xbd0
        bus_for_each_drv+0x118/0x1b0
        __device_attach+0x1cd/0x2b0
        bus_probe_device+0x1ac/0x260
        device_add+0x90d/0x1380
        nd_async_device_register+0xe/0x50
        async_run_entry_fn+0xc3/0x5d0
        process_one_work+0xa0a/0x1810
        worker_thread+0x87/0xe80
        kthread+0x2d7/0x390
        ret_from_fork+0x3a/0x50
      
      Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init()
      interface to map/unmap kasan_zero_page at requested virtual addresses.
      And use it to add/remove the shadow memory for hotplugged/unplugged
      device memory.
      
      Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0207df4f
    • Song Liu's avatar
      mm: thp: pass correct vm_flags to hugepage_vma_check() · 50f8b92f
      Song Liu authored
      khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
      hugepage_vma_check().  The argument vm_flags contains the latest value.
      Therefore, it is necessary to pass this vm_flags into
      hugepage_vma_check().
      
      With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
      put memory in huge pages.  Here is an example of failed madvise():
      
         /* mount /dev/shm with huge=advise:
          *     mount -o remount,huge=advise /dev/shm */
         /* create file /dev/shm/huge */
         #define HUGE_FILE "/dev/shm/huge"
      
         fd = open(HUGE_FILE, O_RDONLY);
         ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
         ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);
      
      madvise() will return 0, but this memory region is never put in huge
      page (check from /proc/meminfo: ShmemHugePages).
      
      Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
      Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50f8b92f
    • Andrey Ryabinin's avatar
      mm/fadvise.c: fix signed overflow UBSAN complaint · a718e28f
      Andrey Ryabinin authored
      Signed integer overflow is undefined according to the C standard.  The
      overflow in ksys_fadvise64_64() is deliberate, but since it is signed
      overflow, UBSAN complains:
      
      	UBSAN: Undefined behaviour in mm/fadvise.c:76:10
      	signed integer overflow:
      	4 + 9223372036854775805 cannot be represented in type 'long long int'
      
      Use unsigned types to do math.  Unsigned overflow is defined so UBSAN
      will not complain about it.  This patch doesn't change generated code.
      
      [akpm@linux-foundation.org: add comment explaining the casts]
      Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.comSigned-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: <icytxw@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a718e28f
    • Colin Ian King's avatar
      mm/swap_slots.c: make swap_slots_cache_mutex and swap_slots_cache_enable_mutex static · 31f21da1
      Colin Ian King authored
      The mutexes swap_slots_cache_mutex and swap_slots_cache_enable_mutex are
      local to the source and do not need to be in global scope, so make them
      static.
      
      Cleans up sparse warnings:
        symbol 'swap_slots_cache_mutex' was not declared. Should it be static?
        symbol 'swap_slots_cache_enable_mutex' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20180624182536.4937-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31f21da1
    • Colin Ian King's avatar
      mm/zsmalloc.c: make several functions and a struct static · 4d0a5402
      Colin Ian King authored
      The functions zs_page_isolate, zs_page_migrate, zs_page_putback,
      lock_zspage, trylock_zspage and structure zsmalloc_aops are local to
      source and do not need to be in global scope, so make them static.
      
      Cleans up sparse warnings:
        symbol 'zs_page_isolate' was not declared. Should it be static?
        symbol 'zs_page_migrate' was not declared. Should it be static?
        symbol 'zs_page_putback' was not declared. Should it be static?
        symbol 'zsmalloc_aops' was not declared. Should it be static?
        symbol 'lock_zspage' was not declared. Should it be static?
        symbol 'trylock_zspage' was not declared. Should it be static?
      
      [arnd@arndb.de: hide unused lock_zspage]
        Link: http://lkml.kernel.org/r/20180706130924.3891230-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20180624213322.13776-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d0a5402
    • Greg Thelen's avatar
      mm/page-writeback.c: update stale account_page_redirty() comment · dcfe4df3
      Greg Thelen authored
      Commit 93f78d88 ("writeback: move backing_dev_info->bdi_stat[] into
      bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
      account_page_redirty().  Update comment to track that change.
      
        BDI_DIRTIED => WB_DIRTIED
        BDI_WRITTEN => WB_WRITTEN
      
      Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcfe4df3
    • Shakeel Butt's avatar
      fs, mm: account buffer_head to kmemcg · f745c6f5
      Shakeel Butt authored
      The buffer_head can consume a significant amount of system memory and is
      directly related to the amount of page cache.  In our production
      environment we have observed that a lot of machines are spending a
      significant amount of memory as buffer_head and can not be left as
      system memory overhead.
      
      Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
      allocation.  The buffer_heads can be allocated in a memcg different from
      the memcg of the page for which buffer_heads are being allocated.  One
      concrete example is memory reclaim.  The reclaim can trigger I/O of
      pages of any memcg on the system.  So, the right way to charge
      buffer_head is to extract the memcg from the page for which buffer_heads
      are being allocated and then use targeted memcg charging API.
      
      [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
        Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f745c6f5
    • Shakeel Butt's avatar
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt authored
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
    • Roman Gushchin's avatar
      mm: introduce mem_cgroup_put() helper · dc0b5864
      Roman Gushchin authored
      Introduce the mem_cgroup_put() helper, which helps to eliminate guarding
      memcg css release with "#ifdef CONFIG_MEMCG" in multiple places.
      
      Link: http://lkml.kernel.org/r/20180623000600.5818-2-guro@fb.comSigned-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc0b5864
    • Luis R. Rodriguez's avatar
      mm: provide a fallback for PAGE_KERNEL_EXEC for architectures · 1a9b4b3d
      Luis R. Rodriguez authored
      Some architectures just don't have PAGE_KERNEL_EXEC.  The mm/nommu.c and
      mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years.
      Move this fallback to asm-generic.
      
      Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.orgSigned-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a9b4b3d
    • Luis R. Rodriguez's avatar
      mm: provide a fallback for PAGE_KERNEL_RO for architectures · a3266bd4
      Luis R. Rodriguez authored
      Some architectures do not define certain PAGE_KERNEL_* flags, this is
      either because:
      
       a) The way to implement some of these flags is *not yet ported*, or
       b) The architecture *has no way* to describe them
      
      Over time we have accumulated a few PAGE_KERNEL_* fallback workarounds
      for architectures in the kernel which do not define them using
      *relatively safe* equivalents.  Move these scattered fallback hacks into
      asm-generic.
      
      We start off with PAGE_KERNEL_RO using PAGE_KERNEL as a fallback.  This
      has been in place on the firmware loader for years.  Move the fallback
      into the respective asm-generic header.
      
      Link: http://lkml.kernel.org/r/20180510185507.2439-2-mcgrof@kernel.orgSigned-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3266bd4
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: drop unnecessary checks from register_mem_sect_under_node() · 3172e5e6
      Oscar Salvador authored
      Callers of register_mem_sect_under_node() are always passing a valid
      memory_block (not NULL), so we can safely drop the check for NULL.
      
      In the same way, register_mem_sect_under_node() is only called in case
      the node is online, so we can safely remove that check as well.
      
      Link: http://lkml.kernel.org/r/20180622111839.10071-5-osalvador@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3172e5e6
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range() · 4fbce633
      Oscar Salvador authored
      link_mem_sections() and walk_memory_range() share most of the code, so
      we can use convert link_mem_sections() into a dummy function that calls
      walk_memory_range() with a callback to register_mem_sect_under_node().
      
      This patch converts register_mem_sect_under_node() in order to match a
      walk_memory_range's callback, getting rid of the check_nid argument and
      checking instead if the system is still boothing, since we only have to
      check for the nid if the system is in such state.
      
      Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Suggested-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fbce633
    • Oscar Salvador's avatar
      mm/memory_hotplug.c: call register_mem_sect_under_node() · d5b6f6a3
      Oscar Salvador authored
      When hotplugging memory, it is possible that two calls are being made to
      register_mem_sect_under_node().
      
      One comes from __add_section()->hotplug_memory_register() and the other
      from add_memory_resource()->link_mem_sections() if we had to register a
      new node.
      
      In case we had to register a new node, hotplug_memory_register() will
      only handle/allocate the memory_block's since
      register_mem_sect_under_node() will return right away because the node
      it is not online yet.
      
      I think it is better if we leave hotplug_memory_register() to
      handle/allocate only memory_block's and make link_mem_sections() to call
      register_mem_sect_under_node().
      
      So this patch removes the call to register_mem_sect_under_node() from
      hotplug_memory_register(), and moves the call to link_mem_sections() out
      of the condition, so it will always be called.  In this way we only have
      one place where the memory sections are registered.
      
      Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.netSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5b6f6a3