1. 24 Feb, 2013 40 commits
    • Zhang Yanfei's avatar
      ia64: use %ld to print pages calculated in nr_free_buffer_pages · 6434b94a
      Zhang Yanfei authored
      Now the function nr_free_buffer_pages returns unsigned long, so use %ld
      to print its return value.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6434b94a
    • Zhang Yanfei's avatar
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei authored
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • Michal Hocko's avatar
      memcg: cleanup mem_cgroup_init comment · 1081312f
      Michal Hocko authored
      We should encourage all memcg controller initialization independent on a
      specific mem_cgroup to be done here rather than exploit css_alloc
      callback and assume that nothing happens before root cgroup is created.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1081312f
    • Michal Hocko's avatar
      memcg: move memcg_stock initialization to mem_cgroup_init · e4777496
      Michal Hocko authored
      memcg_stock are currently initialized during the root cgroup allocation
      which is OK but it pointlessly pollutes memcg allocation code with
      something that can be called when the memcg subsystem is initialized by
      mem_cgroup_init along with other controller specific parts.
      
      This patch wraps the current memcg_stock initialization code into a
      helper calls it from the controller subsystem initialization code.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4777496
    • Michal Hocko's avatar
      memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init · 8787a1df
      Michal Hocko authored
      Per-node-zone soft limit tree is currently initialized when the root
      cgroup is created which is OK but it pointlessly pollutes memcg
      allocation code with something that can be called when the memcg
      subsystem is initialized by mem_cgroup_init along with other controller
      specific parts.
      
      While we are at it let's make mem_cgroup_soft_limit_tree_init void
      because it doesn't make much sense to report memory failure because if
      we fail to allocate memory that early during the boot then we are
      screwed anyway (this saves some code).
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8787a1df
    • Minchan Kim's avatar
      mm: use up free swap space before reaching OOM kill · 0e50ce3b
      Minchan Kim authored
      Recently, Luigi reported there are lots of free swap space when OOM
      happens.  It's easily reproduced on zram-over-swap, where many instance
      of memory hogs are running and laptop_mode is enabled.  He said there
      was no problem when he disabled laptop_mode.  The problem when I
      investigate problem is following as.
      
      Assumption for easy explanation: There are no page cache page in system
      because they all are already reclaimed.
      
      1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
      2. shrink_inactive_list isolates victim pages from inactive anon lru list.
      3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
         pageout because sc->may_writepage is 0 so the page is rotated back into
         inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
      4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
         retry reclaim with higher priority.
      5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
         but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
         inactive anon lru list is full of dirty pages by 3 so it just returns
         without  any reclaim progress.
      6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
         Because sc->nr_scanned is increased by shrink_page_list but we don't call
         shrink_page_list in 5 due to short of isolated pages.
      
      Above loop is continued until OOM happens.
      
      The problem didn't happen before [1] was merged because old logic's
      isolatation in shrink_inactive_list was successful and tried to call
      shrink_page_list to pageout them but it still ends up failed to page out
      by may_writepage.  But important point is that sc->nr_scanned was
      increased although we couldn't swap out them so do_try_to_free_pages
      could set may_writepages.
      
      Since commit f80c0673 ("mm: zone_reclaim: make isolate_lru_page()
      filter-aware") was introduced, it's not a good idea any more to depends
      on only the number of scanned pages for setting may_writepage.  So this
      patch adds new trigger point of setting may_writepage as below
      DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
      in VM so it's good fit for our purpose which would be better to lose
      power saving or clickety rather than OOM killing.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarLuigi Semenzato <semenzato@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e50ce3b
    • David Rientjes's avatar
      mm: use NUMA_NO_NODE · 00ef2d2f
      David Rientjes authored
      Make a sweep through mm/ and convert code that uses -1 directly to using
      the more appropriate NUMA_NO_NODE.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00ef2d2f
    • Robin Holt's avatar
      mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts · 751efd86
      Robin Holt authored
      There is a race condition between mmu_notifier_unregister() and
      __mmu_notifier_release().
      
      Assume two tasks, one calling mmu_notifier_unregister() as a result of a
      filp_close() ->flush() callout (task A), and the other calling
      mmu_notifier_release() from an mmput() (task B).
      
                      A                               B
      t1                                              srcu_read_lock()
      t2              if (!hlist_unhashed())
      t3                                              srcu_read_unlock()
      t4              srcu_read_lock()
      t5                                              hlist_del_init_rcu()
      t6                                              synchronize_srcu()
      t7              srcu_read_unlock()
      t8              hlist_del_rcu()  <--- NULL pointer deref.
      
      Additionally, the list traversal in __mmu_notifier_release() is not
      protected by the by the mmu_notifier_mm->hlist_lock which can result in
      callouts to the ->release() notifier from both mmu_notifier_unregister()
      and __mmu_notifier_release().
      
      -stable suggestions:
      
      The stable trees prior to 3.7.y need commits 21a92735 and
      70400303 cherry-picked in that order prior to cherry-picking this
      commit.  The 3.7.y tree already has those two commits.
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      751efd86
    • Cody P Schafer's avatar
      mm/memory_hotplug: use pgdat_end_pfn() instead of open coding the same. · c1f19495
      Cody P Schafer authored
      Replace open coded pgdat_end_pfn() with helper function.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1f19495
    • Cody P Schafer's avatar
      mm/memory_hotplug: use ensure_zone_is_initialized() · 64dd1b29
      Cody P Schafer authored
      Remove open coding of ensure_zone_is_initialzied().
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64dd1b29
    • Cody P Schafer's avatar
      mm: add helper ensure_zone_is_initialized() · f6bbb78e
      Cody P Schafer authored
      ensure_zone_is_initialized() checks if a zone is in a empty & not
      initialized state (typically occuring after it is created in memory
      hotplugging), and, if so, calls init_currently_empty_zone() to
      initialize the zone.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6bbb78e
    • Cody P Schafer's avatar
      mm/page_alloc: add informative debugging message in page_outside_zone_boundaries() · b5e6a5a2
      Cody P Schafer authored
      Add a debug message which prints when a page is found outside of the
      boundaries of the zone it should belong to. Format is:
      	"page $pfn outside zone [ $start_pfn - $end_pfn ]"
      
      [akpm@linux-foundation.org: s/pr_debug/pr_err/]
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5e6a5a2
    • Cody P Schafer's avatar
      mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate. · da3649e1
      Cody P Schafer authored
      Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
      zone_*() functions.
      
      Change node_end_pfn() to be a wrapper of pgdat_end_pfn().
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da3649e1
    • Cody P Schafer's avatar
      mm/page_alloc: add a VM_BUG in __free_one_page() if the zone is uninitialized. · d29bb978
      Cody P Schafer authored
      Freeing pages to uninitialized zones is not handled by
      __free_one_page(), and should never happen when the code is correct.
      
      Ran into this while writing some code that dynamically onlines extra
      zones.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d29bb978
    • Cody P Schafer's avatar
      mm: add zone_is_empty() and zone_is_initialized() · 2a6e3ebe
      Cody P Schafer authored
      Factoring out these 2 checks makes it more clear what we are actually
      checking for.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a6e3ebe
    • Cody P Schafer's avatar
      mm: add & use zone_end_pfn() and zone_spans_pfn() · 108bcc96
      Cody P Schafer authored
      Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
      duplication.
      
      This also switches to using them in compaction (where an additional
      variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
      kmemleak.
      
      Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
      because I expect at some point the sycronization issues with start_pfn &
      spanned_pages will need fixing, either by actually using the seqlock or
      clever memory barrier usage.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108bcc96
    • Cody P Schafer's avatar
      mm: add SECTION_IN_PAGE_FLAGS · 9127ab4f
      Cody P Schafer authored
      Instead of directly utilizing a combination of config options to determine
      this, add a macro to specifically address it.
      Signed-off-by: default avatarCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9127ab4f
    • Johannes Weiner's avatar
      mm/mlock.c: document scary-looking stack expansion mlock chain · 4805b02e
      Johannes Weiner authored
      The fact that mlock calls get_user_pages, and get_user_pages might call
      mlock when expanding a stack looks like a potential recursion.
      
      However, mlock makes sure the requested range is already contained
      within a vma, so no stack expansion will actually happen from mlock.
      
      Should this ever change: the stack expansion mlocks only the newly
      expanded range and so will not result in recursive expansion.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarMichel Lespinasse <walken@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4805b02e
    • Johannes Weiner's avatar
      mm: refactor inactive_file_is_low() to use get_lru_size() · e3790144
      Johannes Weiner authored
      An inactive file list is considered low when its active counterpart is
      bigger, regardless of whether it is a global zone LRU list or a memcg
      zone LRU list.  The only difference is in how the LRU size is assessed.
      
      get_lru_size() does the right thing for both global and memcg reclaim
      situations.
      
      Get rid of inactive_file_is_low_global() and
      mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
      the numbers in common code.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3790144
    • Johannes Weiner's avatar
      mm: shmem: use new radix tree iterator · 860f2759
      Johannes Weiner authored
      In shmem_find_get_pages_and_swap(), use the faster radix tree iterator
      construct from commit 78c1d784 ("radix-tree: introduce bit-optimized
      iterator").
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      860f2759
    • Hugh Dickins's avatar
      ksm: stop hotremove lockdep warning · ef4d43a8
      Hugh Dickins authored
      Complaints are rare, but lockdep still does not understand the way
      ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
      it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
      problem because notifier callbacks are made under down_read of
      blocking_notifier_head->rwsem (so first the mutex is taken while holding
      the rwsem, then later the rwsem is taken while still holding the mutex);
      but is not in fact a problem because mem_hotplug_mutex is held
      throughout the dance.
      
      There was an attempt to fix this with mutex_lock_nested(); but if that
      happened to fool lockdep two years ago, apparently it does so no longer.
      
      I had hoped to eradicate this issue in extending KSM page migration not
      to need the ksm_thread_mutex.  But then realized that although the page
      migration itself is safe, we do still need to lock out ksmd and other
      users of get_ksm_page() while offlining memory - at some point between
      MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
      vanish, and get_ksm_page()'s accesses to them become a violation.
      
      So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
      MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
      checks, to achieve the same lockout without being caught by lockdep.
      This is less elegant for KSM, but it's more important to keep lockdep
      useful to other users - and I apologize for how long it took to fix.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Tested-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef4d43a8
    • Hugh Dickins's avatar
      mm: remove offlining arg to migrate_pages · 9c620e2b
      Hugh Dickins authored
      No functional change, but the only purpose of the offlining argument to
      migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
      KSM page for memory hotremove (which took ksm_thread_mutex) but not for
      other callers.  Now all cases are safe, remove the arg.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c620e2b
    • Hugh Dickins's avatar
      ksm: enable KSM page migration · b79bc0a0
      Hugh Dickins authored
      Migration of KSM pages is now safe: remove the PageKsm restrictions from
      mempolicy.c and migrate.c.
      
      But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
      are irrelevant to KSM: it looks as if that code was preventing hotremove
      migration of KSM pages, unless they happened to be in swapcache.
      
      There is some question as to whether enforcing a NUMA mempolicy migration
      ought to migrate KSM pages, mapped into entirely unrelated processes; but
      moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
      and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
      any area where this is a worry.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b79bc0a0
    • Hugh Dickins's avatar
      ksm: make !merge_across_nodes migration safe · 4146d2d6
      Hugh Dickins authored
      The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
      set to non-default 0: if a KSM page is migrated to a different NUMA node,
      how do we migrate its stable node to the right tree?  And what if that
      collides with an existing stable node?
      
      ksm_migrate_page() can do no more than it's already doing, updating
      stable_node->kpfn: the stable tree itself cannot be manipulated without
      holding ksm_thread_mutex.  So accept that a stable tree may temporarily
      indicate a page belonging to the wrong NUMA node, leave updating until the
      next pass of ksmd, just be careful not to merge other pages on to a
      misplaced page.  Note nid of holding tree in stable_node, and recognize
      that it will not always match nid of kpfn.
      
      A misplaced KSM page is discovered, either when ksm_do_scan() next comes
      around to one of its rmap_items (we now have to go to cmp_and_merge_page
      even on pages in a stable tree), or when stable_tree_search() arrives at a
      matching node for another page, and this node page is found misplaced.
      
      In each case, move the misplaced stable_node to a list of migrate_nodes
      (and use the address of migrate_nodes as magic by which to identify them):
      we don't need them in a tree.  If stable_tree_search() finds no match for
      a page, but it's currently exiled to this list, then slot its stable_node
      right there into the tree, bringing all of its mappings with it; otherwise
      they get migrated one by one to the original page of the colliding node.
      stable_tree_search() is now modelled more like stable_tree_insert(), in
      order to handle these insertions of migrated nodes.
      
      remove_node_from_stable_tree(), remove_all_stable_nodes() and
      ksm_check_stable_tree() have to handle the migrate_nodes list as well as
      the stable tree itself.  Less obviously, we do need to prune the list of
      stale entries from time to time (scan_get_next_rmap_item() does it once
      each full scan): whereas stale nodes in the stable tree get naturally
      pruned as searches try to brush past them, these migrate_nodes may get
      forgotten and accumulate.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4146d2d6
    • Hugh Dickins's avatar
      ksm: make KSM page migration possible · c8d6553b
      Hugh Dickins authored
      KSM page migration is already supported in the case of memory hotremove,
      which takes the ksm_thread_mutex across all its migrations to keep life
      simple.
      
      But the new KSM NUMA merge_across_nodes knob introduces a problem, when
      it's set to non-default 0: if a KSM page is migrated to a different NUMA
      node, how do we migrate its stable node to the right tree?  And what if
      that collides with an existing stable node?
      
      So far there's no provision for that, and this patch does not attempt to
      deal with it either.  But how will I test a solution, when I don't know
      how to hotremove memory?  The best answer is to enable KSM page migration
      in all cases now, and test more common cases.  With THP and compaction
      added since KSM came in, page migration is now mainstream, and it's a
      shame that a KSM page can frustrate freeing a page block.
      
      Without worrying about merge_across_nodes 0 for now, this patch gets KSM
      page migration working reliably for default merge_across_nodes 1 (but
      leave the patch enabling it until near the end of the series).
      
      It's much simpler than I'd originally imagined, and does not require an
      additional tier of locking: page migration relies on the page lock, KSM
      page reclaim relies on the page lock, the page lock is enough for KSM page
      migration too.
      
      Almost all the care has to be in get_ksm_page(): that's the function which
      worries about when a stable node is stale and should be freed, now it also
      has to worry about the KSM page being migrated.
      
      The only new overhead is an additional put/get/lock/unlock_page when
      stable_tree_search() arrives at a matching node: to make sure migration
      respects the raised page count, and so does not migrate the page while
      we're busy with it here.  That's probably avoidable, either by changing
      internal interfaces from using kpage to stable_node, or by moving the
      ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
      swapcache); but this works well, I've no urge to pull it apart now.
      
      (Descents of the stable tree may pass through nodes whose KSM pages are
      under migration: being unlocked, the raised page count does not prevent
      that, nor need it: it's safe to memcmp against either old or new page.)
      
      You might worry about mremap, and whether page migration's rmap_walk to
      remove migration entries will find all the KSM locations where it inserted
      earlier: that should already be handled, by the satisfyingly heavy hammer
      of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8d6553b
    • Hugh Dickins's avatar
      ksm: remove old stable nodes more thoroughly · cbf86cfe
      Hugh Dickins authored
      Switching merge_across_nodes after running KSM is liable to oops on stale
      nodes still left over from the previous stable tree.  It's not something
      that people will often want to do, but it would be lame to demand a reboot
      when they're trying to determine which merge_across_nodes setting is best.
      
      How can this happen?  We only permit switching merge_across_nodes when
      pages_shared is 0, and usually set run 2 to force that beforehand, which
      ought to unmerge everything: yet oopses still occur when you then run 1.
      
      Three causes:
      
      1. The old stable tree (built according to the inverse
         merge_across_nodes) has not been fully torn down.  A stable node
         lingers until get_ksm_page() notices that the page it references no
         longer references it: but the page is not necessarily freed as soon as
         expected, particularly when swapcache.
      
         Fix this with a pass through the old stable tree, applying
         get_ksm_page() to each of the remaining nodes (most found stale and
         removed immediately), with forced removal of any left over.  Unless the
         page is still mapped: I've not seen that case, it shouldn't occur, but
         better to WARN_ON_ONCE and EBUSY than BUG.
      
      2. __ksm_enter() has a nice little optimization, to insert the new mm
         just behind ksmd's cursor, so there's a full pass for it to stabilize
         (or be removed) before ksmd addresses it.  Nice when ksmd is running,
         but not so nice when we're trying to unmerge all mms: we were missing
         those mms forked and inserted behind the unmerge cursor.  Easily fixed
         by inserting at the end when KSM_RUN_UNMERGE.
      
      3.  It is possible for a KSM page to be faulted back from swapcache
         into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
         it.  Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
         private to ksm.c, so dissolve the distinction between
         ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
         the one call into ksm.c.
      
      A long outstanding, unrelated bugfix sneaks in with that third fix:
      ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
      error when read in from swap) to a page which it then marks Uptodate.  Fix
      this case by not copying, letting do_swap_page() discover the error.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbf86cfe
    • Hugh Dickins's avatar
      ksm: get_ksm_page locked · 8aafa6a4
      Hugh Dickins authored
      In some places where get_ksm_page() is used, we need the page to be locked.
      
      When KSM migration is fully enabled, we shall want that to make sure that
      the page just acquired cannot be migrated beneath us (raised page count is
      only effective when there is serialization to make sure migration
      notices).  Whereas when navigating through the stable tree, we certainly
      do not want to lock each node (raised page count is enough to guarantee
      the memcmps, even if page is migrated to another node).
      
      Since we're about to add another use case, add the locked argument to
      get_ksm_page() now.
      
      Hmm, what's that rcu_read_lock() about?  Complete misunderstanding, I
      really got the wrong end of the stick on that!  There's a configuration in
      which page_cache_get_speculative() can do something cheaper than
      get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
      disabled preemption for it.  There's no need for rcu_read_lock() around
      get_page_unless_zero() (and mapping checks) here.  Cut out that silliness
      before making this any harder to understand.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aafa6a4
    • Hugh Dickins's avatar
      ksm: reorganize ksm_check_stable_tree · ee0ea59c
      Hugh Dickins authored
      Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
      (restarting whenever it finds a stale node to remove), but rearrange so
      that at least it does not needlessly restart from nid 0 each time.  And
      add a couple of comments: here is why we keep pfn instead of page.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee0ea59c
    • Hugh Dickins's avatar
      ksm: trivial tidyups · e850dcf5
      Hugh Dickins authored
      Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef
      CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid
      when not NUMA).  Add comment, remove "unsigned" from rmap_item->nid, as
      "int nid" elsewhere.  Define ksm_merge_across_nodes 1U when #ifndef NUMA
      to help optimizing out.  Use ?: in get_kpfn_nid().  Adjust a few
      comments noticed in ongoing work.
      
      Leave stable_tree_insert()'s rb_linkage until after the node has been
      set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page
      lock make either way safe, but we're going to copy and I prefer this
      precedent.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e850dcf5
    • Petr Holasek's avatar
      ksm: add sysfs ABI Documentation · f00dc0ee
      Petr Holasek authored
      Add sysfs documentation for Kernel Samepage Merging (KSM) including new
      merge_across_nodes knob.
      Signed-off-by: default avatarPetr Holasek <pholasek@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f00dc0ee
    • Petr Holasek's avatar
      ksm: allow trees per NUMA node · 90bd6fd3
      Petr Holasek authored
      Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
      Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
      we had with that, fully enabling KSM page migration on the way.
      
      (A different kind of KSM/NUMA issue which I've certainly not begun to
      address here: when KSM pages are unmerged, there's usually no sense in
      preferring to allocate the new pages local to the caller's node.)
      
      This patch:
      
      Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
      which control merging pages across different numa nodes.  When it is set
      to zero only pages from the same node are merged, otherwise pages from
      all nodes can be merged together (default behavior).
      
      Typical use-case could be a lot of KVM guests on NUMA machine and cpus
      from more distant nodes would have significant increase of access
      latency to the merged ksm page.  Sysfs knob was choosen for higher
      variability when some users still prefers higher amount of saved
      physical memory regardless of access latency.
      
      Every numa node has its own stable & unstable trees because of faster
      searching and inserting.  Changing of merge_across_nodes value is
      possible only when there are not any ksm shared pages in system.
      
      I've tested this patch on numa machines with 2, 4 and 8 nodes and
      measured speed of memory access inside of KVM guests with memory pinned
      to one of nodes with this benchmark:
      
        http://pholasek.fedorapeople.org/alloc_pg.c
      
      Population standard deviations of access times in percentage of average
      were following:
      
      merge_across_nodes=1
      2 nodes 1.4%
      4 nodes 1.6%
      8 nodes	1.7%
      
      merge_across_nodes=0
      2 nodes	1%
      4 nodes	0.32%
      8 nodes	0.018%
      
      RFC: https://lkml.org/lkml/2011/11/30/91
      v1: https://lkml.org/lkml/2012/1/23/46
      v2: https://lkml.org/lkml/2012/6/29/105
      v3: https://lkml.org/lkml/2012/9/14/550
      v4: https://lkml.org/lkml/2012/9/23/137
      v5: https://lkml.org/lkml/2012/12/10/540
      v6: https://lkml.org/lkml/2012/12/23/154
      v7: https://lkml.org/lkml/2012/12/27/225
      
      Hugh notes that this patch brings two problems, whose solution needs
      further support in mm/ksm.c, which follows in subsequent patches:
      
      1) switching merge_across_nodes after running KSM is liable to oops
         on stale nodes still left over from the previous stable tree;
      
      2) memory hotremove may migrate KSM pages, but there is no provision
         here for !merge_across_nodes to migrate nodes to the proper tree.
      Signed-off-by: default avatarPetr Holasek <pholasek@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90bd6fd3
    • Mel Gorman's avatar
      mm: rename page struct field helpers · 22b751c3
      Mel Gorman authored
      The function names page_xchg_last_nid(), page_last_nid() and
      reset_page_last_nid() were judged to be inconsistent so rename them to a
      struct_field_op style pattern.  As it looked jarring to have
      reset_page_mapcount() and page_nid_reset_last() beside each other in
      memmap_init_zone(), this patch also renames reset_page_mapcount() to
      page_mapcount_reset().  There are others like init_page_count() but as
      it is used throughout the arch code a rename would likely cause more
      conflicts than it is worth.
      
      [akpm@linux-foundation.org: fix zcache]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22b751c3
    • Glauber Costa's avatar
      memcg: avoid dangling reference count in creation failure. · e4715f01
      Glauber Costa authored
      When use_hierarchy is enabled, we acquire an extra reference count in
      our parent during cgroup creation.  We don't release it, though, if any
      failure exist in the creation process.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reported-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4715f01
    • Glauber Costa's avatar
      memcg: increment static branch right after limit set · 692e89ab
      Glauber Costa authored
      We were deferring the kmemcg static branch increment to a later time,
      due to a nasty dependency between the cpu_hotplug lock, taken by the
      jump label update, and the cgroup_lock.
      
      Now we no longer take the cgroup lock, and we can save ourselves the
      trouble.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      692e89ab
    • Glauber Costa's avatar
      memcg: replace cgroup_lock with memcg specific memcg_lock · 0999821b
      Glauber Costa authored
      After the preparation work done in earlier patches, the cgroup_lock can
      be trivially replaced with a memcg-specific lock.  This is an automatic
      translation at every site where the values involved were queried.
      
      The sites where values are written, however, used to be naturally called
      under cgroup_lock.  This is the case for instance in the css_online
      callback.  For those, we now need to explicitly add the memcg lock.
      
      With this, all the calls to cgroup_lock outside cgroup core are gone.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0999821b
    • Glauber Costa's avatar
      memcg: fast hierarchy-aware child test · b5f99b53
      Glauber Costa authored
      Currently, we use cgroups' provided list of children to verify if it is
      safe to proceed with any value change that is dependent on the cgroup
      being empty.
      
      This is less than ideal, because it enforces a dependency over cgroup
      core that we would be better off without.  The solution proposed here is
      to iterate over the child cgroups and if any is found that is already
      online, we bounce and return: we don't really care how many children we
      have, only if we have any.
      
      This is also made to be hierarchy aware.  IOW, cgroups with hierarchy
      disabled, while they still exist, will be considered for the purpose of
      this interface as having no children.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5f99b53
    • Glauber Costa's avatar
      memcg: split part of memcg creation to css_online · d142e3e6
      Glauber Costa authored
      This patch is a preparatory work for later locking rework to get rid of
      big cgroup lock from memory controller code.
      
      The memory controller uses some tunables to adjust its operation.  Those
      tunables are inherited from parent to children upon children
      intialization.  For most of them, the value cannot be changed after the
      parent has a new children.
      
      cgroup core splits initialization in two phases: css_alloc and css_online.
      After css_alloc, the memory allocation and basic initialization are done.
      But the new group is not yet visible anywhere, not even for cgroup core
      code.  It is only somewhere between css_alloc and css_online that it is
      inserted into the internal children lists.  Copying tunable values in
      css_alloc will lead to inconsistent values: the children will copy the old
      parent values, that can change between the copy and the moment in which
      the groups is linked to any data structure that can indicate the presence
      of children.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d142e3e6
    • Glauber Costa's avatar
      memcg: prevent changes to move_charge_at_immigrate during task attach · ee5e8472
      Glauber Costa authored
      In memcg, we use the cgroup_lock basically to synchronize against
      attaching new children to a cgroup.  We do this because we rely on
      cgroup core to provide us with this information.
      
      We need to guarantee that upon child creation, our tunables are
      consistent.  For those, the calls to cgroup_lock() all live in handlers
      like mem_cgroup_hierarchy_write(), where we change a tunable in the
      group that is hierarchy-related.  For instance, the use_hierarchy flag
      cannot be changed if the cgroup already have children.
      
      Furthermore, those values are propagated from the parent to the child
      when a new child is created.  So if we don't lock like this, we can end
      up with the following situation:
      
      A                                   B
       memcg_css_alloc()                       mem_cgroup_hierarchy_write()
       copy use hierarchy from parent          change use hierarchy in parent
       finish creation.
      
      This is mainly because during create, we are still not fully connected
      to the css tree.  So all iterators and the such that we could use, will
      fail to show that the group has children.
      
      My observation is that all of creation can proceed in parallel with
      those tasks, except value assignment.  So what this patch series does is
      to first move all value assignment that is dependent on parent values
      from css_alloc to css_online, where the iterators all work, and then we
      lock only the value assignment.  This will guarantee that parent and
      children always have consistent values.  Together with an online test,
      that can be derived from the observation that the refcount of an online
      memcg can be made to be always positive, we should be able to
      synchronize our side without the cgroup lock.
      
      This patch:
      
      Currently, we rely on the cgroup_lock() to prevent changes to
      move_charge_at_immigrate during task migration.  However, this is only
      needed because the current strategy keeps checking this value throughout
      the whole process.  Since all we need is serialization, one needs only
      to guarantee that whatever decision we made in the beginning of a
      specific migration is respected throughout the process.
      
      We can achieve this by just saving it in mc.  By doing this, no kind of
      locking is needed.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee5e8472
    • Glauber Costa's avatar
      memcg: reduce the size of struct memcg 244-fold. · 45cf7ebd
      Glauber Costa authored
      In order to maintain all the memcg bookkeeping, we need per-node
      descriptors, which will in turn contain a per-zone descriptor.
      
      Because we want to statically allocate those, this array ends up being
      very big.  Part of the reason is that we allocate something large enough
      to hold MAX_NUMNODES, the compile time constant that holds the maximum
      number of nodes we would ever consider.
      
      However, we can do better in some cases if the firmware help us.  This
      is true for modern x86 machines; coincidentally one of the architectures
      in which MAX_NUMNODES tends to be very big.
      
      By using the firmware-provided maximum number of nodes instead of
      MAX_NUMNODES, we can reduce the memory footprint of struct memcg
      considerably.  In the extreme case in which we have only one node, this
      reduces the size of the structure from ~ 64k to ~2k.  This is
      particularly important because it means that we will no longer resort to
      the vmalloc area for the struct memcg on defconfigs.  We also have
      enough room for an extra node and still be outside vmalloc.
      
      One also has to keep in mind that with the industry's ability to fit
      more processors in a die as fast as the FED prints money, a nodes = 2
      configuration is already respectably big.
      
      [akpm@linux-foundation.org: add check for invalid nid, remove inline]
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45cf7ebd
    • Mel Gorman's avatar
      mm: init: report on last-nid information stored in page->flags · a4e1b4c6
      Mel Gorman authored
      Answering the question "how much space remains in the page->flags" is
      time-consuming.  mminit_loglevel can help answer the question but it
      does not take last_nid information into account.  This patch corrects it
      and while there it corrects the messages related to page flag usage,
      pgshifts and node/zone id.  When applied the relevant output looks
      something like this but will depend on the kernel configuration.
      
        mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
        mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
        mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
        mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
        mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4e1b4c6