1. 04 Jun, 2014 40 commits
    • Oleg Nesterov's avatar
      memcg: optimize the "Search everything else" loop in mm_update_next_owner() · 39af1765
      Oleg Nesterov authored
      for_each_process_thread() is sub-optimal. All threads share the same
      ->mm, we can swicth to the next process once we found a thread with
      ->mm != NULL and ->mm != mm.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Chiang <pchiang@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39af1765
    • Oleg Nesterov's avatar
      memcg: mm_update_next_owner() should skip kthreads · f87fb599
      Oleg Nesterov authored
      "Search through everything else" in mm_update_next_owner() can hit a
      kthread which adopted this "mm" via use_mm(), it should not be used as
      mm->owner.  Add the PF_KTHREAD check.
      
      While at it, change this code to use for_each_process_thread() instead
      of deprecated do_each_thread/while_each_thread.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Chiang <pchiang@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f87fb599
    • Fabian Frederick's avatar
      mm/memblock.c: use PFN_DOWN · f7e2f7e8
      Fabian Frederick authored
      Replace ((x) >> PAGE_SHIFT) with the pfn macro.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7e2f7e8
    • Fabian Frederick's avatar
      mm/memory_hotplug.c: use PFN_DOWN() · c8e861a5
      Fabian Frederick authored
      Replace ((x) >> PAGE_SHIFT) with the pfn macro.
      Signed-off-by: default avatarFabian Frederick <fabf@skynet.be>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8e861a5
    • Matthew Wilcox's avatar
      brd: return -ENOSPC rather than -ENOMEM on page allocation failure · 96f8d8e0
      Matthew Wilcox authored
      brd is effectively a thinly provisioned device.  Thinly provisioned
      devices return -ENOSPC when they can't write a new block.  -ENOMEM is an
      implementation detail that callers shouldn't know.
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Acked-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96f8d8e0
    • Matthew Wilcox's avatar
      brd: add support for rw_page() · a72132c3
      Matthew Wilcox authored
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a72132c3
    • Matthew Wilcox's avatar
      swap: use bdev_read_page() / bdev_write_page() · dd6bd0d9
      Matthew Wilcox authored
      By calling the device driver to write the page directly, we avoid
      allocating a BIO, which allows us to free memory without allocating
      memory.
      
      [akpm@linux-foundation.org: fix used-uninitialized bug]
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd6bd0d9
    • Matthew Wilcox's avatar
      fs/block_dev.c: add bdev_read_page() and bdev_write_page() · 47a191fd
      Matthew Wilcox authored
      A block device driver may choose to provide a rw_page operation.  These
      will be called when the filesystem is attempting to do page sized I/O to
      page cache pages (ie not for direct I/O).  This does preclude I/Os that
      are larger than page size, so this may only be a performance gain for
      some devices.
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Tested-by: default avatarDheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47a191fd
    • Matthew Wilcox's avatar
      fs/mpage.c: factor page_endio() out of mpage_end_io() · 57d99845
      Matthew Wilcox authored
      page_endio() takes care of updating all the appropriate page flags once
      I/O has finished to a page.  Switch to using mapping_set_error() instead
      of setting AS_EIO directly; this will handle thin-provisioned devices
      correctly.
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57d99845
    • Matthew Wilcox's avatar
      fs/mpage.c: factor clean_buffers() out of __mpage_writepage() · 90768eee
      Matthew Wilcox authored
      __mpage_writepage() is over 200 lines long, has 20 local variables, four
      goto labels and could desperately use simplification.  Splitting
      clean_buffers() into a helper function improves matters a little,
      removing 20+ lines from it.
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90768eee
    • Matthew Wilcox's avatar
      fs/buffer.c: remove block_write_full_page_endio() · 1b938c08
      Matthew Wilcox authored
      The last in-tree caller of block_write_full_page_endio() was removed in
      January 2013.  It's time to remove the EXPORT_SYMBOL, which leaves
      block_write_full_page() as the only caller of
      block_write_full_page_endio(), so inline block_write_full_page_endio()
      into block_write_full_page().
      Signed-off-by: default avatarMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b938c08
    • NeilBrown's avatar
      mm/vmscan.c: avoid throttling reclaim for loop-back nfsd threads · 399ba0b9
      NeilBrown authored
      When a loopback NFS mount is active and the backing device for the NFS
      mount becomes congested, that can impose throttling delays on the nfsd
      threads.
      
      These delays significantly reduce throughput and so the NFS mount remains
      congested.
      
      This results in a livelock and the reduced throughput persists.
      
      This livelock has been found in testing with the 'wait_iff_congested'
      call, and could possibly be caused by the 'congestion_wait' call.
      
      This livelock is similar to the deadlock which justified the introduction
      of PF_LESS_THROTTLE, and the same flag can be used to remove this
      livelock.
      
      To minimise the impact of the change, we still throttle nfsd when the
      filesystem it is writing to is congested, but not when some separate
      filesystem (e.g.  the NFS filesystem) is congested.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      399ba0b9
    • Mel Gorman's avatar
      mm: numa: add migrated transhuge pages to LRU the same way as base pages · 11de9927
      Mel Gorman authored
      Migration of misplaced transhuge pages uses page_add_new_anon_rmap() when
      putting the page back as it avoided an atomic operations and added the new
      page to the correct LRU.  A side-effect is that the page gets marked
      activated as part of the migration meaning that transhuge and base pages
      are treated differently from an aging perspective than base page
      migration.
      
      This patch uses page_add_anon_rmap() and putback_lru_page() on completion
      of a transhuge migration similar to base page migration.  It would require
      fewer atomic operations to use lru_cache_add without taking an additional
      reference to the page.  The downside would be that it's still different to
      base page migration and unevictable pages may be added to the wrong LRU
      for cleaning up later.  Testing of the usual workloads did not show any
      adverse impact to the change.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11de9927
    • Vladimir Davydov's avatar
      memcg, slab: simplify synchronization scheme · bd673145
      Vladimir Davydov authored
      At present, we have the following mutexes protecting data related to per
      memcg kmem caches:
      
       - slab_mutex.  This one is held during the whole kmem cache creation
         and destruction paths.  We also take it when updating per root cache
         memcg_caches arrays (see memcg_update_all_caches).  As a result, taking
         it guarantees there will be no changes to any kmem cache (including per
         memcg).  Why do we need something else then?  The point is it is
         private to slab implementation and has some internal dependencies with
         other mutexes (get_online_cpus).  So we just don't want to rely upon it
         and prefer to introduce additional mutexes instead.
      
       - activate_kmem_mutex.  Initially it was added to synchronize
         initializing kmem limit (memcg_activate_kmem).  However, since we can
         grow per root cache memcg_caches arrays only on kmem limit
         initialization (see memcg_update_all_caches), we also employ it to
         protect against memcg_caches arrays relocation (e.g.  see
         __kmem_cache_destroy_memcg_children).
      
       - We have a convention not to take slab_mutex in memcontrol.c, but we
         want to walk over per memcg memcg_slab_caches lists there (e.g.  for
         destroying all memcg caches on offline).  So we have per memcg
         slab_caches_mutex's protecting those lists.
      
      The mutexes are taken in the following order:
      
         activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
      
      Such a syncrhonization scheme has a number of flaws, for instance:
      
       - We can't call kmem_cache_{destroy,shrink} while walking over a
         memcg::memcg_slab_caches list due to locking order.  As a result, in
         mem_cgroup_destroy_all_caches we schedule the
         memcg_cache_params::destroy work shrinking and destroying the cache.
      
       - We don't have a mutex to synchronize per memcg caches destruction
         between memcg offline (mem_cgroup_destroy_all_caches) and root cache
         destruction (__kmem_cache_destroy_memcg_children).  Currently we just
         don't bother about it.
      
      This patch simplifies it by substituting per memcg slab_caches_mutex's
      with the global memcg_slab_mutex.  It will be held whenever a new per
      memcg cache is created or destroyed, so it protects per root cache
      memcg_caches arrays and per memcg memcg_slab_caches lists.  The locking
      order is following:
      
         activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
      
      This allows us to call kmem_cache_{create,shrink,destroy} under the
      memcg_slab_mutex.  As a result, we don't need memcg_cache_params::destroy
      work any more - we can simply destroy caches while iterating over a per
      memcg slab caches list.
      
      Also using the global mutex simplifies synchronization between concurrent
      per memcg caches creation/destruction, e.g.  mem_cgroup_destroy_all_caches
      vs __kmem_cache_destroy_memcg_children.
      
      The downside of this is that we substitute per-memcg slab_caches_mutex's
      with a hummer-like global mutex, but since we already take either the
      slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
      shouldn't hurt concurrency a lot.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd673145
    • Vladimir Davydov's avatar
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov authored
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • Vladimir Davydov's avatar
      memcg, slab: do not schedule cache destruction when last page goes away · 1e32e77f
      Vladimir Davydov authored
      This patchset is a part of preparations for kmemcg re-parenting.  It
      targets at simplifying kmemcg work-flows and synchronization.
      
      First, it removes async per memcg cache destruction (see patches 1, 2).
      Now caches are only destroyed on memcg offline.  That means the caches
      that are not empty on memcg offline will be leaked.  However, they are
      already leaked, because memcg_cache_params::nr_pages normally never drops
      to 0 so the destruction work is never scheduled except kmem_cache_shrink
      is called explicitly.  In the future I'm planning reaping such dead caches
      on vmpressure or periodically.
      
      Second, it substitutes per memcg slab_caches_mutex's with the global
      memcg_slab_mutex, which should be taken during the whole per memcg cache
      creation/destruction path before the slab_mutex (see patch 3).  This
      greatly simplifies synchronization among various per memcg cache
      creation/destruction paths.
      
      I'm still not quite sure about the end picture, in particular I don't know
      whether we should reap dead memcgs' kmem caches periodically or try to
      merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
      more details), but whichever way we choose, this set looks like a
      reasonable change to me, because it greatly simplifies kmemcg work-flows
      and eases further development.
      
      This patch (of 3):
      
      After a memcg is offlined, we mark its kmem caches that cannot be deleted
      right now due to pending objects as dead by setting the
      memcg_cache_params::dead flag, so that memcg_release_pages will schedule
      cache destruction (memcg_cache_params::destroy) as soon as the last slab
      of the cache is freed (memcg_cache_params::nr_pages drops to zero).
      
      I guess the idea was to destroy the caches as soon as possible, i.e.
      immediately after freeing the last object.  However, it just doesn't work
      that way, because kmem caches always preserve some pages for the sake of
      performance, so that nr_pages never gets to zero unless the cache is
      shrunk explicitly using kmem_cache_shrink.  Of course, we could account
      the total number of objects on the cache or check if all the slabs
      allocated for the cache are empty on kmem_cache_free and schedule
      destruction if so, but that would be too costly.
      
      Thus we have a piece of code that works only when we explicitly call
      kmem_cache_shrink, but complicates the whole picture a lot.  Moreover,
      it's racy in fact.  For instance, kmem_cache_shrink may free the last slab
      and thus schedule cache destruction before it finishes checking that the
      cache is empty, which can lead to use-after-free.
      
      So I propose to remove this async cache destruction from
      memcg_release_pages, and check if the cache is empty explicitly after
      calling kmem_cache_shrink instead.  This will simplify things a lot w/o
      introducing any functional changes.
      
      And regarding dead memcg caches (i.e.  those that are left hanging around
      after memcg offline for they have objects), I suppose we should reap them
      either periodically or on vmpressure as Glauber suggested initially.  I'm
      going to implement this later.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e32e77f
    • Michal Hocko's avatar
      memcg: do not hang on OOM when killed by userspace OOM access to memory reserves · d8dc595c
      Michal Hocko authored
      Eric has reported that he can see task(s) stuck in memcg OOM handler
      regularly.  The only way out is to
      
      	echo 0 > $GROUP/memory.oom_control
      
      His usecase is:
      
      - Setup a hierarchy with memory and the freezer (disable kernel oom and
        have a process watch for oom).
      
      - In that memory cgroup add a process with one thread per cpu.
      
      - In one thread slowly allocate once per second I think it is 16M of ram
        and mlock and dirty it (just to force the pages into ram and stay
        there).
      
      - When oom is achieved loop:
        * attempt to freeze all of the tasks.
        * if frozen send every task SIGKILL, unfreeze, remove the directory in
          cgroupfs.
      
      Eric has then pinpointed the issue to be memcg specific.
      
      All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
      Those that have received fatal signal will bypass the charge and should
      continue on their way out.  The tricky part is that the exit path might
      trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
      while its memcg is still under OOM because nobody has released any charges
      yet.
      
      Unlike with the in-kernel OOM handler the exiting task doesn't get
      TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
      and falls to the memcg OOM again without any way out of it as there are no
      fatal signals pending anymore.
      
      This patch fixes the issue by checking PF_EXITING early in
      mem_cgroup_try_charge and bypass the charge same as if it had fatal
      signal pending or TIF_MEMDIE set.
      
      Normally exiting tasks (aka not killed) will bypass the charge now but
      this should be OK as the task is leaving and will release memory and
      increasing the memory pressure just to release it in a moment seems
      dubious wasting of cycles.  Besides that charges after exit_signals should
      be rare.
      
      I am bringing this patch again (rebased on the current mmotm tree). I
      hope we can move forward finally. If there is still an opposition then
      I would really appreciate a concurrent approach so that we can discuss
      alternatives.
      
      http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
      to the followup discussion when the patch has been dropped from the mmotm
      last time.
      Reported-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8dc595c
    • Mel Gorman's avatar
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL · 675becce
      Mel Gorman authored
      throttle_direct_reclaim() is meant to trigger during swap-over-network
      during which the min watermark is treated as a pfmemalloc reserve.  It
      throttes on the first node in the zonelist but this is flawed.
      
      The user-visible impact is that a process running on CPU whose local
      memory node has no ZONE_NORMAL will stall for prolonged periods of time,
      possibly indefintely.  This is due to throttle_direct_reclaim thinking the
      pfmemalloc reserves are depleted when in fact they don't exist on that
      node.
      
      On a NUMA machine running a 32-bit kernel (I know) allocation requests
      from CPUs on node 1 would detect no pfmemalloc reserves and the process
      gets throttled.  This patch adjusts throttling of direct reclaim to
      throttle based on the first node in the zonelist that has a usable
      ZONE_NORMAL or lower zone.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      675becce
    • Oleg Nesterov's avatar
      memcg: kill CONFIG_MM_OWNER · f98bafa0
      Oleg Nesterov authored
      CONFIG_MM_OWNER makes no sense.  It is not user-selectable, it is only
      selected by CONFIG_MEMCG automatically.  So we can kill this option in
      init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f98bafa0
    • Huang Shijie's avatar
      mm/mmap.c: remove the first mapping check · 64ac4940
      Huang Shijie authored
      Remove the first mapping check for vma_link.  Move the mutex_lock into the
      braces when vma->vm_file is true.
      Signed-off-by: default avatarHuang Shijie <b32955@freescale.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64ac4940
    • Jianyu Zhan's avatar
      mm/swap.c: clean up *lru_cache_add* functions · 2329d375
      Jianyu Zhan authored
      In mm/swap.c, __lru_cache_add() is exported, but actually there are no
      users outside this file.
      
      This patch unexports __lru_cache_add(), and makes it static.  It also
      exports lru_cache_add_file(), as it is use by cifs and fuse, which can
      loaded as modules.
      Signed-off-by: default avatarJianyu Zhan <nasa4836@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2329d375
    • Jonathan Gonzalez V's avatar
      drm/exynos: call find_vma with the mmap_sem held · cbe97414
      Jonathan Gonzalez V authored
      Performing vma lookups without taking the mm->mmap_sem is asking for
      trouble.  While doing the search, the vma in question can be modified or
      even removed before returning to the caller.  Take the lock (exclusively)
      in order to avoid races while iterating through the vmacache and/or
      rbtree.
      Signed-off-by: default avatarJonathan Gonzalez V <zeus@gnu.org>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Inki Dae <inki.dae@samsung.com>
      Cc: Joonyoung Shim <jy0922.shim@samsung.com>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbe97414
    • Davidlohr Bueso's avatar
      arc: call find_vma with the mmap_sem held · 5040573e
      Davidlohr Bueso authored
      Performing vma lookups without taking the mm->mmap_sem is asking for
      trouble.  While doing the search, the vma in question can be modified or
      even removed before returning to the caller.  Take the lock (shared) in
      order to avoid races while iterating through the vmacache and/or rbtree.
      
      [akpm@linux-foundation.org: CSE current->active_mm, per Vineet]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5040573e
    • Vladimir Davydov's avatar
      Documentation/memcg: warn about incomplete kmemcg state · 2ee06468
      Vladimir Davydov authored
      Kmemcg is currently under development and lacks some important features.
      In particular, it does not have support of kmem reclaim on memory pressure
      inside cgroup, which practically makes it unusable in real life.  Let's
      warn about it in both Kconfig and Documentation to prevent complaints
      arising.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ee06468
    • Dave Hansen's avatar
      mm: debug: make bad_range() output more usable and readable · 613813e8
      Dave Hansen authored
      Nobody outputs memory addresses in decimal.  PFNs are essentially
      addresses, and they're gibberish in decimal.  Output them in hex.
      
      Also, add the nid and zone name to give a little more context to the
      message.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      613813e8
    • Vlastimil Babka's avatar
      mm/compaction: cleanup isolate_freepages() · c96b9e50
      Vlastimil Babka authored
      isolate_freepages() is currently somewhat hard to follow thanks to many
      looks like it is related to the 'low_pfn' variable, but in fact it is not.
      
      This patch renames the 'high_pfn' variable to a hopefully less confusing name,
      and slightly changes its handling without a functional change. A comment made
      obsolete by recent changes is also updated.
      
      [akpm@linux-foundation.org: comment fixes, per Minchan]
      [iamjoonsoo.kim@lge.com: cleanups]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c96b9e50
    • Heesub Shin's avatar
      mm/compaction: clean up unused code lines · 13fb44e4
      Heesub Shin authored
      Remove code lines currently not in use or never called.
      Signed-off-by: default avatarHeesub Shin <heesub.shin@samsung.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13fb44e4
    • Vlastimil Babka's avatar
      mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced · 5bcc9f86
      Vlastimil Babka authored
      For the MIGRATE_RESERVE pages, it is useful when they do not get
      misplaced on free_list of other migratetype, otherwise they might get
      allocated prematurely and e.g.  fragment the MIGRATE_RESEVE pageblocks.
      While this cannot be avoided completely when allocating new
      MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
      prevent the misplacement where possible.
      
      Currently, it is possible for the misplacement to happen when a
      MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
      fallback for other desired migratetype, and then later freed back
      through free_pcppages_bulk() without being actually used.  This happens
      because free_pcppages_bulk() uses get_freepage_migratetype() to choose
      the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
      the *desired* migratetype and not the page's original MIGRATE_RESERVE
      migratetype.
      
      This patch fixes the problem by moving the call to
      set_freepage_migratetype() from rmqueue_bulk() down to
      __rmqueue_smallest() and __rmqueue_fallback() where the actual page's
      migratetype (e.g.  from which free_list the page is taken from) is used.
      Note that this migratetype might be different from the pageblock's
      migratetype due to freepage stealing decisions.  This is OK, as page
      stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
      to leave all MIGRATE_CMA pages on the correct freelist.
      
      Therefore, as an additional benefit, the call to
      get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
      be removed completely.  This relies on the fact that MIGRATE_CMA
      pageblocks are created only during system init, and the above.  The
      related is_migrate_isolate() check is also unnecessary, as memory
      isolation has other ways to move pages between freelists, and drain pcp
      lists containing pages that should be isolated.  The buffered_rmqueue()
      can also benefit from calling get_freepage_migratetype() instead of
      get_pageblock_migratetype().
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarYong-Taek Lee <ytk.lee@samsung.com>
      Reported-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Wang, Yalin" <Yalin.Wang@sonymobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bcc9f86
    • Andrew Morton's avatar
    • Vladimir Davydov's avatar
      slab: get_online_mems for kmem_cache_{create,destroy,shrink} · 03afc0e2
      Vladimir Davydov authored
      When we create a sl[au]b cache, we allocate kmem_cache_node structures
      for each online NUMA node.  To handle nodes taken online/offline, we
      register memory hotplug notifier and allocate/free kmem_cache_node
      corresponding to the node that changes its state for each kmem cache.
      
      To synchronize between the two paths we hold the slab_mutex during both
      the cache creationg/destruction path and while tuning per-node parts of
      kmem caches in memory hotplug handler, but that's not quite right,
      because it does not guarantee that a newly created cache will have all
      kmem_cache_nodes initialized in case it races with memory hotplug.  For
      instance, in case of slub:
      
          CPU0                            CPU1
          ----                            ----
          kmem_cache_create:              online_pages:
           __kmem_cache_create:            slab_memory_callback:
                                            slab_mem_going_online_callback:
                                             lock slab_mutex
                                             for each slab_caches list entry
                                                 allocate kmem_cache node
                                             unlock slab_mutex
            lock slab_mutex
            init_kmem_cache_nodes:
             for_each_node_state(node, N_NORMAL_MEMORY)
                 allocate kmem_cache node
            add kmem_cache to slab_caches list
            unlock slab_mutex
                                          online_pages (continued):
                                           node_states_set_node
      
      As a result we'll get a kmem cache with not all kmem_cache_nodes
      allocated.
      
      To avoid issues like that we should hold get/put_online_mems() during
      the whole kmem cache creation/destruction/shrink paths, just like we
      deal with cpu hotplug.  This patch does the trick.
      
      Note, that after it's applied, there is no need in taking the slab_mutex
      for kmem_cache_shrink any more, so it is removed from there.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03afc0e2
    • Vladimir Davydov's avatar
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov authored
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • Vladimir Davydov's avatar
      memcg: un-export __memcg_kmem_get_cache · e8d9df3a
      Vladimir Davydov authored
      It is only used in slab and should not be used anywhere else so there is
      no need in exporting it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8d9df3a
    • Mel Gorman's avatar
      mm: page_alloc: do not cache reclaim distances · 5f7a75ac
      Mel Gorman authored
      pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
      by zone_reclaim due to its distance.  As it is expected that
      zone_reclaim_mode will be rarely enabled it is unreasonable for all
      machines to take a penalty.  Fortunately, the zone_reclaim_mode() path
      is already slow and it is the path that takes the hit.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f7a75ac
    • Mel Gorman's avatar
      mm: disable zone_reclaim_mode by default · 4f9b16a6
      Mel Gorman authored
      When it was introduced, zone_reclaim_mode made sense as NUMA distances
      punished and workloads were generally partitioned to fit into a NUMA
      node.  NUMA machines are now common but few of the workloads are
      NUMA-aware and it's routine to see major performance degradation due to
      zone_reclaim_mode being enabled but relatively few can identify the
      problem.
      
      Those that require zone_reclaim_mode are likely to be able to detect
      when it needs to be enabled and tune appropriately so lets have a
      sensible default for the bulk of users.
      
      This patch (of 2):
      
      zone_reclaim_mode causes processes to prefer reclaiming memory from
      local node instead of spilling over to other nodes.  This made sense
      initially when NUMA machines were almost exclusively HPC and the
      workload was partitioned into nodes.  The NUMA penalties were
      sufficiently high to justify reclaiming the memory.  On current machines
      and workloads it is often the case that zone_reclaim_mode destroys
      performance but not all users know how to detect this.  Favour the
      common case and disable it by default.  Users that are sophisticated
      enough to know they need zone_reclaim_mode will detect it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f9b16a6
    • Luiz Capitulino's avatar
      hugetlb: add support for gigantic page allocation at runtime · 944d9fec
      Luiz Capitulino authored
      HugeTLB is limited to allocating hugepages whose size are less than
      MAX_ORDER order.  This is so because HugeTLB allocates hugepages via the
      buddy allocator.  Gigantic pages (that is, pages whose size is greater
      than MAX_ORDER order) have to be allocated at boottime.
      
      However, boottime allocation has at least two serious problems.  First,
      it doesn't support NUMA and second, gigantic pages allocated at boottime
      can't be freed.
      
      This commit solves both issues by adding support for allocating gigantic
      pages during runtime.  It works just like regular sized hugepages,
      meaning that the interface in sysfs is the same, it supports NUMA, and
      gigantic pages can be freed.
      
      For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
      gigantic pages on node 1, one can do:
      
       # echo 2 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      And to free them all:
      
       # echo 0 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      The one problem with gigantic page allocation at runtime is that it
      can't be serviced by the buddy allocator.  To overcome that problem,
      this commit scans all zones from a node looking for a large enough
      contiguous region.  When one is found, it's allocated by using CMA, that
      is, we call alloc_contig_range() to do the actual allocation.  For
      example, on x86_64 we scan all zones looking for a 1GB contiguous
      region.  When one is found, it's allocated by alloc_contig_range().
      
      One expected issue with that approach is that such gigantic contiguous
      regions tend to vanish as runtime goes by.  The best way to avoid this
      for now is to make gigantic page allocations very early during system
      boot, say from a init script.  Other possible optimization include using
      compaction, which is supported by CMA but is not explicitly used by this
      commit.
      
      It's also important to note the following:
      
       1. Gigantic pages allocated at boottime by the hugepages= command-line
          option can be freed at runtime just fine
      
       2. This commit adds support for gigantic pages only to x86_64. The
          reason is that I don't have access to nor experience with other archs.
          The code is arch indepedent though, so it should be simple to add
          support to different archs
      
       3. I didn't add support for hugepage overcommit, that is allocating
          a gigantic page on demand when
         /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
         think it's reasonable to do the hard and long work required for
         allocating a gigantic page at fault time. But it should be simple
         to add this if wanted
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      944d9fec
    • Luiz Capitulino's avatar
      hugetlb: move helpers up in the file · 1cac6f2c
      Luiz Capitulino authored
      Next commit will add new code which will want to call
      for_each_node_mask_to_alloc() macro.  Move it, its buddy
      for_each_node_mask_to_free() and their dependencies up in the file so the
      new code can use them.  This is just code movement, no logic change.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cac6f2c
    • Luiz Capitulino's avatar
      hugetlb: update_and_free_page(): don't clear PG_reserved bit · a7407a27
      Luiz Capitulino authored
      Hugepages pages never get the PG_reserved bit set, so don't clear it.
      
      However, note that if the bit gets mistakenly set free_pages_check() will
      catch it.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7407a27
    • Luiz Capitulino's avatar
      hugetlb: add hstate_is_gigantic() · bae7f4ae
      Luiz Capitulino authored
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bae7f4ae
    • Luiz Capitulino's avatar
      hugetlb: prep_compound_gigantic_page(): drop __init marker · 2906dd52
      Luiz Capitulino authored
      The HugeTLB subsystem uses the buddy allocator to allocate hugepages
      during runtime.  This means that hugepages allocation during runtime is
      limited to MAX_ORDER order.  For archs supporting gigantic pages (that
      is, page sizes greater than MAX_ORDER), this in turn means that those
      pages can't be allocated at runtime.
      
      HugeTLB supports gigantic page allocation during boottime, via the boot
      allocator.  To this end the kernel provides the command-line options
      hugepagesz= and hugepages=, which can be used to instruct the kernel to
      allocate N gigantic pages during boot.
      
      For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
      can be allocated and freed at runtime.  If one wants to allocate 1G
      gigantic pages, this has to be done at boot via the hugepagesz= and
      hugepages= command-line options.
      
      Now, gigantic page allocation at boottime has two serious problems:
      
       1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
          evenly distributes boottime allocated hugepages among nodes.
      
          For example, suppose you have a four-node NUMA machine and want
          to allocate four 1G gigantic pages at boottime. The kernel will
          allocate one gigantic page per node.
      
          On the other hand, we do have users who want to be able to specify
          which NUMA node gigantic pages should allocated from. So that they
          can place virtual machines on a specific NUMA node.
      
       2. Gigantic pages allocated at boottime can't be freed
      
      At this point it's important to observe that regular hugepages allocated
      at runtime don't have those problems.  This is so because HugeTLB
      interface for runtime allocation in sysfs supports NUMA and runtime
      allocated pages can be freed just fine via the buddy allocator.
      
      This series adds support for allocating gigantic pages at runtime.  It
      does so by allocating gigantic pages via CMA instead of the buddy
      allocator.  Releasing gigantic pages is also supported via CMA.  As this
      series builds on top of the existing HugeTLB interface, it makes gigantic
      page allocation and releasing just like regular sized hugepages.  This
      also means that NUMA support just works.
      
      For example, to allocate two 1G gigantic pages on node 1, one can do:
      
       # echo 2 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      And, to release all gigantic pages on the same node:
      
       # echo 0 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      Please, refer to patch 5/5 for full technical details.
      
      Finally, please note that this series is a follow up for a previous series
      that tried to extend the command-line options set to be NUMA aware:
      
       http://marc.info/?l=linux-mm&m=139593335312191&w=2
      
      During the discussion of that series it was agreed that having runtime
      allocation support for gigantic pages was a better solution.
      
      This patch (of 5):
      
      This function is going to be used by non-init code in a future
      commit.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2906dd52
    • Duan Jiong's avatar
      mm/mmap.c: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 14bd5b45
      Duan Jiong authored
      Fix a coccinelle error regarding usage of IS_ERR and PTR_ERR instead of
      PTR_ERR_OR_ZERO.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14bd5b45