1. 13 Nov, 2013 40 commits
    • Akira Takeuchi's avatar
      mm: ensure get_unmapped_area() returns higher address than mmap_min_addr · 2afc745f
      Akira Takeuchi authored
      This patch fixes the problem that get_unmapped_area() can return illegal
      address and result in failing mmap(2) etc.
      
      In case that the address higher than PAGE_SIZE is set to
      /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
      returned by get_unmapped_area(), even if you do not pass any virtual
      address hint (i.e.  the second argument).
      
      This is because the current get_unmapped_area() code does not take into
      account mmap_min_addr.
      
      This leads to two actual problems as follows:
      
      1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
         although any illegal parameter is not passed.
      
      2. The bottom-up search path after the top-down search might not work in
         arch_get_unmapped_area_topdown().
      
      Note: The first and third chunk of my patch, which changes "len" check,
      are for more precise check using mmap_min_addr, and not for solving the
      above problem.
      
      [How to reproduce]
      
      	--- test.c -------------------------------------------------
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/errno.h>
      
      	int main(int argc, char *argv[])
      	{
      		void *ret = NULL, *last_map;
      		size_t pagesize = sysconf(_SC_PAGESIZE);
      
      		do {
      			last_map = ret;
      			ret = mmap(0, pagesize, PROT_NONE,
      				MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      	//		printf("ret=%p\n", ret);
      		} while (ret != MAP_FAILED);
      
      		if (errno != ENOMEM) {
      			printf("ERR: unexpected errno: %d (last map=%p)\n",
      			errno, last_map);
      		}
      
      		return 0;
      	}
      	---------------------------------------------------------------
      
      	$ gcc -m32 -o test test.c
      	$ sudo sysctl -w vm.mmap_min_addr=65536
      	vm.mmap_min_addr = 65536
      	$ ./test  (run as non-priviledge user)
      	ERR: unexpected errno: 1 (last map=0x10000)
      Signed-off-by: default avatarAkira Takeuchi <takeuchi.akr@jp.panasonic.com>
      Signed-off-by: default avatarKiyoshi Owada <owada.kiyoshi@jp.panasonic.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2afc745f
    • KOSAKI Motohiro's avatar
      mm: __rmqueue_fallback() should respect pageblock type · 0cbef29a
      KOSAKI Motohiro authored
      When __rmqueue_fallback() doesn't find a free block with the required size
      it splits a larger page and puts the rest of the page onto the free list.
      
      But it has one serious mistake.  When putting back, __rmqueue_fallback()
      always use start_migratetype if type is not CMA.  However,
      __rmqueue_fallback() is only called when all of the start_migratetype
      queue is empty.  That said, __rmqueue_fallback always puts back memory to
      the wrong queue except try_to_steal_freepages() changed pageblock type
      (i.e.  requested size is smaller than half of page block).  The end result
      is that the antifragmentation framework increases fragmenation instead of
      decreasing it.
      
      Mel's original anti fragmentation does the right thing.  But commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.
      
      This patch restores sane and old behavior.  It also removes an incorrect
      comment which was introduced by commit fef903ef ("mm/page_alloc.c:
      restructure free-page stealing code and fix a bug").
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0cbef29a
    • KOSAKI Motohiro's avatar
      mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() · 52c8f6a5
      KOSAKI Motohiro authored
      In general, every tracepoint should be zero overhead if it is disabled.
      However, trace_mm_page_alloc_extfrag() is one of exception.  It evaluate
      "new_type == start_migratetype" even if tracepoint is disabled.
      
      However, the code can be moved into tracepoint's TP_fast_assign() and
      TP_fast_assign exist exactly such purpose.  This patch does it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52c8f6a5
    • KOSAKI Motohiro's avatar
      mm: fix page_group_by_mobility_disabled breakage · 5d0f3f72
      KOSAKI Motohiro authored
      Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
      MIGRATE_ISOLATE if page_group_by_mobility_disabled is true.  It rewrites
      the argument to MIGRATE_UNMOVABLE and we lost these attribute.
      
      The problem was introduced by commit 49255c61 ("page allocator: move
      check for disabled anti-fragmentation out of fastpath").  So a 4 year
      old issue may mean that nobody uses page_group_by_mobility_disabled.
      
      But anyway, this patch fixes the problem.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d0f3f72
    • Damien Ramonda's avatar
      readahead: fix sequential read cache miss detection · af248a0c
      Damien Ramonda authored
      The kernel's readahead algorithm sometimes interprets random read
      accesses as sequential and triggers unnecessary data prefecthing from
      storage device (impacting random read average latency).
      
      In order to identify sequential cache read misses, the readahead
      algorithm intends to check whether offset - previous offset == 1
      (trivial sequential reads) or offset - previous offset == 0 (sequential
      reads not aligned on page boundary):
      
        if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
      
      The current offset is stored in the "offset" variable of type "pgoff_t"
      (unsigned long), while previous offset is stored in "ra->prev_pos" of
      type "loff_t" (long long).  Therefore, operands of the if statement are
      implicitly converted to type long long.  Consequently, when previous
      offset > current offset (which happens on random pattern), the if
      condition is true and access is wrongly interpeted as sequential.  An
      unnecessary data prefetching is triggered, impacting the average random
      read latency.
      
      Storing the previous offset value in a "pgoff_t" variable (unsigned
      long) fixes the sequential read detection logic.
      Signed-off-by: default avatarDamien Ramonda <damien.ramonda@intel.com>
      Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarPierre Tardy <pierre.tardy@intel.com>
      Acked-by: default avatarDavid Cohen <david.a.cohen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af248a0c
    • Mel Gorman's avatar
      mm: do not walk all of system memory during show_mem · c78e9363
      Mel Gorman authored
      It has been reported on very large machines that show_mem is taking almost
      5 minutes to display information.  This is a serious problem if there is
      an OOM storm.  The bulk of the cost is in show_mem doing a very expensive
      PFN walk to give us the following information
      
        Total RAM:       Also available as totalram_pages
        Highmem pages:   Also available as totalhigh_pages
        Reserved pages:  Can be inferred from the zone structure
        Shared pages:    PFN walk required
        Unshared pages:  PFN walk required
        Quick pages:     Per-cpu walk required
      
      Only the shared/unshared pages requires a full PFN walk but that
      information is useless.  It is also inaccurate as page pins of unshared
      pages would be accounted for as shared.  Even if the information was
      accurate, I'm struggling to think how the shared/unshared information
      could be useful for debugging OOM conditions.  Maybe it was useful before
      rmap existed when reclaiming shared pages was costly but it is less
      relevant today.
      
      The PFN walk could be optimised a bit but why bother as the information is
      useless.  This patch deletes the PFN walker and infers the total RAM,
      highmem and reserved pages count from struct zone.  It omits the
      shared/unshared page usage on the grounds that it is useless.  It also
      corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
      has similar problems to HighMem with respect to lowmem/highmem exhaustion.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c78e9363
    • Daeseok Youn's avatar
    • Toshi Kani's avatar
      mm: clear N_CPU from node_states at CPU offline · 807a1bd2
      Toshi Kani authored
      vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
      node at CPU online event.  However, it does not update this N_CPU info at
      CPU offline event.
      
      Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
      node is put into offline, i.e.  the node no longer has any online CPU.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      807a1bd2
    • Toshi Kani's avatar
      mm: set N_CPU to node_states during boot · d7e0b37a
      Toshi Kani authored
      After a system booted, N_CPU is not set to any node as has_cpu shows an
      empty line.
      
        # cat /sys/devices/system/node/has_cpu
        (show-empty-line)
      
      setup_vmstat() registers its CPU notifier callback,
      vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
      into online.  However, setup_vmstat() is called after all CPUs are
      launched in the boot sequence.
      
      Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
      boot, which is consistent with other operations in
      vmstat_cpuup_callback(), i.e.  start_cpu_timer() and
      refresh_zone_stat_thresholds().
      
      Also added get_online_cpus() to protect the for_each_online_cpu() loop.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7e0b37a
    • Tang Chen's avatar
      mem-hotplug: introduce movable_node boot option · c5320926
      Tang Chen authored
      The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
      As we mentioned before, if hotpluggable memory is used by the kernel, it
      cannot be hot-removed.  So memory hotplug users may want to set all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
      
      Memory hotplug users may also set a node as movable node, which has
      ZONE_MOVABLE only, so that the whole node can be hot-removed.
      
      But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
      kernel cannot use memory in movable nodes.  This will cause NUMA
      performance down.  And other users may be unhappy.
      
      So we need a way to allow users to enable and disable this functionality.
      In this patch, we introduce movable_node boot option to allow users to
      choose to not to consume hotpluggable memory at early boot time and later
      we can set it as ZONE_MOVABLE.
      
      To achieve this, the movable_node boot option will control the memblock
      allocation direction.  That said, after memblock is ready, before SRAT is
      parsed, we should allocate memory near the kernel image as we explained in
      the previous patches.  So if movable_node boot option is set, the kernel
      does the following:
      
      1. After memblock is ready, make memblock allocate memory bottom up.
      2. After SRAT is parsed, make memblock behave as default, allocate memory
         top down.
      
      Users can specify "movable_node" in kernel commandline to enable this
      functionality.  For those who don't use memory hotplug or who don't want
      to lose their NUMA performance, just don't specify anything.  The kernel
      will work as before.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Suggested-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5320926
    • Tang Chen's avatar
      x86, acpi, crash, kdump: do reserve_crashkernel() after SRAT is parsed. · fa591c4a
      Tang Chen authored
      Memory reserved for crashkernel could be large.  So we should not allocate
      this memory bottom up from the end of kernel image.
      
      When SRAT is parsed, we will be able to know which memory is hotpluggable,
      and we can avoid allocating this memory for the kernel.  So reorder
      reserve_crashkernel() after SRAT is parsed.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa591c4a
    • Tang Chen's avatar
      x86/mem-hotplug: support initialize page tables in bottom-up · b959ed6c
      Tang Chen authored
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      So direct memory mapping page tables setup is the case.
      init_mem_mapping() is called before SRAT is parsed.  To prevent page
      tables being allocated within hotpluggable memory, we will use bottom-up
      direction to allocate page tables from the end of kernel image to the
      higher memory.
      
      Note:
      As for allocating page tables in lower memory, TJ said:
      
      : This is an optional behavior which is triggered by a very specific kernel
      : boot param, which I suspect is gonna need to stick around to support
      : memory hotplug in the current setup unless we add another layer of address
      : translation to support memory hotplug.
      
      As for page tables may occupy too much lower memory if using 4K mapping
      (CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
      pages), TJ said:
      
      : But as I said in the same paragraph, parsing SRAT earlier doesn't solve
      : the problem in itself either.  Ignoring the option if 4k mapping is
      : required and memory consumption would be prohibitive should work, no?
      : Something like that would be necessary if we're gonna worry about cases
      : like this no matter how we implement it, but, frankly, I'm not sure this
      : is something worth worrying about.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b959ed6c
    • Tang Chen's avatar
      x86/mm: factor out of top-down direct mapping setup · 0167d7d8
      Tang Chen authored
      Create a new function memory_map_top_down to factor out of the top-down
      direct memory mapping pagetable setup.  This is also a preparation for the
      following patch, which will introduce the bottom-up memory mapping.  That
      said, we will put the two ways of pagetable setup into separate functions,
      and choose to use which way in init_mem_mapping, which makes the code more
      clear.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0167d7d8
    • Tang Chen's avatar
      mm/memblock.c: introduce bottom-up allocation mode · 79442ed1
      Tang Chen authored
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      So the basic idea is: Allocate memory from the end of the kernel image and
      to the higher memory.  Since memory allocation before SRAT is parsed won't
      be too much, it could highly likely be in the same node with kernel image.
      
      The current memblock can only allocate memory top-down.  So this patch
      introduces a new bottom-up allocation mode to allocate memory bottom-up.
      And later when we use this allocation direction to allocate memory, we
      will limit the start address above the kernel.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79442ed1
    • Tang Chen's avatar
      mm/memblock.c: factor out of top-down allocation · 1402899e
      Tang Chen authored
      [Problem]
      
      The current Linux cannot migrate pages used by the kernel because of the
      kernel direct mapping.  In Linux kernel space, va = pa + PAGE_OFFSET.
      When the pa is changed, we cannot simply update the pagetable and keep the
      va unmodified.  So the kernel pages are not migratable.
      
      There are also some other issues will cause the kernel pages not
      migratable.  For example, the physical address may be cached somewhere and
      will be used.  It is not to update all the caches.
      
      When doing memory hotplug in Linux, we first migrate all the pages in one
      memory device somewhere else, and then remove the device.  But if pages
      are used by the kernel, they are not migratable.  As a result, memory used
      by the kernel cannot be hot-removed.
      
      Modifying the kernel direct mapping mechanism is too difficult to do.  And
      it may cause the kernel performance down and unstable.  So we use the
      following way to do memory hotplug.
      
      [What we are doing]
      
      In Linux, memory in one numa node is divided into several zones.  One of
      the zones is ZONE_MOVABLE, which the kernel won't use.
      
      In order to implement memory hotplug in Linux, we are going to arrange all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
      memory.  To do this, we need ACPI's help.
      
      In ACPI, SRAT(System Resource Affinity Table) contains NUMA info.  The
      memory affinities in SRAT record every memory range in the system, and
      also, flags specifying if the memory range is hotpluggable.  (Please refer
      to ACPI spec 5.0 5.2.16)
      
      With the help of SRAT, we have to do the following two things to achieve our
      goal:
      
      1. When doing memory hot-add, allow the users arranging hotpluggable as
         ZONE_MOVABLE.
         (This has been done by the MOVABLE_NODE functionality in Linux.)
      
      2. when the system is booting, prevent bootmem allocator from allocating
         hotpluggable memory for the kernel before the memory initialization
         finishes.
      
      The problem 2 is the key problem we are going to solve. But before solving it,
      we need some preparation. Please see below.
      
      [Preparation]
      
      Bootloader has to load the kernel image into memory.  And this memory must
      be unhotpluggable.  We cannot prevent this anyway.  So in a memory hotplug
      system, we can assume any node the kernel resides in is not hotpluggable.
      
      Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
       But memblock has already started to work.  In the current kernel,
      memblock allocates the following memory before SRAT is parsed:
      
      setup_arch()
       |->memblock_x86_fill()            /* memblock is ready */
       |......
       |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
       |->reserve_real_mode()            /* allocate memory under 1MB */
       |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
       |->dma_contiguous_reserve()       /* specified by user, should be low */
       |->setup_log_buf()                /* specified by user, several mega bytes */
       |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
       |->acpi_initrd_override()         /* several mega bytes */
       |->reserve_crashkernel()          /* could be large, should reorder */
       |......
       |->initmem_init()                 /* Parse SRAT */
      
      According to Tejun's advice, before SRAT is parsed, we should try our best
      to allocate memory near the kernel image.  Since the whole node the kernel
      resides in won't be hotpluggable, and for a modern server, a node may have
      at least 16GB memory, allocating several mega bytes memory around the
      kernel image won't cross to hotpluggable memory.
      
      [About this patchset]
      
      So this patchset is the preparation for the problem 2 that we want to
      solve.  It does the following:
      
      1. Make memblock be able to allocate memory bottom up.
         1) Keep all the memblock APIs' prototype unmodified.
         2) When the direction is bottom up, keep the start address greater than the
            end of kernel image.
      
      2. Improve init_mem_mapping() to support allocate page tables in
         bottom up direction.
      
      3. Introduce "movable_node" boot option to enable and disable this
         functionality.
      
      This patch (of 6):
      
      Create a new function __memblock_find_range_top_down to factor out of
      top-down allocation from memblock_find_in_range_node.  This is a
      preparation because we will introduce a new bottom-up allocation mode in
      the following patch.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1402899e
    • Heiko Carstens's avatar
      s390/mmap: randomize mmap base for bottom up direction · 7aba842f
      Heiko Carstens authored
      Implement mmap base randomization for the bottom up direction, so ASLR
      works for both mmap layouts on s390.  See also commit df54d6fa ("x86
      get_unmapped_area(): use proper mmap base for bottom-up direction").
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Radu Caragea <sinaelgl@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7aba842f
    • Heiko Carstens's avatar
      mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction · 4e99b021
      Heiko Carstens authored
      This is more or less the generic variant of commit 41aacc1e ("x86
      get_unmapped_area: Access mmap_legacy_base through mm_struct member").
      
      So effectively architectures which use an own arch_pick_mmap_layout()
      implementation but call the generic arch_get_unmapped_area() now can
      also randomize their mmap_base.
      
      All architectures which have an own arch_pick_mmap_layout() and call the
      generic arch_get_unmapped_area() (arm64, s390, tile) currently set
      mmap_base to TASK_UNMAPPED_BASE.  This is also true for the generic
      arch_pick_mmap_layout() function.  So this change is a no-op currently.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Radu Caragea <sinaelgl@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e99b021
    • Weijie Yang's avatar
      mm/zswap: avoid unnecessary page scanning · b349acc7
      Weijie Yang authored
      Add SetPageReclaim() before __swap_writepage() so that page can be moved
      to the tail of the inactive list, which can avoid unnecessary page
      scanning as this page was reclaimed by swap subsystem before.
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b349acc7
    • Jan Kara's avatar
      writeback: do not sync data dirtied after sync start · c4a391b5
      Jan Kara authored
      When there are processes heavily creating small files while sync(2) is
      running, it can easily happen that quite some new files are created
      between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
      especially if there are several busy filesystems (remember that sync
      traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
      fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
      (e.g.  causes a transaction commit and cache flush for each inode in
      ext3), resulting sync(2) times are rather large.
      
      The following script reproduces the problem:
      
        function run_writers
        {
          for (( i = 0; i < 10; i++ )); do
            mkdir $1/dir$i
            for (( j = 0; j < 40000; j++ )); do
              dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
            done &
          done
        }
      
        for dir in "$@"; do
          run_writers $dir
        done
      
        sleep 40
        time sync
      
      Fix the problem by disregarding inodes dirtied after sync(2) was called
      in the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes
      a time stamp when sync has started which is used for setting up work for
      flusher threads.
      
      To give some numbers, when above script is run on two ext4 filesystems
      on simple SATA drive, the average sync time from 10 runs is 267.549
      seconds with standard deviation 104.799426.  With the patched kernel,
      the average sync time from 10 runs is 2.995 seconds with standard
      deviation 0.096.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4a391b5
    • Naoya Horiguchi's avatar
      tools/vm/page-types.c: support KPF_SOFTDIRTY bit · 46c77e2b
      Naoya Horiguchi authored
      Soft dirty bit allows us to track which pages are written since the last
      clear_ref (by "echo 4 > /proc/pid/clear_refs".) This is useful for
      userspace applications to know their memory footprints.
      
      Note that the kernel exposes this flag via bit[55] of /proc/pid/pagemap,
      and the semantics is not a default one (scheduled to be the default in the
      near future.) However, it shifts to the new semantics at the first
      clear_ref, and the users of soft dirty bit always do it before utilizing
      the bit, so that's not a big deal.  Users must avoid relying on the bit in
      page-types before the first clear_ref.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c77e2b
    • Naoya Horiguchi's avatar
      /proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line · ec8e41ae
      Naoya Horiguchi authored
      This flag shows that the VMA is "newly created" and thus represents
      "dirty" in the task's VM.
      
      You can clear it by "echo 4 > /proc/pid/clear_refs."
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec8e41ae
    • Zhang Yanfei's avatar
    • Krzysztof Kozlowski's avatar
      frontswap: enable call to invalidate area on swapoff · 58e97ba6
      Krzysztof Kozlowski authored
      During swapoff the frontswap_map was NULL-ified before calling
      frontswap_invalidate_area().  However the frontswap_invalidate_area()
      exits early if frontswap_map is NULL.  Invalidate was never called
      during swapoff.
      
      This patch moves frontswap_map_set() in swapoff just after calling
      frontswap_invalidate_area() so outside of locks (swap_lock and
      swap_info_struct->lock).  This shouldn't be a problem as during swapon
      the frontswap_map_set() is called also outside of any locks.
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Reviewed-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58e97ba6
    • Seth Jennings's avatar
    • Catalin Marinas's avatar
      mm: kmemleak: avoid false negatives on vmalloc'ed objects · 7f88f88f
      Catalin Marinas authored
      Commit 248ac0e1 ("mm/vmalloc: remove guard page from between vmap
      blocks") had the side effect of making vmap_area.va_end member point to
      the next vmap_area.va_start.  This was creating an artificial reference
      to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.
      
      This patch marks the vmap_area containing pointers explicitly and
      reduces the min ref_count to 2 as vm_struct still contains a reference
      to the vmalloc'ed object.  The kmemleak add_scan_area() function has
      been improved to allow a SIZE_MAX argument covering the rest of the
      object (for simpler calling sites).
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f88f88f
    • Zhang Yanfei's avatar
      mm/sparsemem: fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP · 81556b02
      Zhang Yanfei authored
      We pass the number of pages which hold page structs of a memory section
      to free_map_bootmem().  This is right when !CONFIG_SPARSEMEM_VMEMMAP but
      wrong when CONFIG_SPARSEMEM_VMEMMAP.  When CONFIG_SPARSEMEM_VMEMMAP, we
      should pass the number of pages of a memory section to free_map_bootmem.
      
      So the fix is removing the nr_pages parameter.  When
      CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
      PAGES_PER_SECTION in free_map_bootmem.  When !CONFIG_SPARSEMEM_VMEMMAP,
      we calculate page numbers needed to hold the page structs for a memory
      section and use the value in free_map_bootmem().
      
      This was found by reading the code.  And I have no machine that support
      memory hot-remove to test the bug now.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81556b02
    • Zhang Yanfei's avatar
      mm/sparsemem: use PAGES_PER_SECTION to remove redundant nr_pages parameter · 85b35fea
      Zhang Yanfei authored
      For below functions,
      
      - sparse_add_one_section()
      - kmalloc_section_memmap()
      - __kmalloc_section_memmap()
      - __kfree_section_memmap()
      
      they are always invoked to operate on one memory section, so it is
      redundant to always pass a nr_pages parameter, which is the page numbers
      in one section.  So we can directly use predefined macro PAGES_PER_SECTION
      instead of passing the parameter.
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85b35fea
    • Ying Han's avatar
      memcg: support hierarchical memory.numa_stats · 071aee13
      Ying Han authored
      The memory.numa_stat file was not hierarchical.  Memory charged to the
      children was not shown in parent's numa_stat.
      
      This change adds the "hierarchical_" stats to the existing stats.  The
      new hierarchical stats include the sum of all children's values in
      addition to the value of the memcg.
      
      Tested: Create cgroup a, a/b and run workload under b.  The values of
      b are included in the "hierarchical_*" under a.
      
      $ cd /sys/fs/cgroup
      $ echo 1 > memory.use_hierarchy
      $ mkdir a a/b
      
      Run workload in a/b:
      $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &
      
      The hierarchical_ fields in parent (a) show use of workload in a/b:
      $ cat a/memory.numa_stat
      total=0 N0=0 N1=0 N2=0 N3=0
      file=0 N0=0 N1=0 N2=0 N3=0
      anon=0 N0=0 N1=0 N2=0 N3=0
      unevictable=0 N0=0 N1=0 N2=0 N3=0
      hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
      hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
      hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
      hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
      
      $ cat a/b/memory.numa_stat
      total=908 N0=552 N1=317 N2=39 N3=0
      file=850 N0=549 N1=301 N2=0 N3=0
      anon=58 N0=3 N1=16 N2=39 N3=0
      unevictable=0 N0=0 N1=0 N2=0 N3=0
      hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
      hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
      hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
      hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      071aee13
    • Greg Thelen's avatar
      memcg: refactor mem_control_numa_stat_show() · 25485de6
      Greg Thelen authored
      Refactor mem_control_numa_stat_show() to use a new stats structure for
      smaller and simpler code.  This consolidates nearly identical code.
      
          text      data      bss        dec      hex   filename
        8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
        8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25485de6
    • Jianguo Wu's avatar
      mm/mempolicy: use NUMA_NO_NODE · b76ac7e7
      Jianguo Wu authored
      Use more appropriate NUMA_NO_NODE instead of -1
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b76ac7e7
    • Bob Liu's avatar
      mm: thp: khugepaged: add policy for finding target node · 9f1b868a
      Bob Liu authored
      Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
      hugepage which is allocated from the node of the first scanned normal
      page, but this policy is too rough and may end with unexpected result to
      upper users.
      
      The problem is the original page-balancing among all nodes will be
      broken after hugepaged started.  Thinking about the case if the first
      scanned normal page is allocated from node A, most of other scanned
      normal pages are allocated from node B or C..  But hugepaged will always
      allocate hugepage from node A which will cause extra memory pressure on
      node A which is not the situation before khugepaged started.
      
      This patch try to fix this problem by making khugepaged allocate
      hugepage from the node which have max record of scaned normal pages hit,
      so that the effect to original page-balancing can be minimized.
      
      The other problem is if normal scanned pages are equally allocated from
      Node A,B and C, after khugepaged started Node A will still suffer extra
      memory pressure.
      
      Andrew Davidoff reported a related issue several days ago.  He wanted
      his application interleaving among all nodes and "numactl
      --interleave=all ./test" was used to run the testcase, but the result
      wasn't not as expected.
      
        cat /proc/2814/numa_maps:
        7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098
      
      The end result showed that most pages are from Node3 instead of
      interleave among node0-3 which was unreasonable.
      
      This patch also fix this issue by allocating hugepage round robin from
      all nodes have the same record, after this patch the result was as
      expected:
      
        7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722
      
      The simple testcase is like this:
      
      int main() {
      	char *p;
      	int i;
      	int j;
      
      	for (i=0; i < 200; i++) {
      		p = (char *)malloc(1048576);
      		printf("malloc done\n");
      
      		if (p == 0) {
      			printf("Out of memory\n");
      			return 1;
      		}
      		for (j=0; j < 1048576; j++) {
      			p[j] = 'A';
      		}
      		printf("touched memory\n");
      
      		sleep(1);
      	}
      	printf("enter sleep\n");
      	while(1) {
      		sleep(100);
      	}
      }
      
      [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
      Reported-by: default avatarAndrew Davidoff <davidoff@qedmf.net>
      Tested-by: default avatarAndrew Davidoff <davidoff@qedmf.net>
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f1b868a
    • Bob Liu's avatar
      mm: thp: cleanup: mv alloc_hugepage to better place · 10dc4155
      Bob Liu authored
      Move alloc_hugepage() to a better place, no need for a seperate #ifndef
      CONFIG_NUMA
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Davidoff <davidoff@qedmf.net>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10dc4155
    • Christian Hesse's avatar
    • Wanpeng Li's avatar
      revert mm/vmalloc.c: emit the failure message before return · b82225f3
      Wanpeng Li authored
      Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
      __vmalloc_area_node allocation failure.  This patch reverts commit
      46c001a2 ("mm/vmalloc.c: emit the failure message before return").
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b82225f3
    • Wanpeng Li's avatar
      mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info" · af12346c
      Wanpeng Li authored
      The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e00 ("mm:
      avoid null pointer access in vm_struct via /proc/vmallocinfo") is used
      to avoid accessing the pages field with unallocated page when
      show_numa_info() is called.
      
      This patch moves the check just before show_numa_info in order that some
      messages still can be dumped via /proc/vmallocinfo.  This patch reverts
      commit d157a558 ("mm/vmalloc.c: check VM_UNINITIALIZED flag in
      s_show instead of show_numa_info");
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af12346c
    • Wanpeng Li's avatar
      mm/vmalloc: fix show vmap_area information race with vmap_area tear down · c2ce8c14
      Wanpeng Li authored
      There is a race window between vmap_area tear down and show vmap_area
      information.
      
      	A                                                B
      
      remove_vm_area
      spin_lock(&vmap_area_lock);
      va->vm = NULL;
      va->flags &= ~VM_VM_AREA;
      spin_unlock(&vmap_area_lock);
      						spin_lock(&vmap_area_lock);
      						if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
      							return 0;
      						if (!(va->flags & VM_VM_AREA)) {
      							seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
      								(void *)va->va_start, (void *)va->va_end,
      								va->va_end - va->va_start);
      							return 0;
      						}
      free_unmap_vmap_area(va);
      	flush_cache_vunmap
      	free_unmap_vmap_area_noflush
      		unmap_vmap_area
      		free_vmap_area_noflush
      			va->flags |= VM_LAZY_FREE
      
      The assumption !VM_VM_AREA represents vm_map_ram allocation is
      introduced by d4033afd ("mm, vmalloc: iterate vmap_area_list,
      instead of vmlist, in vmallocinfo()").
      
      However, !VM_VM_AREA also represents vmap_area is being tear down in
      race window mentioned above.  This patch fix it by don't dump any
      information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
      VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.
      Suggested-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2ce8c14
    • Wanpeng Li's avatar
      mm/vmalloc: don't set area->caller twice · 3722e13c
      Wanpeng Li authored
      The caller address has already been set in set_vmalloc_vm(), there's no
      need to set it again in __vmalloc_area_node.
      Reviewed-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3722e13c
    • David Rientjes's avatar
      mm, mempolicy: make mpol_to_str robust and always succeed · 948927ee
      David Rientjes authored
      mpol_to_str() should not fail.  Currently, it either fails because the
      string buffer is too small or because a string hasn't been defined for a
      mempolicy mode.
      
      If a new mempolicy mode is introduced and no string is defined for it,
      just warn and return "unknown".
      
      If the buffer is too small, just truncate the string and return, the
      same behavior as snprintf().
      
      This also fixes a bug where there was no NULL-byte termination when doing
      *p++ = '=' and *p++ ':' and maxlen has been reached.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Chen Gang <gang.chen@asianux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      948927ee
    • Jianguo Wu's avatar
      mm/arch: use NUMA_NO_NODE · 40c3baa7
      Jianguo Wu authored
      Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc()
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40c3baa7
    • Naoya Horiguchi's avatar
      mm/memory-failure.c: move set_migratetype_isolate() outside get_any_page() · 03b61ff3
      Naoya Horiguchi authored
      Chen Gong pointed out that set/unset_migratetype_isolate() was done in
      different functions in mm/memory-failure.c, which makes the code less
      readable/maintainable.  So this patch does it in soft_offline_page().
      
      With this patch, we get to hold lock_memory_hotplug() longer but it's
      not a problem because races between memory hotplug and soft offline are
      very rare.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarChen, Gong <gong.chen@linux.intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03b61ff3