1. 08 Oct, 2016 40 commits
    • Michal Hocko's avatar
      oom: warn if we go OOM for higher order and compaction is disabled · 9254990f
      Michal Hocko authored
      Since the lumpy reclaim is gone there is no source of higher order pages
      if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
      unreliable for that purpose to say the least.  Hitting an OOM for
      !costly higher order requests is therefore all not that hard to imagine.
      We are trying hard to not invoke OOM killer as much as possible but
      there is simply no reliable way to detect whether more reclaim retries
      make sense.
      
      Disabling COMPACTION is not widespread but it seems that some users
      might have disable the feature without realizing full consequences
      (mostly along with disabling THP because compaction used to be THP
      mainly thing).  This patch just adds a note if the OOM killer was
      triggered by higher order request with compaction disabled.  This will
      help us identifying possible misconfiguration right from the oom report
      which is easier than to always keep in mind that somebody might have
      disabled COMPACTION without a good reason.
      
      Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9254990f
    • Huang Ying's avatar
      mm: don't use radix tree writeback tags for pages in swap cache · 371a096e
      Huang Ying authored
      File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
      etc.) to accelerate finding the pages with a specific tag in the radix
      tree during inode writeback.  But for anonymous pages in the swap cache,
      there is no inode writeback.  So there is no need to find the pages with
      some writeback tags in the radix tree.  It is not necessary to touch
      radix tree writeback tags for pages in the swap cache.
      
      Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
      introduced for address spaces which don't need to update the writeback
      tags.  The flag is set for swap caches.  It may be used for DAX file
      systems, etc.
      
      With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
      ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
      The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
      simulated PMEM (persistent memory) device.  The improvement comes from
      the reduced contention on the swap cache radix tree lock.  To test
      sequential swapping out, the test case uses 8 processes, which
      sequentially allocate and write to the anonymous pages until RAM and
      part of the swap device is used up.
      
      Details of comparison is as follow,
      
      base             base+patch
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
         2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
         1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
           10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
           10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
           10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
           10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
      
      Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      371a096e
    • zijun_hu's avatar
      mm/bootmem.c: replace kzalloc() by kzalloc_node() · 1d8bf926
      zijun_hu authored
      In ___alloc_bootmem_node_nopanic(), replace kzalloc() by kzalloc_node()
      in order to allocate memory within given node preferentially when slab
      is available
      
      Link: http://lkml.kernel.org/r/1f487f12-6af4-5e4f-a28c-1de2361cdcd8@zoho.comSigned-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d8bf926
    • zijun_hu's avatar
      mm/nobootmem.c: remove duplicate macro ARCH_LOW_ADDRESS_LIMIT statements · 2382705f
      zijun_hu authored
      Fix the following bugs:
      
       - the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
         header and relevant source
      
       - don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
         asm/processor.h is preferred over default in linux/bootmem.h
         completely since the former header isn't included by the latter
      
      Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.comSigned-off-by: default avatarzijun_hu <zijun_hu@htc.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2382705f
    • Srikar Dronamraju's avatar
      powerpc: implement arch_reserved_kernel_pages · 1e76609c
      Srikar Dronamraju authored
      Currently significant amount of memory is reserved only in kernel booted
      to capture kernel dump using the fa_dump method.
      
      Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
      only certain size memory per node.  The certain size takes into account
      the dentry and inode cache sizes.  Currently the cache sizes are
      calculated based on the total system memory including the reserved
      memory.  However such a kernel when booting the same kernel as fadump
      kernel will not be able to allocate the required amount of memory to
      suffice for the dentry and inode caches.  This results in crashes like
      
      Hence only implement arch_reserved_kernel_pages() for CONFIG_FA_DUMP
      configurations.  The amount reserved will be reduced while calculating
      the large caches and will avoid crashes like the below on large systems
      such as 32 TB systems.
      
        Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
        vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
        swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
        Call Trace:
           dump_stack+0xb0/0xf0 (unreliable)
           warn_alloc_failed+0x114/0x160
           __vmalloc_node_range+0x304/0x340
           __vmalloc+0x6c/0x90
           alloc_large_system_hash+0x1b8/0x2c0
           inode_init+0x94/0xe4
           vfs_caches_init+0x8c/0x13c
           start_kernel+0x50c/0x578
           start_here_common+0x20/0xa8
      
      Link: http://lkml.kernel.org/r/1472476010-4709-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e76609c
    • Srikar Dronamraju's avatar
      mm/memblock.c: expose total reserved memory · 8907de5d
      Srikar Dronamraju authored
      The total reserved memory in a system is accounted but not available for
      use use outside mm/memblock.c.  By exposing the total reserved memory,
      systems can better calculate the size of large hashes.
      
      Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8907de5d
    • Srikar Dronamraju's avatar
      mm: introduce arch_reserved_kernel_pages() · f6f34b43
      Srikar Dronamraju authored
      Currently arch specific code can reserve memory blocks but
      alloc_large_system_hash() may not take it into consideration when sizing
      the hashes.  This can lead to bigger hash than required and lead to no
      available memory for other purposes.  This is specifically true for
      systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
      
      One approach to solve this problem would be to walk through the memblock
      regions and calculate the available memory and base the size of hash
      system on the available memory.
      
      The other approach would be to depend on the architecture to provide the
      number of pages that are reserved.  This change provides hooks to allow
      the architecture to provide the required info.
      
      Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Suggested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6f34b43
    • Aneesh Kumar K.V's avatar
      mm: use zonelist name instead of using hardcoded index · c9634cf0
      Aneesh Kumar K.V authored
      Use the existing enums instead of hardcoded index when looking at the
      zonelist.  This makes it more readable.  No functionality change by this
      patch.
      
      Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9634cf0
    • Michal Hocko's avatar
      oom, oom_reaper: allow to reap mm shared by the kthreads · 1b51e65e
      Michal Hocko authored
      oom reaper was skipped for an mm which is shared with the kernel thread
      (aka use_mm()).  The primary concern was that such a kthread might want
      to read from the userspace memory and see zero page as a result of the
      oom reaper action.  This is no longer a problem after "mm: make sure
      that kthreads will not refault oom reaped memory" because any attempt to
      fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
      target user should see an error.  This means that we can finally allow
      oom reaper also to tasks which share their mm with kthreads.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b51e65e
    • Michal Hocko's avatar
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko authored
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatar"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
    • Tetsuo Handa's avatar
      mm, oom: enforce exit_oom_victim on current task · 38531201
      Tetsuo Handa authored
      There are no users of exit_oom_victim on !current task anymore so enforce
      the API to always work on the current.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38531201
    • Michal Hocko's avatar
      oom, suspend: fix oom_killer_disable vs. pm suspend properly · 7d2e7a22
      Michal Hocko authored
      Commit 74070542 ("oom, suspend: fix oom_reaper vs.
      oom_killer_disable race") has workaround an existing race between
      oom_killer_disable and oom_reaper by adding another round of
      try_to_freeze_tasks after the oom killer was disabled.  This was the
      easiest thing to do for a late 4.7 fix.  Let's fix it properly now.
      
      After "oom: keep mm of the killed task available" we no longer have to
      call exit_oom_victim from the oom reaper because we have stable mm
      available and hide the oom_reaped mm by MMF_OOM_SKIP flag.  So let's
      remove exit_oom_victim and the race described in the above commit
      doesn't exist anymore if.
      
      Unfortunately this alone is not sufficient for the oom_killer_disable
      usecase because now we do not have any reliable way to reach
      exit_oom_victim (the victim might get stuck on a way to exit for an
      unbounded amount of time).  OOM killer can cope with that by checking mm
      flags and move on to another victim but we cannot do the same for
      oom_killer_disable as we would lose the guarantee of no further
      interference of the victim with the rest of the system.  What we can do
      instead is to cap the maximum time the oom_killer_disable waits for
      victims.  The only current user of this function (pm suspend) already
      has a concept of timeout for back off so we can reuse the same value
      there.
      
      Let's drop set_freezable for the oom_reaper kthread because it is no
      longer needed as the reaper doesn't wake or thaw any processes.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d2e7a22
    • Michal Hocko's avatar
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko authored
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      862e3073
    • Michal Hocko's avatar
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop · 7283094e
      Michal Hocko authored
      Lockdep complains that __mmdrop is not safe from the softirq context:
      
        =================================
        [ INFO: inconsistent lock state ]
        4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
        ---------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
        {SOFTIRQ-ON-W} state was registered at:
           __lock_acquire+0xa06/0x196e
           lock_acquire+0x139/0x1e1
           _raw_spin_lock+0x32/0x41
           __change_page_attr_set_clr+0x2a5/0xacd
           change_page_attr_set_clr+0x16f/0x32c
           set_memory_nx+0x37/0x3a
           free_init_pages+0x9e/0xc7
           alternative_instructions+0xa2/0xb3
           check_bugs+0xe/0x2d
           start_kernel+0x3ce/0x3ea
           x86_64_start_reservations+0x2a/0x2c
           x86_64_start_kernel+0x17a/0x18d
        irq event stamp: 105916
        hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
        hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
        softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
        softirqs last disabled at (105879): irq_exit+0x6f/0xd1
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(pgd_lock);
          <Interrupt>
            lock(pgd_lock);
      
         *** DEADLOCK ***
      
        1 lock held by swapper/1/0:
         #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
      
        stack backtrace:
        CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
         <IRQ>
          print_usage_bug.part.25+0x259/0x268
          mark_lock+0x381/0x567
          __lock_acquire+0x993/0x196e
          lock_acquire+0x139/0x1e1
          _raw_spin_lock+0x32/0x41
          pgd_free+0x19/0x6b
          __mmdrop+0x25/0xb9
          __put_task_struct+0x103/0x11e
          delayed_put_task_struct+0x157/0x15e
          rcu_process_callbacks+0x660/0x800
          __do_softirq+0x1ec/0x4d5
          irq_exit+0x6f/0xd1
          smp_apic_timer_interrupt+0x42/0x4d
          apic_timer_interrupt+0x8e/0xa0
         <EOI>
          arch_cpu_idle+0xf/0x11
          default_idle_call+0x32/0x34
          cpu_startup_entry+0x20c/0x399
          start_secondary+0xfe/0x101
      
      More over commit a79e53d8 ("x86/mm: Fix pgd_lock deadlock") was
      explicit about pgd_lock not to be called from the irq context.  This
      means that __mmdrop called from free_signal_struct has to be postponed
      to a user context.  We already have a similar mechanism for mmput_async
      so we can use it here as well.  This is safe because mm_count is pinned
      by mm_users.
      
      This fixes bug introduced by "oom: keep mm of the killed task available"
      
      Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7283094e
    • Michal Hocko's avatar
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko authored
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
      
        do_exit
          exit_mm
            tsk->mm = NULL;
            mmput
              __mmput
            exit_oom_victim
      
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26db62f1
    • Tetsuo Handa's avatar
      mm,oom_reaper: do not attempt to reap a task twice · 8496afab
      Tetsuo Handa authored
      "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
      OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
      But the usefulness of the flag is rather limited and actually never
      shown in practice.  If the flag is set, it means that the holder of
      mm->mmap_sem cannot call up_write() due to presumably being blocked at
      unkillable wait waiting for other thread's memory allocation.  But since
      one of threads sharing that mm will queue that mm immediately via
      task_will_free_mem() shortcut (otherwise, oom_badness() will select the
      same mm again due to oom_score_adj value unchanged), retrying
      MMF_OOM_NOT_REAPABLE mm is unlikely helpful.
      
      Let's always set MMF_OOM_REAPED.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8496afab
    • Tetsuo Handa's avatar
      mm,oom_reaper: reduce find_lock_task_mm() usage · 7ebffa45
      Tetsuo Handa authored
      Patch series "fortify oom killer even more", v2.
      
      This patch (of 9):
      
      __oom_reap_task() can be simplified a bit if it receives a valid mm from
      oom_reap_task() which also uses that mm when __oom_reap_task() failed.
      We can drop one find_lock_task_mm() call and also make the
      __oom_reap_task() code flow easier to follow.  Moreover, this will make
      later patch in the series easier to review.  Pinning mm's mm_count for
      longer time is not really harmful because this will not pin much memory.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ebffa45
    • Huang Ying's avatar
      mm, swap: add swap_cluster_list · 6b534915
      Huang Ying authored
      This is a code clean up patch without functionality changes.  The
      swap_cluster_list data structure and its operations are introduced to
      provide some better encapsulation for the free cluster and discard
      cluster list operations.  This avoid some code duplication, improved the
      code readability, and reduced the total line number.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b534915
    • Alexey Dobriyan's avatar
      mm: unrig VMA cache hit ratio · 131ddc5c
      Alexey Dobriyan authored
      Current code doesn't count first FIND operation after VMA cache flush
      (which happen surprisingly often) artificially increasing cache hit ratio.
      
      On my regular setup the difference is:
      
      		Before				After
      	==========================================================
      
      	* boot, login into KDE
      
      	vmacache_find_calls 446216	vmacache_find_calls 492741
      	vmacache_find_hits 277596	vmacache_find_hits 276096
      
      		~62.2%				~56.0%
      
      	* rebuild kernel (no changes to code, usual config)
      
      	vmacache_find_calls 1943007	vmacache_find_calls 2083718
      	vmacache_find_hits 1246123	vmacache_find_hits 1244146
      
      		~64.1%				~59.7%
      
      	* rebuild kernel (full rebuild, usual config)
      
      	vmacache_find_calls 32163155	vmacache_find_calls 33677183
      	vmacache_find_hits 27889956	vmacache_find_hits 27877591
      
      		~88.2%				~84.3%
      
      Total: ~4% cache hit ratio.
      
      If someone is counting _relative_ cache _miss_ ratio, misreporting is much
      higher.
      
      Link: http://lkml.kernel.org/r/20160822225009.GA3934@p183.telecom.bySigned-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      131ddc5c
    • James Morse's avatar
      mm: pagewalk: fix the comment for test_walk · f7e2355f
      James Morse authored
      Modify the comment describing struct mm_walk->test_walk()s behaviour to
      match the comment on walk_page_test() and the behaviour of
      walk_page_vma().
      
      Fixes: fafaa426 ("pagewalk: improve vma handling")
      Link: http://lkml.kernel.org/r/1471622518-21980-1-git-send-email-james.morse@arm.comSigned-off-by: default avatarJames Morse <james.morse@arm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7e2355f
    • Bart Van Assche's avatar
      do_generic_file_read(): fail immediately if killed · c4b209a4
      Bart Van Assche authored
      If a fatal signal has been received, fail immediately instead of trying
      to read more data.
      
      If wait_on_page_locked_killable() was interrupted then this page is most
      likely is not PageUptodate() and in this case do_generic_file_read()
      will fail after lock_page_killable().
      
      See also commit ebded027 ("mm: filemap: avoid unnecessary calls to
      lock_page when waiting for IO to complete during a read")
      
      [oleg@redhat.com: changelog addition]
      Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.comSigned-off-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4b209a4
    • Joonsoo Kim's avatar
      mm/page_owner: don't define fields on struct page_ext by hard-coding · 9300d8df
      Joonsoo Kim authored
      There is a memory waste problem if we define field on struct page_ext by
      hard-coding.  Entry size of struct page_ext includes the size of those
      fields even if it is disabled at runtime.  Now, extra memory request at
      runtime is possible so page_owner don't need to define it's own fields
      by hard-coding.
      
      This patch removes hard-coded define and uses extra memory for storing
      page_owner information in page_owner.  Most of code are just mechanical
      changes.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9300d8df
    • Joonsoo Kim's avatar
      mm/page_ext: support extra space allocation by page_ext user · 980ac167
      Joonsoo Kim authored
      Until now, if some page_ext users want to use it's own field on
      page_ext, it should be defined in struct page_ext by hard-coding.  It
      has a problem that wastes memory in following situation.
      
        struct page_ext {
         #ifdef CONFIG_A
        	int a;
         #endif
         #ifdef CONFIG_B
        	int b;
         #endif
        };
      
      Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
      enable feature A and doesn't enable feature B at runtime, each entry of
      struct page_ext takes two int rather than one int.  It's undesirable
      result so this patch tries to fix it.
      
      To solve above problem, this patch implements to support extra space
      allocation at runtime.  When need() callback returns true, it's extra
      memory requirement is summed to entry size of page_ext.  Also, offset
      for each user's extra memory space is returned.  With this offset, user
      can use this extra space and there is no need to define needed field on
      page_ext by hard-coding.
      
      This patch only implements an infrastructure.  Following patch will use
      it for page_owner which is only user having it's own fields on page_ext.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      980ac167
    • Joonsoo Kim's avatar
      mm/page_ext: rename offset to index · 0b06bb3f
      Joonsoo Kim authored
      Here, 'offset' means entry index in page_ext array.  Following patch
      will use 'offset' for field offset in each entry so rename current
      'offset' to prevent confusion.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-5-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b06bb3f
    • Joonsoo Kim's avatar
      mm/page_owner: move page_owner specific function to page_owner.c · e2f612e6
      Joonsoo Kim authored
      There is no reason that page_owner specific function resides on
      vmstat.c.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2f612e6
    • Joonsoo Kim's avatar
      mm/debug_pagealloc.c: don't allocate page_ext if we don't use guard page · f1c1e9f7
      Joonsoo Kim authored
      What debug_pagealloc does is just mapping/unmapping page table.
      Basically, it doesn't need additional memory space to memorize
      something.  But, with guard page feature, it requires additional memory
      to distinguish if the page is for guard or not.  Guard page is only used
      when debug_guardpage_minorder is non-zero so this patch removes
      additional memory allocation (page_ext) if debug_guardpage_minorder is
      zero.
      
      It saves memory if we just use debug_pagealloc and not guard page.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1c1e9f7
    • Joonsoo Kim's avatar
      mm/debug_pagealloc.c: clean-up guard page handling code · acbc15a4
      Joonsoo Kim authored
      Patch series "Reduce memory waste by page extension user".
      
      This patchset tries to reduce memory waste by page extension user.
      
      First case is architecture supported debug_pagealloc.  It doesn't
      requires additional memory if guard page isn't used.  8 bytes per page
      will be saved in this case.
      
      Second case is related to page owner feature.  Until now, if page_ext
      users want to use it's own fields on page_ext, fields should be defined
      in struct page_ext by hard-coding.  It has a following problem.
      
        struct page_ext {
         #ifdef CONFIG_A
        	int a;
         #endif
         #ifdef CONFIG_B
      	int b;
         #endif
        };
      
      Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
      enable feature A and doesn't enable feature B at runtime, each entry of
      struct page_ext takes two int rather than one int.  It's undesirable
      waste so this patch tries to reduce it.  By this patchset, we can save
      20 bytes per page dedicated for page owner feature in some
      configurations.
      
      This patch (of 6):
      
      We can make code clean by moving decision condition for set_page_guard()
      into set_page_guard() itself.  It will help code readability.  There is
      no functional change.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acbc15a4
    • Michal Hocko's avatar
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko authored
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
    • Xishi Qiu's avatar
      mm: fix set pageblock migratetype in deferred struct page init · e780149b
      Xishi Qiu authored
      On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually
      2M, so we only set one pageblock's migratetype in deferred_free_range()
      if pfn is aligned to MAX_ORDER_NR_PAGES.  That means it causes
      uninitialized migratetype blocks, you can see from "cat
      /proc/pagetypeinfo", almost half blocks are Unmovable.
      
      Also we missed freeing the last block in deferred_init_memmap(), it
      causes memory leak.
      
      Fixes: ac5d2539 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
      Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.comSigned-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e780149b
    • Xishi Qiu's avatar
      mem-hotplug: fix node spanned pages when we have a movable node · e506b996
      Xishi Qiu authored
      Commit 342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror
      option") rewrote the calculation of node spanned pages.  But when we
      have a movable node, the size of node spanned pages is double added.
      That's because we have an empty normal zone, the present pages is zero,
      but its spanned pages is not zero.
      
      e.g.
          Zone ranges:
            DMA      [mem 0x0000000000001000-0x0000000000ffffff]
            DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
            Normal   [mem 0x0000000100000000-0x0000007c7fffffff]
          Movable zone start for each node
            Node 1: 0x0000001080000000
            Node 2: 0x0000002080000000
            Node 3: 0x0000003080000000
            Node 4: 0x0000003c80000000
            Node 5: 0x0000004c80000000
            Node 6: 0x0000005c80000000
          Early memory node ranges
            node   0: [mem 0x0000000000001000-0x000000000009ffff]
            node   0: [mem 0x0000000000100000-0x000000007552afff]
            node   0: [mem 0x000000007bd46000-0x000000007bd46fff]
            node   0: [mem 0x000000007bdcd000-0x000000007bffffff]
            node   0: [mem 0x0000000100000000-0x000000107fffffff]
            node   1: [mem 0x0000001080000000-0x000000207fffffff]
            node   2: [mem 0x0000002080000000-0x000000307fffffff]
            node   3: [mem 0x0000003080000000-0x0000003c7fffffff]
            node   4: [mem 0x0000003c80000000-0x0000004c7fffffff]
            node   5: [mem 0x0000004c80000000-0x0000005c7fffffff]
            node   6: [mem 0x0000005c80000000-0x0000006c7fffffff]
            node   7: [mem 0x0000006c80000000-0x0000007c7fffffff]
      
        node1:
          Normal, start=0x1080000, present=0x0, spanned=0x1000000
          Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
          pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000
      
      After this patch, the problem is fixed.
      
        node1:
          Normal, start=0x0, present=0x0, spanned=0x0
          Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
          pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000
      
      Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.comSigned-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e506b996
    • Vlastimil Babka's avatar
      mm, vmscan: make compaction_ready() more accurate and readable · fdd4c614
      Vlastimil Babka authored
      The compaction_ready() is used during direct reclaim for costly order
      allocations to skip reclaim for zones where compaction should be
      attempted instead.  It's combining the standard compaction_suitable()
      check with its own watermark check based on high watermark with extra
      gap, and the result is confusing at best.
      
      This patch attempts to better structure and document the checks
      involved.  First, compaction_suitable() can determine that the
      allocation should either succeed already, or that compaction doesn't
      have enough free pages to proceed.  The third possibility is that
      compaction has enough free pages, but we still decide to reclaim first -
      unless we are already above the high watermark with gap.  This does not
      mean that the reclaim will actually reach this watermark during single
      attempt, this is rather an over-reclaim protection.  So document the
      code as such.  The check for compaction_deferred() is removed
      completely, as it in fact had no proper role here.
      
      The result after this patch is mainly a less confusing code.  We also
      skip some over-reclaim in cases where the allocation should already
      succed.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdd4c614
    • Vlastimil Babka's avatar
      mm, compaction: require only min watermarks for non-costly orders · 8348faf9
      Vlastimil Babka authored
      The __compaction_suitable() function checks the low watermark plus a
      compact_gap() gap to decide if there's enough free memory to perform
      compaction.  Then __isolate_free_page uses low watermark check to decide
      if particular free page can be isolated.  In the latter case, using low
      watermark is needlessly pessimistic, as the free page isolations are
      only temporary.  For __compaction_suitable() the higher watermark makes
      sense for high-order allocations where more freepages increase the
      chance of success, and we can typically fail with some order-0 fallback
      when the system is struggling to reach that watermark.  But for
      low-order allocation, forming the page should not be that hard.  So
      using low watermark here might just prevent compaction from even trying,
      and eventually lead to OOM killer even if we are above min watermarks.
      
      So after this patch, we use min watermark for non-costly orders in
      __compaction_suitable(), and for all orders in __isolate_free_page().
      
      [vbabka@suse.cz: clarify __isolate_free_page() comment]
       Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8348faf9
    • Vlastimil Babka's avatar
      mm, compaction: use proper alloc_flags in __compaction_suitable() · 984fdba6
      Vlastimil Babka authored
      The __compaction_suitable() function checks the low watermark plus a
      compact_gap() gap to decide if there's enough free memory to perform
      compaction.  This check uses direct compactor's alloc_flags, but that's
      wrong, since these flags are not applicable for freepage isolation.
      
      For example, alloc_flags may indicate access to memory reserves, making
      compaction proceed, and then fail watermark check during the isolation.
      
      A similar problem exists for ALLOC_CMA, which may be part of
      alloc_flags, but not during freepage isolation.  In this case however it
      makes sense to use ALLOC_CMA both in __compaction_suitable() and
      __isolate_free_page(), since there's actually nothing preventing the
      freepage scanner to isolate from CMA pageblocks, with the assumption
      that a page that could be migrated once by compaction can be migrated
      also later by CMA allocation.  Thus we should count pages in CMA
      pageblocks when considering compaction suitability and when isolating
      freepages.
      
      To sum up, this patch should remove some false positives from
      __compaction_suitable(), and allow compaction to proceed when free pages
      required for compaction reside in the CMA pageblocks.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      984fdba6
    • Vlastimil Babka's avatar
      mm, compaction: create compact_gap wrapper · 9861a62c
      Vlastimil Babka authored
      Compaction uses a watermark gap of (2UL << order) pages at various
      places and it's not immediately obvious why.  Abstract it through a
      compact_gap() wrapper to create a single place with a thorough
      explanation.
      
      [vbabka@suse.cz: clarify the comment of compact_gap()]
       Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9861a62c
    • Vlastimil Babka's avatar
      mm, compaction: use correct watermark when checking compaction success · f2b8228c
      Vlastimil Babka authored
      The __compact_finished() function uses low watermark in a check that has
      to pass if the direct compaction is to finish and allocation should
      succeed.  This is too pessimistic, as the allocation will typically use
      min watermark.  It may happen that during compaction, we drop below the
      low watermark (due to parallel activity), but still form the target
      high-order page.  By checking against low watermark, we might needlessly
      continue compaction.
      
      Similarly, __compaction_suitable() uses low watermark in a check whether
      allocation can succeed without compaction.  Again, this is unnecessarily
      pessimistic.
      
      After this patch, these check will use direct compactor's alloc_flags to
      determine the watermark, which is effectively the min watermark.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2b8228c
    • Vlastimil Babka's avatar
      mm, compaction: add the ultimate direct compaction priority · a8e025e5
      Vlastimil Babka authored
      During reclaim/compaction loop, it's desirable to get a final answer
      from unsuccessful compaction so we can either fail the allocation or
      invoke the OOM killer.  However, heuristics such as deferred compaction
      or pageblock skip bits can cause compaction to skip parts or whole zones
      and lead to premature OOM's, failures or excessive reclaim/compaction
      retries.
      
      To remedy this, we introduce a new direct compaction priority called
      COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
      
       - ignore deferred compaction status for a zone
       - ignore pageblock skip hints
       - ignore cached scanner positions and scan the whole zone
      
      The new priority should get eventually picked up by
      should_compact_retry() and this should improve success rates for costly
      allocations using __GFP_REPEAT, such as hugetlbfs allocations, and
      reduce some corner-case OOM's for non-costly allocations.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
      [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias]
        Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8e025e5
    • Vlastimil Babka's avatar
      mm, compaction: don't recheck watermarks after COMPACT_SUCCESS · 7ceb009a
      Vlastimil Babka authored
      Joonsoo has reminded me that in a later patch changing watermark checks
      throughout compaction I forgot to update checks in
      try_to_compact_pages() and compactd_do_work().  Closer inspection
      however shows that they are redundant now in the success case, because
      compact_zone() now reliably reports this with COMPACT_SUCCESS.  So
      effectively the checks just repeat (a subset) of checks that have just
      passed.  So instead of checking watermarks again, just test the return
      value.
      
      Note it's also possible that compaction would declare failure e.g.
      because its find_suitable_fallback() is more strict than simple
      watermark check, and then the watermark check we are removing would then
      still succeed.  After this patch this is not possible and it's arguably
      better, because for long-term fragmentation avoidance we should rather
      try a different zone than allocate with the unsuitable fallback.  If
      compaction of all zones fail and the allocation is important enough, it
      will retry and succeed anyway.
      
      Also remove the stray "bool success" variable from kcompactd_do_work().
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ceb009a
    • Vlastimil Babka's avatar
      mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS · cf378319
      Vlastimil Babka authored
      COMPACT_PARTIAL has historically meant that compaction returned after
      doing some work without fully compacting a zone.  It however didn't
      distinguish if compaction terminated because it succeeded in creating
      the requested high-order page.  This has changed recently and now we
      only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
      high-order watermark check in compaction_suitable() passes and no
      compaction needs to be done.
      
      So at this point we can make the return value clearer by renaming it to
      COMPACT_SUCCESS.  The next patch will remove some redundant tests for
      success where compaction just returned COMPACT_SUCCESS.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf378319
    • Vlastimil Babka's avatar
      mm, compaction: cleanup unused functions · 791cae96
      Vlastimil Babka authored
      Since kswapd compaction moved to kcompactd, compact_pgdat() is not
      called anymore, so we remove it.  The only caller of __compact_pgdat()
      is compact_node(), so we merge them and remove code that was only
      reachable from kswapd.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      791cae96
    • Vlastimil Babka's avatar
      mm, compaction: make whole_zone flag ignore cached scanner positions · 06ed2998
      Vlastimil Babka authored
      Patch series "make direct compaction more deterministic")
      
      This is mostly a followup to Michal's oom detection rework, which
      highlighted the need for direct compaction to provide better feedback in
      reclaim/compaction loop, so that it can reliably recognize when
      compaction cannot make further progress, and allocation should invoke
      OOM killer or fail.  We've discussed this at LSF/MM [1] where I proposed
      expanding the async/sync migration mode used in compaction to more
      general "priorities".  This patchset adds one new priority that just
      overrides all the heuristics and makes compaction fully scan all zones.
      I don't currently think that we need more fine-grained priorities, but
      we'll see.  Other than that there's some smaller fixes and cleanups,
      mainly related to the THP-specific hacks.
      
      I've tested this with stress-highalloc in GFP_KERNEL order-4 and
      THP-like order-9 scenarios.  There's some improvement for compaction
      stats for the order-4, which is likely due to the better watermarks
      handling.  In the previous version I reported mostly noise wrt
      compaction stats, and decreased direct reclaim - now the reclaim is
      without difference.  I believe this is due to the less aggressive
      compaction priority increase in patch 6.
      
      "before" is a mmotm tree prior to 4.7 release plus the first part of the
      series that was sent and merged separately
      
                                          before        after
      order-4:
      
      Compaction stalls                    27216       30759
      Compaction success                   19598       25475
      Compaction failures                   7617        5283
      Page migrate success                370510      464919
      Page migrate failure                 25712       27987
      Compaction pages isolated           849601     1041581
      Compaction migrate scanned       143146541   101084990
      Compaction free scanned          208355124   144863510
      Compaction cost                       1403        1210
      
      order-9:
      
      Compaction stalls                     7311        7401
      Compaction success                    1634        1683
      Compaction failures                   5677        5718
      Page migrate success                194657      183988
      Page migrate failure                  4753        4170
      Compaction pages isolated           498790      456130
      Compaction migrate scanned          565371      524174
      Compaction free scanned            4230296     4250744
      Compaction cost                        215         203
      
      [1] https://lwn.net/Articles/684611/
      
      This patch (of 11):
      
      A recent patch has added whole_zone flag that compaction sets when
      scanning starts from the zone boundary, in order to report that zone has
      been fully scanned in one attempt.  For allocations that want to try
      really hard or cannot fail, we will want to introduce a mode where
      scanning whole zone is guaranteed regardless of the cached positions.
      
      This patch reuses the whole_zone flag in a way that if it's already
      passed true to compaction, the cached scanner positions are ignored.
      Employing this flag during reclaim/compaction loop will be done in the
      next patch.  This patch however converts compaction invoked from
      userspace via procfs to use this flag.  Before this patch, the cached
      positions were first reset to zone boundaries and then read back from
      struct zone, so there was a window where a parallel compaction could
      replace the reset values, making the manual compaction less effective.
      Using the flag instead of performing reset is more robust.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06ed2998