1. 21 May, 2016 40 commits
    • Yang Shi's avatar
      mm: page_is_guard(): return false when page_ext arrays are not allocated yet · 0bb2fd13
      Yang Shi authored
      When enabling the below kernel configs:
      
      CONFIG_DEFERRED_STRUCT_PAGE_INIT
      CONFIG_DEBUG_PAGEALLOC
      CONFIG_PAGE_EXTENSION
      CONFIG_DEBUG_VM
      
      kernel bootup may fail due to the following oops:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
        Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
        task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
        RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
        RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
        RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
        R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
        R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
        FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
        Call Trace:
          free_hot_cold_page+0x192/0x1d0
          __free_pages+0x5c/0x90
          __free_pages_boot_core+0x11a/0x14e
          deferred_free_range+0x50/0x62
          deferred_init_memmap+0x220/0x3c3
          kthread+0xf8/0x110
          ret_from_fork+0x22/0x40
        Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
        RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
         RSP <ffff88017c087c48>
        CR2: 0000000000000000
      
      The problem is lookup_page_ext() returns NULL then page_is_guard() tried
      to access it in page freeing.
      
      page_is_guard() depends on PAGE_EXT_DEBUG_GUARD bit of page extension
      flag, but freeing page might reach here before the page_ext arrays are
      allocated when feeding a range of pages to the allocator for the first
      time during bootup or memory hotplug.
      
      When it returns NULL, page_is_guard() should just return false instead
      of checking PAGE_EXT_DEBUG_GUARD unconditionally.
      
      Link: http://lkml.kernel.org/r/1463610225-29060-1-git-send-email-yang.shi@linaro.orgSigned-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bb2fd13
    • David Rientjes's avatar
      mm, thp: khugepaged should scan when sleep value is written · f0508977
      David Rientjes authored
      If a large value is written to scan_sleep_millisecs, for example, that
      period must lapse before khugepaged will wake up for periodic
      collapsing.
      
      If this value is tuned to 1 day, for example, and then re-tuned to its
      default 10s, khugepaged will still wait for a day before scanning again.
      
      This patch causes khugepaged to wakeup immediately when the value is
      changed and then sleep until that value is rewritten or the new value
      lapses.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0508977
    • NeilBrown's avatar
      MM: increase safety margin provided by PF_LESS_THROTTLE · a53eaff8
      NeilBrown authored
      When nfsd is exporting a filesystem over NFS which is then NFS-mounted
      on the local machine there is a risk of deadlock.  This happens when
      there are lots of dirty pages in the NFS filesystem and they cause NFSD
      to be throttled, either in throttle_vm_writeout() or in
      balance_dirty_pages().
      
      To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
      and it provides a 25% increase to the limits that affect NFSD.  Any
      process writing to an NFS filesystem will be throttled well before the
      number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
      will not deadlock on pages that it needs to write out.  At least it
      shouldn't.
      
      All processes are allowed a small excess margin to avoid performing too
      many calculations: ratelimit_pages.
      
      ratelimit_pages is set so that if a thread on every CPU uses the entire
      margin, the total will only go 3% over the limit, and this is much less
      than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
      shouldn't be a problem.  But it is.
      
      The "total memory" that these 3% and 25% are calculated against are not
      really total memory but are "global_dirtyable_memory()" which doesn't
      include anonymous memory, just free memory and page-cache memory.
      
      The "ratelimit_pages" number is based on whatever the
      global_dirtyable_memory was on the last CPU hot-plug, which might not be
      what you expect, but is probably close to the total freeable memory.
      
      The throttle threshold uses the global_dirtable_memory at the moment
      when the throttling happens, which could be much less than at the last
      CPU hotplug.  So if lots of anonymous memory has been allocated, thus
      pushing out lots of page-cache pages, then NFSD might end up being
      throttled due to dirty NFS pages because the "25%" bonus it gets is
      calculated against a rather small amount of dirtyable memory, while the
      "3%" margin that other processes are allowed to dirty without penalty is
      calculated against a much larger number.
      
      To remove this possibility of deadlock we need to make sure that the
      margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
      Simply adding ratelimit_pages isn't enough as that should be multiplied
      by the number of cpus.
      
      So add "global_wb_domain.dirty_limit / 32" as that more accurately
      reflects the current total over-shoot margin.  This ensures that the
      number of dirty NFS pages never gets so high that nfsd will be throttled
      waiting for them to be written.
      
      Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.nameSigned-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a53eaff8
    • Naoya Horiguchi's avatar
      mm: check_new_page_bad() directly returns in __PG_HWPOISON case · e570f56c
      Naoya Horiguchi authored
      Currently we check page->flags twice for "HWPoisoned" case of
      check_new_page_bad(), which can cause a race with unpoisoning.
      
      This race unnecessarily taints kernel with "BUG: Bad page state".
      check_new_page_bad() is the only caller of bad_page() which is
      interested in __PG_HWPOISON, so let's move the hwpoison related code in
      bad_page() to it.
      
      Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e570f56c
    • seokhoon.yoon's avatar
      mm, kasan: fix to call kasan_free_pages() after poisoning page · 29b52de1
      seokhoon.yoon authored
      When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
      free_pages_prepare()'s codeflow is below.
      
        1)kmemcheck_free_shadow()
        2)kasan_free_pages()
          - set shadow byte of page is freed
        3)kernel_poison_pages()
        3.1) check access to page is valid or not using kasan
          ---> error occur, kasan think it is invalid access
        3.2) poison page
        4)kernel_map_pages()
      
      So kasan_free_pages() should be called after poisoning the page.
      
      Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.comSigned-off-by: default avatarseokhoon.yoon <iamyooon@gmail.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29b52de1
    • Minchan Kim's avatar
      mm: disable fault around on emulated access bit architecture · d0834a6c
      Minchan Kim authored
      fault_around aims to reduce minor faults of file-backed pages via
      speculative ahead pte mapping and relying on readahead logic.  However,
      on non-HW access bit architecture the benefit is highly limited because
      they should emulate the young bit with minor faults for reclaim's page
      aging algorithm.  IOW, we cannot reduce minor faults on those
      architectures.
      
      I did quick a test on my ARM machine.
      
      512M file mmap sequential every word read on eSATA drive 4 times.
      stddev is stable.
      
        = fault_around 4096 =
        elapsed time(usec): 6747645
      
        = fault_around 65536 =
        elapsed time(usec): 6709263
      
        0.5% gain.
      
      Even when I tested it with eMMC there is no gain because I guess with
      slow storage the major fault is the dominant factor.
      
      Also, fault_around has the side effect of shrinking slab more
      aggressively and causes higher vmpressure, so if such speculation fails,
      it can evict slab more which can result in page I/O (e.g., inode cache).
      In the end, it would make void any benefit of fault_around.
      
      So let's make the default "disabled" on those architectures.
      
      Link: http://lkml.kernel.org/r/20160518014229.GB21538@bboxSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0834a6c
    • Kirill A. Shutemov's avatar
      mm: make faultaround produce old ptes · 5c0a85fa
      Kirill A. Shutemov authored
      Currently, faultaround code produces young pte.  This can screw up
      vmscan behaviour[1], as it makes vmscan think that these pages are hot
      and not push them out on first round.
      
      During sparse file access faultaround gets more pages mapped and all of
      them are young.  Under memory pressure, this makes vmscan swap out anon
      pages instead, or to drop other page cache pages which otherwise stay
      resident.
      
      Modify faultaround to produce old ptes, so they can easily be reclaimed
      under memory pressure.
      
      This can to some extend defeat the purpose of faultaround on machines
      without hardware accessed bit as it will not help us with reducing the
      number of minor page faults.
      
      We may want to disable faultaround on such machines altogether, but
      that's subject for separate patchset.
      
      Minchan:
       "I tested 512M mmap sequential word read test on non-HW access bit
        system (i.e., ARM) and confirmed it doesn't increase minor fault any
        more.
      
        old: 4096 fault_around
        minor fault: 131291
        elapsed time: 6747645 usec
      
        new: 65536 fault_around
        minor fault: 131291
        elapsed time: 6709263 usec
      
        0.56% benefit"
      
      [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
      
      Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Tested-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c0a85fa
    • Stefan Bader's avatar
      mm: use phys_addr_t for reserve_bootmem_region() arguments · 4b50bcc7
      Stefan Bader authored
      Since commit 92923ca3 ("mm: meminit: only set page reserved in the
      memblock region") the reserved bit is set on reserved memblock regions.
      However start and end address are passed as unsigned long.  This is only
      32bit on i386, so it can end up marking the wrong pages reserved for
      ranges at 4GB and above.
      
      This was observed on a 32bit Xen dom0 which was booted with initial
      memory set to a value below 4G but allowing to balloon in memory
      (dom0_mem=1024M for example).  This would define a reserved bootmem
      region for the additional memory (for example on a 8GB system there was
      a reverved region covering the 4GB-8GB range).  But since the addresses
      were passed on as unsigned long, this was actually marking all pages
      from 0 to 4GB as reserved.
      
      Fixes: 92923ca3 ("mm: meminit: only set page reserved in the memblock region")
      Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.comSigned-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b50bcc7
    • Oleg Nesterov's avatar
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · d2005e3f
      Oleg Nesterov authored
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2005e3f
    • Richard Leitner's avatar
      mm/memblock.c: remove unnecessary always-true comparison · cd33a76b
      Richard Leitner authored
      Comparing an u64 variable to >= 0 returns always true and can therefore
      be removed.  This issue was detected using the -Wtype-limits gcc flag.
      
      This patch fixes following type-limits warning:
      
        mm/memblock.c: In function `__next_reserved_mem_region':
        mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
          if (*idx >= 0 && *idx < type->cnt) {
      
      Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.netSigned-off-by: default avatarRichard Leitner <dev@g0hl1n.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd33a76b
    • Vitaly Wool's avatar
      z3fold: the 3-fold allocator for compressed pages · 9a001fc1
      Vitaly Wool authored
      This patch introduces z3fold, a special purpose allocator for storing
      compressed pages.  It is designed to store up to three compressed pages
      per physical page.  It is a ZBUD derivative which allows for higher
      compression ratio keeping the simplicity and determinism of its
      predecessor.
      
      This patch comes as a follow-up to the discussions at the Embedded Linux
      Conference in San-Diego related to the talk [1].  The outcome of these
      discussions was that it would be good to have a compressed page
      allocator as stable and deterministic as zbud with with higher
      compression ratio.
      
      To keep the determinism and simplicity, z3fold, just like zbud, always
      stores an integral number of compressed pages per page, but it can store
      up to 3 pages unlike zbud which can store at most 2.  Therefore the
      compression ratio goes to around 2.6x while zbud's one is around 1.7x.
      
      The patch is based on the latest linux.git tree.
      
      This version has been updated after testing on various simulators (e.g.
      ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
      comments from Dan Streetman [3].
      
      [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
      [2] https://lkml.org/lkml/2016/4/21/799
      [3] https://lkml.org/lkml/2016/5/4/852
      
      Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.comSigned-off-by: default avatarVitaly Wool <vitalywool@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a001fc1
    • Andrea Arcangeli's avatar
      mm: thp: split_huge_pmd_address() comment improvement · d5ee7c3b
      Andrea Arcangeli authored
      Comment is partly wrong, this improves it by including the case of
      split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
      is set.
      
      Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5ee7c3b
    • Andrea Arcangeli's avatar
      mm: thp: microoptimize compound_mapcount() · 5f527c2b
      Andrea Arcangeli authored
      compound_mapcount() is only called after PageCompound() has already been
      checked by the caller, so there's no point to check it again.  Gcc may
      optimize it away too because it's inline but this will remove the
      runtime check for sure and add it'll add an assert instead.
      
      Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f527c2b
    • Christoph Lameter's avatar
      vmstat: get rid of the ugly cpu_stat_off variable · 7b8da4c7
      Christoph Lameter authored
      The cpu_stat_off variable is unecessary since we can check if a
      workqueue request is pending otherwise.  Removal of cpu_stat_off makes
      it pretty easy for the vmstat shepherd to ensure that the proper things
      happen.
      
      Removing the state also removes all races related to it.  Should a
      workqueue not be scheduled as needed for vmstat_update then the shepherd
      will notice and schedule it as needed.  Should a workqueue be
      unecessarily scheduled then the vmstat updater will disable it.
      
      [akpm@linux-foundation.org: fix indentation, per Michal]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.orgSigned-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b8da4c7
    • Greg Thelen's avatar
      memcg: fix stale mem_cgroup_force_empty() comment · 51038171
      Greg Thelen authored
      Commit f61c42a7 ("memcg: remove tasks/children test from
      mem_cgroup_force_empty()") removed memory reparenting from the function.
      
      Fix the function's comment.
      
      Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51038171
    • Yu Zhao's avatar
      mm: use unsigned long constant for page flags · d2a1a1f0
      Yu Zhao authored
      struct page->flags is unsigned long, so when shifting bits we should use
      UL suffix to match it.
      
      Found this problem after I added 64-bit CPU specific page flags and
      failed to compile the kernel:
      
        mm/page_alloc.c: In function '__free_one_page':
        mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]
      
      Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2a1a1f0
    • Minfei Huang's avatar
      mm: use existing helper to convert "on"/"off" to boolean · 2a138dc7
      Minfei Huang authored
      It's more convenient to use existing function helper to convert string
      "on/off" to boolean.
      
      Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.comSigned-off-by: default avatarMinfei Huang <mnghuan@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a138dc7
    • Tetsuo Handa's avatar
      mm,writeback: don't use memory reserves for wb_start_writeback · 78ebc2f7
      Tetsuo Handa authored
      When writeback operation cannot make forward progress because memory
      allocation requests needed for doing I/O cannot be satisfied (e.g.
      under OOM-livelock situation), we can observe flood of order-0 page
      allocation failure messages caused by complete depletion of memory
      reserves.
      
      This is caused by unconditionally allocating "struct wb_writeback_work"
      objects using GFP_ATOMIC from PF_MEMALLOC context.
      
      __alloc_pages_nodemask() {
        __alloc_pages_slowpath() {
          __alloc_pages_direct_reclaim() {
            __perform_reclaim() {
              current->flags |= PF_MEMALLOC;
              try_to_free_pages() {
                do_try_to_free_pages() {
                  wakeup_flusher_threads() {
                    wb_start_writeback() {
                      kzalloc(sizeof(*work), GFP_ATOMIC) {
                        /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                      }
                    }
                  }
                }
              }
              current->flags &= ~PF_MEMALLOC;
            }
          }
        }
      }
      
      Since I/O is stalling, allocating writeback requests forever shall
      deplete memory reserves.  Fortunately, since wb_start_writeback() can
      fall back to wb_wakeup() when allocating "struct wb_writeback_work"
      failed, we don't need to allow wb_start_writeback() to use memory
      reserves.
      
        Mem-Info:
        active_anon:289393 inactive_anon:2093 isolated_anon:29
         active_file:10838 inactive_file:113013 isolated_file:859
         unevictable:0 dirty:108531 writeback:5308 unstable:0
         slab_reclaimable:5526 slab_unreclaimable:7077
         mapped:9970 shmem:2159 pagetables:2387 bounce:0
         free:3042 free_pcp:0 free_cma:0
        Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
        Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        126847 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
        Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
        kthreadd: page allocation failure: order:0, mode:0x2200020
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          warn_alloc_failed+0xf7/0x150
          __alloc_pages_nodemask+0x23f/0xa60
          alloc_pages_current+0x87/0x110
          new_slab+0x3a1/0x440
          ___slab_alloc+0x3cf/0x590
          __slab_alloc.isra.64+0x18/0x1d
          kmem_cache_alloc+0x11c/0x150
          wb_start_writeback+0x39/0x90
          wakeup_flusher_threads+0x7f/0xf0
          do_try_to_free_pages+0x1f9/0x410
          try_to_free_pages+0x94/0xc0
          __alloc_pages_nodemask+0x566/0xa60
          alloc_pages_current+0x87/0x110
          __page_cache_alloc+0xaf/0xc0
          pagecache_get_page+0x88/0x260
          grab_cache_page_write_begin+0x21/0x40
          xfs_vm_write_begin+0x2f/0xf0
          generic_perform_write+0xca/0x1c0
          xfs_file_buffered_aio_write+0xcc/0x1f0
          xfs_file_write_iter+0x84/0x140
          __vfs_write+0xc7/0x100
          vfs_write+0x9d/0x190
          SyS_write+0x50/0xc0
          entry_SYSCALL_64_fastpath+0x12/0x6a
        Mem-Info:
        active_anon:293335 inactive_anon:2093 isolated_anon:0
         active_file:10829 inactive_file:110045 isolated_file:32
         unevictable:0 dirty:109275 writeback:822 unstable:0
         slab_reclaimable:5489 slab_unreclaimable:10070
         mapped:9999 shmem:2159 pagetables:2420 bounce:0
         free:3 free_pcp:0 free_cma:0
        Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        123086 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
          cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
          node 0: slabs: 3218, objs: 205952, free: 0
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
      
      Assuming that somebody will find a better solution, let's apply this
      patch for now to stop bleeding, for this problem frequently prevents me
      from testing OOM livelock condition.
      
      Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.czSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78ebc2f7
    • Eric Engestrom's avatar
      Documentation: vm: fix spelling mistakes · 89474d50
      Eric Engestrom authored
      Signed-off-by: default avatarEric Engestrom <eric@engestrom.ch>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89474d50
    • Weijie Yang's avatar
      mm fix commmets: if SPARSEMEM, pgdata doesn't have page_ext · 0c9ad804
      Weijie Yang authored
      If SPARSEMEM, use page_ext in mem_section
      if !SPARSEMEM, use page_ext in pgdata
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c9ad804
    • Chen Gang's avatar
      include/linux/hugetlb.h: use bool instead of int for hugepage_migration_supported() · d70c17d4
      Chen Gang authored
      It is used as a pure bool function within kernel source wide.
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d70c17d4
    • Chen Gang's avatar
      include/linux/hugetlb*.h: clean up code · 7fab358d
      Chen Gang authored
      Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3
      lines statements.
      
      Remove redundant return statements for non-return functions, which can
      save lines, at least.
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fab358d
    • Ming Li's avatar
      mm/swap.c: put activate_page_pvecs and other pagevecs together · a4a921aa
      Ming Li authored
      Put the activate_page_pvecs definition next to those of the other
      pagevecs, for clarity.
      Signed-off-by: default avatarMing Li <mingli199x@qq.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4a921aa
    • Eric Dumazet's avatar
      mm: tighten fault_in_pages_writeable() · b8ca9e3a
      Eric Dumazet authored
      copy_page_to_iter_iovec() is currently the only user of
      fault_in_pages_writeable(), and it definitely can use fragments from
      high order pages.
      
      Make sure fault_in_pages_writeable() is only touching two adjacent pages
      at most, as claimed.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8ca9e3a
    • David Rientjes's avatar
      mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size · 297880f4
      David Rientjes authored
      The page_counter rounds limits down to page size values.  This makes
      sense, except in the case of hugetlb_cgroup where it's not possible to
      charge partial hugepages.  If the hugetlb_cgroup margin is less than the
      hugepage size being charged, it will fail as expected.
      
      Round the hugetlb_cgroup limit down to hugepage size, since it is the
      effective limit of the cgroup.
      
      For consistency, round down PAGE_COUNTER_MAX as well when a
      hugetlb_cgroup is created: this prevents error reports when a user
      cannot restore the value to the kernel default.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      297880f4
    • Rich Felker's avatar
      tmpfs/ramfs: fix VM_MAYSHARE mappings for NOMMU · 63678c32
      Rich Felker authored
      The nommu do_mmap expects f_op->get_unmapped_area to either succeed or
      return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings.
      Returning addr in the non-MAP_SHARED case was completely wrong, and only
      happened to work because addr was 0.  However, it prevented VM_MAYSHARE
      mappings from sharing backing with the fs cache, and forced such
      mappings (including shareable program text) to be copied whenever the
      number of mappings transitioned from 0 to 1, impacting performance and
      memory usage.  Subsequent mappings beyond the first still correctly
      shared memory with the first.
      
      Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level;
      do_mmap already handles the semantic differences between them.
      Signed-off-by: default avatarRich Felker <dalias@libc.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63678c32
    • Konstantin Khlebnikov's avatar
      mm: enable RLIMIT_DATA by default with workaround for valgrind · f4fcd558
      Konstantin Khlebnikov authored
      Since commit 84638335 ("mm: rework virtual memory accounting")
      RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
      default because of incompatibility with older versions of valgrind.
      
      Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
      Fortunately it changes only rlim_cur and keeps rlim_max for reverting
      limit back when needed.
      
      This patch checks current usage also against rlim_max if rlim_cur is
      zero.  This is safe because task anyway can increase rlim_cur up to
      rlim_max.  Size of brk is still checked against rlim_cur, so this part
      is completely compatible - zero rlim_cur forbids brk() but allows
      private mmap().
      
      Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.comSigned-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4fcd558
    • Yongji Xie's avatar
      mm: fix incorrect pfn passed to untrack_pfn() in remap_pfn_range() · d5957d2f
      Yongji Xie authored
      We use generic hooks in remap_pfn_range() to help archs to track pfnmap
      regions.  The code is something like:
      
        int remap_pfn_range()
        {
      	...
      	track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
      	...
      	pfn -= addr >> PAGE_SHIFT;
      	...
      	untrack_pfn(vma, pfn, PAGE_ALIGN(size));
      	...
        }
      
      Here we can easily find the pfn is changed but not recovered before
      untrack_pfn() is called.  That's incorrect.
      
      There are no known runtime effects - this is from inspection.
      Signed-off-by: default avatarYongji Xie <xyjxie@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5957d2f
    • Chris Wilson's avatar
      mm/vmalloc: keep a separate lazy-free list · 80c4bd7a
      Chris Wilson authored
      When mixing lots of vmallocs and set_memory_*() (which calls
      vm_unmap_aliases()) I encountered situations where the performance
      degraded severely due to the walking of the entire vmap_area list each
      invocation.
      
      One simple improvement is to add the lazily freed vmap_area to a
      separate lockless free list, such that we then avoid having to walk the
      full list on each purge.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarRoman Pen <r.peniaev@gmail.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Pen <r.peniaev@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Shawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80c4bd7a
    • Alexander Kuleshov's avatar
      mm/memblock.c: move memblock_{add,reserve}_region into memblock_{add,reserve} · f705ac4b
      Alexander Kuleshov authored
      memblock_add_region() and memblock_reserve_region() do nothing specific
      before the call of memblock_add_range(), only print debug output.
      
      We can do the same in memblock_add() and memblock_reserve() since both
      memblock_add_region() and memblock_reserve_region() are not used by
      anybody outside of memblock.c and memblock_{add,reserve}() have the same
      set of flags and nids.
      
      Since memblock_add_region() and memblock_reserve_region() will be
      inlined, there will not be functional changes, but will improve code
      readability a little.
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f705ac4b
    • Chen Yucong's avatar
      mm/memory-failure.c: replace "MCE" with "Memory failure" · 495367c0
      Chen Yucong authored
      HWPoison was specific to some particular x86 platforms.  And it is often
      seen as high level machine check handler.  And therefore, 'MCE' is used
      for the format prefix of printk().  However, 'PowerNV' has also used
      HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
      to memory_failure.c.
      
      Additionally, 'MCE' and 'Memory failure' have different context.  The
      former belongs to exception context and the latter belongs to process
      context.  Furthermore, HWPoison can also be used for off-lining those
      sub-health pages that do not trigger any machine check exception.
      
      This patch aims to replace 'MCE' with a more appropriate prefix.
      
      [1] commit 75eb3d9b ("powerpc/powernv: Get FSP memory errors
      and plumb into memory poison infrastructure.")
      Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      495367c0
    • Yang Shi's avatar
      mm: thp: simplify the implementation of mk_huge_pmd() · 340a43be
      Yang Shi authored
      The implementation of mk_huge_pmd looks verbose, it could be just
      simplified to one line code.
      Signed-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      340a43be
    • Tetsuo Handa's avatar
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa authored
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • Michal Hocko's avatar
      oom: consider multi-threaded tasks in task_will_free_mem · 98748bd7
      Michal Hocko authored
      task_will_free_mem is a misnomer for a more complex PF_EXITING test for
      early break out from the oom killer because it is believed that such a
      task would release its memory shortly and so we do not have to select an
      oom victim and perform a disruptive action.
      
      Currently we make sure that the given task is not participating in the
      core dumping because it might get blocked for a long time - see commit
      d003f371 ("oom: don't assume that a coredumping thread will exit
      soon").
      
      The check can still do better though.  We shouldn't consider the task
      unless the whole thread group is going down.  This is rather unlikely
      but not impossible.  A single exiting thread would surely leave all the
      address space behind.  If we are really unlucky it might get stuck on
      the exit path and keep its TIF_MEMDIE and so block the oom killer.
      
      Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98748bd7
    • Michal Hocko's avatar
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko authored
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • Michal Hocko's avatar
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko authored
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
    • Michal Hocko's avatar
      mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION · 31e49bfd
      Michal Hocko authored
      Joonsoo has reported that he is able to trigger OOM for !costly high
      order requests (heavy fork() workload close the OOM) with the new oom
      detection rework.  This is because we rely only on should_reclaim_retry
      when the compaction is disabled and it only checks watermarks for the
      requested order and so we might trigger OOM when there is a lot of free
      memory.
      
      It is not very clear what are the usual workloads when the compaction is
      disabled.  Relying on high order allocations heavily without any
      mechanism to create those orders except for unbound amount of reclaim is
      certainly not a good idea.
      
      To prevent from potential regressions let's help this configuration
      some.  We have to sacrifice the determinsm though because there simply
      is none here possible.  should_compact_retry implementation for
      !CONFIG_COMPACTION, which was empty so far, will do watermark check for
      order-0 on all eligible zones.  This will cause retrying until either
      the reclaim cannot make any further progress or all the zones are
      depleted even for order-0 pages.  This means that the number of retries
      is basically unbounded for !costly orders but that was the case before
      the rework as well so this shouldn't regress.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.orgReported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31e49bfd
    • Michal Hocko's avatar
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko authored
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • Michal Hocko's avatar
      mm: consider compaction feedback also for costly allocation · 7854ea6c
      Michal Hocko authored
      PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
      should_reclaim_retry currently where we decide to not retry after at
      least order worth of pages were reclaimed or the watermark check for at
      least one zone would succeed after reclaiming all pages if the reclaim
      hasn't made any progress.  Compaction feedback is mostly ignored and we
      just try to make sure that the compaction did at least something before
      giving up.
      
      The first condition was added by a41f24ea ("page allocator: smarter
      retry of costly-order allocations) and it assumed that lumpy reclaim
      could have created a page of the sufficient order.  Lumpy reclaim, has
      been removed quite some time ago so the assumption doesn't hold anymore.
      Remove the check for the number of reclaimed pages and rely on the
      compaction feedback solely.  should_reclaim_retry now only makes sure
      that we keep retrying reclaim for high order pages only if they are
      hidden by watermaks so order-0 reclaim makes really sense.
      
      should_compact_retry now keeps retrying even for the costly allocations.
      The number of retries is reduced wrt.  !costly requests because they are
      less important and harder to grant and so their pressure shouldn't cause
      contention for other requests or cause an over reclaim.  We also do not
      reset no_progress_loops for costly request to make sure we do not keep
      reclaiming too agressively.
      
      This has been tested by running a process which fragments memory:
      	- compact memory
      	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
      	  of swapspace)
      	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
      	  steps until certain amount of memory is freed (250M in my test)
      	  and reduce the step to (step / 2) + 1 after reaching the end of
      	  the mapping
      	- then run a script which populates the page cache 2G (MemTotal)
      	  from /dev/zero to a new file
      And then tries to allocate
      nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
      huge pages.
      
      root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
      Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
      Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437
      
      * This is the /proc/buddyinfo after the compaction
      
      Done fragmenting. size=2013265920 freed=262144000
      Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0
      
      * /proc/buddyinfo after memory got fragmented
      
      Executing "/root/alloc_hugepages.sh"
      Eating some pagecache
      508623+0 records in
      508623+0 records out
      2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0
      
      * /proc/buddyinfo after page cache got eaten
      
      Trying to allocate 129
      129
      
      * 129 hugepages requested and all of them granted.
      
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0
      
      * /proc/buddyinfo after hugetlb allocation.
      
      10 runs will behave as follows:
      Trying to allocate 130
      130
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 132
      132
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      
      So basically 100% success for all 10 attempts.
      Without the patch numbers looked much worse:
      Trying to allocate 128
      12
      --
      Trying to allocate 129
      14
      --
      Trying to allocate 129
      7
      --
      Trying to allocate 129
      16
      --
      Trying to allocate 129
      30
      --
      Trying to allocate 129
      38
      --
      Trying to allocate 129
      19
      --
      Trying to allocate 129
      37
      --
      Trying to allocate 129
      28
      --
      Trying to allocate 129
      37
      
      Just for completness the base kernel without oom detection rework looks
      as follows:
      Trying to allocate 127
      30
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      52
      --
      Trying to allocate 128
      32
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      10
      --
      Trying to allocate 129
      32
      --
      Trying to allocate 128
      14
      --
      Trying to allocate 128
      16
      --
      Trying to allocate 129
      8
      
      As we can see the success rate is much more volatile and smaller without
      this patch. So the patch not only makes the retry logic for costly
      requests more sensible the success rate is even higher.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7854ea6c
    • Michal Hocko's avatar
      mm, oom: protect !costly allocations some more · 33c2d214
      Michal Hocko authored
      should_reclaim_retry will give up retries for higher order allocations
      if none of the eligible zones has any requested or higher order pages
      available even if we pass the watermak check for order-0.  This is done
      because there is no guarantee that the reclaimable and currently free
      pages will form the required order.
      
      This can, however, lead to situations where the high-order request (e.g.
      order-2 required for the stack allocation during fork) will trigger OOM
      too early - e.g.  after the first reclaim/compaction round.  Such a
      system would have to be highly fragmented and there is no guarantee
      further reclaim/compaction attempts would help but at least make sure
      that the compaction was active before we go OOM and keep retrying even
      if should_reclaim_retry tells us to oom if
      
      	- the last compaction round backed off or
      	- we haven't completed at least MAX_COMPACT_RETRIES active
      	  compaction rounds.
      
      The first rule ensures that the very last attempt for compaction was not
      ignored while the second guarantees that the compaction has done some
      work.  Multiple retries might be needed to prevent occasional pigggy
      backing of other contexts to steal the compacted pages before the
      current context manages to retry to allocate them.
      
      compaction_failed() is taken as a final word from the compaction that
      the retry doesn't make much sense.  We have to be careful though because
      the first compaction round is MIGRATE_ASYNC which is rather weak as it
      ignores pages under writeback and gives up too easily in other
      situations.  We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
      has been used before we give up.  With this logic in place we do not
      have to increase the migration mode unconditionally and rather do it
      only if the compaction failed for the weaker mode.  A nice side effect
      is that the stronger migration mode is used only when really needed so
      this has a potential of smaller latencies in some cases.
      
      Please note that the compaction doesn't tell us much about how
      successful it was when returning compaction_made_progress so we just
      have to blindly trust that another retry is worthwhile and cap the
      number to something reasonable to guarantee a convergence.
      
      If the given number of successful retries is not sufficient for a
      reasonable workloads we should focus on the collected compaction
      tracepoints data and try to address the issue in the compaction code.
      If this is not feasible we can increase the retries limit.
      
      [mhocko@suse.com: fix warning]
        Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33c2d214