1. 21 May, 2016 40 commits
    • Christoph Lameter's avatar
      vmstat: get rid of the ugly cpu_stat_off variable · 7b8da4c7
      Christoph Lameter authored
      The cpu_stat_off variable is unecessary since we can check if a
      workqueue request is pending otherwise.  Removal of cpu_stat_off makes
      it pretty easy for the vmstat shepherd to ensure that the proper things
      happen.
      
      Removing the state also removes all races related to it.  Should a
      workqueue not be scheduled as needed for vmstat_update then the shepherd
      will notice and schedule it as needed.  Should a workqueue be
      unecessarily scheduled then the vmstat updater will disable it.
      
      [akpm@linux-foundation.org: fix indentation, per Michal]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.orgSigned-off-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b8da4c7
    • Greg Thelen's avatar
      memcg: fix stale mem_cgroup_force_empty() comment · 51038171
      Greg Thelen authored
      Commit f61c42a7 ("memcg: remove tasks/children test from
      mem_cgroup_force_empty()") removed memory reparenting from the function.
      
      Fix the function's comment.
      
      Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51038171
    • Yu Zhao's avatar
      mm: use unsigned long constant for page flags · d2a1a1f0
      Yu Zhao authored
      struct page->flags is unsigned long, so when shifting bits we should use
      UL suffix to match it.
      
      Found this problem after I added 64-bit CPU specific page flags and
      failed to compile the kernel:
      
        mm/page_alloc.c: In function '__free_one_page':
        mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]
      
      Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.comSigned-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2a1a1f0
    • Minfei Huang's avatar
      mm: use existing helper to convert "on"/"off" to boolean · 2a138dc7
      Minfei Huang authored
      It's more convenient to use existing function helper to convert string
      "on/off" to boolean.
      
      Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.comSigned-off-by: default avatarMinfei Huang <mnghuan@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a138dc7
    • Tetsuo Handa's avatar
      mm,writeback: don't use memory reserves for wb_start_writeback · 78ebc2f7
      Tetsuo Handa authored
      When writeback operation cannot make forward progress because memory
      allocation requests needed for doing I/O cannot be satisfied (e.g.
      under OOM-livelock situation), we can observe flood of order-0 page
      allocation failure messages caused by complete depletion of memory
      reserves.
      
      This is caused by unconditionally allocating "struct wb_writeback_work"
      objects using GFP_ATOMIC from PF_MEMALLOC context.
      
      __alloc_pages_nodemask() {
        __alloc_pages_slowpath() {
          __alloc_pages_direct_reclaim() {
            __perform_reclaim() {
              current->flags |= PF_MEMALLOC;
              try_to_free_pages() {
                do_try_to_free_pages() {
                  wakeup_flusher_threads() {
                    wb_start_writeback() {
                      kzalloc(sizeof(*work), GFP_ATOMIC) {
                        /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                      }
                    }
                  }
                }
              }
              current->flags &= ~PF_MEMALLOC;
            }
          }
        }
      }
      
      Since I/O is stalling, allocating writeback requests forever shall
      deplete memory reserves.  Fortunately, since wb_start_writeback() can
      fall back to wb_wakeup() when allocating "struct wb_writeback_work"
      failed, we don't need to allow wb_start_writeback() to use memory
      reserves.
      
        Mem-Info:
        active_anon:289393 inactive_anon:2093 isolated_anon:29
         active_file:10838 inactive_file:113013 isolated_file:859
         unevictable:0 dirty:108531 writeback:5308 unstable:0
         slab_reclaimable:5526 slab_unreclaimable:7077
         mapped:9970 shmem:2159 pagetables:2387 bounce:0
         free:3042 free_pcp:0 free_cma:0
        Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
        Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        126847 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
        Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
        kthreadd: page allocation failure: order:0, mode:0x2200020
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          warn_alloc_failed+0xf7/0x150
          __alloc_pages_nodemask+0x23f/0xa60
          alloc_pages_current+0x87/0x110
          new_slab+0x3a1/0x440
          ___slab_alloc+0x3cf/0x590
          __slab_alloc.isra.64+0x18/0x1d
          kmem_cache_alloc+0x11c/0x150
          wb_start_writeback+0x39/0x90
          wakeup_flusher_threads+0x7f/0xf0
          do_try_to_free_pages+0x1f9/0x410
          try_to_free_pages+0x94/0xc0
          __alloc_pages_nodemask+0x566/0xa60
          alloc_pages_current+0x87/0x110
          __page_cache_alloc+0xaf/0xc0
          pagecache_get_page+0x88/0x260
          grab_cache_page_write_begin+0x21/0x40
          xfs_vm_write_begin+0x2f/0xf0
          generic_perform_write+0xca/0x1c0
          xfs_file_buffered_aio_write+0xcc/0x1f0
          xfs_file_write_iter+0x84/0x140
          __vfs_write+0xc7/0x100
          vfs_write+0x9d/0x190
          SyS_write+0x50/0xc0
          entry_SYSCALL_64_fastpath+0x12/0x6a
        Mem-Info:
        active_anon:293335 inactive_anon:2093 isolated_anon:0
         active_file:10829 inactive_file:110045 isolated_file:32
         unevictable:0 dirty:109275 writeback:822 unstable:0
         slab_reclaimable:5489 slab_unreclaimable:10070
         mapped:9999 shmem:2159 pagetables:2420 bounce:0
         free:3 free_pcp:0 free_cma:0
        Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        123086 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
          cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
          node 0: slabs: 3218, objs: 205952, free: 0
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
      
      Assuming that somebody will find a better solution, let's apply this
      patch for now to stop bleeding, for this problem frequently prevents me
      from testing OOM livelock condition.
      
      Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.czSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78ebc2f7
    • Eric Engestrom's avatar
      Documentation: vm: fix spelling mistakes · 89474d50
      Eric Engestrom authored
      Signed-off-by: default avatarEric Engestrom <eric@engestrom.ch>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89474d50
    • Weijie Yang's avatar
      mm fix commmets: if SPARSEMEM, pgdata doesn't have page_ext · 0c9ad804
      Weijie Yang authored
      If SPARSEMEM, use page_ext in mem_section
      if !SPARSEMEM, use page_ext in pgdata
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c9ad804
    • Chen Gang's avatar
      include/linux/hugetlb.h: use bool instead of int for hugepage_migration_supported() · d70c17d4
      Chen Gang authored
      It is used as a pure bool function within kernel source wide.
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d70c17d4
    • Chen Gang's avatar
      include/linux/hugetlb*.h: clean up code · 7fab358d
      Chen Gang authored
      Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3
      lines statements.
      
      Remove redundant return statements for non-return functions, which can
      save lines, at least.
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7fab358d
    • Ming Li's avatar
      mm/swap.c: put activate_page_pvecs and other pagevecs together · a4a921aa
      Ming Li authored
      Put the activate_page_pvecs definition next to those of the other
      pagevecs, for clarity.
      Signed-off-by: default avatarMing Li <mingli199x@qq.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4a921aa
    • Eric Dumazet's avatar
      mm: tighten fault_in_pages_writeable() · b8ca9e3a
      Eric Dumazet authored
      copy_page_to_iter_iovec() is currently the only user of
      fault_in_pages_writeable(), and it definitely can use fragments from
      high order pages.
      
      Make sure fault_in_pages_writeable() is only touching two adjacent pages
      at most, as claimed.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8ca9e3a
    • David Rientjes's avatar
      mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size · 297880f4
      David Rientjes authored
      The page_counter rounds limits down to page size values.  This makes
      sense, except in the case of hugetlb_cgroup where it's not possible to
      charge partial hugepages.  If the hugetlb_cgroup margin is less than the
      hugepage size being charged, it will fail as expected.
      
      Round the hugetlb_cgroup limit down to hugepage size, since it is the
      effective limit of the cgroup.
      
      For consistency, round down PAGE_COUNTER_MAX as well when a
      hugetlb_cgroup is created: this prevents error reports when a user
      cannot restore the value to the kernel default.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      297880f4
    • Rich Felker's avatar
      tmpfs/ramfs: fix VM_MAYSHARE mappings for NOMMU · 63678c32
      Rich Felker authored
      The nommu do_mmap expects f_op->get_unmapped_area to either succeed or
      return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings.
      Returning addr in the non-MAP_SHARED case was completely wrong, and only
      happened to work because addr was 0.  However, it prevented VM_MAYSHARE
      mappings from sharing backing with the fs cache, and forced such
      mappings (including shareable program text) to be copied whenever the
      number of mappings transitioned from 0 to 1, impacting performance and
      memory usage.  Subsequent mappings beyond the first still correctly
      shared memory with the first.
      
      Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level;
      do_mmap already handles the semantic differences between them.
      Signed-off-by: default avatarRich Felker <dalias@libc.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63678c32
    • Konstantin Khlebnikov's avatar
      mm: enable RLIMIT_DATA by default with workaround for valgrind · f4fcd558
      Konstantin Khlebnikov authored
      Since commit 84638335 ("mm: rework virtual memory accounting")
      RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
      default because of incompatibility with older versions of valgrind.
      
      Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
      Fortunately it changes only rlim_cur and keeps rlim_max for reverting
      limit back when needed.
      
      This patch checks current usage also against rlim_max if rlim_cur is
      zero.  This is safe because task anyway can increase rlim_cur up to
      rlim_max.  Size of brk is still checked against rlim_cur, so this part
      is completely compatible - zero rlim_cur forbids brk() but allows
      private mmap().
      
      Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.comSigned-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4fcd558
    • Yongji Xie's avatar
      mm: fix incorrect pfn passed to untrack_pfn() in remap_pfn_range() · d5957d2f
      Yongji Xie authored
      We use generic hooks in remap_pfn_range() to help archs to track pfnmap
      regions.  The code is something like:
      
        int remap_pfn_range()
        {
      	...
      	track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
      	...
      	pfn -= addr >> PAGE_SHIFT;
      	...
      	untrack_pfn(vma, pfn, PAGE_ALIGN(size));
      	...
        }
      
      Here we can easily find the pfn is changed but not recovered before
      untrack_pfn() is called.  That's incorrect.
      
      There are no known runtime effects - this is from inspection.
      Signed-off-by: default avatarYongji Xie <xyjxie@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5957d2f
    • Chris Wilson's avatar
      mm/vmalloc: keep a separate lazy-free list · 80c4bd7a
      Chris Wilson authored
      When mixing lots of vmallocs and set_memory_*() (which calls
      vm_unmap_aliases()) I encountered situations where the performance
      degraded severely due to the walking of the entire vmap_area list each
      invocation.
      
      One simple improvement is to add the lazily freed vmap_area to a
      separate lockless free list, such that we then avoid having to walk the
      full list on each purge.
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarRoman Pen <r.peniaev@gmail.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Pen <r.peniaev@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Shawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80c4bd7a
    • Alexander Kuleshov's avatar
      mm/memblock.c: move memblock_{add,reserve}_region into memblock_{add,reserve} · f705ac4b
      Alexander Kuleshov authored
      memblock_add_region() and memblock_reserve_region() do nothing specific
      before the call of memblock_add_range(), only print debug output.
      
      We can do the same in memblock_add() and memblock_reserve() since both
      memblock_add_region() and memblock_reserve_region() are not used by
      anybody outside of memblock.c and memblock_{add,reserve}() have the same
      set of flags and nids.
      
      Since memblock_add_region() and memblock_reserve_region() will be
      inlined, there will not be functional changes, but will improve code
      readability a little.
      Signed-off-by: default avatarAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f705ac4b
    • Chen Yucong's avatar
      mm/memory-failure.c: replace "MCE" with "Memory failure" · 495367c0
      Chen Yucong authored
      HWPoison was specific to some particular x86 platforms.  And it is often
      seen as high level machine check handler.  And therefore, 'MCE' is used
      for the format prefix of printk().  However, 'PowerNV' has also used
      HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
      to memory_failure.c.
      
      Additionally, 'MCE' and 'Memory failure' have different context.  The
      former belongs to exception context and the latter belongs to process
      context.  Furthermore, HWPoison can also be used for off-lining those
      sub-health pages that do not trigger any machine check exception.
      
      This patch aims to replace 'MCE' with a more appropriate prefix.
      
      [1] commit 75eb3d9b ("powerpc/powernv: Get FSP memory errors
      and plumb into memory poison infrastructure.")
      Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      495367c0
    • Yang Shi's avatar
      mm: thp: simplify the implementation of mk_huge_pmd() · 340a43be
      Yang Shi authored
      The implementation of mk_huge_pmd looks verbose, it could be just
      simplified to one line code.
      Signed-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      340a43be
    • Tetsuo Handa's avatar
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa authored
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • Michal Hocko's avatar
      oom: consider multi-threaded tasks in task_will_free_mem · 98748bd7
      Michal Hocko authored
      task_will_free_mem is a misnomer for a more complex PF_EXITING test for
      early break out from the oom killer because it is believed that such a
      task would release its memory shortly and so we do not have to select an
      oom victim and perform a disruptive action.
      
      Currently we make sure that the given task is not participating in the
      core dumping because it might get blocked for a long time - see commit
      d003f371 ("oom: don't assume that a coredumping thread will exit
      soon").
      
      The check can still do better though.  We shouldn't consider the task
      unless the whole thread group is going down.  This is rather unlikely
      but not impossible.  A single exiting thread would surely leave all the
      address space behind.  If we are really unlucky it might get stuck on
      the exit path and keep its TIF_MEMDIE and so block the oom killer.
      
      Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98748bd7
    • Michal Hocko's avatar
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko authored
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • Michal Hocko's avatar
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko authored
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
    • Michal Hocko's avatar
      mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION · 31e49bfd
      Michal Hocko authored
      Joonsoo has reported that he is able to trigger OOM for !costly high
      order requests (heavy fork() workload close the OOM) with the new oom
      detection rework.  This is because we rely only on should_reclaim_retry
      when the compaction is disabled and it only checks watermarks for the
      requested order and so we might trigger OOM when there is a lot of free
      memory.
      
      It is not very clear what are the usual workloads when the compaction is
      disabled.  Relying on high order allocations heavily without any
      mechanism to create those orders except for unbound amount of reclaim is
      certainly not a good idea.
      
      To prevent from potential regressions let's help this configuration
      some.  We have to sacrifice the determinsm though because there simply
      is none here possible.  should_compact_retry implementation for
      !CONFIG_COMPACTION, which was empty so far, will do watermark check for
      order-0 on all eligible zones.  This will cause retrying until either
      the reclaim cannot make any further progress or all the zones are
      depleted even for order-0 pages.  This means that the number of retries
      is basically unbounded for !costly orders but that was the case before
      the rework as well so this shouldn't regress.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.orgReported-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31e49bfd
    • Michal Hocko's avatar
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko authored
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • Michal Hocko's avatar
      mm: consider compaction feedback also for costly allocation · 7854ea6c
      Michal Hocko authored
      PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
      should_reclaim_retry currently where we decide to not retry after at
      least order worth of pages were reclaimed or the watermark check for at
      least one zone would succeed after reclaiming all pages if the reclaim
      hasn't made any progress.  Compaction feedback is mostly ignored and we
      just try to make sure that the compaction did at least something before
      giving up.
      
      The first condition was added by a41f24ea ("page allocator: smarter
      retry of costly-order allocations) and it assumed that lumpy reclaim
      could have created a page of the sufficient order.  Lumpy reclaim, has
      been removed quite some time ago so the assumption doesn't hold anymore.
      Remove the check for the number of reclaimed pages and rely on the
      compaction feedback solely.  should_reclaim_retry now only makes sure
      that we keep retrying reclaim for high order pages only if they are
      hidden by watermaks so order-0 reclaim makes really sense.
      
      should_compact_retry now keeps retrying even for the costly allocations.
      The number of retries is reduced wrt.  !costly requests because they are
      less important and harder to grant and so their pressure shouldn't cause
      contention for other requests or cause an over reclaim.  We also do not
      reset no_progress_loops for costly request to make sure we do not keep
      reclaiming too agressively.
      
      This has been tested by running a process which fragments memory:
      	- compact memory
      	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
      	  of swapspace)
      	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
      	  steps until certain amount of memory is freed (250M in my test)
      	  and reduce the step to (step / 2) + 1 after reaching the end of
      	  the mapping
      	- then run a script which populates the page cache 2G (MemTotal)
      	  from /dev/zero to a new file
      And then tries to allocate
      nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
      huge pages.
      
      root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
      Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
      Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437
      
      * This is the /proc/buddyinfo after the compaction
      
      Done fragmenting. size=2013265920 freed=262144000
      Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0
      
      * /proc/buddyinfo after memory got fragmented
      
      Executing "/root/alloc_hugepages.sh"
      Eating some pagecache
      508623+0 records in
      508623+0 records out
      2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0
      
      * /proc/buddyinfo after page cache got eaten
      
      Trying to allocate 129
      129
      
      * 129 hugepages requested and all of them granted.
      
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0
      
      * /proc/buddyinfo after hugetlb allocation.
      
      10 runs will behave as follows:
      Trying to allocate 130
      130
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 132
      132
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      
      So basically 100% success for all 10 attempts.
      Without the patch numbers looked much worse:
      Trying to allocate 128
      12
      --
      Trying to allocate 129
      14
      --
      Trying to allocate 129
      7
      --
      Trying to allocate 129
      16
      --
      Trying to allocate 129
      30
      --
      Trying to allocate 129
      38
      --
      Trying to allocate 129
      19
      --
      Trying to allocate 129
      37
      --
      Trying to allocate 129
      28
      --
      Trying to allocate 129
      37
      
      Just for completness the base kernel without oom detection rework looks
      as follows:
      Trying to allocate 127
      30
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      52
      --
      Trying to allocate 128
      32
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      10
      --
      Trying to allocate 129
      32
      --
      Trying to allocate 128
      14
      --
      Trying to allocate 128
      16
      --
      Trying to allocate 129
      8
      
      As we can see the success rate is much more volatile and smaller without
      this patch. So the patch not only makes the retry logic for costly
      requests more sensible the success rate is even higher.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7854ea6c
    • Michal Hocko's avatar
      mm, oom: protect !costly allocations some more · 33c2d214
      Michal Hocko authored
      should_reclaim_retry will give up retries for higher order allocations
      if none of the eligible zones has any requested or higher order pages
      available even if we pass the watermak check for order-0.  This is done
      because there is no guarantee that the reclaimable and currently free
      pages will form the required order.
      
      This can, however, lead to situations where the high-order request (e.g.
      order-2 required for the stack allocation during fork) will trigger OOM
      too early - e.g.  after the first reclaim/compaction round.  Such a
      system would have to be highly fragmented and there is no guarantee
      further reclaim/compaction attempts would help but at least make sure
      that the compaction was active before we go OOM and keep retrying even
      if should_reclaim_retry tells us to oom if
      
      	- the last compaction round backed off or
      	- we haven't completed at least MAX_COMPACT_RETRIES active
      	  compaction rounds.
      
      The first rule ensures that the very last attempt for compaction was not
      ignored while the second guarantees that the compaction has done some
      work.  Multiple retries might be needed to prevent occasional pigggy
      backing of other contexts to steal the compacted pages before the
      current context manages to retry to allocate them.
      
      compaction_failed() is taken as a final word from the compaction that
      the retry doesn't make much sense.  We have to be careful though because
      the first compaction round is MIGRATE_ASYNC which is rather weak as it
      ignores pages under writeback and gives up too easily in other
      situations.  We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
      has been used before we give up.  With this logic in place we do not
      have to increase the migration mode unconditionally and rather do it
      only if the compaction failed for the weaker mode.  A nice side effect
      is that the stronger migration mode is used only when really needed so
      this has a potential of smaller latencies in some cases.
      
      Please note that the compaction doesn't tell us much about how
      successful it was when returning compaction_made_progress so we just
      have to blindly trust that another retry is worthwhile and cap the
      number to something reasonable to guarantee a convergence.
      
      If the given number of successful retries is not sufficient for a
      reasonable workloads we should focus on the collected compaction
      tracepoints data and try to address the issue in the compaction code.
      If this is not feasible we can increase the retries limit.
      
      [mhocko@suse.com: fix warning]
        Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33c2d214
    • Michal Hocko's avatar
      mm: throttle on IO only when there are too many dirty and writeback pages · ede37713
      Michal Hocko authored
      wait_iff_congested has been used to throttle allocator before it retried
      another round of direct reclaim to allow the writeback to make some
      progress and prevent reclaim from looping over dirty/writeback pages
      without making any progress.
      
      We used to do congestion_wait before commit 0e093d99 ("writeback: do
      not sleep on the congestion queue if there are no congested BDIs or if
      significant congestion is not being encountered in the current zone")
      but that led to undesirable stalls and sleeping for the full timeout
      even when the BDI wasn't congested.  Hence wait_iff_congested was used
      instead.
      
      But it seems that even wait_iff_congested doesn't work as expected.  We
      might have a small file LRU list with all pages dirty/writeback and yet
      the bdi is not congested so this is just a cond_resched in the end and
      can end up triggering pre mature OOM.
      
      This patch replaces the unconditional wait_iff_congested by
      congestion_wait which is executed only if we _know_ that the last round
      of direct reclaim didn't make any progress and dirty+writeback pages are
      more than a half of the reclaimable pages on the zone which might be
      usable for our target allocation.  This shouldn't reintroduce stalls
      fixed by 0e093d99 because congestion_wait is called only when we are
      getting hopeless when sleeping is a better choice than OOM with many
      pages under IO.
      
      We have to preserve logic introduced by commit 373ccbe5 ("mm,
      vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
      progress") into the __alloc_pages_slowpath now that wait_iff_congested
      is not used anymore.  As the only remaining user of wait_iff_congested
      is shrink_inactive_list we can remove the WQ specific short sleep from
      wait_iff_congested because the sleep is needed to be done only once in
      the allocation retry cycle.
      
      [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
       Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ede37713
    • Michal Hocko's avatar
      mm, oom: rework oom detection · 0a0337e0
      Michal Hocko authored
      __alloc_pages_slowpath has traditionally relied on the direct reclaim
      and did_some_progress as an indicator that it makes sense to retry
      allocation rather than declaring OOM.  shrink_zones had to rely on
      zone_reclaimable if shrink_zone didn't make any progress to prevent from
      a premature OOM killer invocation - the LRU might be full of dirty or
      writeback pages and direct reclaim cannot clean those up.
      
      zone_reclaimable allows to rescan the reclaimable lists several times
      and restart if a page is freed.  This is really subtle behavior and it
      might lead to a livelock when a single freed page keeps allocator
      looping but the current task will not be able to allocate that single
      page.  OOM killer would be more appropriate than looping without any
      progress for unbounded amount of time.
      
      This patch changes OOM detection logic and pulls it out from shrink_zone
      which is too low to be appropriate for any high level decisions such as
      OOM which is per zonelist property.  It is __alloc_pages_slowpath which
      knows how many attempts have been done and what was the progress so far
      therefore it is more appropriate to implement this logic.
      
      The new heuristic is implemented in should_reclaim_retry helper called
      from __alloc_pages_slowpath.  It tries to be more deterministic and
      easier to follow.  It builds on an assumption that retrying makes sense
      only if the currently reclaimable memory + free pages would allow the
      current allocation request to succeed (as per __zone_watermark_ok) at
      least for one zone in the usable zonelist.
      
      This alone wouldn't be sufficient, though, because the writeback might
      get stuck and reclaimable pages might be pinned for a really long time
      or even depend on the current allocation context.  Therefore there is a
      backoff mechanism implemented which reduces the reclaim target after
      each reclaim round without any progress.  This means that we should
      eventually converge to only NR_FREE_PAGES as the target and fail on the
      wmark check and proceed to OOM.  The backoff is simple and linear with
      1/16 of the reclaimable pages for each round without any progress.  We
      are optimistic and reset counter for successful reclaim rounds.
      
      Costly high order pages mostly preserve their semantic and those without
      __GFP_REPEAT fail right away while those which have the flag set will
      back off after the amount of reclaimable pages reaches equivalent of the
      requested order.  The only difference is that if there was no progress
      during the reclaim we rely on zone watermark check.  This is more
      logical thing to do than previous 1<<order attempts which were a result
      of zone_reclaimable faking the progress.
      
      [vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
      [hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
      [rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
      [rientjes@google.com: shrink_zones doesn't need to return anything]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a0337e0
    • Michal Hocko's avatar
      mm, compaction: abstract compaction feedback to helpers · cab1802b
      Michal Hocko authored
      Compaction can provide a wild variation of feedback to the caller.  Many
      of them are implementation specific and the caller of the compaction
      (especially the page allocator) shouldn't be bound to specifics of the
      current implementation.
      
      This patch abstracts the feedback into three basic types:
      	- compaction_made_progress - compaction was active and made some
      	  progress.
      	- compaction_failed - compaction failed and further attempts to
      	  invoke it would most probably fail and therefore it is not
      	  worth retrying
      	- compaction_withdrawn - compaction wasn't invoked for an
                implementation specific reasons. In the current implementation
                it means that the compaction was deferred, contended or the
                page scanners met too early without any progress. Retrying is
                still worthwhile.
      
      [vbabka@suse.cz: do not change thp back off behavior]
      [akpm@linux-foundation.org: fix typo in comment, per Hillf]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cab1802b
    • Michal Hocko's avatar
      mm, compaction: simplify __alloc_pages_direct_compact feedback interface · c5d01d0d
      Michal Hocko authored
      __alloc_pages_direct_compact communicates potential back off by two
      variables:
      	- deferred_compaction tells that the compaction returned
      	  COMPACT_DEFERRED
      	- contended_compaction is set when there is a contention on
      	  zone->lock resp. zone->lru_lock locks
      
      __alloc_pages_slowpath then backs of for THP allocation requests to
      prevent from long stalls. This is rather messy and it would be much
      cleaner to return a single compact result value and hide all the nasty
      details into __alloc_pages_direct_compact.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5d01d0d
    • Michal Hocko's avatar
      mm, compaction: update compaction_result ordering · 4f9a358c
      Michal Hocko authored
      compaction_result will be used as the primary feedback channel for
      compaction users.  At the same time try_to_compact_pages (and
      potentially others) assume a certain ordering where a more specific
      feedback takes precendence.
      
      This gets a bit awkward when we have conflicting feedback from different
      zones.  E.g one returing COMPACT_COMPLETE meaning the full zone has been
      scanned without any outcome while other returns with COMPACT_PARTIAL aka
      made some progress.  The caller should get COMPACT_PARTIAL because that
      means that the compaction still can make some progress.  The same
      applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED.
      
      Reorder PARTIAL to be the largest one so the larger the value is the
      more progress we have done.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f9a358c
    • Michal Hocko's avatar
      mm, compaction: distinguish between full and partial COMPACT_COMPLETE · c8f7de0b
      Michal Hocko authored
      COMPACT_COMPLETE now means that compaction and free scanner met.  This
      is not very useful information if somebody just wants to use this
      feedback and make any decisions based on that.  The current caller might
      be a poor guy who just happened to scan tiny portion of the zone and
      that could be the reason no suitable pages were compacted.  Make sure we
      distinguish the full and partial zone walks.
      
      Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
      and be optimistic in retrying.
      
      The existing users of COMPACT_COMPLETE are conservatively changed to use
      COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
      reconsidered and only defer the compaction only for COMPACT_COMPLETE
      with the new semantic.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8f7de0b
    • Michal Hocko's avatar
      mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED · 1d4746d3
      Michal Hocko authored
      try_to_compact_pages() can currently return COMPACT_SKIPPED even when
      the compaction is defered for some zone just because zone DMA is skipped
      in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
      basically unusable for the page allocator as a feedback mechanism.
      
      Make sure we distinguish those two states properly and switch their
      ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
      returned only when all eligible zones are skipped.
      
      As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
      will be more precise and we would bail out rather than reclaim.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d4746d3
    • Michal Hocko's avatar
      mm, compaction: cover all compaction mode in compact_zone · c46649de
      Michal Hocko authored
      The compiler is complaining after "mm, compaction: change COMPACT_
      constants into enum"
      
        mm/compaction.c: In function `compact_zone':
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
          switch (ret) {
          ^
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
        mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]
      
      compaction_suitable is allowed to return only COMPACT_PARTIAL,
      COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
      impossible.  Put a VM_BUG_ON to catch an impossible return value.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c46649de
    • Michal Hocko's avatar
      mm, compaction: change COMPACT_ constants into enum · ea7ab982
      Michal Hocko authored
      Compaction code is doing weird dances between COMPACT_FOO -> int ->
      unsigned long
      
      But there doesn't seem to be any reason for that.  All functions which
      return/use one of those constants are not expecting any other value so it
      really makes sense to define an enum for them and make it clear that no
      other values are expected.
      
      This is a pure cleanup and shouldn't introduce any functional changes.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea7ab982
    • Michal Hocko's avatar
      vmscan: consider classzone_idx in compaction_ready · b6459cc1
      Michal Hocko authored
      Motivation:
      As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
      communicate the reclaim progress is rater dubious. I tend to agree,
      not only it is really obscure, it is not hard to imagine cases where a
      single page freed in the loop keeps all the reclaimers looping without
      getting any progress because their gfp_mask wouldn't allow to get that
      page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
      rare so it doesn't happen in the practice but the current logic which we
      have is rather obscure and hard to follow a also non-deterministic.
      
      This is an attempt to make the OOM detection more deterministic and
      easier to follow because each reclaimer basically tracks its own
      progress which is implemented at the page allocator layer rather spread
      out between the allocator and the reclaim.  The more on the
      implementation is described in the first patch.
      
      I have tested several different scenarios but it should be clear that
      testing OOM killer is quite hard to be representative.  There is usually
      a tiny gap between almost OOM and full blown OOM which is often time
      sensitive.  Anyway, I have tested the following 2 scenarios and I would
      appreciate if there are more to test.
      
      Testing environment: a virtual machine with 2G of RAM and 2CPUs without
      any swap to make the OOM more deterministic.
      
      1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
         file size, removes the files and starts over again) running in
         parallel for 10s to build up a lot of dirty pages when 100 parallel
         mem_eaters (anon private populated mmap which waits until it gets
         signal) with 80M each.
      
         This causes an OOM flood of course and I have compared both patched
         and unpatched kernels. The test is considered finished after there
         are no OOM conditions detected. This should tell us whether there are
         any excessive kills or some of them premature (e.g. due to dirty pages):
      
      I have performed two runs this time each after a fresh boot.
      
      * base kernel
      $ grep "Out of memory:" base-oom-run1.log | wc -l
      78
      $ grep "Out of memory:" base-oom-run2.log | wc -l
      78
      
      $ grep "Kill process" base-oom-run1.log | tail -n1
      [   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
      $ grep "Kill process" base-oom-run2.log | tail -n1
      [   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child
      
      $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
      $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52
      
      $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
      1
      $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
      3
      
      * patched kernel
      $ grep "Out of memory:" patched-oom-run1.log | wc -l
      78
      miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
      77
      
      e grep "Kill process" patched-oom-run1.log | tail -n1
      [  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
      $ grep "Kill process" patched-oom-run2.log | tail -n1
      [  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child
      
      $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
      $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77
      
      e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
      2
      $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
      3
      
      The patched kernel run noticeably longer while invoking OOM killer same
      number of times. This means that the original implementation is much
      more aggressive and triggers the OOM killer sooner. free pages stats
      show that neither kernels went OOM too early most of the time, though. I
      guess the difference is in the backoff when retries without any progress
      do sleep for a while if there is memory under writeback or dirty which
      is highly likely considering the parallel IO.
      Both kernels have seen races where zone wasn't marked unreclaimable
      and we still hit the OOM killer. This is most likely a race where
      a task managed to exit between the last allocation attempt and the oom
      killer invocation.
      
      2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
         memory as possible without triggering the OOM killer. This required a lot
         of tuning but I've considered 3 consecutive runs in three different boots
         without OOM as a success.
      
      * base kernel
      size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
      
      * patched kernel
      size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)
      
      That means 40M more memory was usable without triggering OOM killer. The
      base kernel sometimes managed to handle the same as patched but it
      wasn't consistent and failed in at least on of the 3 runs. This seems
      like a minor improvement.
      
      I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
      memory and under memory pressure. The results are in patch 11 where the
      logic is implemented. In short I can see huge improvement there.
      
      I am certainly interested in other usecases as well as well as any
      feedback. Especially those which require higher order requests.
      
      This patch (of 14):
      
      While playing with the oom detection rework [1] I have noticed that my
      heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
      where the reclaim hasn't made any progress but did_some_progress didn't
      reflect that and compaction_suitable was backing off because no zone is
      above low wmark + 1 << order.
      
      It turned out that this is in fact an old standing bug in
      compaction_ready which ignores the requested_highidx and did the
      watermark check for 0 classzone_idx.  This succeeds for zone DMA most
      of the time as the zone is mostly unused because of lowmem protection.
      As a result costly high order allocatios always report a successfull
      progress even when there was none.  This wasn't a problem so far
      because these allocations usually fail quite early or retry only few
      times with __GFP_REPEAT but this will change after later patch in this
      series so make sure to not lie about the progress and propagate
      requested_highidx down to compaction_ready and use it for both the
      watermak check and compaction_suitable to fix this issue.
      
      [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
      [2] https://lkml.org/lkml/2015/10/12/808
      [3] https://lkml.org/lkml/2015/10/13/597Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6459cc1
    • Rik van Riel's avatar
      mm: vmscan: reduce size of inactive file list · 59dc76b0
      Rik van Riel authored
      The inactive file list should still be large enough to contain readahead
      windows and freshly written file data, but it no longer is the only
      source for detecting multiple accesses to file pages.  The workingset
      refault measurement code causes recently evicted file pages that get
      accessed again after a shorter interval to be promoted directly to the
      active list.
      
      With that mechanism in place, we can afford to (on a larger system)
      dedicate more memory to the active file list, so we can actually cache
      more of the frequently used file pages in memory, and not have them
      pushed out by streaming writes, once-used streaming file reads, etc.
      
      This can help things like database workloads, where only half the page
      cache can currently be used to cache the database working set.  This
      patch automatically increases that fraction on larger systems, using the
      same ratio that has already been used for anonymous memory.
      
      [hannes@cmpxchg.org: cgroup-awareness]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59dc76b0
    • Johannes Weiner's avatar
      mm: filemap: only do access activations on reads · bbddabe2
      Johannes Weiner authored
      Andres observed that his database workload is struggling with the
      transaction journal creating pressure on frequently read pages.
      
      Access patterns like transaction journals frequently write the same
      pages over and over, but in the majority of cases those pages are never
      read back.  There are no caching benefits to be had for those pages, so
      activating them and having them put pressure on pages that do benefit
      from caching is a bad choice.
      
      Leave page activations to read accesses and don't promote pages based on
      writes alone.
      
      It could be said that partially written pages do contain cache-worthy
      data, because even if *userspace* does not access the unwritten part,
      the kernel still has to read it from the filesystem for correctness.
      However, a counter argument is that these pages enjoy at least *some*
      protection over other inactive file pages through the writeback cache,
      in the sense that dirty pages are written back with a delay and cache
      reclaim leaves them alone until they have been written back to disk.
      Should that turn out to be insufficient and we see increased read IO
      from partial writes under memory pressure, we can always go back and
      update grab_cache_page_write_begin() to take (pos, len) so that it can
      tell partial writes from pages that don't need partial reads.  But for
      now, keep it simple.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bbddabe2
    • Rik van Riel's avatar
      mm: workingset: only do workingset activations on reads · f0281a00
      Rik van Riel authored
      This is a follow-up to
      
        http://www.spinics.net/lists/linux-mm/msg101739.html
      
      where Andres reported his database workingset being pushed out by the
      minimum size enforcement of the inactive file list - currently 50% of
      cache - as well as repeatedly written file pages that are never actually
      read.
      
      Two changes fell out of the discussions.  The first change observes that
      pages that are only ever written don't benefit from caching beyond what
      the writeback cache does for partial page writes, and so we shouldn't
      promote them to the active file list where they compete with pages whose
      cached data is actually accessed repeatedly.  This change comes in two
      patches - one for in-cache write accesses and one for refaults triggered
      by writes, neither of which should promote a cache page.
      
      Second, with the refault detection we don't need to set 50% of the cache
      aside for used-once cache anymore since we can detect frequently used
      pages even when they are evicted between accesses.  We can allow the
      active list to be bigger and thus protect a bigger workingset that isn't
      challenged by streamers.  Depending on the access patterns, this can
      increase major faults during workingset transitions for better
      performance during stable phases.
      
      This patch (of 3):
      
      When rewriting a page, the data in that page is replaced with new data.
      This means that evicting something else from the active file list, in
      order to cache data that will be replaced by something else, is likely
      to be a waste of memory.
      
      It is better to save the active list for frequently read pages, because
      reads actually use the data that is in the page.
      
      This patch ignores partial writes, because it is unclear whether the
      complexity of identifying those is worth any potential performance gain
      obtained from better caching pages that see repeated partial writes at
      large enough intervals to not get caught by the use-twice promotion code
      used for the inactive file list.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0281a00