1. 28 Jul, 2016 31 commits
  2. 26 Jul, 2016 2 commits
  3. 21 May, 2016 3 commits
    • Michal Hocko's avatar
      mm, oom: rework oom detection · 0a0337e0
      Michal Hocko authored
      
      __alloc_pages_slowpath has traditionally relied on the direct reclaim
      and did_some_progress as an indicator that it makes sense to retry
      allocation rather than declaring OOM.  shrink_zones had to rely on
      zone_reclaimable if shrink_zone didn't make any progress to prevent from
      a premature OOM killer invocation - the LRU might be full of dirty or
      writeback pages and direct reclaim cannot clean those up.
      
      zone_reclaimable allows to rescan the reclaimable lists several times
      and restart if a page is freed.  This is really subtle behavior and it
      might lead to a livelock when a single freed page keeps allocator
      looping but the current task will not be able to allocate that single
      page.  OOM killer would be more appropriate than looping without any
      progress for unbounded amount of time.
      
      This patch changes OOM detection logic and pulls it out from shrink_zone
      which is too low to be appropriate for any high level decisions such as
      OOM which is per zonelist property.  It is __alloc_pages_slowpath which
      knows how many attempts have been done and what was the progress so far
      therefore it is more appropriate to implement this logic.
      
      The new heuristic is implemented in should_reclaim_retry helper called
      from __alloc_pages_slowpath.  It tries to be more deterministic and
      easier to follow.  It builds on an assumption that retrying makes sense
      only if the currently reclaimable memory + free pages would allow the
      current allocation request to succeed (as per __zone_watermark_ok) at
      least for one zone in the usable zonelist.
      
      This alone wouldn't be sufficient, though, because the writeback might
      get stuck and reclaimable pages might be pinned for a really long time
      or even depend on the current allocation context.  Therefore there is a
      backoff mechanism implemented which reduces the reclaim target after
      each reclaim round without any progress.  This means that we should
      eventually converge to only NR_FREE_PAGES as the target and fail on the
      wmark check and proceed to OOM.  The backoff is simple and linear with
      1/16 of the reclaimable pages for each round without any progress.  We
      are optimistic and reset counter for successful reclaim rounds.
      
      Costly high order pages mostly preserve their semantic and those without
      __GFP_REPEAT fail right away while those which have the flag set will
      back off after the amount of reclaimable pages reaches equivalent of the
      requested order.  The only difference is that if there was no progress
      during the reclaim we rely on zone watermark check.  This is more
      logical thing to do than previous 1<<order attempts which were a result
      of zone_reclaimable faking the progress.
      
      [vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
      [hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
      [rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
      [rientjes@google.com: shrink_zones doesn't need to return anything]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a0337e0
    • Michal Hocko's avatar
      vmscan: consider classzone_idx in compaction_ready · b6459cc1
      Michal Hocko authored
      Motivation:
      As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
      communicate the reclaim progress is rater dubious. I tend to agree,
      not only it is really obscure, it is not hard to imagine cases where a
      single page freed in the loop keeps all the reclaimers looping without
      getting any progress because their gfp_mask wouldn't allow to get that
      page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
      rare so it doesn't happen in the practice but the current logic which we
      have is rather obscure and hard to follow a also non-deterministic.
      
      This is an attempt to make the OOM detection more deterministic and
      easier to follow because each reclaimer basically tracks its own
      progress which is implemented at the page allocator layer rather spread
      out between the allocator and the reclaim.  The more on the
      implementation is described in the first patch.
      
      I have tested several different scenarios but it should be clear that
      testing OOM killer is quite hard to be representative.  There is usually
      a tiny gap between almost OOM and full blown OOM which is often time
      sensitive.  Anyway, I have tested the following 2 scenarios and I would
      appreciate if there are more to test.
      
      Testing environment: a virtual machine with 2G of RAM and 2CPUs without
      any swap to make the OOM more deterministic.
      
      1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
         file size, removes the files and starts over again) running in
         parallel for 10s to build up a lot of dirty pages when 100 parallel
         mem_eaters (anon private populated mmap which waits until it gets
         signal) with 80M each.
      
         This causes an OOM flood of course and I have compared both patched
         and unpatched kernels. The test is considered finished after there
         are no OOM conditions detected. This should tell us whether there are
         any excessive kills or some of them premature (e.g. due to dirty pages):
      
      I have performed two runs this time each after a fresh boot.
      
      * base kernel
      $ grep "Out of memory:" base-oom-run1.log | wc -l
      78
      $ grep "Out of memory:" base-oom-run2.log | wc -l
      78
      
      $ grep "Kill process" base-oom-run1.log | tail -n1
      [   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
      $ grep "Kill process" base-oom-run2.log | tail -n1
      [   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child
      
      $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
      $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52
      
      $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
      1
      $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
      3
      
      * patched kernel
      $ grep "Out of memory:" patched-oom-run1.log | wc -l
      78
      miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
      77
      
      e grep "Kill process" patched-oom-run1.log | tail -n1
      [  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
      $ grep "Kill process" patched-oom-run2.log | tail -n1
      [  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child
      
      $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
      $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
      min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77
      
      e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
      2
      $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
      3
      
      The patched kernel run noticeably longer while invoking OOM killer same
      number of times. This means that the original implementation is much
      more aggressive and triggers the OOM killer sooner. free pages stats
      show that neither kernels went OOM too early most of the time, though. I
      guess the difference is in the backoff when retries without any progress
      do sleep for a while if there is memory under writeback or dirty which
      is highly likely considering the parallel IO.
      Both kernels have seen races where zone wasn't marked unreclaimable
      and we still hit the OOM killer. This is most likely a race where
      a task managed to exit between the last allocation attempt and the oom
      killer invocation.
      
      2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
         memory as possible without triggering the OOM killer. This required a lot
         of tuning but I've considered 3 consecutive runs in three different boots
         without OOM as a success.
      
      * base kernel
      size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
      
      * patched kernel
      size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)
      
      That means 40M more memory was usable without triggering OOM killer. The
      base kernel sometimes managed to handle the same as patched but it
      wasn't consistent and failed in at least on of the 3 runs. This seems
      like a minor improvement.
      
      I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
      memory and under memory pressure. The results are in patch 11 where the
      logic is implemented. In short I can see huge improvement there.
      
      I am certainly interested in other usecases as well as well as any
      feedback. Especially those which require higher order requests.
      
      This patch (of 14):
      
      While playing with the oom detection rework [1] I have noticed that my
      heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
      where the reclaim hasn't made any progress but did_some_progress didn't
      reflect that and compaction_suitable was backing off because no zone is
      above low wmark + 1 << order.
      
      It turned out that this is in fact an old standing bug in
      compaction_ready which ignores the requested_highidx and did the
      watermark check for 0 classzone_idx.  This succeeds for zone DMA most
      of the time as the zone is mostly unused because of lowmem protection.
      As a result costly high order allocatios always report a successfull
      progress even when there was none.  This wasn't a problem so far
      because these allocations usually fail quite early or retry only few
      times with __GFP_REPEAT but this will change after later patch in this
      series so make sure to not lie about the progress and propagate
      requested_highidx down to compaction_ready and use it for both the
      watermak check and compaction_suitable to fix this issue.
      
      [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
      [2] https://lkml.org/lkml/2015/10/12/808
      [3] https://lkml.org/lkml/2015/10/13/597
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6459cc1
    • Rik van Riel's avatar
      mm: vmscan: reduce size of inactive file list · 59dc76b0
      Rik van Riel authored
      
      The inactive file list should still be large enough to contain readahead
      windows and freshly written file data, but it no longer is the only
      source for detecting multiple accesses to file pages.  The workingset
      refault measurement code causes recently evicted file pages that get
      accessed again after a shorter interval to be promoted directly to the
      active list.
      
      With that mechanism in place, we can afford to (on a larger system)
      dedicate more memory to the active file list, so we can actually cache
      more of the frequently used file pages in memory, and not have them
      pushed out by streaming writes, once-used streaming file reads, etc.
      
      This can help things like database workloads, where only half the page
      cache can currently be used to cache the database working set.  This
      patch automatically increases that fraction on larger systems, using the
      same ratio that has already been used for anonymous memory.
      
      [hannes@cmpxchg.org: cgroup-awareness]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59dc76b0
  4. 20 May, 2016 2 commits
    • Hugh Dickins's avatar
      mm: update_lru_size do the __mod_zone_page_state · 9d5e6a9f
      Hugh Dickins authored
      
      Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
      reclaim was removed) that lru_size can be updated by -nr_taken once per
      call to isolate_lru_pages(), instead of page by page.
      
      Update it inside isolate_lru_pages(), or at its two callsites? I chose
      to update it at the callsites, rearranging and grouping the updates by
      nr_taken and nr_scanned together in both.
      
      With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
      __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
      some more calls in a future commit.  Make the code a little smaller and
      simpler by incorporating stat update in lru_size update.
      
      The exception was move_active_pages_to_lru(), which aggregated the
      pgmoved stat update separately from the individual lru_size updates; but
      I still think this a simplification worth making.
      
      However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
      better use the name update_lru_size, calls mem_cgroup_update_lru_size
      when CONFIG_MEMCG.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d5e6a9f
    • Joonsoo Kim's avatar
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim authored
      
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
  5. 29 Apr, 2016 2 commits
    • Vlastimil Babka's avatar
      mm: wake kcompactd before kswapd's short sleep · fd901c95
      Vlastimil Babka authored
      When kswapd goes to sleep it checks if the node is balanced and at first
      it sleeps only for HZ/10 time, then rechecks if the node is still
      balanced and nobody has woken it during the initial sleep.  Only then it
      goes fully sleep until an allocation slowpath wakes it up again.
      
      For higher-order allocations, waking up kcompactd is done only before
      the full sleep.  This turns out to be an issue in case another
      high-order allocation fails during the initial sleep.  It will wake
      kswapd up, however kswapd considers the zone balanced from the order-0
      perspective, and will just quickly try to sleep again.  So if there's a
      longer stream of high-order allocations hitting the slowpath and waking
      up kswapd, it might never actually wake up kcompactd, which may be
      considered a regression from kswapd-based compaction.  In the worst
      case, it might be that a single allocation that cannot direct
      reclaim/compact itself is waking kswapd in the retry loop and preventing
      kcompactd...
      fd901c95
    • Minchan Kim's avatar
      mm: vmscan: reclaim highmem zone if buffer_heads is over limit · 7bf52fb8
      Minchan Kim authored
      We have been reclaimed highmem zone if buffer_heads is over limit but
      commit 6b4f7799 ("mm: vmscan: invoke slab shrinkers from
      shrink_zone()") changed the behavior so it doesn't reclaim highmem zone
      although buffer_heads is over the limit.  This patch restores the logic.
      
      Fixes: 6b4f7799
      
       ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7bf52fb8