• Johannes Weiner's avatar
    mm: vmscan: do not share cgroup iteration between reclaimers · 1ba6fc9a
    Johannes Weiner authored
    One of our services observed a high rate of cgroup OOM kills in the
    presence of large amounts of clean cache.  Debugging showed that the
    culprit is the shared cgroup iteration in page reclaim.
    
    Under high allocation concurrency, multiple threads enter reclaim at the
    same time.  Fearing overreclaim when we first switched from the single
    global LRU to cgrouped LRU lists, we introduced a shared iteration state
    for reclaim invocations - whether 1 or 20 reclaimers are active
    concurrently, we only walk the cgroup tree once: the 1st reclaimer
    reclaims the first cgroup, the second the second one etc.  With more
    reclaimers than cgroups, we start another walk from the top.
    
    This sounded reasonable at the time, but the problem is that reclaim
    concurrency doesn't scale with allocation concurrency.  As reclaim
    concurrency increases, the amount of memory individual reclaimers get to
    scan gets smaller and smaller.  Individual reclaimers may only see one
    cgroup per cycle, and that may not have much reclaimable memory.  We see
    individual reclaimers declare OOM when there is plenty of reclaimable
    memory available in cgroups they didn't visit.
    
    This patch does away with the shared iterator, and every reclaimer is
    allowed to scan the full cgroup tree and see all of reclaimable memory,
    just like it would on a non-cgrouped system.  This way, when OOM is
    declared, we know that the reclaimer actually had a chance.
    
    To still maintain fairness in reclaim pressure, disallow cgroup reclaim
    from bailing out of the tree walk early.  Kswapd and regular direct
    reclaim already don't bail, so it's not clear why limit reclaim would have
    to, especially since it only walks subtrees to begin with.
    
    This change completely eliminates the OOM kills on our service, while
    showing no signs of overreclaim - no increased scan rates, %sys time, or
    abrupt free memory spikes.  I tested across 100 machines that have 64G of
    RAM and host about 300 cgroups each.
    
    [ It's possible overreclaim never was a *practical* issue to begin
      with - it was simply a concern we had on the mailing lists at the
      time, with no real data to back it up. But we have also added more
      bail-out conditions deeper inside reclaim (e.g. the proportional
      exit in shrink_node_memcg) since. Regardless, now we have data that
      suggests full walks are more reliable and scale just fine. ]
    
    Link: http://lkml.kernel.org/r/20190812192316.13615-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    1ba6fc9a
vmscan.c 122 KB