• Johannes Weiner's avatar
    mm: memcontrol: switch to rstat · 2d146aa3
    Johannes Weiner authored
    Replace the memory controller's custom hierarchical stats code with the
    generic rstat infrastructure provided by the cgroup core.
    
    The current implementation does batched upward propagation from the
    write side (i.e.  as stats change).  The per-cpu batches introduce an
    error, which is multiplied by the number of subgroups in a tree.  In
    systems with many CPUs and sizable cgroup trees, the error can be large
    enough to confuse users (e.g.  32 batch pages * 32 CPUs * 32 subgroups
    results in an error of up to 128M per stat item).  This can entirely
    swallow allocation bursts inside a workload that the user is expecting
    to see reflected in the statistics.
    
    In the past, we've done read-side aggregation, where a memory.stat read
    would have to walk the entire subtree and add up per-cpu counts.  This
    became problematic with lazily-freed cgroups: we could have large
    subtrees where most cgroups were entirely idle.  Hence the switch to
    change-driven upward propagation.  Unfortunately, it needed to trade
    accuracy for speed due to the write side being so hot.
    
    Rstat combines the best of both worlds: from the write side, it cheaply
    maintains a queue of cgroups that have pending changes, so that the read
    side can do selective tree aggregation.  This way the reported stats
    will always be precise and recent as can be, while the aggregation can
    skip over potentially large numbers of idle cgroups.
    
    The way rstat works is that it implements a tree for tracking cgroups
    with pending local changes, as well as a flush function that walks the
    tree upwards.  The controller then drives this by 1) telling rstat when
    a local cgroup stat changes (e.g.  mod_memcg_state) and 2) when a flush
    is required to get uptodate hierarchy stats for a given subtree (e.g.
    when memory.stat is read).  The controller also provides a flush
    callback that is called during the rstat flush walk for each cgroup and
    aggregates its local per-cpu counters and propagates them upwards.
    
    This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
    NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
    aggregation.  It removes 3 words from the per-cpu data.  It eliminates
    memcg_exact_page_state(), since memcg_page_state() is now exact.
    
    [akpm@linux-foundation.org: merge fix]
    [hannes@cmpxchg.org: fix a sleep in atomic section problem]
      Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org
    
    Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
    Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
    Cc: Tejun Heo <tj@kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    2d146aa3
memcontrol.c 189 KB