• Yosry Ahmed's avatar
    mm: memcg: make stats flushing threshold per-memcg · 8d59d221
    Yosry Ahmed authored
    A global counter for the magnitude of memcg stats update is maintained on
    the memcg side to avoid invoking rstat flushes when the pending updates
    are not significant.  This avoids unnecessary flushes, which are not very
    cheap even if there isn't a lot of stats to flush.  It also avoids
    unnecessary lock contention on the underlying global rstat lock.
    
    Make this threshold per-memcg.  The scheme is followed where percpu (now
    also per-memcg) counters are incremented in the update path, and only
    propagated to per-memcg atomics when they exceed a certain threshold.
    
    This provides two benefits: (a) On large machines with a lot of memcgs,
    the global threshold can be reached relatively fast, so guarding the
    underlying lock becomes less effective.  Making the threshold per-memcg
    avoids this.
    
    (b) Having a global threshold makes it hard to do subtree flushes, as we
    cannot reset the global counter except for a full flush.  Per-memcg
    counters removes this as a blocker from doing subtree flushes, which helps
    avoid unnecessary work when the stats of a small subtree are needed.
    
    Nothing is free, of course.  This comes at a cost: (a) A new per-cpu
    counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes.  The extra
    memory usage is insigificant.
    
    (b) More work on the update side, although in the common case it will only
    be percpu counter updates.  The amount of work scales with the number of
    ancestors (i.e.  tree depth).  This is not a new concept, adding a cgroup
    to the rstat tree involves a parent loop, so is charging.  Testing results
    below show no significant regressions.
    
    (c) The error margin in the stats for the system as a whole increases from
    NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. 
    This is probably fine because we have a similar per-memcg error in charges
    coming from percpu stocks, and we have a periodic flusher that makes sure
    we always flush all the stats every 2s anyway.
    
    This patch was tested to make sure no significant regressions are
    introduced on the update path as follows.  The following benchmarks were
    ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):
    
    (1) Running 22 instances of netperf on a 44 cpu machine with
    hyperthreading disabled. All instances are run in a level 2 cgroup, as
    well as netserver:
      # netserver -6
      # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
    
    Averaging 20 runs, the numbers are as follows:
    Base: 40198.0 mbps
    Patched: 38629.7 mbps (-3.9%)
    
    The regression is minimal, especially for 22 instances in the same
    cgroup sharing all ancestors (so updating the same atomics).
    
    (2) will-it-scale page_fault tests. These tests (specifically
    per_process_ops in page_fault3 test) detected a 25.9% regression before
    for a change in the stats update path [1]. These are the
    numbers from 10 runs (+ is good) on a machine with 256 cpus:
    
                 LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
    ------------------------------+-------------+-------------+-------------
      page_fault1_per_process_ops |             |             |            |
      (A) base                    | 270249.164  | 265437.000  | 13451.836  |
      (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
                                  | -3.29%      | -3.66%      |            |
      page_fault1_per_thread_ops  |             |             |            |
      (A) base                    | 242111.345  | 239737.000  | 10026.031  |
      (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
                                  | -2.09%      | -1.85%      |            |
      page_fault1_scalability     |             |             |
      (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
      (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
                                  | -1.16%      | -1.69%      |            |
      page_fault2_per_process_ops |             |             |
      (A) base                    | 203561.836  | 203301.000  | 2550.764   |
      (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
                                  | -3.13%      | -2.73%      |            |
      page_fault2_per_thread_ops  |             |             |
      (A) base                    | 171046.473  | 170776.000  | 1509.679   |
      (B) patched                 | 166626.327  | 166406.000  | 768.753    |
                                  | -2.58%      | -2.56%      |            |
      page_fault2_scalability     |             |             |
      (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
      (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
                                  | -1.29%      | -1.41%      |            |
      page_fault3_per_process_ops |             |             |
      (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
      (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
                                  | -1.56%      | -1.86%      |            |
      page_fault3_per_thread_ops  |             |             |
      (A) base                    | 391234.164  | 390860.000  | 1760.720   |
      (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
                                  | -3.58%      | -3.71%      |            |
      page_fault3_scalability     |             |             |
      (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
      (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
                                  | +2.26%      | +2.45%      |            |
    
    All regressions seem to be minimal, and within the normal variance for the
    benchmark.  The fix for [1] assumes that 3% is noise -- and there were no
    further practical complaints), so hopefully this means that such
    variations in these microbenchmarks do not reflect on practical workloads.
    
    (3) I also ran stress-ng in a nested cgroup and did not observe any
    obvious regressions.
    
    [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
    
    Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
    Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Tested-by: default avatarDomenico Cerasuolo <cerasuolodomenico@gmail.com>
    Acked-by: default avatarShakeel Butt <shakeelb@google.com>
    Cc: Chris Li <chrisl@kernel.org>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Ivan Babrou <ivan@cloudflare.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Michal Koutny <mkoutny@suse.com>
    Cc: Muchun Song <muchun.song@linux.dev>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Waiman Long <longman@redhat.com>
    Cc: Wei Xu <weixugc@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    8d59d221
memcontrol.c 212 KB