• Chengming Zhou's avatar
    sched/psi: Per-cgroup PSI accounting disable/re-enable interface · 34f26a15
    Chengming Zhou authored
    PSI accounts stalls for each cgroup separately and aggregates it
    at each level of the hierarchy. This may cause non-negligible overhead
    for some workloads when under deep level of the hierarchy.
    
    commit 3958e2d0 ("cgroup: make per-cgroup pressure stall tracking configurable")
    make PSI to skip per-cgroup stall accounting, only account system-wide
    to avoid this each level overhead.
    
    But for our use case, we also want leaf cgroup PSI stats accounted for
    userspace adjustment on that cgroup, apart from only system-wide adjustment.
    
    So this patch introduce a per-cgroup PSI accounting disable/re-enable
    interface "cgroup.pressure", which is a read-write single value file that
    allowed values are "0" and "1", the defaults is "1" so per-cgroup
    PSI stats is enabled by default.
    
    Implementation details:
    
    It should be relatively straight-forward to disable and re-enable
    state aggregation, time tracking, averaging on a per-cgroup level,
    if we can live with losing history from while it was disabled.
    I.e. the avgs will restart from 0, total= will have gaps.
    
    But it's hard or complex to stop/restart groupc->tasks[] updates,
    which is not implemented in this patch. So we always update
    groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
    the cgroup PSI stats is disabled.
    Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Suggested-by: default avatarTejun Heo <tj@kernel.org>
    Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
    34f26a15
cgroup-v2.rst 108 KB