• Johannes Weiner's avatar
    psi: fix aggregation idle shut-off · 1b69ac6b
    Johannes Weiner authored
    psi has provisions to shut off the periodic aggregation worker when
    there is a period of no task activity - and thus no data that needs
    aggregating.  However, while developing psi monitoring, Suren noticed
    that the aggregation clock currently won't stay shut off for good.
    
    Debugging this revealed a flaw in the idle design: an aggregation run
    will see no task activity and decide to go to sleep; shortly thereafter,
    the kworker thread that executed the aggregation will go idle and cause
    a scheduling change, during which the psi callback will kick the
    !pending worker again.  This will ping-pong forever, and is equivalent
    to having no shut-off logic at all (but with more code!)
    
    Fix this by exempting aggregation workers from psi's clock waking logic
    when the state change is them going to sleep.  To do this, tag workers
    with the last work function they executed, and if in psi we see a worker
    going to sleep after aggregating psi data, we will not reschedule the
    aggregation work item.
    
    What if the worker is also executing other items before or after?
    
    Any psi state times that were incurred by work items preceding the
    aggregation work will have been collected from the per-cpu buckets
    during the aggregation itself.  If there are work items following the
    aggregation work, the worker's last_func tag will be overwritten and the
    aggregator will be kept alive to process this genuine new activity.
    
    If the aggregation work is the last thing the worker does, and we decide
    to go idle, the brief period of non-idle time incurred between the
    aggregation run and the kworker's dequeue will be stranded in the
    per-cpu buckets until the clock is woken by later activity.  But that
    should not be a problem.  The buckets can hold 4s worth of time, and
    future activity will wake the clock with a 2s delay, giving us 2s worth
    of data we can leave behind when disabling aggregation.  If it takes a
    worker more than two seconds to go idle after it finishes its last work
    item, we likely have bigger problems in the system, and won't notice one
    sample that was averaged with a bogus per-CPU weight.
    
    Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
    Fixes: eb414681 ("psi: pressure stall information for CPU, memory, and IO")
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reported-by: default avatarSuren Baghdasaryan <surenb@google.com>
    Acked-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lai Jiangshan <jiangshanlai@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    1b69ac6b
workqueue_internal.h 2.39 KB