• Yuyang Du's avatar
    sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
    Yuyang Du authored
    The idea of runnable load average (let runnable time contribute to weight)
    was proposed by Paul Turner and Ben Segall, and it is still followed by
    this rewrite. This rewrite aims to solve the following issues:
    
    1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
       updated at the granularity of an entity at a time, which results in the
       cfs_rq's load average is stale or partially updated: at any time, only
       one entity is up to date, all other entities are effectively lagging
       behind. This is undesirable.
    
       To illustrate, if we have n runnable entities in the cfs_rq, as time
       elapses, they certainly become outdated:
    
         t0: cfs_rq { e1_old, e2_old, ..., en_old }
    
       and when we update:
    
         t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
    
         t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
    
         ...
    
       We solve this by combining all runnable entities' load averages together
       in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
       on the fact that if we regard the update as a function, then:
    
       w * update(e) = update(w * e) and
    
       update(e1) + update(e2) = update(e1 + e2), then
    
       w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
    
       therefore, by this rewrite, we have an entirely updated cfs_rq at the
       time we update it:
    
         t1: update cfs_rq { e1_new, e2_new, ..., en_new }
    
         t2: update cfs_rq { e1_new, e2_new, ..., en_new }
    
         ...
    
    2. cfs_rq's load average is different between top rq->cfs_rq and other
       task_group's per CPU cfs_rqs in whether or not blocked_load_average
       contributes to the load.
    
       The basic idea behind runnable load average (the same for utilization)
       is that the blocked state is taken into account as opposed to only
       accounting for the currently runnable state. Therefore, the average
       should include both the runnable/running and blocked load averages.
       This rewrite does that.
    
       In addition, we also combine runnable/running and blocked averages
       of all entities into the cfs_rq's average, and update it together at
       once. This is based on the fact that:
    
         update(runnable) + update(blocked) = update(runnable + blocked)
    
       This significantly reduces the code as we don't need to separately
       maintain/update runnable/running load and blocked load.
    
    3. How task_group entities' share is calculated is complex and imprecise.
    
       We reduce the complexity in this rewrite to allow a very simple rule:
       the task_group's load_avg is aggregated from its per CPU cfs_rqs's
       load_avgs. Then group entity's weight is simply proportional to its
       own cfs_rq's load_avg / task_group's load_avg. To illustrate,
    
       if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
    
       task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
    
       cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
    
    To sum up, this rewrite in principle is equivalent to the current one, but
    fixes the issues described above. Turns out, it significantly reduces the
    code complexity and hence increases clarity and efficiency. In addition,
    the new averages are more smooth/continuous (no spurious spikes and valleys)
    and updated more consistently and quickly to reflect the load dynamics.
    
    As a result, we have less load tracking overhead, better performance,
    and especially better power efficiency due to more balanced load.
    Signed-off-by: default avatarYuyang Du <yuyang.du@intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: arjan@linux.intel.com
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: fengguang.wu@intel.com
    Cc: len.brown@intel.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: rafael.j.wysocki@intel.com
    Cc: umgwanakikbuti@gmail.com
    Cc: vincent.guittot@linaro.org
    Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    9d89c257
core.c 202 KB