• Peter Zijlstra's avatar
    sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466
    Peter Zijlstra authored
    Peter Portante reported that for large cgroup hierarchies (and or on
    large CPU counts) we get immense lock contention on rq->lock and stuff
    stops working properly.
    
    His workload was a ton of processes, each in their own cgroup,
    everybody idling except for a sporadic wakeup once every so often.
    
    It was found that:
    
      schedule()
        idle_balance()
          load_balance()
            local_irq_save()
            double_rq_lock()
            update_h_load()
              walk_tg_tree(tg_load_down)
                tg_load_down()
    
    Results in an entire cgroup hierarchy walk under rq->lock for every
    new-idle balance and since new-idle balance isn't throttled this
    results in a lot of work while holding the rq->lock.
    
    This patch does two things, it removes the work from under rq->lock
    based on the good principle of race and pray which is widely employed
    in the load-balancer as a whole. And secondly it throttles the
    update_h_load() calculation to max once per jiffy.
    
    I considered excluding update_h_load() for new-idle balance
    all-together, but purely relying on regular balance passes to update
    this data might not work out under some rare circumstances where the
    new-idle busiest isn't the regular busiest for a while (unlikely, but
    a nightmare to debug if someone hits it and suffers).
    
    Cc: pjt@google.com
    Cc: Larry Woodman <lwoodman@redhat.com>
    Cc: Mike Galbraith <efault@gmx.de>
    Reported-by: default avatarPeter Portante <pportant@redhat.com>
    Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    a35b6466
sched.h 29.6 KB