Commit a35b6466 authored by Peter Zijlstra's avatar Peter Zijlstra Committed by Thomas Gleixner

sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies

Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: default avatarPeter Portante <pportant@redhat.com>
Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
parent b9403130
...@@ -3387,6 +3387,14 @@ static int tg_load_down(struct task_group *tg, void *data) ...@@ -3387,6 +3387,14 @@ static int tg_load_down(struct task_group *tg, void *data)
static void update_h_load(long cpu) static void update_h_load(long cpu)
{ {
struct rq *rq = cpu_rq(cpu);
unsigned long now = jiffies;
if (rq->h_load_throttle == now)
return;
rq->h_load_throttle = now;
rcu_read_lock(); rcu_read_lock();
walk_tg_tree(tg_load_down, tg_nop, (void *)cpu); walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
rcu_read_unlock(); rcu_read_unlock();
...@@ -4293,11 +4301,10 @@ static int load_balance(int this_cpu, struct rq *this_rq, ...@@ -4293,11 +4301,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.src_rq = busiest; env.src_rq = busiest;
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
update_h_load(env.src_cpu);
more_balance: more_balance:
local_irq_save(flags); local_irq_save(flags);
double_rq_lock(this_rq, busiest); double_rq_lock(this_rq, busiest);
if (!env.loop)
update_h_load(env.src_cpu);
/* /*
* cur_ld_moved - load moved in current iteration * cur_ld_moved - load moved in current iteration
......
...@@ -374,7 +374,11 @@ struct rq { ...@@ -374,7 +374,11 @@ struct rq {
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this cpu: */ /* list of leaf cfs_rq on this cpu: */
struct list_head leaf_cfs_rq_list; struct list_head leaf_cfs_rq_list;
#endif #ifdef CONFIG_SMP
unsigned long h_load_throttle;
#endif /* CONFIG_SMP */
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED #ifdef CONFIG_RT_GROUP_SCHED
struct list_head leaf_rt_rq_list; struct list_head leaf_rt_rq_list;
#endif #endif
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment