• Peter Puhov's avatar
    sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal · 3edecfef
    Peter Puhov authored
    In slow path, when selecting idlest group, if both groups have type
    group_has_spare, only idle_cpus count gets compared.
    As a result, if multiple tasks are created in a tight loop,
    and go back to sleep immediately
    (while waiting for all tasks to be created),
    they may be scheduled on the same core, because CPU is back to idle
    when the new fork happen.
    
    For example:
    sudo perf record -e sched:sched_wakeup_new -- \
                                      sysbench threads --threads=4 run
    ...
        total number of events:              61582
    ...
    sudo perf script
    sysbench 129378 [006] 74586.633466: sched:sched_wakeup_new:
                                sysbench:129380 [120] success=1 CPU:007
    sysbench 129378 [006] 74586.634718: sched:sched_wakeup_new:
                                sysbench:129381 [120] success=1 CPU:007
    sysbench 129378 [006] 74586.635957: sched:sched_wakeup_new:
                                sysbench:129382 [120] success=1 CPU:007
    sysbench 129378 [006] 74586.637183: sched:sched_wakeup_new:
                                sysbench:129383 [120] success=1 CPU:007
    
    This may have negative impact on performance for workloads with frequent
    creation of multiple threads.
    
    In this patch we are using group_util to select idlest group if both groups
    have equal number of idle_cpus. Comparing the number of idle cpu is
    not enough in this case, because the newly forked thread sleeps
    immediately and before we select the cpu for the next one.
    This is shown in the trace where the same CPU7 is selected for
    all wakeup_new events.
    That's why, looking at utilization when there is the same number of
    CPU is a good way to see where the previous task was placed. Using
    nr_running doesn't solve the problem because the newly forked task is not
    running and the cpu would not have been idle in this case and an idle
    CPU would have been selected instead.
    
    With this patch newly created tasks would be better distributed.
    
    With this patch:
    sudo perf record -e sched:sched_wakeup_new -- \
                                        sysbench threads --threads=4 run
    ...
        total number of events:              74401
    ...
    sudo perf script
    sysbench 129455 [006] 75232.853257: sched:sched_wakeup_new:
                                sysbench:129457 [120] success=1 CPU:008
    sysbench 129455 [006] 75232.854489: sched:sched_wakeup_new:
                                sysbench:129458 [120] success=1 CPU:009
    sysbench 129455 [006] 75232.855732: sched:sched_wakeup_new:
                                sysbench:129459 [120] success=1 CPU:010
    sysbench 129455 [006] 75232.856980: sched:sched_wakeup_new:
                                sysbench:129460 [120] success=1 CPU:011
    
    We tested this patch with following benchmarks:
    master: 'commit b3a9e3b9 ("Linux 5.8-rc1")'
    
    100 iterations of: perf bench -f simple futex wake -s -t 128 -w 1
    Lower result is better
    |         |   BASELINE |   +PATCH |   DELTA (%) |
    |---------|------------|----------|-------------|
    | mean    |      0.33  |    0.313 |      +5.152 |
    | std (%) |     10.433 |    7.563 |             |
    
    100 iterations of: sysbench threads --threads=8 run
    Higher result is better
    |         |   BASELINE |   +PATCH |   DELTA (%) |
    |---------|------------|----------|-------------|
    | mean    |   5235.02  | 5863.73  |      +12.01 |
    | std (%) |      8.166 |   10.265 |             |
    
    100 iterations of: sysbench mutex --mutex-num=1 --threads=8 run
    Lower result is better
    |         |   BASELINE |   +PATCH |   DELTA (%) |
    |---------|------------|----------|-------------|
    | mean    |      0.413 |    0.404 |      +2.179 |
    | std (%) |      3.791 |    1.816 |             |
    Signed-off-by: default avatarPeter Puhov <peter.puhov@linaro.org>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20200714125941.4174-1-peter.puhov@linaro.org
    3edecfef
fair.c 296 KB