• Tianchen Ding's avatar
    sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle · f3dd3f67
    Tianchen Ding authored
    Wakelist can help avoid cache bouncing and offload the overhead of waker
    cpu. So far, using wakelist within the same llc only happens on
    WF_ON_CPU, and this limitation could be removed to further improve
    wakeup performance.
    
    The commit 518cd623 ("sched: Only queue remote wakeups when
    crossing cache boundaries") disabled queuing tasks on wakelist when
    the cpus share llc. This is because, at that time, the scheduler must
    send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
    supports TIF_POLLING, so this is not a problem now when the wakee cpu is
    in idle polling.
    
    Benefits:
      Queuing the task on idle cpu can help improving performance on waker cpu
      and utilization on wakee cpu, and further improve locality because
      the wakee cpu can handle its own rq. This patch helps improving rt on
      our real java workloads where wakeup happens frequently.
    
      Consider the normal condition (CPU0 and CPU1 share same llc)
      Before this patch:
    
             CPU0                                       CPU1
    
        select_task_rq()                                idle
        rq_lock(CPU1->rq)
        enqueue_task(CPU1->rq)
        notify CPU1 (by sending IPI or CPU1 polling)
    
                                                        resched()
    
      After this patch:
    
             CPU0                                       CPU1
    
        select_task_rq()                                idle
        add to wakelist of CPU1
        notify CPU1 (by sending IPI or CPU1 polling)
    
                                                        rq_lock(CPU1->rq)
                                                        enqueue_task(CPU1->rq)
                                                        resched()
    
      We see CPU0 can finish its work earlier. It only needs to put task to
      wakelist and return.
      While CPU1 is idle, so let itself handle its own runqueue data.
    
    This patch brings no difference about IPI.
      This patch only takes effect when the wakee cpu is:
      1) idle polling
      2) idle not polling
    
      For 1), there will be no IPI with or without this patch.
    
      For 2), there will always be an IPI before or after this patch.
      Before this patch: waker cpu will enqueue task and check preempt. Since
      "idle" will be sure to be preempted, waker cpu must send a resched IPI.
      After this patch: waker cpu will put the task to the wakelist of wakee
      cpu, and send an IPI.
    
    Benchmark:
    We've tested schbench, unixbench, and hachbench on both x86 and arm64.
    
    On x86 (Intel Xeon Platinum 8269CY):
      schbench -m 2 -t 8
    
        Latency percentiles (usec)              before        after
            50.0000th:                             8            6
            75.0000th:                            10            7
            90.0000th:                            11            8
            95.0000th:                            12            8
            *99.0000th:                           13           10
            99.5000th:                            15           11
            99.9000th:                            18           14
    
      Unixbench with full threads (104)
                                                before        after
        Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
        Double-Precision Whetstone              617119.3      617298.5   0.03%
        Execl Throughput                         27667.3       27627.3  -0.14%
        File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
        File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
        File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
        Pipe Throughput                      145535622.8   145323033.2  -0.15%
        Pipe-based Context Switching           3221686.4     3583975.4  11.25%
        Process Creation                        101347.1      103345.4   1.97%
        Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
        Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
        System Call Overhead                   5300604.8     5312213.6   0.22%
    
      hackbench -g 1 -l 100000
                                                before        after
        Time                                     3.246        2.251
    
    On arm64 (Ampere Altra):
      schbench -m 2 -t 8
    
        Latency percentiles (usec)              before        after
            50.0000th:                            14           10
            75.0000th:                            19           14
            90.0000th:                            22           16
            95.0000th:                            23           16
            *99.0000th:                           24           17
            99.5000th:                            24           17
            99.9000th:                            28           25
    
      Unixbench with full threads (80)
                                                before        after
        Dhrystone 2 using register variables  3536194249    3537019613   0.02%
        Double-Precision Whetstone              629383.6      629431.6   0.01%
        Execl Throughput                         65920.5       65846.2  -0.11%
        File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026b.8   0.03%
        File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
        File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
        Pipe Throughput                      133542875.3   131619389.8  -1.44%
        Pipe-based Context Switching           3215356.1     3576945.1  11.25%
        Process Creation                        108520.5      120184.6  10.75%
        Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
        Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
        System Call Overhead                   4429998.9     44350061.7   0.11%
    
      hackbench -g 1 -l 100000
                                                before        after
        Time                                     4.217        2.916
    
    Our patch has improvement on schbench, hackbench
    and Pipe-based Context Switching of unixbench
    when there exists idle cpus,
    and no obvious regression on other tests of unixbench.
    This can help improve rt in scenes where wakeup happens frequently.
    Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarValentin Schneider <vschneid@redhat.com>
    Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
    f3dd3f67
sched.h 81.3 KB