• Huaixin Chang's avatar
    sched/fair: Introduce the burstable CFS controller · f4183717
    Huaixin Chang authored
    The CFS bandwidth controller limits CPU requests of a task group to
    quota during each period. However, parallel workloads might be bursty
    so that they get throttled even when their average utilization is under
    quota. And they are latency sensitive at the same time so that
    throttling them is undesired.
    
    We borrow time now against our future underrun, at the cost of increased
    interference against the other system users. All nicely bounded.
    
    Traditional (UP-EDF) bandwidth control is something like:
    
      (U = \Sum u_i) <= 1
    
    This guaranteeds both that every deadline is met and that the system is
    stable. After all, if U were > 1, then for every second of walltime,
    we'd have to run more than a second of program time, and obviously miss
    our deadline, but the next deadline will be further out still, there is
    never time to catch up, unbounded fail.
    
    This work observes that a workload doesn't always executes the full
    quota; this enables one to describe u_i as a statistical distribution.
    
    For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
    (the traditional WCET). This effectively allows u to be smaller,
    increasing the efficiency (we can pack more tasks in the system), but at
    the cost of missing deadlines when all the odds line up. However, it
    does maintain stability, since every overrun must be paired with an
    underrun as long as our x is above the average.
    
    That is, suppose we have 2 tasks, both specify a p(95) value, then we
    have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
    everything is good. At the same time we have a p(5)p(5) = 0.25% chance
    both tasks will exceed their quota at the same time (guaranteed deadline
    fail). Somewhere in between there's a threshold where one exceeds and
    the other doesn't underrun enough to compensate; this depends on the
    specific CDFs.
    
    At the same time, we can say that the worst case deadline miss, will be
    \Sum e_i; that is, there is a bounded tardiness (under the assumption
    that x+e is indeed WCET).
    
    The benefit of burst is seen when testing with schbench. Default value of
    kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
    
    	mkdir /sys/fs/cgroup/cpu/test
    	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
    	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
    	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
    
    	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
    
    The average CPU usage is at 80%. I run this for 10 times, and got long tail
    latency for 6 times and got throttled for 8 times.
    
    Tail latencies are shown below, and it wasn't the worst case.
    
    	Latency percentiles (usec)
    		50.0000th: 19872
    		75.0000th: 21344
    		90.0000th: 22176
    		95.0000th: 22496
    		*99.0000th: 22752
    		99.5000th: 22752
    		99.9000th: 22752
    		min=0, max=22727
    	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
    
    The interferenece when using burst is valued by the possibilities for
    missing the deadline and the average WCET. Test results showed that when
    there many cgroups or CPU is under utilized, the interference is
    limited. More details are shown in:
    https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/Co-developed-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
    Signed-off-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
    Co-developed-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
    Signed-off-by: default avatarHuaixin Chang <changhuaixin@linux.alibaba.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarBen Segall <bsegall@google.com>
    Acked-by: default avatarTejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
    f4183717
sched.h 77.4 KB