1. 25 Jul, 2018 9 commits
    • Srikar Dronamraju's avatar
      sched/numa: Evaluate move once per node · 305c1fac
      Srikar Dronamraju authored
      task_numa_compare() helps choose the best CPU to move or swap the
      selected task. To achieve this task_numa_compare() is called for every
      CPU in the node. Currently it evaluates if the task can be moved/swapped
      for each of the CPUs. However the move evaluation is mostly independent
      of the CPU. Evaluating the move logic once per node, provides scope for
      simplifying task_numa_compare().
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25705.2     25058.2     -2.51
      1     74433       72950       -1.99
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     96589.6     105930      9.670
      1     181830      178624      -1.76
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      440.65      941.32      758.98      189.17
      numa01.sh       Sys:      183.48      320.07      258.42       50.09
      numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
      numa02.sh      Real:       61.24       65.35       62.49        1.49
      numa02.sh       Sys:       16.83       24.18       21.40        2.60
      numa02.sh      User:     5219.59     5356.34     5264.03       49.07
      numa03.sh      Real:      822.04      912.40      873.55       37.35
      numa03.sh       Sys:      118.80      140.94      132.90        7.60
      numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
      numa04.sh      Real:      690.66      872.12      778.49       65.44
      numa04.sh       Sys:      459.26      563.03      494.03       42.39
      numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
      numa05.sh      Real:      418.37      562.28      525.77       54.27
      numa05.sh       Sys:      299.45      481.00      392.49       64.27
      numa05.sh      User:    34115.09    41324.02    39105.30     2627.68
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
      numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
      numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
      numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
      numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
      numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
      numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
      numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
      numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
      numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      305c1fac
    • Srikar Dronamraju's avatar
      sched/numa: Remove redundant field · 6e303967
      Srikar Dronamraju authored
      'numa_entry' is a struct list_head defined in task_struct, but never used.
      
      No functional change.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6e303967
    • Yun Wang's avatar
      sched/debug: Show the sum wait time of a task group · 3d6c50c2
      Yun Wang authored
      Although we can rely on cpuacct to present the CPU usage of task
      groups, it is hard to tell how intense the competition is between
      these groups on CPU resources.
      
      Monitoring the wait time or sched_debug of each process could be
      very expensive, and there is no good way to accurately represent the
      conflict with these info, we need the wait time on group dimension.
      
      Thus we introduce group's wait_sum to represent the resource conflict
      between task groups, which is simply the sum of the wait time of
      the group's cfs_rq.
      
      The 'cpu.stat' is modified to show the statistic, like:
      
         nr_periods 0
         nr_throttled 0
         throttled_time 0
         wait_sum 2035098795584
      
      Now we can monitor the changes of wait_sum to tell how much a
      a task group is suffering in the fight of CPU resources.
      
      For example:
      
         (wait_sum - last_wait_sum) * 100 / (nr_cpu * period_ns) == X%
      
      means the task group paid X percentage of period on waiting
      for the CPU.
      Signed-off-by: default avatarMichael Wang <yun.wang@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/ff7dae3b-e5f9-7157-1caa-ff02c6b23dc1@linux.alibaba.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3d6c50c2
    • Vincent Guittot's avatar
      sched/fair: Remove #ifdefs from scale_rt_capacity() · 2e62c474
      Vincent Guittot authored
      Reuse cpu_util_irq() that has been defined for schedutil and set irq util
      to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
      
      But the compiler is not able to optimize the sequence (at least with
      aarch64 GCC 7.2.1):
      
      	free *= (max - irq);
      	free /= max;
      
      when irq is fixed to 0
      
      Add a new inline function scale_irq_capacity() that will scale utilization
      when irq is accounted. Reuse this funciton in schedutil which applies
      similar formula.
      Suggested-by: default avatarIngo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2e62c474
    • Ingo Molnar's avatar
    • Hailong Liu's avatar
      sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee
      Hailong Liu authored
      NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
      runtime with a spin-rt-task.
      
      However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
      enough rt_runtime at the beginning, rt_runtime can't be restored to
      its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
      
      E.g. on my PC with 4 cores, procedure to reproduce:
      1) Make sure  RT_RUNTIME_SHARE is enabled
       cat /sys/kernel/debug/sched_features
        GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
        CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
        LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
        NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
        ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
      2) Start a spin-rt-task
       ./loop_rr &
      3) set affinity to the last cpu
       taskset -p 8 $pid_of_loop_rr
      4) Observe that last cpu have borrowed enough runtime.
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      5) Disable RT_RUNTIME_SHARE
       echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
      6) Observe that rt_runtime can not been restored
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      
      This patch help to restore rt_runtime after we disable
      RT_RUNTIME_SHARE.
      Signed-off-by: default avatarHailong Liu <liu.hailong6@zte.com.cn>
      Signed-off-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f3d133ee
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Update rq_clock of later_rq when pushing a task · 840d7196
      Daniel Bristot de Oliveira authored
      Daniel Casini got this warn while running a DL task here at RetisLab:
      
        [  461.137582] ------------[ cut here ]------------
        [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
        [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20
            [a ton of modules]
        [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
        [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
        [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
        [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
        [  461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082
        [  461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006
        [  461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0
        [  461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339
        [  461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20
        [  461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940
        [  461.137678] FS:  00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000
        [  461.137679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0
        [  461.137680] Call Trace:
        [  461.137684]  push_dl_task.part.46+0x3bc/0x460
        [  461.137686]  task_woken_dl+0x60/0x80
        [  461.137689]  ttwu_do_wakeup+0x4f/0x150
        [  461.137690]  ttwu_do_activate+0x77/0x80
        [  461.137692]  try_to_wake_up+0x1d6/0x4c0
        [  461.137693]  wake_up_q+0x32/0x70
        [  461.137696]  do_futex+0x7e7/0xb50
        [  461.137698]  __x64_sys_futex+0x8b/0x180
        [  461.137701]  do_syscall_64+0x5a/0x110
        [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [  461.137705] RIP: 0033:0x7efe4918ca26
        [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
        [  461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca
        [  461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26
        [  461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24
        [  461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001
        [  461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000
        [  461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10
        [  461.137743] ---[ end trace 187df4cad2bf7649 ]---
      
      This warning happened in the push_dl_task(), because
      __add_running_bw()->cpufreq_update_util() is getting the rq_clock of
      the later_rq before its update, which takes place at activate_task().
      The fix then is to update the rq_clock before calling add_running_bw().
      
      To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
      activate_task().
      Reported-by: default avatarDaniel Casini <daniel.casini@santannapisa.it>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
      Fixes: e0367b12 sched/deadline: Move CPU frequency selection triggering points
      Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      840d7196
    • Isaac J. Manjarres's avatar
      stop_machine: Disable preemption after queueing stopper threads · 2610e889
      Isaac J. Manjarres authored
      This commit:
      
        9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      
      does not fully address the race condition that can occur
      as follows:
      
      On one CPU, call it CPU 3, thread 1 invokes
      cpu_stop_queue_two_works(2, 3,...), and the execution is such
      that thread 1 queues the works for migration/2 and migration/3,
      and is preempted after releasing the locks for migration/2 and
      migration/3, but before waking the threads.
      
      Then, On CPU 2, a kworker, call it thread 2, is running,
      and it invokes cpu_stop_queue_two_works(1, 2,...), such that
      thread 2 queues the works for migration/1 and migration/2.
      Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
      migration/2 and migration/3. This means that when CPU 2
      releases the locks for migration/1 and migration/2, but before
      it wakes those threads, it can be preempted by migration/2.
      
      If thread 2 is preempted by migration/2, then migration/2 will
      execute the first work item successfully, since migration/3
      was woken up by CPU 3, but when it goes to execute the second
      work item, it disables preemption, calls multi_cpu_stop(),
      and thus, CPU 2 will wait forever for migration/1, which should
      have been woken up by thread 2. However migration/1 cannot be
      woken up by thread 2, since it is a kworker, so it is affine to
      CPU 2, but CPU 2 is running migration/2 with preemption
      disabled, so thread 2 will never run.
      
      Disable preemption after queueing works for stopper threads
      to ensure that the operation of queueing the works and waking
      the stopper threads is atomic.
      Co-Developed-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Co-Developed-by: default avatarPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarIsaac J. Manjarres <isaacm@codeaurora.org>
      Signed-off-by: default avatarPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: default avatarPavankumar Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: gregkh@linuxfoundation.org
      Cc: matt@codeblueprint.co.uk
      Fixes: 9fb8d5dc ("stop_machine, Disable preemption when waking two stopper threads")
      Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2610e889
    • Yi Wang's avatar
      sched/topology: Check variable group before dereferencing it · 6cd0c583
      Yi Wang authored
      The 'group' variable in sched_domain_debug_one() is not checked
      when firstly used in cpumask_test_cpu(cpu, sched_group_span(group)),
      but it might be NULL (it is checked later in the following while loop)
      and may cause NULL pointer dereference.
      
      We need to check it before using to avoid NULL dereference.
      Signed-off-by: default avatarYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1532319547-33335-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6cd0c583
  2. 22 Jul, 2018 8 commits
  3. 21 Jul, 2018 12 commits
  4. 20 Jul, 2018 11 commits