An error occurred fetching the project authors.
  1. 19 May, 2020 1 commit
  2. 06 Mar, 2020 4 commits
  3. 28 Jan, 2020 1 commit
    • Konstantin Khlebnikov's avatar
      sched/rt: Optimize checking group RT scheduler constraints · b4fb015e
      Konstantin Khlebnikov authored
      Group RT scheduler contains protection against setting zero runtime for
      cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates
      over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which
      runtime is zero (not only for changed one). Default RT runtime is zero,
      thus tg_has_rt_tasks() will is called for almost at CPU cgroups.
      
      This protection already is slightly racy: runtime limit could be changed
      between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing
      cgroup attribute does not lock cgroup_mutex while attach does not lock
      rt_constraints_mutex. Changing task scheduler class also races with
      changing rt runtime: check in __sched_setscheduler() isn't protected.
      
      Function tg_has_rt_tasks() iterates over all threads in the system.
      This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock
      locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking
      tasklist_lock for write (for example fork) will stuck with disabled irqs.
      
      This patch makes two optimizations:
      1) Remove locking tasklist_lock and iterate only tasks in cgroup
      2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero
      
      All changed code is under CONFIG_RT_GROUP_SCHED.
      
      Testcase:
      
       # mkdir /sys/fs/cgroup/cpu/test{1..10000}
       # echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us
      
      At the same time without patch fork time will be >100ms:
      
       # perf trace -e clone --duration 100 stress-ng --fork 1
      
      Also remote ping will show timings >100ms caused by irq latency.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz
      b4fb015e
  4. 25 Dec, 2019 1 commit
    • Qais Yousef's avatar
      sched/rt: Make RT capacity-aware · 804d402f
      Qais Yousef authored
      Capacity Awareness refers to the fact that on heterogeneous systems
      (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
      when placing tasks we need to be aware of this difference of CPU
      capacities.
      
      In such scenarios we want to ensure that the selected CPU has enough
      capacity to meet the requirement of the running task. Enough capacity
      means here that capacity_orig_of(cpu) >= task.requirement.
      
      The definition of task.requirement is dependent on the scheduling class.
      
      For CFS, utilization is used to select a CPU that has >= capacity value
      than the cfs_task.util.
      
      	capacity_orig_of(cpu) >= cfs_task.util
      
      DL isn't capacity aware at the moment but can make use of the bandwidth
      reservation to implement that in a similar manner CFS uses utilization.
      The following patchset implements that:
      
      https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/
      
      	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
      
      For RT we don't have a per task utilization signal and we lack any
      information in general about what performance requirement the RT task
      needs. But with the introduction of uclamp, RT tasks can now control
      that by setting uclamp_min to guarantee a minimum performance point.
      
      ATM the uclamp value are only used for frequency selection; but on
      heterogeneous systems this is not enough and we need to ensure that the
      capacity of the CPU is >= uclamp_min. Which is what implemented here.
      
      	capacity_orig_of(cpu) >= rt_task.uclamp_min
      
      Note that by default uclamp.min is 1024, which means that RT tasks will
      always be biased towards the big CPUs, which make for a better more
      predictable behavior for the default case.
      
      Must stress that the bias acts as a hint rather than a definite
      placement strategy. For example, if all big cores are busy executing
      other RT tasks we can't guarantee that a new RT task will be placed
      there.
      
      On non-heterogeneous systems the original behavior of RT should be
      retained. Similarly if uclamp is not selected in the config.
      
      [ mingo: Minor edits to comments. ]
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      804d402f
  5. 11 Nov, 2019 2 commits
  6. 08 Nov, 2019 1 commit
    • Peter Zijlstra's avatar
      sched: Fix pick_next_task() vs 'change' pattern race · 6e2df058
      Peter Zijlstra authored
      Commit 67692435 ("sched: Rework pick_next_task() slow-path")
      inadvertly introduced a race because it changed a previously
      unexplored dependency between dropping the rq->lock and
      sched_class::put_prev_task().
      
      The comments about dropping rq->lock, in for example
      newidle_balance(), only mentions the task being current and ->on_cpu
      being set. But when we look at the 'change' pattern (in for example
      sched_setnuma()):
      
      	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
      	running = task_current(rq, p); /* rq->curr == p */
      
      	if (queued)
      		dequeue_task(...);
      	if (running)
      		put_prev_task(...);
      
      	/* change task properties */
      
      	if (queued)
      		enqueue_task(...);
      	if (running)
      		set_next_task(...);
      
      It becomes obvious that if we do this after put_prev_task() has
      already been called on @p, things go sideways. This is exactly what
      the commit in question allows to happen when it does:
      
      	prev->sched_class->put_prev_task(rq, prev, rf);
      	if (!rq->nr_running)
      		newidle_balance(rq, rf);
      
      The newidle_balance() call will drop rq->lock after we've called
      put_prev_task() and that allows the above 'change' pattern to
      interleave and mess up the state.
      
      Furthermore, it turns out we lost the RT-pull when we put the last DL
      task.
      
      Fix both problems by extracting the balancing from put_prev_task() and
      doing a multi-class balance() pass before put_prev_task().
      
      Fixes: 67692435 ("sched: Rework pick_next_task() slow-path")
      Reported-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarQuentin Perret <qperret@google.com>
      Tested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      6e2df058
  7. 28 Aug, 2019 1 commit
  8. 08 Aug, 2019 4 commits
  9. 01 Aug, 2019 1 commit
  10. 24 Jun, 2019 1 commit
    • Patrick Bellasi's avatar
      sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks · 982d9cdc
      Patrick Bellasi authored
      Each time a frequency update is required via schedutil, a frequency is
      selected to (possibly) satisfy the utilization reported by each
      scheduling class and irqs. However, when utilization clamping is in use,
      the frequency selection should consider userspace utilization clamping
      hints.  This will allow, for example, to:
      
       - boost tasks which are directly affecting the user experience
         by running them at least at a minimum "requested" frequency
      
       - cap low priority tasks not directly affecting the user experience
         by running them only up to a maximum "allowed" frequency
      
      These constraints are meant to support a per-task based tuning of the
      frequency selection thus supporting a fine grained definition of
      performance boosting vs energy saving strategies in kernel space.
      
      Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
      within the boundaries defined by their aggregated utilization clamp
      constraints.
      
      Do that by considering the max(min_util, max_util) to give boosted tasks
      the performance they need even when they happen to be co-scheduled with
      other capped tasks.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      982d9cdc
  11. 03 Jun, 2019 1 commit
  12. 19 Apr, 2019 1 commit
  13. 04 Feb, 2019 1 commit
    • Vincent Guittot's avatar
      sched/fair: Update scale invariance of PELT · 23127296
      Vincent Guittot authored
      The current implementation of load tracking invariance scales the
      contribution with current frequency and uarch performance (only for
      utilization) of the CPU. One main result of this formula is that the
      figures are capped by current capacity of CPU. Another one is that the
      load_avg is not invariant because not scaled with uarch.
      
      The util_avg of a periodic task that runs r time slots every p time slots
      varies in the range :
      
          U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
      
      with U is the max util_avg value = SCHED_CAPACITY_SCALE
      
      At a lower capacity, the range becomes:
      
          U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
      
      with C reflecting the compute capacity ratio between current capacity and
      max capacity.
      
      so C tries to compensate changes in (1-y^r') but it can't be accurate.
      
      Instead of scaling the contribution value of PELT algo, we should scale the
      running time. The PELT signal aims to track the amount of computation of
      tasks and/or rq so it seems more correct to scale the running time to
      reflect the effective amount of computation done since the last update.
      
      In order to be fully invariant, we need to apply the same amount of
      running time and idle time whatever the current capacity. Because running
      at lower capacity implies that the task will run longer, we have to ensure
      that the same amount of idle time will be applied when system becomes idle
      and no idle time has been "stolen". But reaching the maximum utilization
      value (SCHED_CAPACITY_SCALE) means that the task is seen as an
      always-running task whatever the capacity of the CPU (even at max compute
      capacity). In this case, we can discard this "stolen" idle times which
      becomes meaningless.
      
      In order to achieve this time scaling, a new clock_pelt is created per rq.
      The increase of this clock scales with current capacity when something
      is running on rq and synchronizes with clock_task when rq is idle. With
      this mechanism, we ensure the same running and idle time whatever the
      current capacity. This also enables to simplify the pelt algorithm by
      removing all references of uarch and frequency and applying the same
      contribution to utilization and loads. Furthermore, the scaling is done
      only once per update of clock (update_rq_clock_task()) instead of during
      each update of sched_entities and cfs/rt/dl_rq of the rq like the current
      implementation. This is interesting when cgroup are involved as shown in
      the results below:
      
      On a hikey (octo Arm64 platform).
      Performance cpufreq governor and only shallowest c-state to remove variance
      generated by those power features so we only track the impact of pelt algo.
      
      each test runs 16 times:
      
      	./perf bench sched pipe
      	(higher is better)
      	kernel	tip/sched/core     + patch
      	        ops/seconds        ops/seconds         diff
      	cgroup
      	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
      	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
      	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%
      
      	hackbench -l 1000
      	(lower is better)
      	kernel	tip/sched/core     + patch
      	        duration(sec)      duration(sec)        diff
      	cgroup
      	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
      	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
      	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%
      
      Then, the responsiveness of PELT is improved when CPU is not running at max
      capacity with this new algorithm. I have put below some examples of
      duration to reach some typical load values according to the capacity of the
      CPU with current implementation and with this patch. These values has been
      computed based on the geometric series and the half period value:
      
        Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
        972 (95%)    138ms         not reachable            276ms
        486 (47.5%)  30ms          138ms                     60ms
        256 (25%)    13ms           32ms                     26ms
      
      On my hikey (octo Arm64 platform) with schedutil governor, the time to
      reach max OPP when starting from a null utilization, decreases from 223ms
      with current scale invariance down to 121ms with the new algorithm.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      23127296
  14. 11 Dec, 2018 1 commit
  15. 03 Nov, 2018 1 commit
    • Muchun Song's avatar
      sched/core: Introduce set_next_task() helper for better code readability · ff1cdc94
      Muchun Song authored
      When we pick the next task, we will do the following for the task:
      
        1) p->se.exec_start = rq_clock_task(rq);
        2) dequeue_pushable(_dl)_task(rq, p);
      
      When we call set_curr_task(), we also need to do the same thing
      above. In rt.c, the code at 1) is in the _pick_next_task_rt()
      and the code at 2) is in the pick_next_task_rt(). If we put two
      operations in one function, maybe better. So, we introduce a new
      function set_next_task(), which is responsible for doing the above.
      
      By introducing the function we can get rid of calling the
      dequeue_pushable(_dl)_task() directly(We can call set_next_task())
      in pick_next_task() and have better code readability and reuse.
      In set_curr_task_rt(), we also can call set_next_task().
      
      Do this things such that we end up with:
      
        static struct task_struct *pick_next_task(struct rq *rq,
        					    struct task_struct *prev,
        					    struct rq_flags *rf)
        {
        	/* do something else ... */
      
        	put_prev_task(rq, prev);
      
        	/* pick next task p */
      
        	set_next_task(rq, p);
      
        	/* do something else ... */
        }
      
      put_prev_task() can match set_next_task(), which can make the
      code more readable.
      Signed-off-by: default avatarMuchun Song <smuchun@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff1cdc94
  16. 29 Oct, 2018 1 commit
  17. 25 Jul, 2018 1 commit
    • Hailong Liu's avatar
      sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHARE · f3d133ee
      Hailong Liu authored
      NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
      runtime with a spin-rt-task.
      
      However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
      enough rt_runtime at the beginning, rt_runtime can't be restored to
      its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.
      
      E.g. on my PC with 4 cores, procedure to reproduce:
      1) Make sure  RT_RUNTIME_SHARE is enabled
       cat /sys/kernel/debug/sched_features
        GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
        CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
        LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
        NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
        ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
      2) Start a spin-rt-task
       ./loop_rr &
      3) set affinity to the last cpu
       taskset -p 8 $pid_of_loop_rr
      4) Observe that last cpu have borrowed enough runtime.
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      5) Disable RT_RUNTIME_SHARE
       echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
      6) Observe that rt_runtime can not been restored
       cat /proc/sched_debug | grep rt_runtime
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 900.000000
        .rt_runtime                    : 950.000000
        .rt_runtime                    : 1000.000000
      
      This patch help to restore rt_runtime after we disable
      RT_RUNTIME_SHARE.
      Signed-off-by: default avatarHailong Liu <liu.hailong6@zte.com.cn>
      Signed-off-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cnSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f3d133ee
  18. 15 Jul, 2018 2 commits
    • Vincent Guittot's avatar
      sched/core: Use PELT for scale_rt_capacity() · 523e979d
      Vincent Guittot authored
      The utilization of the CPU by RT, DL and IRQs are now tracked with
      PELT so we can use these metrics instead of rt_avg to evaluate the remaining
      capacity available for CFS class.
      
      scale_rt_capacity() behavior has been changed and now returns the remaining
      capacity available for CFS instead of a scaling factor because RT, DL and
      IRQ provide now absolute utilization value.
      
      The same formula as schedutil is used:
      
        IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg
      
      but the implementation is different because it doesn't return the same value
      and doesn't benefit of the same optimization.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      523e979d
    • Vincent Guittot's avatar
      sched/rt: Add rt_rq utilization tracking · 371bf427
      Vincent Guittot authored
      schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS
      tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks
      are preempted by RT tasks and in this case util_avg reflects the remaining
      capacity but not what CFS want to use. In such case, schedutil can select a
      lower OPP whereas the CPU is overloaded. In order to have a more accurate
      view of the utilization of the CPU, we track the utilization of RT tasks.
      Only util_avg is correctly tracked but not load_avg and runnable_load_avg
      which are useless for rt_rq.
      
      rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
      the same at the root group level, so the PELT windows of the util_sum are
      aligned.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      371bf427
  19. 03 Jul, 2018 1 commit
    • Vincent Guittot's avatar
      sched/rt: Fix call to cpufreq_update_util() · 296b2ffe
      Vincent Guittot authored
      With commit:
      
        8f111bc3 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
      
      the schedutil governor uses rq->rt.rt_nr_running to detect whether an
      RT task is currently running on the CPU and to set frequency to max
      if necessary.
      
      cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
      rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
      called so schedutil still considers that an RT task is running when the
      last task is dequeued. The update of rq->rt.rt_nr_running happens later
      in dequeue_rt_stack().
      
      In fact, we can take advantage of the sequence that the dequeue then
      re-enqueue rt entities when a rt task is enqueued or dequeued;
      As a result enqueue_top_rt_rq() is always called when a task is
      enqueued or dequeued and also when groups are throttled or unthrottled.
      The only place that not use enqueue_top_rt_rq() is when root rt_rq is
      throttled.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: juri.lelli@redhat.com
      Cc: patrick.bellasi@arm.com
      Cc: viresh.kumar@linaro.org
      Fixes: 8f111bc3 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
      Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      296b2ffe
  20. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook authored
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6396bb22
  21. 18 May, 2018 1 commit
  22. 05 Apr, 2018 2 commits
    • Davidlohr Bueso's avatar
      sched/core: Simplify helpers for rq clock update skip requests · adcc8da8
      Davidlohr Bueso authored
      By renaming the functions we can get rid of the skip parameter
      and have better code redability. It makes zero sense to have
      things such as:
      
        rq_clock_skip_update(rq, false)
      
      When the skip request is in fact not going to happen. Ever. Rename
      things such that we end up with:
      
        rq_clock_skip_update(rq)
        rq_clock_cancel_skipupdate(rq)
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: matt@codeblueprint.co.uk
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      adcc8da8
    • Davidlohr Bueso's avatar
      sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning · d29a2064
      Davidlohr Bueso authored
      While running rt-tests' pi_stress program I got the following splat:
      
        rq->clock_update_flags < RQCF_ACT_SKIP
        WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20
      
        [...]
      
        <IRQ>
        enqueue_top_rt_rq+0xf4/0x150
        ? cpufreq_dbs_governor_start+0x170/0x170
        sched_rt_rq_enqueue+0x65/0x80
        sched_rt_period_timer+0x156/0x360
        ? sched_rt_rq_enqueue+0x80/0x80
        __hrtimer_run_queues+0xfa/0x260
        hrtimer_interrupt+0xcb/0x220
        smp_apic_timer_interrupt+0x62/0x120
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
        [...]
      
        do_idle+0x183/0x1e0
        cpu_startup_entry+0x5f/0x70
        start_secondary+0x192/0x1d0
        secondary_startup_64+0xa5/0xb0
      
      We can get rid of it be the "traditional" means of adding an
      update_rq_clock() call after acquiring the rq->lock in
      do_sched_rt_period_timer().
      
      The case for the RT task throttling (which this workload also hits)
      can be ignored in that the skip_update call is actually bogus and
      quite the contrary (the request bits are removed/reverted).
      
      By setting RQCF_UPDATED we really don't care if the skip is happening
      or not and will therefore make the assert_clock_updated() check happy.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: linux-kernel@vger.kernel.org
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d29a2064
  23. 09 Mar, 2018 1 commit
    • Peter Zijlstra's avatar
      cpufreq/schedutil: Rewrite CPUFREQ_RT support · 8f111bc3
      Peter Zijlstra authored
      Instead of trying to duplicate scheduler state to track if an RT task
      is running, directly use the scheduler runqueue state for it.
      
      This vastly simplifies things and fixes a number of bugs related to
      sugov and the scheduler getting out of sync wrt this state.
      
      As a consequence we not also update the remove cfs/dl state when
      iterating the shared mask.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8f111bc3
  24. 04 Mar, 2018 2 commits
    • Ingo Molnar's avatar
      sched/deadline, rt: Rename queue_push_tasks/queue_pull_task to create separate namespace · 02d8ec94
      Ingo Molnar authored
      There are similarly named functions in both of these modules:
      
        kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq)
        kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq)
        kernel/sched/deadline.c:	queue_push_tasks(rq);
        kernel/sched/deadline.c:	queue_pull_task(rq);
        kernel/sched/deadline.c:			queue_push_tasks(rq);
        kernel/sched/deadline.c:			queue_pull_task(rq);
        kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/rt.c:static inline void queue_pull_task(struct rq *rq)
        kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/rt.c:	queue_push_tasks(rq);
        kernel/sched/rt.c:	queue_pull_task(rq);
        kernel/sched/rt.c:			queue_push_tasks(rq);
        kernel/sched/rt.c:			queue_pull_task(rq);
      
      ... which makes it harder to grep for them. Prefix them with
      deadline_ and rt_, respectively.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      02d8ec94
    • Ingo Molnar's avatar
      sched/headers: Simplify and clean up header usage in the scheduler · 325ea10c
      Ingo Molnar authored
      Do the following cleanups and simplifications:
      
       - sched/sched.h already includes <asm/paravirt.h>, so no need to
         include it in sched/core.c again.
      
       - order the <linux/sched/*.h> headers alphabetically
      
       - add all <linux/sched/*.h> headers to kernel/sched/sched.h
      
       - remove all unnecessary includes from the .c files that
         are already included in kernel/sched/sched.h.
      
      Finally, make all scheduler .c files use a single common header:
      
        #include "sched.h"
      
      ... which now contains a union of the relied upon headers.
      
      This makes the various .c files easier to read and easier to handle.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      325ea10c
  25. 03 Mar, 2018 1 commit
    • Ingo Molnar's avatar
      sched: Clean up and harmonize the coding style of the scheduler code base · 97fb7a0a
      Ingo Molnar authored
      A good number of small style inconsistencies have accumulated
      in the scheduler core, so do a pass over them to harmonize
      all these details:
      
       - fix speling in comments,
      
       - use curly braces for multi-line statements,
      
       - remove unnecessary parentheses from integer literals,
      
       - capitalize consistently,
      
       - remove stray newlines,
      
       - add comments where necessary,
      
       - remove invalid/unnecessary comments,
      
       - align structure definitions and other data types vertically,
      
       - add missing newlines for increased readability,
      
       - fix vertical tabulation where it's misaligned,
      
       - harmonize preprocessor conditional block labeling
         and vertical alignment,
      
       - remove line-breaks where they uglify the code,
      
       - add newline after local variable definitions,
      
      No change in functionality:
      
        md5:
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      97fb7a0a
  26. 21 Feb, 2018 1 commit
    • Frederic Weisbecker's avatar
      sched/isolation: Offload residual 1Hz scheduler tick · d84b3131
      Frederic Weisbecker authored
      When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
      keep the scheduler stats alive. However this residual tick is a burden
      for bare metal tasks that can't stand any interruption at all, or want
      to minimize them.
      
      The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
      outsource these scheduler ticks to the global workqueue so that a
      housekeeping CPU handles those remotely. The sched_class::task_tick()
      implementations have been audited and look safe to be called remotely
      as the target runqueue and its current task are passed in parameter
      and don't seem to be accessed locally.
      
      Note that in the case of using isolcpus, it's still up to the user to
      affine the global workqueues to the housekeeping CPUs through
      /sys/devices/virtual/workqueue/cpumask or domains isolation
      "isolcpus=nohz,domain".
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d84b3131
  27. 13 Feb, 2018 1 commit
  28. 06 Feb, 2018 3 commits