An error occurred fetching the project authors.
  1. 30 Nov, 2017 1 commit
    • Steven Rostedt (Red Hat)'s avatar
      sched/rt: Simplify the IPI based RT balancing logic · 1c37ff78
      Steven Rostedt (Red Hat) authored
      commit 4bdced5c upstream.
      
      When a CPU lowers its priority (schedules out a high priority task for a
      lower priority one), a check is made to see if any other CPU has overloaded
      RT tasks (more than one). It checks the rto_mask to determine this and if so
      it will request to pull one of those tasks to itself if the non running RT
      task is of higher priority than the new priority of the next task to run on
      the current CPU.
      
      When we deal with large number of CPUs, the original pull logic suffered
      from large lock contention on a single CPU run queue, which caused a huge
      latency across all CPUs. This was caused by only having one CPU having
      overloaded RT tasks and a bunch of other CPUs lowering their priority. To
      solve this issue, commit:
      
        b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
      
      changed the way to request a pull. Instead of grabbing the lock of the
      overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
      
      Although the IPI logic worked very well in removing the large latency build
      up, it still could suffer from a large number of IPIs being sent to a single
      CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
      when I tested this on a 120 CPU box, with a stress test that had lots of
      RT tasks scheduling on all CPUs, it actually triggered the hard lockup
      detector! One CPU had so many IPIs sent to it, and due to the restart
      mechanism that is triggered when the source run queue has a priority status
      change, the CPU spent minutes! processing the IPIs.
      
      Thinking about this further, I realized there's no reason for each run queue
      to send its own IPI. As all CPUs with overloaded tasks must be scanned
      regardless if there's one or many CPUs lowering their priority, because
      there's no current way to find the CPU with the highest priority task that
      can schedule to one of these CPUs, there really only needs to be one IPI
      being sent around at a time.
      
      This greatly simplifies the code!
      
      The new approach is to have each root domain have its own irq work, as the
      rto_mask is per root domain. The root domain has the following fields
      attached to it:
      
        rto_push_work	 - the irq work to process each CPU set in rto_mask
        rto_lock	 - the lock to protect some of the other rto fields
        rto_loop_start - an atomic that keeps contention down on rto_lock
      		    the first CPU scheduling in a lower priority task
      		    is the one to kick off the process.
        rto_loop_next	 - an atomic that gets incremented for each CPU that
      		    schedules in a lower priority task.
        rto_loop	 - a variable protected by rto_lock that is used to
      		    compare against rto_loop_next
        rto_cpu	 - The cpu to send the next IPI to, also protected by
      		    the rto_lock.
      
      When a CPU schedules in a lower priority task and wants to make sure
      overloaded CPUs know about it. It increments the rto_loop_next. Then it
      atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
      then it is done, as another CPU is kicking off the IPI loop. If the old
      value is "0", then it will take the rto_lock to synchronize with a possible
      IPI being sent around to the overloaded CPUs.
      
      If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
      IPI being sent around, or one is about to finish. Then rto_cpu is set to the
      first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
      in rto_mask, then there's nothing to be done.
      
      When the CPU receives the IPI, it will first try to push any RT tasks that is
      queued on the CPU but can't run because a higher priority RT task is
      currently running on that CPU.
      
      Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
      finds one, it simply sends an IPI to that CPU and the process continues.
      
      If there's no more CPUs in the rto_mask, then rto_loop is compared with
      rto_loop_next. If they match, everything is done and the process is over. If
      they do not match, then a CPU scheduled in a lower priority task as the IPI
      was being passed around, and the process needs to start again. The first CPU
      in rto_mask is sent the IPI.
      
      This change removes this duplication of work in the IPI logic, and greatly
      lowers the latency caused by the IPIs. This removed the lockup happening on
      the 120 CPU machine. It also simplifies the code tremendously. What else
      could anyone ask for?
      
      Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
      supplying me with the rto_start_trylock() and rto_start_unlock() helper
      functions.
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c37ff78
  2. 31 Mar, 2017 1 commit
  3. 16 Aug, 2016 2 commits
    • Rafael J. Wysocki's avatar
      cpufreq / sched: Pass runqueue pointer to cpufreq_update_util() · 12bde33d
      Rafael J. Wysocki authored
      All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
      as the time argument and some of them check whether or not cpu_of(rq)
      is equal to smp_processor_id() before calling it, so rework it to
      take a runqueue pointer as the argument and move the rq_clock(rq)
      evaluation into it.
      
      Additionally, provide a wrapper checking cpu_of(rq) against
      smp_processor_id() for the cpufreq_update_util() callers that
      need it.
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      12bde33d
    • Rafael J. Wysocki's avatar
      cpufreq / sched: Pass flags to cpufreq_update_util() · 58919e83
      Rafael J. Wysocki authored
      It is useful to know the reason why cpufreq_update_util() has just
      been called and that can be passed as flags to cpufreq_update_util()
      and to the ->func() callback in struct update_util_data.  However,
      doing that in addition to passing the util and max arguments they
      already take would be clumsy, so avoid it.
      
      Instead, use the observation that the schedutil governor is part
      of the scheduler proper, so it can access scheduler data directly.
      This allows the util and max arguments of cpufreq_update_util()
      and the ->func() callback in struct update_util_data to be replaced
      with a flags one, but schedutil has to be modified to follow.
      
      Thus make the schedutil governor obtain the CFS utilization
      information from the scheduler and use the "RT" and "DL" flags
      instead of the special utilization value of ULONG_MAX to track
      updates from the RT and DL sched classes.  Make it non-modular
      too to avoid having to export scheduler variables to modules at
      large.
      
      Next, update all of the other users of cpufreq_update_util()
      and the ->func() callback in struct update_util_data accordingly.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      58919e83
  4. 12 May, 2016 1 commit
  5. 10 May, 2016 1 commit
    • Xunlei Pang's avatar
      sched/rt, sched/dl: Don't push if task's scheduling class was changed · 13b5ab02
      Xunlei Pang authored
      We got this warning:
      
          WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0
          [...]
          Call Trace:
      
          dump_stack+0x63/0x87
          __warn+0xd1/0xf0
          warn_slowpath_null+0x1d/0x20
          set_task_cpu+0x1af/0x1c0
          push_dl_task.part.34+0xea/0x180
          push_dl_tasks+0x17/0x30
          __balance_callback+0x45/0x5c
          __sched_setscheduler+0x906/0xb90
          SyS_sched_setattr+0x150/0x190
          do_syscall_64+0x62/0x110
          entry_SYSCALL64_slow_path+0x25/0x25
      
      This corresponds to:
      
          WARN_ON_ONCE(p->state == TASK_RUNNING &&
                   p->sched_class == &fair_sched_class &&
                   (p->on_rq && !task_on_rq_migrating(p)))
      
      It happens because in find_lock_later_rq(), the task whose scheduling
      class was changed to fair class is still pushed away as if it were
      a deadline task ...
      
      So, check in find_lock_later_rq() after double_lock_balance(), if the
      scheduling class of the deadline task was changed, break and retry.
      
      Apply the same logic to RT tasks.
      Signed-off-by: default avatarXunlei Pang <xlpang@redhat.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      13b5ab02
  6. 05 May, 2016 1 commit
    • Peter Zijlstra's avatar
      locking/lockdep, sched/core: Implement a better lock pinning scheme · e7904a28
      Peter Zijlstra authored
      The problem with the existing lock pinning is that each pin is of
      value 1; this mean you can simply unpin if you know its pinned,
      without having any extra information.
      
      This scheme generates a random (16 bit) cookie for each pin and
      requires this same cookie to unpin. This means you have to keep the
      cookie in context.
      
      No objsize difference for !LOCKDEP kernels.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e7904a28
  7. 28 Apr, 2016 1 commit
  8. 09 Mar, 2016 1 commit
  9. 02 Mar, 2016 1 commit
    • Frederic Weisbecker's avatar
      sched: Account rr tasks · 01d36d0a
      Frederic Weisbecker authored
      In order to evaluate the scheduler tick dependency without probing
      context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks
      are enqueued as those policies don't have the same preemption
      requirements.
      
      To prepare for that, let's account SCHED_RR tasks, we'll be able to
      deduce SCHED_FIFO tasks as well from it and the total RT tasks in the
      runqueue.
      Reviewed-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      01d36d0a
  10. 29 Feb, 2016 2 commits
    • Steven Rostedt's avatar
      sched/rt: Kick RT bandwidth timer immediately on start up · c3a990dc
      Steven Rostedt authored
      I've been debugging why deadline tasks can cause the RT scheduler to
      throttle, even when the deadline tasks are only taking up 50% of the
      CPU and RT tasks are not even using 1% of the CPU. Here's what I found.
      
      In order to keep a CPU from being hogged by RT tasks, the deadline
      scheduler adds its run time (delta_exec) to the rt_time of the RT
      bandwidth. That way, if the two use more than 95% of the CPU within one
      second (default settings), the RT tasks are throttled to allow non RT
      tasks to run.
      
      Although the deadline tasks add their run time to the RT bandwidth, it
      lets the RT tasks do the accounting. This is where the problem lies. If
      a deadline task runs for a bit, and no RT tasks are running, then it
      will continually add to the RT rt_time that is used to calculate how
      much CPU the RT tasks use. But no RT period is in play, and this
      accumulation of the runtime never gets reset.
      
      When an RT task finally gets to run, and the watchdog goes off, it can
      see that the RT task has used more than it should of, because the
      deadline task added all this runtime to its rt_time. Then the RT task
      that just woke up gets throttled for no good reason.
      
      I also noticed that when an RT task is queued, it starts the timer to
      account for overload and such. But that timer goes off one period
      later, which may be too late and the extra rt_time will trigger a
      throttle.
      
      This is a quick work around to the problem. When a new RT task is
      queued, the bandwidth timer is set to go off immediately. Then the
      timer can clear out the extra time added to the rt_time while there was
      no RT task running. This stops my tests from triggering the throttle,
      and it will still throttle if an RT task runs too much, even while a
      deadline task is running.
      
      A better solution may be to subtract the bandwidth that the deadline
      task uses from the rt_runtime, and add it back when its finished. Then
      there wont be a need for runtime tracking of the time used by deadline
      tasks.
      
      I may play with that solution tomorrow.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <juri.lelli@gmail.com>
      Cc: <williams@redhat.com>
      Cc: Clark Williams
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160216183746.349ec98b@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c3a990dc
    • Peter Zijlstra's avatar
      sched/rt: Fix PI handling vs. sched_setscheduler() · ff77e468
      Peter Zijlstra authored
      Andrea Parri reported:
      
      > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
      > handled correctly:
      >
      >     T1 (prio = 20)
      >        lock(rtmutex);
      >
      >     T2 (prio = 20)
      >        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
      >
      >     T1 (prio = 20)
      >        sys_set_scheduler(prio = 0)
      >           [new_effective_prio == oldprio]
      >           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
      >
      > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
      > in particular, if we continue with
      >
      >    T1 (prio = 20)
      >       unlock(rtmutex)
      >          wakeup(T2)
      >          adjust_prio(T1)
      >             [prio != rt_mutex_getprio(T1)]
      >	    dequeue(T1)
      >	       rt_nr_boosted = (unsigned long)(-1)
      >	       ...
      >             T1 prio = 0
      >
      > then we end up leaving rt_nr_boosted in an "inconsistent" state.
      >
      > The simple program attached could reproduce the previous scenario; note
      > that, as a consequence of the presence of this state, the "assertion"
      >
      >     WARN_ON(!rt_nr_running && rt_nr_boosted)
      >
      > from dec_rt_group() may trigger.
      
      So normally we dequeue/enqueue tasks in sched_setscheduler(), which
      would ensure the accounting stays correct. However in the early PI path
      we fail to do so.
      
      So this was introduced at around v3.14, by:
      
        c365c292 ("sched: Consider pi boosting in setscheduler()")
      
      which fixed another problem exactly because that dequeue/enqueue, joy.
      
      Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
      preserve runqueue location with that option. This requires decoupling
      the on_rt_rq() state from being on the list.
      
      In order to allow for explicit movement during the SAVE/RESTORE,
      introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
      cases to preserve other invariants.
      
      Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
      things like sys_nice()/sys_sched_setaffinity() also do not reorder
      FIFO tasks (whereas they used to before this patch).
      Reported-by: default avatarAndrea Parri <parri.andrea@gmail.com>
      Tested-by: default avatarAndrea Parri <parri.andrea@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff77e468
  11. 23 Nov, 2015 1 commit
  12. 23 Sep, 2015 1 commit
  13. 12 Aug, 2015 2 commits
  14. 03 Aug, 2015 1 commit
  15. 18 Jun, 2015 4 commits
  16. 18 May, 2015 1 commit
    • Peter Zijlstra's avatar
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra authored
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  17. 08 May, 2015 1 commit
  18. 22 Apr, 2015 1 commit
    • Peter Zijlstra's avatar
      sched: Cleanup bandwidth timers · 77a4d1a1
      Peter Zijlstra authored
      Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
      
      The more I look at that code the more I'm convinced its crack, that
      entire __start_cfs_bandwidth() thing is brain melting, we don't need to
      cancel a timer before starting it, *hrtimer_start*() will happily remove
      the timer for you if its still enqueued.
      
      Removing that, removes a big part of the problem, no more ugly cancel
      loop to get stuck in.
      
      So now, if I understand things right, the entire reason you have this
      cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
      accidentally lose the timer.
      
      It appears to me that it should be possible to guarantee that same by
      unconditionally (re)starting the timer when !queued. Because regardless
      what hrtimer::function will return, if we beat it to (re)enqueue the
      timer, it doesn't matter.
      
      Now, because hrtimers don't come with any serialization guarantees we
      must ensure both handler and (re)start loop serialize their access to
      the hrtimer to avoid both trying to forward the timer at the same
      time.
      
      Update the rt bandwidth timer to match.
      
      This effectively reverts: 09dc4ab0 ("sched/fair: Fix
      tg_set_cfs_bandwidth() deadlock on rq->lock").
      Reported-by: default avatarRoman Gushchin <klamm@yandex-team.ru>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      77a4d1a1
  19. 02 Apr, 2015 1 commit
  20. 23 Mar, 2015 1 commit
    • Steven Rostedt's avatar
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt authored
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b6366f04
  21. 30 Jan, 2015 1 commit
  22. 14 Jan, 2015 1 commit
  23. 16 Nov, 2014 2 commits
    • Wanpeng Li's avatar
      sched: Move p->nr_cpus_allowed check to select_task_rq() · 6c1d9410
      Wanpeng Li authored
      Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
      This change will make fair.c, rt.c, and deadline.c all start with the
      same logic.
      Suggested-and-Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "pang.xunlei" <pang.xunlei@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6c1d9410
    • Stanislaw Gruszka's avatar
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka authored
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6e998916
  24. 04 Nov, 2014 1 commit
  25. 24 Sep, 2014 1 commit
  26. 19 Sep, 2014 1 commit
  27. 28 Aug, 2014 1 commit
    • Christoph Lameter's avatar
      percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t · 4ba29684
      Christoph Lameter authored
      __get_cpu_var can paper over differences in the definitions of
      cpumask_var_t and either use the address of the cpumask variable
      directly or perform a fetch of the address of the struct cpumask
      allocated elsewhere. This is important particularly when using per cpu
      cpumask_var_t declarations because in one case we have an offset into
      a per cpu area to handle and in the other case we need to fetch a
      pointer from the offset.
      
      This patch introduces a new macro
      
      this_cpu_cpumask_var_ptr()
      
      that is defined where cpumask_var_t is defined and performs the proper
      actions. All use cases where __get_cpu_var is used with cpumask_var_t
      are converted to the use of this_cpu_cpumask_var_ptr().
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4ba29684
  28. 20 Aug, 2014 1 commit
    • Kirill Tkhai's avatar
      sched: Add wrapper for checking task_struct::on_rq · da0c1e65
      Kirill Tkhai authored
      Implement task_on_rq_queued() and use it everywhere instead of
      on_rq check. No functional changes.
      
      The only exception is we do not use the wrapper in
      check_for_tasks(), because it requires to export
      task_on_rq_queued() in global header files. Next patch in series
      would return it back, so we do not twist it from here to there.
      Signed-off-by: default avatarKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhaiSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      da0c1e65
  29. 16 Jul, 2014 1 commit
    • Kirill Tkhai's avatar
      sched: Transform resched_task() into resched_curr() · 8875125e
      Kirill Tkhai authored
      We always use resched_task() with rq->curr argument.
      It's not possible to reschedule any task but rq's current.
      
      The patch introduces resched_curr(struct rq *) to
      replace all of the repeating patterns. The main aim
      is cleanup, but there is a little size profit too:
      
        (before)
      	$ size kernel/sched/built-in.o
      	   text	   data	    bss	    dec	    hex	filename
      	155274	  16445	   7042	 178761	  2ba49	kernel/sched/built-in.o
      
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	7411490	1178376	 991232	9581098	 92322a	vmlinux
      
        (after)
      	$ size kernel/sched/built-in.o
      	   text	   data	    bss	    dec	    hex	filename
      	155130	  16445	   7042	 178617	  2b9b9	kernel/sched/built-in.o
      
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	7411362	1178376	 991232	9580970	 9231aa	vmlinux
      
      	I was choosing between resched_curr() and resched_rq(),
      	and the first name looks better for me.
      
      A little lie in Documentation/trace/ftrace.txt. I have not
      actually collected the tracing again. With a hope the patch
      won't make execution times much worse :)
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhostSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8875125e
  30. 05 Jul, 2014 1 commit
  31. 05 Jun, 2014 1 commit
  32. 04 Jun, 2014 2 commits