1. 04 Apr, 2017 3 commits
    • Xunlei Pang's avatar
      sched/rtmutex/deadline: Fix a PI crash for deadline tasks · e96a7705
      Xunlei Pang authored
      A crash happened while I was playing with deadline PI rtmutex.
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
          PGD 232a75067 PUD 230947067 PMD 0
          Oops: 0000 [#1] SMP
          CPU: 1 PID: 10994 Comm: a.out Not tainted
      
          Call Trace:
          [<ffffffff810b658c>] enqueue_task+0x2c/0x80
          [<ffffffff810ba763>] activate_task+0x23/0x30
          [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
          [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
          [<ffffffff8164e783>] __schedule+0xd3/0x900
          [<ffffffff8164efd9>] schedule+0x29/0x70
          [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
          [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
          [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
          [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
          [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
          [<ffffffff810edd50>] SyS_futex+0x80/0x180
      
      This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
      are only protected by pi_lock when operating pi waiters, while
      rt_mutex_get_top_task(), will access them with rq lock held but
      not holding pi_lock.
      
      In order to tackle it, we introduce new "pi_top_task" pointer
      cached in task_struct, and add new rt_mutex_update_top_task()
      to update its value, it can be called by rt_mutex_setprio()
      which held both owner's pi_lock and rq lock. Thus "pi_top_task"
      can be safely accessed by enqueue_task_dl() under rq lock.
      
      Originally-From: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarXunlei Pang <xlpang@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e96a7705
    • Xunlei Pang's avatar
      rtmutex: Deboost before waking up the top waiter · 2a1c6029
      Xunlei Pang authored
      We should deboost before waking the high-priority task, such that we
      don't run two tasks with the same "state" (priority, deadline,
      sched_class, etc).
      
      In order to make sure the boosting task doesn't start running between
      unlock and deboost (due to 'spurious' wakeup), we move the deboost
      under the wait_lock, that way its serialized against the wait loop in
      __rt_mutex_slowlock().
      
      Doing the deboost early can however lead to priority-inversion if
      current would get preempted after the deboost but before waking our
      high-prio task, hence we disable preemption before doing deboost, and
      enabling it after the wake up is over.
      
      This gets us the right semantic order, but most importantly however;
      this change ensures pointer stability for the next patch, where we
      have rt_mutex_setprio() cache a pointer to the top-most waiter task.
      If we, as before this change, do the wakeup first and then deboost,
      this pointer might point into thin air.
      
      [peterz: Changelog + patch munging]
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarXunlei Pang <xlpang@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      2a1c6029
    • Thomas Gleixner's avatar
      Merge branch 'sched/core' into locking/core · 38bffdac
      Thomas Gleixner authored
      Required for the rtmutex/sched_deadline patches which depend on both
      branches
      38bffdac
  2. 30 Mar, 2017 4 commits
    • Chris Wilson's avatar
      locking/ww-mutex: Limit stress test to 2 seconds · 57dd924e
      Chris Wilson authored
      Use a timeout rather than a fixed number of loops to avoid running for
      very long periods, such as under the kbuilder VMs.
      Reported-by: default avatarkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170310105733.6444-1-chris@chris-wilson.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      57dd924e
    • Yuyang Du's avatar
      sched/fair: Optimize ___update_sched_avg() · a481db34
      Yuyang Du authored
      The main PELT function ___update_load_avg(), which implements the
      accumulation and progression of the geometric average series, is
      implemented along the following lines for the scenario where the time
      delta spans all 3 possible sections (see figure below):
      
        1. add the remainder of the last incomplete period
        2. decay old sum
        3. accumulate new sum in full periods since last_update_time
        4. accumulate the current incomplete period
        5. update averages
      
      Or:
      
                  d1          d2           d3
                  ^           ^            ^
                  |           |            |
                |<->|<----------------->|<--->|
        ... |---x---|------| ... |------|-----x (now)
      
        load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)
      
                                              p
      	      weight * scale * 1024 * \Sum y^n +		(3)
                                             n=1
      
      	      weight * scale * d3 * y^0				(4)
      
        load_avg' = load_sum' / LOAD_AVG_MAX				(5)
      
      Where:
      
       d1 - is the delta part completing the remainder of the last
            incomplete period,
       d2 - is the delta part spannind complete periods, and
       d3 - is the delta part starting the current incomplete period.
      
      We can simplify the code in two steps; the first step is to separate
      the first term into new and old parts like:
      
        (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
      					       weight * scale * d1 * y^(p+1)
      
      Once we've done that, its easy to see that all new terms carry the
      common factors:
      
        weight * scale
      
      If we factor those out, we arrive at the form:
      
        load_sum' = load_sum * y^(p+1) +
      
      	      weight * scale * (d1 * y^(p+1) +
      
      					 p
      			        1024 * \Sum y^n +
      					n=1
      
      				d3 * y^0)
      
      Which results in a simpler, smaller and faster implementation.
      Signed-off-by: default avatarYuyang Du <yuyang.du@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: matt@codeblueprint.co.uk
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a481db34
    • Peter Zijlstra's avatar
      sched/fair: Explicitly generate __update_load_avg() instances · 0ccb977f
      Peter Zijlstra authored
      The __update_load_avg() function is an __always_inline because its
      used with constant propagation to generate different variants of the
      code without having to duplicate it (which would be prone to bugs).
      
      Explicitly instantiate the 3 variants.
      
      Note that most of this is called from rather hot paths, so reducing
      branches is good.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0ccb977f
    • Peter Zijlstra's avatar
      locking/atomic: Fix atomic_try_cmpxchg() semantics · 44fe8445
      Peter Zijlstra authored
      Dmitry noted that the new atomic_try_cmpxchg() primitive is broken when
      the old pointer doesn't point to the local stack.
      
      He writes:
      
        "Consider a classical lock-free stack push:
      
          node->next = atomic_read(&head);
          do {
          } while (!atomic_try_cmpxchg(&head, &node->next, node));
      
        This code is broken with the current implementation, the problem is
        with unconditional update of *__po.
      
        In case of success it writes the same value back into *__po, but in
        case of cmpxchg success we might have lose ownership of some memory
        locations and potentially over what __po has pointed to. The same
        holds for the re-read of *__po. "
      
      He also points out that this makes it surprisingly different from the
      similar C/C++ atomic operation.
      
      After investigating the code-gen differences caused by this patch; and
      a number of alternatives (Linus dislikes this interface lots), we
      arrived at these results (size x86_64-defconfig/vmlinux):
      
        GCC-6.3.0:
      
        10735757        cmpxchg
        10726413        try_cmpxchg
        10730509        try_cmpxchg + patch
        10730445        try_cmpxchg-linus
      
        GCC-7 (20170327):
      
        10709514        cmpxchg
        10704266        try_cmpxchg
        10704266        try_cmpxchg + patch
        10704394        try_cmpxchg-linus
      
      From this we see that the patch has the advantage of better code-gen
      on GCC-7 and keeps the interface roughly consistent with the C
      language variant.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: a9ebf306 ("locking/atomic: Introduce atomic_try_cmpxchg()")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      44fe8445
  3. 27 Mar, 2017 1 commit
    • Srikar Dronamraju's avatar
      sched/fair: Prefer sibiling only if local group is under-utilized · 05b40e05
      Srikar Dronamraju authored
      If the child domain prefers tasks to go siblings, the local group could
      end up pulling tasks to itself even if the local group is almost equally
      loaded as the source group.
      
      Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
      Everytime, local group has capacity and source group has atleast 2 threads,
      local group tries to pull the task. This causes the threads to constantly
      move between different cores. This is even more profound if the cores have
      more threads, like in Power 8, smt 8 mode.
      
      Fix this by only allowing local group to pull a task, if the source group
      has more number of tasks than the local group.
      
      Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
      
      Without patch:
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
                     366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
                   3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
                   3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
                   5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
                   2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
                  10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )
      
      With patch:
      
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
                     123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
                   3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
                     189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
                   5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
                     506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
                  10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )
      
      Which show that in these workloads CPU migrations get reduced significantly.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      05b40e05
  4. 26 Mar, 2017 1 commit
  5. 23 Mar, 2017 18 commits
    • Peter Zijlstra's avatar
      futex: Drop hb->lock before enqueueing on the rtmutex · 56222b21
      Peter Zijlstra authored
      When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
      chain code will (falsely) report a deadlock and BUG.
      
      The problem is that it hold hb->lock (now an rt_mutex) while doing
      task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
      interleaved just right with futex_unlock_pi() leads it to believe to see an
      AB-BA deadlock.
      
        Task1 (holds rt_mutex,	Task2 (does FUTEX_LOCK_PI)
               does FUTEX_UNLOCK_PI)
      
      				lock hb->lock
      				lock rt_mutex (as per start_proxy)
        lock hb->lock
      
      Which is a trivial AB-BA.
      
      It is not an actual deadlock, because it won't be holding hb->lock by the
      time it actually blocks on the rt_mutex, but the chainwalk code doesn't
      know that and it would be a nightmare to handle this gracefully.
      
      To avoid this problem, do the same as in futex_unlock_pi() and drop
      hb->lock after acquiring wait_lock. This still fully serializes against
      futex_unlock_pi(), since adding to the wait_list does the very same lock
      dance, and removing it holds both locks.
      
      Aside of solving the RT problem this makes the lock and unlock mechanism
      symetric and reduces the hb->lock held time.
      Reported-and-tested-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      56222b21
    • Peter Zijlstra's avatar
      futex: Futex_unlock_pi() determinism · bebe5b51
      Peter Zijlstra authored
      The problem with returning -EAGAIN when the waiter state mismatches is that
      it becomes very hard to proof a bounded execution time on the
      operation. And seeing that this is a RT operation, this is somewhat
      important.
      
      While in practise; given the previous patch; it will be very unlikely to
      ever really take more than one or two rounds, proving so becomes rather
      hard.
      
      However, now that modifying wait_list is done while holding both hb->lock
      and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
      while still holding hb-lock. Doing a hand-over, without leaving a hole.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      bebe5b51
    • Peter Zijlstra's avatar
      futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() · cfafcd11
      Peter Zijlstra authored
      By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
      modifications are done under both hb->lock and wait_lock.
      
      This closes the obvious interleave pattern between futex_lock_pi() and
      futex_unlock_pi(), but not entirely so. See below:
      
      Before:
      
      futex_lock_pi()			futex_unlock_pi()
        unlock hb->lock
      
      				  lock hb->lock
      				  unlock hb->lock
      
      				  lock rt_mutex->wait_lock
      				  unlock rt_mutex_wait_lock
      				    -EAGAIN
      
        lock rt_mutex->wait_lock
        list_add
        unlock rt_mutex->wait_lock
      
        schedule()
      
        lock rt_mutex->wait_lock
        list_del
        unlock rt_mutex->wait_lock
      
      				  <idem>
      				    -EAGAIN
      
        lock hb->lock
      
      
      After:
      
      futex_lock_pi()			futex_unlock_pi()
      
        lock hb->lock
        lock rt_mutex->wait_lock
        list_add
        unlock rt_mutex->wait_lock
        unlock hb->lock
      
        schedule()
      				  lock hb->lock
      				  unlock hb->lock
        lock hb->lock
        lock rt_mutex->wait_lock
        list_del
        unlock rt_mutex->wait_lock
      
      				  lock rt_mutex->wait_lock
      				  unlock rt_mutex_wait_lock
      				    -EAGAIN
      
        unlock hb->lock
      
      
      It does however solve the earlier starvation/live-lock scenario which got
      introduced with the -EAGAIN since unlike the before scenario; where the
      -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
      after scenario it happens while futex_unlock_pi() actually holds a lock,
      and then it is serialized on that lock.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      cfafcd11
    • Peter Zijlstra's avatar
      futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() · 38d589f2
      Peter Zijlstra authored
      With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
      consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
      parts, such that only the actual blocking can be done without hb->lock
      held.
      
      Split split_mutex_finish_proxy_lock() into two parts, one that does the
      blocking and one that does remove_waiter() when the lock acquire failed.
      
      When the rtmutex was acquired successfully the waiter can be removed in the
      acquisiton path safely, since there is no concurrency on the lock owner.
      
      This means that, except for futex_lock_pi(), all wait_list modifications
      are done with both hb->lock and wait_lock held.
      
      [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      38d589f2
    • Peter Zijlstra's avatar
      futex,rt_mutex: Introduce rt_mutex_init_waiter() · 50809358
      Peter Zijlstra authored
      Since there's already two copies of this code, introduce a helper now
      before adding a third one.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      50809358
    • Peter Zijlstra's avatar
      futex: Pull rt_mutex_futex_unlock() out from under hb->lock · 16ffa12d
      Peter Zijlstra authored
      There's a number of 'interesting' problems, all caused by holding
      hb->lock while doing the rt_mutex_unlock() equivalient.
      
      Notably:
      
       - a PI inversion on hb->lock; and,
      
       - a SCHED_DEADLINE crash because of pointer instability.
      
      The previous changes:
      
       - changed the locking rules to cover {uval,pi_state} with wait_lock.
      
       - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
         turn allows to rely on wait_lock atomicity completely.
      
       - simplified the waiter conundrum.
      
      It's now sufficient to hold rtmutex::wait_lock and a reference on the
      pi_state to protect the state consistency, so hb->lock can be dropped
      before calling rt_mutex_futex_unlock().
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      16ffa12d
    • Peter Zijlstra's avatar
      futex: Rework inconsistent rt_mutex/futex_q state · 73d786bd
      Peter Zijlstra authored
      There is a weird state in the futex_unlock_pi() path when it interleaves
      with a concurrent futex_lock_pi() at the point where it drops hb->lock.
      
      In this case, it can happen that the rt_mutex wait_list and the futex_q
      disagree on pending waiters, in particular rt_mutex will find no pending
      waiters where futex_q thinks there are. In this case the rt_mutex unlock
      code cannot assign an owner.
      
      The futex side fixup code has to cleanup the inconsistencies with quite a
      bunch of interesting corner cases.
      
      Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
      situation occurs. This then gives the futex_lock_pi() code the opportunity
      to continue and the retried futex_unlock_pi() will now observe a coherent
      state.
      
      The only problem is that this breaks RT timeliness guarantees. That
      is, consider the following scenario:
      
        T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)
      
          CPU0
      
          T1
            lock_pi()
            queue_me()  <- Waiter is visible
      
          preemption
      
          T2
            unlock_pi()
      	loops with -EAGAIN forever
      
      Which is undesirable for PI primitives. Future patches will rectify
      this.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      73d786bd
    • Peter Zijlstra's avatar
      futex: Cleanup refcounting · bf92cf3a
      Peter Zijlstra authored
      Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
      becomes consistent.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      bf92cf3a
    • Peter Zijlstra's avatar
      futex: Change locking rules · 734009e9
      Peter Zijlstra authored
      Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
      creates another set of problems, especially priority inversions on RT where
      hb->lock becomes a rt_mutex itself.
      
      The rt_mutex::wait_lock is the most obvious protection for keeping the
      futex user space value and the kernel internal pi_state in sync.
      
      Rework and document the locking so rt_mutex::wait_lock is held accross all
      operations which modify the user space value and the pi state.
      
      This allows to invoke rt_mutex_unlock() (including deboost) without holding
      hb->lock as a next step.
      
      Nothing yet relies on the new locking rules.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      734009e9
    • Peter Zijlstra's avatar
      futex,rt_mutex: Provide futex specific rt_mutex API · 5293c2ef
      Peter Zijlstra authored
      Part of what makes futex_unlock_pi() intricate is that
      rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
      rt_mutex::wait_lock.
      
      This means it cannot rely on the atomicy of wait_lock, which would be
      preferred in order to not rely on hb->lock so much.
      
      The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
      race with the rt_mutex fastpath, however futexes have their own fast path.
      
      Since futexes already have a bunch of separate rt_mutex accessors, complete
      that set and implement a rt_mutex variant without fastpath for them.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5293c2ef
    • Peter Zijlstra's avatar
      futex: Remove rt_mutex_deadlock_account_*() · fffa954f
      Peter Zijlstra authored
      These are unused and clutter up the code.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      fffa954f
    • Peter Zijlstra's avatar
      futex: Use smp_store_release() in mark_wake_futex() · 1b367ece
      Peter Zijlstra authored
      Since the futex_q can dissapear the instruction after assigning NULL,
      this really should be a RELEASE barrier. That stops loads from hitting
      dead memory too.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      1b367ece
    • Peter Zijlstra's avatar
      futex: Cleanup variable names for futex_top_waiter() · 499f5aca
      Peter Zijlstra authored
      futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
      this to a variable 'match' totally obscures the code.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: dvhart@infradead.org
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      499f5aca
    • Peter Zijlstra's avatar
      locking/atomic/x86: Use atomic_try_cmpxchg() · e6790e4b
      Peter Zijlstra authored
      Better code generation:
      
            text           data  bss        name
        10665111        4530096  843776     defconfig-build/vmlinux.3
        10655703        4530096  843776     defconfig-build/vmlinux.4
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e6790e4b
    • Peter Zijlstra's avatar
      locking/refcounts: Use atomic_try_cmpxchg() · b78c0d47
      Peter Zijlstra authored
      Generates better code (GCC-6.2.1):
      
        text        filename
        1576        defconfig-build/lib/refcount.o.pre
        1488        defconfig-build/lib/refcount.o.post
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b78c0d47
    • Peter Zijlstra's avatar
      locking/atomic: Introduce atomic_try_cmpxchg() · a9ebf306
      Peter Zijlstra authored
      Add a new cmpxchg interface:
      
        bool try_cmpxchg(u{8,16,32,64} *ptr, u{8,16,32,64} *val, u{8,16,32,64} new);
      
      Where the boolean returns the result of the compare; and thus if the
      exchange happened; and in case of failure, the new value of *ptr is
      returned in *val.
      
      This allows simplification/improvement of loops like:
      
      	for (;;) {
      		new = val $op $imm;
      		old = cmpxchg(ptr, val, new);
      		if (old == val)
      			break;
      		val = old;
      	}
      
      into:
      
      	do {
      	} while (!try_cmpxchg(ptr, &val, val $op $imm));
      
      while also generating better code (GCC6 and onwards).
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a9ebf306
    • Vincent Guittot's avatar
      sched/fair: Fix FTQ noise bench regression · bc427898
      Vincent Guittot authored
      A regression of the FTQ noise has been reported by Ying Huang,
      on the following hardware:
      
        8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory
      
      ... which was caused by this commit:
      
        commit 4e516076 ("sched/fair: Propagate asynchrous detach")
      
      The only part of the patch that can increase the noise is the update
      of blocked load of group entity in update_blocked_averages().
      
      We can optimize this call and skip the update of group entity if its load
      and utilization are already null and there is no pending propagation of load
      in the task group.
      
      This optimization partly restores the noise score. A more agressive
      optimization has been tried but has shown worse score.
      
      Reported-by: ying.huang@linux.intel.com
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: ying.huang@intel.com
      Fixes: 4e516076 ("sched/fair: Propagate asynchrous detach")
      Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
      [ Fixed typos, improved layout. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bc427898
    • Wanpeng Li's avatar
      sched/core: Fix rq lock pinning warning after call balance callbacks · d7921a5d
      Wanpeng Li authored
      This can be reproduced by running rt-migrate-test:
      
       WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock()
       unpinning an unpinned lock
       ...
       Call Trace:
        dump_stack()
        __warn()
        warn_slowpath_fmt()
        lock_unpin_lock()
        __balance_callback()
        __schedule()
        schedule()
        futex_wait_queue_me()
        futex_wait()
        do_futex()
        SyS_futex()
        do_syscall_64()
        entry_SYSCALL64_slow_path()
      
      Revert the rq_lock_irqsave() usage here, the whole point of the
      balance_callback() was to allow dropping rq->lock.
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 8a8c69c3 ("sched/core: Add rq->lock wrappers")
      Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d7921a5d
  6. 16 Mar, 2017 13 commits