Commit 1a1fb985 authored by Thomas Gleixner's avatar Thomas Gleixner

futex: Handle early deadlock return correctly

commit 56222b21 ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi()                         futex_unlock_pi()
  lock(hb->lock)
  queue(hb_waiter)				lock(hb->lock)
  lock(rtmutex->wait_lock)
  unlock(hb->lock)
                                        // acquired hb->lock
                                        hb_waiter = futex_top_waiter()
                                        lock(rtmutex->wait_lock)
  __rt_mutex_proxy_start()
     ---> fail
          remove(rtmutex_waiter);
     ---> returns -EDEADLOCK
  unlock(rtmutex->wait_lock)
                                        // acquired wait_lock
                                        wake_futex_pi()
                                        rt_mutex_next_owner()
					  --> returns NULL
                                          --> WARN

  lock(hb->lock)
  unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b21 ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
Co-developed-by: default avatarPeter Zijlstra <peterz@infradead.org>
Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
Tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Stefan Liebler <stli@linux.ibm.com>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de
parent 6f568ebe
...@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ...@@ -2861,35 +2861,39 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
* and BUG when futex_unlock_pi() interleaves with this. * and BUG when futex_unlock_pi() interleaves with this.
* *
* Therefore acquire wait_lock while holding hb->lock, but drop the * Therefore acquire wait_lock while holding hb->lock, but drop the
* latter before calling rt_mutex_start_proxy_lock(). This still fully * latter before calling __rt_mutex_start_proxy_lock(). This
* serializes against futex_unlock_pi() as that does the exact same * interleaves with futex_unlock_pi() -- which does a similar lock
* lock handoff sequence. * handoff -- such that the latter can observe the futex_q::pi_state
* before __rt_mutex_start_proxy_lock() is done.
*/ */
raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
spin_unlock(q.lock_ptr); spin_unlock(q.lock_ptr);
/*
* __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter
* such that futex_unlock_pi() is guaranteed to observe the waiter when
* it sees the futex_q::pi_state.
*/
ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current);
raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
if (ret) { if (ret) {
if (ret == 1) if (ret == 1)
ret = 0; ret = 0;
goto cleanup;
spin_lock(q.lock_ptr);
goto no_block;
} }
if (unlikely(to)) if (unlikely(to))
hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
cleanup:
spin_lock(q.lock_ptr); spin_lock(q.lock_ptr);
/* /*
* If we failed to acquire the lock (signal/timeout), we must * If we failed to acquire the lock (deadlock/signal/timeout), we must
* first acquire the hb->lock before removing the lock from the * first acquire the hb->lock before removing the lock from the
* rt_mutex waitqueue, such that we can keep the hb and rt_mutex * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
* wait lists consistent. * lists consistent.
* *
* In particular; it is important that futex_unlock_pi() can not * In particular; it is important that futex_unlock_pi() can not
* observe this inconsistency. * observe this inconsistency.
...@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags) ...@@ -3013,6 +3017,10 @@ static int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
* there is no point where we hold neither; and therefore * there is no point where we hold neither; and therefore
* wake_futex_pi() must observe a state consistent with what we * wake_futex_pi() must observe a state consistent with what we
* observed. * observed.
*
* In particular; this forces __rt_mutex_start_proxy() to
* complete such that we're guaranteed to observe the
* rt_waiter. Also see the WARN in wake_futex_pi().
*/ */
raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
spin_unlock(&hb->lock); spin_unlock(&hb->lock);
......
...@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock, ...@@ -1726,12 +1726,33 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
rt_mutex_set_owner(lock, NULL); rt_mutex_set_owner(lock, NULL);
} }
/**
* __rt_mutex_start_proxy_lock() - Start lock acquisition for another task
* @lock: the rt_mutex to take
* @waiter: the pre-initialized rt_mutex_waiter
* @task: the task to prepare
*
* Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
* detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
*
* NOTE: does _NOT_ remove the @waiter on failure; must either call
* rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this.
*
* Returns:
* 0 - task blocked on lock
* 1 - acquired the lock for task, caller should wake it up
* <0 - error
*
* Special API call for PI-futex support.
*/
int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter, struct rt_mutex_waiter *waiter,
struct task_struct *task) struct task_struct *task)
{ {
int ret; int ret;
lockdep_assert_held(&lock->wait_lock);
if (try_to_take_rt_mutex(lock, task, NULL)) if (try_to_take_rt_mutex(lock, task, NULL))
return 1; return 1;
...@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, ...@@ -1749,9 +1770,6 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
ret = 0; ret = 0;
} }
if (unlikely(ret))
remove_waiter(lock, waiter);
debug_rt_mutex_print_deadlock(waiter); debug_rt_mutex_print_deadlock(waiter);
return ret; return ret;
...@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, ...@@ -1763,12 +1781,18 @@ int __rt_mutex_start_proxy_lock(struct rt_mutex *lock,
* @waiter: the pre-initialized rt_mutex_waiter * @waiter: the pre-initialized rt_mutex_waiter
* @task: the task to prepare * @task: the task to prepare
* *
* Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
* detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
*
* NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter
* on failure.
*
* Returns: * Returns:
* 0 - task blocked on lock * 0 - task blocked on lock
* 1 - acquired the lock for task, caller should wake it up * 1 - acquired the lock for task, caller should wake it up
* <0 - error * <0 - error
* *
* Special API call for FUTEX_REQUEUE_PI support. * Special API call for PI-futex support.
*/ */
int rt_mutex_start_proxy_lock(struct rt_mutex *lock, int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter, struct rt_mutex_waiter *waiter,
...@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock, ...@@ -1778,6 +1802,8 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
raw_spin_lock_irq(&lock->wait_lock); raw_spin_lock_irq(&lock->wait_lock);
ret = __rt_mutex_start_proxy_lock(lock, waiter, task); ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
if (unlikely(ret))
remove_waiter(lock, waiter);
raw_spin_unlock_irq(&lock->wait_lock); raw_spin_unlock_irq(&lock->wait_lock);
return ret; return ret;
...@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock, ...@@ -1845,7 +1871,8 @@ int rt_mutex_wait_proxy_lock(struct rt_mutex *lock,
* @lock: the rt_mutex we were woken on * @lock: the rt_mutex we were woken on
* @waiter: the pre-initialized rt_mutex_waiter * @waiter: the pre-initialized rt_mutex_waiter
* *
* Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or
* rt_mutex_wait_proxy_lock().
* *
* Unless we acquired the lock; we're still enqueued on the wait-list and can * Unless we acquired the lock; we're still enqueued on the wait-list and can
* in fact still be granted ownership until we're removed. Therefore we can * in fact still be granted ownership until we're removed. Therefore we can
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment