• Nysal Jan K.A.'s avatar
    powerpc/qspinlock: Fix deadlock in MCS queue · 734ad0af
    Nysal Jan K.A. authored
    If an interrupt occurs in queued_spin_lock_slowpath() after we increment
    qnodesp->count and before node->lock is initialized, another CPU might
    see stale lock values in get_tail_qnode(). If the stale lock value happens
    to match the lock on that CPU, then we write to the "next" pointer of
    the wrong qnode. This causes a deadlock as the former CPU, once it becomes
    the head of the MCS queue, will spin indefinitely until it's "next" pointer
    is set by its successor in the queue.
    
    Running stress-ng on a 16 core (16EC/16VP) shared LPAR, results in
    occasional lockups similar to the following:
    
       $ stress-ng --all 128 --vm-bytes 80% --aggressive \
                   --maximize --oomable --verify  --syslog \
                   --metrics  --times  --timeout 5m
    
       watchdog: CPU 15 Hard LOCKUP
       ......
       NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490
       LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
       Call Trace:
        0xc000002cfffa3bf0 (unreliable)
        _raw_spin_lock+0x6c/0x90
        raw_spin_rq_lock_nested.part.135+0x4c/0xd0
        sched_ttwu_pending+0x60/0x1f0
        __flush_smp_call_function_queue+0x1dc/0x670
        smp_ipi_demux_relaxed+0xa4/0x100
        xive_muxed_ipi_action+0x20/0x40
        __handle_irq_event_percpu+0x80/0x240
        handle_irq_event_percpu+0x2c/0x80
        handle_percpu_irq+0x84/0xd0
        generic_handle_irq+0x54/0x80
        __do_irq+0xac/0x210
        __do_IRQ+0x74/0xd0
        0x0
        do_IRQ+0x8c/0x170
        hardware_interrupt_common_virt+0x29c/0x2a0
       --- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490
       ......
       NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490
       LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
       --- interrupt: 500
        0xc0000029c1a41d00 (unreliable)
        _raw_spin_lock+0x6c/0x90
        futex_wake+0x100/0x260
        do_futex+0x21c/0x2a0
        sys_futex+0x98/0x270
        system_call_exception+0x14c/0x2f0
        system_call_vectored_common+0x15c/0x2ec
    
    The following code flow illustrates how the deadlock occurs.
    For the sake of brevity, assume that both locks (A and B) are
    contended and we call the queued_spin_lock_slowpath() function.
    
            CPU0                                   CPU1
            ----                                   ----
      spin_lock_irqsave(A)                          |
      spin_unlock_irqrestore(A)                     |
        spin_lock(B)                                |
             |                                      |
             ▼                                      |
       id = qnodesp->count++;                       |
      (Note that nodes[0].lock == A)                |
             |                                      |
             ▼                                      |
          Interrupt                                 |
      (happens before "nodes[0].lock = B")          |
             |                                      |
             ▼                                      |
      spin_lock_irqsave(A)                          |
             |                                      |
             ▼                                      |
       id = qnodesp->count++                        |
       nodes[1].lock = A                            |
             |                                      |
             ▼                                      |
      Tail of MCS queue                             |
             |                             spin_lock_irqsave(A)
             ▼                                      |
      Head of MCS queue                             ▼
             |                             CPU0 is previous tail
             ▼                                      |
       Spin indefinitely                            ▼
      (until "nodes[1].next != NULL")      prev = get_tail_qnode(A, CPU0)
                                                    |
                                                    ▼
                                           prev == &qnodes[CPU0].nodes[0]
                                         (as qnodes[CPU0].nodes[0].lock == A)
                                                    |
                                                    ▼
                                           WRITE_ONCE(prev->next, node)
                                                    |
                                                    ▼
                                            Spin indefinitely
                                         (until nodes[0].locked == 1)
    
    Thanks to Saket Kumar Bhaskar for help with recreating the issue
    
    Fixes: 84990b16 ("powerpc/qspinlock: add mcs queueing for contended waiters")
    Cc: stable@vger.kernel.org # v6.2+
    Reported-by: default avatarGeetika Moolchandani <geetika@linux.ibm.com>
    Reported-by: default avatarVaishnavi Bhat <vaish123@in.ibm.com>
    Reported-by: default avatarJijo Varghese <vargjijo@in.ibm.com>
    Signed-off-by: default avatarNysal Jan K.A. <nysal@linux.ibm.com>
    Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://msgid.link/20240829022830.1164355-1-nysal@linux.ibm.com
    734ad0af
qspinlock.c 23.5 KB