• Douglas Anderson's avatar
    serial: qcom-geni: Don't cancel/abort if we can't get the port lock · 9e957a15
    Douglas Anderson authored
    As of commit d7402513 ("arm64: smp: IPI_CPU_STOP and
    IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
    enabled then we'll use it to stop CPUs at panic time. This is nice,
    but it does mean that there's a pretty good chance that we'll end up
    stopping a CPU while it holds the port lock for the console
    UART. Specifically, I see a CPU get stopped while holding the port
    lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
    enabling the "buddy" hardlockup detector and then doing:
    
      sysctl -w kernel.hardlockup_all_cpu_backtrace=1
      sysctl -w kernel.hardlockup_panic=1
      echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
    
    UART drivers are _supposed_ to handle this case OK and this is why
    UART drivers check "oops_in_progress" and only do a "trylock" in that
    case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
    very well-tested situation.
    
    Now that we're testing the situation a lot, it can be seen that the
    Qualcomm GENI UART driver is pretty broken. Specifically, when I run
    my test case and look at the console output I just see a bunch of
    garbled output like:
    
      [  201.069084] NMI backtrace[  201.069084] NM[  201.069087] CPU: 6
      PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
      #1 01112b9f14923cbd0b[  201.069090] Hardware name: Google Lazor
      ([  201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
      201.069095] pc : smp_call_function_man[  201.069099]
    
    That's obviously not so great. This happens because each call to the
    console driver exits after the data has been written to the FIFO but
    before it's actually been flushed out of the serial port. When we have
    multiple calls into the console one after the other then (if we can't
    get the lock) each call tells the UART to throw away any data in the
    FIFO that hadn't been transferred yet.
    
    I've posted up a patch to change the arm64 core to avoid this
    situation most of the time [1] much like x86 seems to do, but even if
    that patch lands the GENI driver should still be fixed.
    
    >From testing, it appears that we can just delete the cancel/abort in
    the case where we weren't able to get the UART lock and the output
    looks good. It makes sense that we'd be able to do this since that
    means we'll just call into __qcom_geni_serial_console_write() and
    __qcom_geni_serial_console_write() looks much like
    qcom_geni_serial_poll_put_char() but with a loop. However, it seems
    safest to poll the FIFO and make sure it's empty before our
    transfer. This should reliably make sure that we're not
    interrupting/clobbering any existing transfers.
    
    As part of this change, we'll also avoid re-setting up a TX at the end
    of the console write function if we weren't able to get the lock,
    since accessing "port->tx_remaining" without the lock is not
    safe. This is only needed to re-start userspace initiated transfers.
    
    [1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid
    
    Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
    Link: https://lore.kernel.org/r/20240112150307.2.Idb1553d1d22123c377f31eacb4486432f6c9ac8d@changeid
    
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    9e957a15
qcom_geni_serial.c 49 KB