An error occurred fetching the project authors.
  1. 19 Apr, 2022 1 commit
  2. 30 Mar, 2022 3 commits
  3. 15 Mar, 2022 9 commits
  4. 08 Feb, 2022 1 commit
  5. 15 Jan, 2022 1 commit
  6. 14 Dec, 2021 1 commit
  7. 07 Dec, 2021 1 commit
  8. 21 Oct, 2021 3 commits
  9. 05 Oct, 2021 1 commit
  10. 22 Sep, 2021 1 commit
  11. 15 Sep, 2021 4 commits
    • James Smart's avatar
      scsi: lpfc: Zero CGN stats only during initial driver load and stat reset · afd63fa5
      James Smart authored
      Currently congestion management framework results are cleared whenever the
      framework settings changed (such as it being turned off then back on). This
      unfortunately means prior stats, rolled up to higher time windows lose
      meaning.
      
      Change such that stats are not cleared. Thus they pause and resume with
      prior values still being considered.
      
      Link: https://lore.kernel.org/r/20210910233159.115896-13-jsmart2021@gmail.comCo-developed-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      afd63fa5
    • James Smart's avatar
      scsi: lpfc: Fix EEH support for NVMe I/O · 25ac2c97
      James Smart authored
      Injecting errors on the PCI slot while the driver is handling NVMe I/O will
      cause crashes and hangs.
      
      There are several rather difficult scenarios occurring. The main issue is
      that the adapter can report a PCI error before or simultaneously to the PCI
      subsystem reporting the error. Both paths have different entry points and
      currently there is no interlock between them. Thus multiple teardown paths
      are competing and all heck breaks loose.
      
      Complicating things is the NVMs path. To a large degree, I/O was able to be
      shutdown for a full FC port on the SCSI stack. But on NVMe, there isn't a
      similar call. At best, it works on a per-controller basis, but even at the
      controller level, it's a controller "reset" call. All of which means I/O is
      still flowing on different CPUs with reset paths expecting hw access
      (mailbox commands) to execute properly.
      
      The following modifications are made:
      
       - A new flag is set in PCI error entrypoints so the driver can track being
         called by that path.
      
       - An interlock is added in the SLI hw error path and the PCI error path
         such that only one of the paths proceeds with the teardown logic.
      
       - RPI cleanup is patched such that RPIs are marked unregistered w/o mbx
         cmds in cases of hw error.
      
       - If entering the SLI port re-init calls, a case where SLI error teardown
         was quick and beat the PCI calls now reporting error, check whether the
         SLI port is still live on the PCI bus.
      
       - In the PCI reset code to bring the adapter back, recheck the IRQ
         settings. Different checks for SLI3 vs SLI4.
      
       - In I/O completions, that may be called as part of the cleanup or
         underway just before the hw error, check the state of the adapter.  If
         in error, shortcut handling that would expect further adapter
         completions as the hw error won't be sending them.
      
       - In routines waiting on I/O completions, which may have been in progress
         prior to the hw error, detect the device is being torn down and abort
         from their waits and just give up. This points to a larger issue in the
         driver on ref-counting for data structures, as it doesn't have
         ref-counting on q and port structures. We'll do this fix for now as it
         would be a major rework to be done differently.
      
       - Fix the NVMe cleanup to simulate NVMe I/O completions if I/O is being
         failed back due to hw error.
      
       - In I/O buf allocation, done at the start of new I/Os, check hw state and
         fail if hw error.
      
      Link: https://lore.kernel.org/r/20210910233159.115896-10-jsmart2021@gmail.comCo-developed-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      25ac2c97
    • James Smart's avatar
      scsi: lpfc: Fix FCP I/O flush functionality for TMF routines · cd8a36a9
      James Smart authored
      A prior patch inadvertently caused lpfc_sli_sum_iocb() to exclude counting
      of outstanding aborted I/Os and ABORT IOCBs.  Thus,
      lpfc_reset_flush_io_context() called from any TMF routine does not properly
      wait to flush all outstanding FCP IOCBs leading to a block layer crash on
      an invalid scsi_cmnd->request pointer.
      
        kernel BUG at ../block/blk-core.c:1489!
        RIP: 0010:blk_requeue_request+0xaf/0xc0
        ...
        Call Trace:
        <IRQ>
        __scsi_queue_insert+0x90/0xe0 [scsi_mod]
        blk_done_softirq+0x7e/0x90
        __do_softirq+0xd2/0x280
        irq_exit+0xd5/0xe0
        do_IRQ+0x4c/0xd0
        common_interrupt+0x87/0x87
        </IRQ>
      
      Fix by separating out the LPFC_IO_FCP, LPFC_IO_ON_TXCMPLQ,
      LPFC_DRIVER_ABORTED, and CMD_ABORT_XRI_CN || CMD_CLOSE_XRI_CN checks into a
      new lpfc_sli_validate_fcp_iocb_for_abort() routine when determining to
      build an ABORT iocb.
      
      Restore lpfc_reset_flush_io_context() functionality by including counting
      of outstanding aborted IOCBs and ABORT IOCBs in lpfc_sli_sum_iocb().
      
      Link: https://lore.kernel.org/r/20210910233159.115896-9-jsmart2021@gmail.com
      Fixes: e1364711 ("scsi: lpfc: Fix illegal memory access on Abort IOCBs")
      Cc: <stable@vger.kernel.org> # v5.12+
      Co-developed-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      cd8a36a9
    • James Smart's avatar
      scsi: lpfc: Fix list_add() corruption in lpfc_drain_txq() · 99154581
      James Smart authored
      When parsing the txq list in lpfc_drain_txq(), the driver attempts to pass
      the requests to the adapter. If such an attempt fails, a local "fail_msg"
      string is set and a log message output.  The job is then added to a
      completions list for cancellation.
      
      Processing of any further jobs from the txq list continues, but since
      "fail_msg" remains set, jobs are added to the completions list regardless
      of whether a wqe was passed to the adapter.  If successfully added to
      txcmplq, jobs are added to both lists resulting in list corruption.
      
      Fix by clearing the fail_msg string after adding a job to the completions
      list. This stops the subsequent jobs from being added to the completions
      list unless they had an appropriate failure.
      
      Link: https://lore.kernel.org/r/20210910233159.115896-2-jsmart2021@gmail.comCo-developed-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJustin Tee <justin.tee@broadcom.com>
      Signed-off-by: default avatarJames Smart <jsmart2021@gmail.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      99154581
  12. 14 Sep, 2021 1 commit
  13. 25 Aug, 2021 8 commits
  14. 27 Jul, 2021 1 commit
  15. 19 Jul, 2021 3 commits
  16. 16 Jun, 2021 1 commit