• Steffen Maier's avatar
    scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response · fc1809ea
    Steffen Maier authored
    BugLink: http://bugs.launchpad.net/bugs/1721477
    
    commit fdb7cee3 upstream.
    
    At the default trace level, we only trace unsuccessful events including
    FSF responses.
    
    zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
    decide on an unsuccessful response. However, this is only one of multiple
    possible sources determining a failed struct zfcp_fsf_req.
    
    An FSF request can also "fail" if its response runs into an ERP timeout
    or if it gets dismissed because a higher level recovery was triggered
    [trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
    FSF requests with ERP timeout are:
    FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
    FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
    FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
    FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
    One example is slow queue processing which can cause follow-on errors,
    e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
    In order to see the root cause, we need to see late responses even if the
    channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
    Example trace records formatted with zfcpdbf from the s390-tools package:
    
    Timestamp      : ...
    Area           : REC
    Subarea        : 00
    Level          : 1
    Exception      : -
    CPU ID         : ..
    Caller         : ...
    Record ID      : 1
    Tag            : fcegpf1
    LUN            : 0xffffffffffffffff
    WWPN           : 0x<WWPN>
    D_ID           : 0x00<D_ID>
    Adapter status : 0x5400050b
    Port status    : 0x41200000
    LUN status     : 0x00000000
    Ready count    : 0x00000001
    Running count  : 0x...
    ERP want       : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
    ERP need       : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
    |
    Timestamp      : ...				30 seconds later
    Area           : REC
    Subarea        : 00
    Level          : 1
    Exception      : -
    CPU ID         : ..
    Caller         : ...
    Record ID      : 2
    Tag            : erscf_2
    LUN            : 0xffffffffffffffff
    WWPN           : 0x<WWPN>
    D_ID           : 0x00<D_ID>
    Adapter status : 0x5400050b
    Port status    : 0x41200000
    LUN status     : 0x00000000
    Request ID     : 0x<request_ID>
    ERP status     : 0x10000000			ZFCP_STATUS_ERP_TIMEDOUT
    ERP step       : 0x0800				ZFCP_ERP_STEP_PORT_OPENING
    ERP action     : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
    ERP count      : 0x00
    |
    Timestamp      : ...				later than previous record
    Area           : HBA
    Subarea        : 00
    Level          : 5	> default level		=> 3	<= default level
    Exception      : -
    CPU ID         : 00
    Caller         : ...
    Record ID      : 1
    Tag            : fs_qtcb			=> fs_rerr
    Request ID     : 0x<request_ID>
    Request status : 0x00001010			ZFCP_STATUS_FSFREQ_DISMISSED
    						| ZFCP_STATUS_FSFREQ_CLEANUP
    FSF cmnd       : 0x00000005
    FSF sequence no: 0x...
    FSF issued     : ...				> 30 seconds ago
    FSF stat       : 0x00000000			FSF_GOOD
    FSF stat qual  : 00000000 00000000 00000000 00000000
    Prot stat      : 0x00000001			FSF_PROT_GOOD
    Prot stat qual : 00000000 00000000 00000000 00000000
    Port handle    : 0x...
    LUN handle     : 0x00000000
    QTCB log length: ...
    QTCB log info  : ...
    
    In case of problems detecting that new responses are waiting on the input
    queue, we sooner or later trigger adapter recovery due to an FSF request
    timeout (trace tag "fsrth_1").
    FSF requests with FSF request timeout are:
    typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
    FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
    FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
    FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
    One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
    because the channel filled in the response via DMA into the request's QTCB.
    
    In a theroretical case, inject code can create an erroneous FSF request
    on purpose. If data router is enabled, it uses deferred error reporting.
    A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
    SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
    it can still fail, e.g. if an intentionally wrong scatter list does not
    provide enough space. Rather than getting an unsuccessful response,
    we get a QDIO activate check which in turn triggers adapter recovery.
    One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
    because the channel filled in the response via DMA into the request's QTCB.
    Example trace records formatted with zfcpdbf from the s390-tools package:
    
    Timestamp      : ...
    Area           : HBA
    Subarea        : 00
    Level          : 6	> default level		=> 3	<= default level
    Exception      : -
    CPU ID         : ..
    Caller         : ...
    Record ID      : 1
    Tag            : fs_norm			=> fs_rerr
    Request ID     : 0x<request_ID2>
    Request status : 0x00001010			ZFCP_STATUS_FSFREQ_DISMISSED
    						| ZFCP_STATUS_FSFREQ_CLEANUP
    FSF cmnd       : 0x00000001
    FSF sequence no: 0x...
    FSF issued     : ...
    FSF stat       : 0x00000000			FSF_GOOD
    FSF stat qual  : 00000000 00000000 00000000 00000000
    Prot stat      : 0x00000001			FSF_PROT_GOOD
    Prot stat qual : ........ ........ 00000000 00000000
    Port handle    : 0x...
    LUN handle     : 0x...
    |
    Timestamp      : ...
    Area           : SCSI
    Subarea        : 00
    Level          : 3
    Exception      : -
    CPU ID         : ..
    Caller         : ...
    Record ID      : 1
    Tag            : rsl_err
    Request ID     : 0x<request_ID2>
    SCSI ID        : 0x...
    SCSI LUN       : 0x...
    SCSI result    : 0x000e0000			DID_TRANSPORT_DISRUPTED
    SCSI retries   : 0x00
    SCSI allowed   : 0x05
    SCSI scribble  : 0x<request_ID2>
    SCSI opcode    : 28...				Read(10)
    FCP rsp inf cod: 0x00
    FCP rsp IU     : 00000000 00000000 00000000 00000000
                                             ^^	SAM_STAT_GOOD
                     00000000 00000000
    
    Only with luck in both above cases, we could see a follow-on trace record
    of an unsuccesful event following a successful but late FSF response with
    FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
    resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
    [On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
    ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
    However, the reason for this follow-on trace was invisible because the
    corresponding HBA trace record was missing at the default trace level
    (by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").
    
    On adapter recovery, after we had shut down the QDIO queues, we perform
    unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
    for each pending FSF request in zfcp_fsf_req_dismiss_all().
    In order to find the root cause, we need to see all pseudo responses even
    if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
    
    Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
    or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".
    
    It does not matter that there are numerous places which set
    ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
    early. These cases are based on protocol status != FSF_PROT_GOOD or
    == FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
    as trace tag "fs_perr" or "fs_ferr" respectively.
    
    NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
    all remains. zfcp_fsf_req_complete() handles this and returns early.
    All other FSF request types are handled separately and as described above.
    Signed-off-by: default avatarSteffen Maier <maier@linux.vnet.ibm.com>
    Fixes: 8a36e453 ("[SCSI] zfcp: enhancement of zfcp debug features")
    Fixes: 2e261af8 ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
    Reviewed-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
    Signed-off-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
    Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
    fc1809ea
zfcp_dbf.h 11.4 KB