• Mike Christie's avatar
    [SCSI] scsi: add transport host byte errors (v3) · a4dfaa6f
    Mike Christie authored
    Currently, if there is a transport problem the iscsi drivers will return
    outstanding commands (commands being exeucted by the driver/fw/hw) with
    DID_BUS_BUSY and block the session so no new commands can be queued.
    Commands that are caught between the failure handling and blocking are
    failed with DID_IMM_RETRY or one of the scsi ml queuecommand return values.
    When the recovery_timeout fires, the iscsi drivers then fail IO with
    DID_NO_CONNECT.
    
    For fcp, some drivers will fail some outstanding IO (disk but possibly not
    tape) with DID_BUS_BUSY or DID_ERROR or some other value that causes a retry
    and hits the scsi_error.c failfast check, block the rport, and commands
    caught in the race are failed with DID_IMM_RETRY. Other drivers, may
    hold onto all IO and wait for the terminate_rport_io or dev_loss_tmo_callbk
    to be called.
    
    The following patches attempt to unify what upper layers will see drivers
    like multipath can make a good guess. This relies on drivers being
    hooked into their transport class.
    
    This first patch just defines two new host byte errors so drivers can
    return the same value for when a rport/session is blocked and for
    when the fast_io_fail_tmo fires.
    
    The idea is that if the LLD/class detects a problem and is going to block
    a rport/session, then if the LLD wants or must return the command to scsi-ml,
    then it can return it with DID_TRANSPORT_DISRUPTED. This will requeue
    the IO into the same scsi queue it came from, until the fast io fail timer
    fires and the class decides what to do.
    
    When using multipath and the fast_io_fail_tmo fires then the class
    can fail commands with DID_TRANSPORT_FAILFAST or drivers can use
    DID_TRANSPORT_FAILFAST in their terminate_rport_io callbacks or
    the equivlent in iscsi if we ever implement more advanced recovery methods.
    A LLD, like lpfc, could continue to return DID_ERROR and then it will hit
    the normal failfast path, so drivers do not have fully be ported to
    work better. The point of the patches is that upper layers will
    not see a failure that could be recovered from while the rport/session is
    blocked until fast_io_fail_tmo/recovery_timeout fires.
    
    V3
    Remove some comments.
    V2
    Fixed patch/diff errors and renamed DID_TRANSPORT_BLOCKED to
    DID_TRANSPORT_DISRUPTED.
    V1
    initial patch.
    Signed-off-by: default avatarMike Christie <michaelc@cs.wisc.edu>
    Signed-off-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
    a4dfaa6f
scsi_error.c 53.2 KB