• Niklas Cassel's avatar
    ata: libata: fix commands incorrectly not getting retried during NCQ error · 3d8a3ae3
    Niklas Cassel authored
    A NCQ error means that the device has aborted processing of all active
    commands.
    To get the single NCQ command that caused the NCQ error, host software has
    to read the NCQ error log, which also takes the device out of error state.
    
    When the device encounters a NCQ error, we receive an error interrupt from
    the HBA, and call ata_do_link_abort() to mark all outstanding commands on
    the link as ATA_QCFLAG_FAILED (which means that these commands are owned
    by libata EH), and then call ata_qc_complete() on them.
    
    ata_qc_complete() will call fill_result_tf() for all commands marked as
    ATA_QCFLAG_FAILED.
    
    The taskfile is simply the latest status/error as seen from the device's
    perspective. The taskfile will have ATA_ERR set in the status field and
    ATA_ABORTED set in the error field.
    
    When we fill the current taskfile values for all outstanding commands,
    that means that qc->result_tf will have ATA_ERR set for all commands
    owned by libata EH.
    
    When ata_eh_link_autopsy() later analyzes all commands owned by libata EH,
    it will call ata_eh_analyze_tf(), which will check if qc->result_tf has
    ATA_ERR set, if it does, it will set qc->err_mask (which marks the command
    as an error).
    
    When ata_eh_finish() later calls __ata_qc_complete() on all commands owned
    by libata EH, it will call qc->complete_fn() (ata_scsi_qc_complete()),
    ata_scsi_qc_complete() will call ata_gen_ata_sense() to generate sense
    data if qc->err_mask is set.
    
    This means that we will generate sense data for commands that should not
    have any sense data set. Having sense data set for the non-failed commands
    will cause SCSI to finish these commands instead of retrying them.
    
    While this incorrect behavior has existed for a long time, this first
    became a problem once we started reading the correct taskfile register in
    commit 4ba09d20 ("ata: libahci: read correct status and error field
    for NCQ commands").
    
    Before this commit, NCQ commands would read the taskfile values received
    from the last non-NCQ command completion, which most likely did not have
    ATA_ERR set, since the last non-NCQ command was most likely not an error.
    
    Fix this by changing ata_eh_analyze_ncq_error() to mark all non-failed
    commands as ATA_QCFLAG_RETRY, and change the loop in ata_eh_link_autopsy()
    to skip commands marked as ATA_QCFLAG_RETRY.
    
    While at it, make sure that we clear ATA_ERR and any error bits for all
    commands except the actual command that caused the NCQ error, so that no
    other libata code will be able to misinterpret these commands as errors.
    
    Fixes: 4ba09d20 ("ata: libahci: read correct status and error field for NCQ commands")
    Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
    Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
    3d8a3ae3
libata-sata.c 38.2 KB