• Bill Kuzeja's avatar
    scsi: qla2xxx: Fix scsi scan hang triggered if adapter fails during init · 501c91ff
    Bill Kuzeja authored
    BugLink: http://bugs.launchpad.net/bugs/1642968
    
    commit a5dd506e upstream.
    
    A system can get hung task timeouts if a qlogic board fails during
    initialization (if the board breaks again or fails the init). The hang
    involves the scsi scan.
    
    In a nutshell, since commit beb9e315 ("qla2xxx: Prevent removal and
    board_disable race"):
    
    ...it is possible to have freed ha (base_vha->hw) early by a call to
    qla2x00_remove_one when pdev->enable_cnt equals zero:
    
           if (!atomic_read(&pdev->enable_cnt)) {
                   scsi_host_put(base_vha->host);
                   kfree(ha);
                   pci_set_drvdata(pdev, NULL);
                   return;
    
    Almost always, the scsi_host_put above frees the vha structure
    (attached to the end of the Scsi_Host we're putting) since it's the last
    put, and life is good.  However, if we are entering this routine because
    the adapter has broken sometime during initialization AND a scsi scan is
    already in progress (and has done its own scsi_host_get), vha will not
    be freed. What's worse, the scsi scan will access the freed ha structure
    through qla2xxx_scan_finished:
    
            if (time > vha->hw->loop_reset_delay * HZ)
                    return 1;
    
    The scsi scan keeps checking to see if a scan is complete by calling
    qla2xxx_scan_finished. There is a timeout value that limits the length
    of time a scan can take (hw->loop_reset_delay, usually set to 5
    seconds), but this definition is in the data structure (hw) that can get
    freed early.
    
    This can yield unpredictable results, the worst of which is that the
    scsi scan can hang indefinitely. This happens when the freed structure
    gets reused and loop_reset_delay gets overwritten with garbage, which
    the scan obliviously uses as its timeout value.
    
    The fix for this is simple: at the top of qla2xxx_scan_finished, check
    for the UNLOADING bit in the vha structure (_vha is not freed at this
    point).  If UNLOADING is set, we exit the scan for this adapter
    immediately. After this last reference to the ha structure, we'll exit
    the scan for this adapter, and continue on.
    
    This problem is hard to hit, but I have run into it doing negative
    testing many times now (with a test specifically designed to bring it
    out), so I can verify that this fix works. My testing has been against a
    RHEL7 driver variant, but the bug and patch are equally relevant to to
    the upstream driver.
    
    Fixes: beb9e315 ("qla2xxx: Prevent removal and board_disable race")
    Signed-off-by: default avatarBill Kuzeja <william.kuzeja@stratus.com>
    Acked-by: default avatarHimanshu Madhani <himanshu.madhani@cavium.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: default avatarTim Gardner <tim.gardner@canonical.com>
    501c91ff
qla_os.c 163 KB