• Koby Elbaz's avatar
    habanalabs: block soft-reset on an unusable device · a6685b57
    Koby Elbaz authored
    A device with status malfunction indicates that it can't be used.
    In such a case we do not support certain reset types, e.g.,
    all kinds of soft-resets (compute reset, inference soft-reset),
    and reset upon device release.
    
    A hard-reset is the only way that an unusable device can change its
    status. All other reset procedures can't put the device in a reset
    procedure, which might ultimately cause the device to change its
    status, unintentionally, to become operational again.
    
    Such a scenario has recently occurred, when a user requested
    a hard-reset while another heavy user workload was ongoing (reset
    request is queued).
    Since the workload couldn't finish within reset's timeout limits, the
    reset has failed and set a device status malfunction.
    Eventually, when the user released the FD, an unsuccessful soft-reset
    occurred, hence followed by an additional hard-reset that changed the
    ASICs status back to be operational.
    Signed-off-by: default avatarKoby Elbaz <kelbaz@habana.ai>
    Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
    Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
    a6685b57
device.c 68.7 KB