Commit 246d8b6c authored by Tomer Tayar's avatar Tomer Tayar Committed by Oded Gabbay

accel/habanalabs: abort device reset for consecutive heartbeat failures

The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.
Signed-off-by: default avatarTomer Tayar <ttayar@habana.ai>
Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
parent d0df8a35
......@@ -1769,14 +1769,16 @@ int hl_device_reset(struct hl_device *hdev, u32 flags)
hdev->device_cpu_disabled = false;
hdev->reset_info.hard_reset_pending = false;
/*
* Put the device in an unusable state if there are 2 back to back resets due to
* fatal errors.
*/
if (hdev->reset_info.reset_trigger_repeated &&
(hdev->reset_info.prev_reset_trigger ==
HL_DRV_RESET_FW_FATAL_ERR)) {
/* if there 2 back to back resets from FW,
* ensure driver puts the driver in a unusable state
*/
(hdev->reset_info.prev_reset_trigger == HL_DRV_RESET_FW_FATAL_ERR ||
hdev->reset_info.prev_reset_trigger ==
HL_DRV_RESET_HEARTBEAT)) {
dev_crit(hdev->dev,
"%s Consecutive FW fatal errors received, stopping hard reset\n",
"%s Consecutive fatal errors, stopping hard reset\n",
dev_name(&(hdev)->pdev->dev));
rc = -EIO;
goto out_err;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment