Commit e806e223 authored by Anthony DeRossi's avatar Anthony DeRossi Committed by Alex Williamson

vfio/pci: Check the device set open count on reset

vfio_pci_dev_set_needs_reset() inspects the open_count of every device
in the set to determine whether a reset is allowed. The current device
always has open_count == 1 within vfio_pci_core_disable(), effectively
disabling the reset logic. This field is also documented as private in
vfio_device, so it should not be used to determine whether other devices
in the set are open.

Checking for vfio_device_set_open_count() > 1 on the device set fixes
both issues.

After commit 2cd8b14a ("vfio/pci: Move to the device set
infrastructure"), failure to create a new file for a device would cause
the reset to be skipped due to open_count being decremented after
calling close_device() in the error path.

After commit eadd86f8 ("vfio: Remove calls to
vfio_group_add_container_user()"), releasing a device would always skip
the reset due to an ordering change in vfio_device_fops_release().

Failing to reset the device leaves it in an unknown state, potentially
causing errors when it is accessed later or bound to a different driver.

This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
assigned to a Windows guest. After shutting down the guest, unbinding
the device from vfio-pci, and binding the device to amdgpu:

[  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
[  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init

Fixes: 2cd8b14a ("vfio/pci: Move to the device set infrastructure")
Fixes: eadd86f8 ("vfio: Remove calls to vfio_group_add_container_user()")
Signed-off-by: default avatarAnthony DeRossi <ajderossi@gmail.com>
Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20221110014027.28780-4-ajderossi@gmail.comSigned-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
parent 5cd189e4
...@@ -2488,12 +2488,12 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set) ...@@ -2488,12 +2488,12 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
struct vfio_pci_core_device *cur; struct vfio_pci_core_device *cur;
bool needs_reset = false; bool needs_reset = false;
list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { /* No other VFIO device in the set can be open. */
/* No VFIO device in the set can have an open device FD */ if (vfio_device_set_open_count(dev_set) > 1)
if (cur->vdev.open_count)
return false; return false;
list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
needs_reset |= cur->needs_reset; needs_reset |= cur->needs_reset;
}
return needs_reset; return needs_reset;
} }
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment