• Abhishek Sahu's avatar
    vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT · cc2742fe
    Abhishek Sahu authored
    Currently, if the runtime power management is enabled for vfio-pci
    based devices in the guest OS, then the guest OS will do the register
    write for PCI_PM_CTRL register. This write request will be handled in
    vfio_pm_config_write() where it will do the actual register write of
    PCI_PM_CTRL register. With this, the maximum D3hot state can be
    achieved for low power. If we can use the runtime PM framework, then
    we can achieve the D3cold state (on the supported systems) which will
    help in saving maximum power.
    
    1. D3cold state can't be achieved by writing PCI standard
       PM config registers. This patch implements the following
       newly added low power related device features:
        - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
        - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
    
       The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY feature will allow the
       device to make use of low power platform states on the host
       while the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT will prevent
       further use of those power states.
    
    2. The vfio-pci driver uses runtime PM framework for low power entry and
       exit. On the platforms where D3cold state is supported, the runtime
       PM framework will put the device into D3cold otherwise, D3hot or some
       other power state will be used.
    
       There are various cases where the device will not go into the runtime
       suspended state. For example,
    
       - The runtime power management is disabled on the host side for
         the device.
       - The user keeps the device busy after calling LOW_POWER_ENTRY.
       - There are dependent devices that are still in runtime active state.
    
       For these cases, the device will be in the same power state that has
       been configured by the user through PCI_PM_CTRL register.
    
    3. The hypervisors can implement virtual ACPI methods. For example,
       in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
       resources with _ON/_OFF method, then guest linux OS invokes
       the _OFF method during D3cold transition and then _ON during D0
       transition. The hypervisor can tap these virtual ACPI calls and then
       call the low power device feature IOCTL.
    
    4. The 'pm_runtime_engaged' flag tracks the entry and exit to
       runtime PM. This flag is protected with 'memory_lock' semaphore.
    
    5. All the config and other region access are wrapped under
       pm_runtime_resume_and_get() and pm_runtime_put(). So, if any
       device access happens while the device is in the runtime suspended
       state, then the device will be resumed first before access. Once the
       access has been finished, then the device will again go into the
       runtime suspended state.
    
    6. The memory region access through mmap will not be allowed in the low
       power state. Since __vfio_pci_memory_enabled() is a common function,
       so check for 'pm_runtime_engaged' has been added explicitly in
       vfio_pci_mmap_fault() to block only mmap'ed access.
    Signed-off-by: default avatarAbhishek Sahu <abhsahu@nvidia.com>
    Link: https://lore.kernel.org/r/20220829114850.4341-5-abhsahu@nvidia.comSigned-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
    cc2742fe
vfio_pci_core.c 65.6 KB