• Jason Gunthorpe's avatar
    iommufd: Algorithms for PFN storage · 8d160cd4
    Jason Gunthorpe authored
    The iopt_pages which represents a logical linear list of full PFNs held in
    different storage tiers. Each area points to a slice of exactly one
    iopt_pages, and each iopt_pages can have multiple areas and accesses.
    
    The three storage tiers are managed to meet these objectives:
    
     - If no iommu_domain or in-kerenel access exists then minimal memory
       should be consumed by iomufd
     - If a page has been pinned then an iopt_pages will not pin it again
     - If an in-kernel access exists then the xarray must provide the backing
       storage to avoid allocations on domain removals
     - Otherwise any iommu_domain will be used for storage
    
    In a common configuration with only an iommu_domain the iopt_pages does
    not allocate significant memory itself.
    
    The external interface for pages has several logical operations:
    
      iopt_area_fill_domain() will load the PFNs from storage into a single
      domain. This is used when attaching a new domain to an existing IOAS.
    
      iopt_area_fill_domains() will load the PFNs from storage into multiple
      domains. This is used when creating a new IOVA map in an existing IOAS
    
      iopt_pages_add_access() creates an iopt_pages_access that tracks an
      in-kernel access of PFNs. This is some external driver that might be
      accessing the IOVA using the CPU, or programming PFNs with the DMA
      API. ie a VFIO mdev.
    
      iopt_pages_rw_access() directly perform a memcpy on the PFNs, without
      the overhead of iopt_pages_add_access()
    
      iopt_pages_fill_xarray() will load PFNs into the xarray and return a
      'struct page *' array. It is used by iopt_pages_access's to extract PFNs
      for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
      is known the xarray is already filled.
    
    As an iopt_pages can be referred to in slices by many areas and accesses
    it uses interval trees to keep track of which storage tiers currently hold
    the PFNs. On a page-by-page basis any request for a PFN will be satisfied
    from one of the storage tiers and the PFN copied to target domain/array.
    
    Unfill actions are similar, on a page by page basis domains are unmapped,
    xarray entries freed or struct pages fully put back.
    
    Significant complexity is required to fully optimize all of these data
    motions. The implementation calculates the largest consecutive range of
    same-storage indexes and operates in blocks. The accumulation of PFNs
    always generates the largest contiguous PFN range possible to optimize and
    this gathering can cross storage tier boundaries. For cases like 'fill
    domains' care is taken to avoid duplicated work and PFNs are read once and
    pushed into all domains.
    
    The map/unmap interaction with the iommu_domain always works in contiguous
    PFN blocks. The implementation does not require or benefit from any
    split/merge optimization in the iommu_domain driver.
    
    This design suggests several possible improvements in the IOMMU API that
    would greatly help performance, particularly a way for the driver to map
    and read the pfns lists instead of working with one driver call per page
    to read, and one driver call per contiguous range to store.
    
    Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.comReviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
    Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
    Tested-by: default avatarYi Liu <yi.l.liu@intel.com>
    Tested-by: default avatarLixiao Yang <lixiao.yang@intel.com>
    Tested-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
    Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
    8d160cd4
pages.c 52.8 KB