An error occurred fetching the project authors.
  1. 21 Dec, 2018 2 commits
    • Alexey Kardashevskiy's avatar
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy authored
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • Alexey Kardashevskiy's avatar
      powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region · e0bf78b0
      Alexey Kardashevskiy authored
      Normally mm_iommu_get() should add a reference and mm_iommu_put() should
      remove it. However historically mm_iommu_find() does the referencing and
      mm_iommu_get() is doing allocation and referencing.
      
      We are going to add another helper to preregister device memory so
      instead of having mm_iommu_new() (which pre-registers the normal memory
      and references the region), we need separate helpers for pre-registering
      and referencing.
      
      This renames:
      - mm_iommu_get to mm_iommu_new;
      - mm_iommu_find to mm_iommu_get.
      
      This changes mm_iommu_get() to reference the region so the name now
      reflects what it does.
      
      This removes the check for exact match from mm_iommu_new() as we want it
      to fail on existing regions; mm_iommu_get() should be used instead.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e0bf78b0
  2. 20 Dec, 2018 1 commit
  3. 20 Oct, 2018 1 commit
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Optimize clearing TCEs for sparse tables · 6e301a8e
      Alexey Kardashevskiy authored
      The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
      table and a table with userspace addresses. These tables are radix trees,
      we allocate indirect levels when they are written to. Since
      the memory allocation is problematic in real mode, we have 2 accessors
      to the entries:
      - for virtual mode: it allocates the memory and it is always expected
      to return non-NULL;
      - fr real mode: it does not allocate and can return NULL.
      
      Also, DMA windows can span to up to 55 bits of the address space and since
      we never have this much RAM, such windows are sparse. However currently
      the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.
      
      Since we maintain a userspace addresses table for VFIO which is a mirror
      of the hardware table, we can use it to know which parts of the DMA
      window have not been mapped and skip these so does this patch.
      
      The bare metal systems do not have this problem as they use a bypass mode
      of a PHB which maps RAM directly.
      
      This helps a lot with sparse DMA windows, reducing the shutdown time from
      about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
      Just skipping the last level seems to be good enough.
      
      As non-allocating accessor is used now in virtual mode as well, rename it
      from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      6e301a8e
  4. 18 Jul, 2018 2 commits
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Check if IOMMU page is contained in the pinned physical page · 76fa4975
      Alexey Kardashevskiy authored
      A VM which has:
       - a DMA capable device passed through to it (eg. network card);
       - running a malicious kernel that ignores H_PUT_TCE failure;
       - capability of using IOMMU pages bigger that physical pages
      can create an IOMMU mapping that exposes (for example) 16MB of
      the host physical memory to the device when only 64K was allocated to the VM.
      
      The remaining 16MB - 64K will be some other content of host memory, possibly
      including pages of the VM, but also pages of host kernel memory, host
      programs or other VMs.
      
      The attacking VM does not control the location of the page it can map,
      and is only allowed to map as many pages as it has pages of RAM.
      
      We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
      an IOMMU page is contained in the physical page so the PCI hardware won't
      get access to unassigned host memory; however this check is missing in
      the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
      did not hit this yet as the very first time when the mapping happens
      we do not have tbl::it_userspace allocated yet and fall back to
      the userspace which in turn calls VFIO IOMMU driver, this fails and
      the guest does not retry,
      
      This stores the smallest preregistered page size in the preregistered
      region descriptor and changes the mm_iommu_xxx API to check this against
      the IOMMU page size.
      
      This calculates maximum page size as a minimum of the natural region
      alignment and compound page size. For the page shift this uses the shift
      returned by find_linux_pte() which indicates how the page is mapped to
      the current userspace - if the page is huge and this is not a zero, then
      it is a leaf pte and the page is mapped within the range.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      76fa4975
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Use IOMMU pageshift rather than pagesize · 1463edca
      Alexey Kardashevskiy authored
      The size is always equal to 1 page so let's use this. Later on this will
      be used for other checks which use page shifts to check the granularity
      of access.
      
      This should cause no behavioral change.
      
      Cc: stable@vger.kernel.org # v4.12+
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1463edca
  5. 16 Jul, 2018 3 commits
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda: Allocate indirect TCE levels on demand · a68bd126
      Alexey Kardashevskiy authored
      At the moment we allocate the entire TCE table, twice (hardware part and
      userspace translation cache). This normally works as we normally have
      contigous memory and the guest will map entire RAM for 64bit DMA.
      
      However if we have sparse RAM (one example is a memory device), then
      we will allocate TCEs which will never be used as the guest only maps
      actual memory for DMA. If it is a single level TCE table, there is nothing
      we can really do but if it a multilevel table, we can skip allocating
      TCEs we know we won't need.
      
      This adds ability to allocate only first level, saving memory.
      
      This changes iommu_table::free() to avoid allocating of an extra level;
      iommu_table::set() will do this when needed.
      
      This adds @alloc parameter to iommu_table::exchange() to tell the callback
      if it can allocate an extra level; the flag is set to "false" for
      the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
      H_TOO_HARD.
      
      This still requires the entire table to be counted in mm::locked_vm.
      
      To be conservative, this only does on-demand allocation when
      the usespace cache table is requested which is the case of VFIO.
      
      The example math for a system replicating a powernv setup with NVLink2
      in a guest:
      16GB RAM mapped at 0x0
      128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
      
      the table to cover that all with 64K pages takes:
      (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
      
      If we allocate only necessary TCE levels, we will only need:
      (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
      levels).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a68bd126
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Add indirect levels to it_userspace · 090bad39
      Alexey Kardashevskiy authored
      We want to support sparse memory and therefore huge chunks of DMA windows
      do not need to be mapped. If a DMA window big enough to require 2 or more
      indirect levels, and a DMA window is used to map all RAM (which is
      a default case for 64bit window), we can actually save some memory by
      not allocation TCE for regions which we are not going to map anyway.
      
      The hardware tables alreary support indirect levels but we also keep
      host-physical-to-userspace translation array which is allocated by
      vmalloc() and is a flat array which might use quite some memory.
      
      This converts it_userspace from vmalloc'ed array to a multi level table.
      
      As the format becomes platform dependend, this replaces the direct access
      to it_usespace with a iommu_table_ops::useraddrptr hook which returns
      a pointer to the userspace copy of a TCE; future extension will return
      NULL if the level was not allocated.
      
      This should not change non-KVM handling of TCE tables and it_userspace
      will not be allocated for non-KVM tables.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      090bad39
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Make iommu_table::it_userspace big endian · 00a5c58d
      Alexey Kardashevskiy authored
      We are going to reuse multilevel TCE code for the userspace copy of
      the TCE table and since it is big endian, let's make the copy big endian
      too.
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      00a5c58d
  6. 02 Oct, 2017 1 commit
  7. 11 Apr, 2017 2 commits
  8. 30 Mar, 2017 2 commits
  9. 02 Mar, 2017 2 commits
  10. 07 Feb, 2017 1 commit
  11. 01 Feb, 2017 1 commit
  12. 24 Jan, 2017 1 commit
    • Greg Kurz's avatar
      vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null · bd00fdf1
      Greg Kurz authored
      The recently added mediated VFIO driver doesn't know about powerpc iommu.
      It thus doesn't register a struct iommu_table_group in the iommu group
      upon device creation. The iommu_data pointer hence remains null.
      
      This causes a kernel oops when userspace tries to set the iommu type of a
      container associated with a mediated device to VFIO_SPAPR_TCE_v2_IOMMU.
      
      [   82.585440] mtty mtty: MDEV: Registered
      [   87.655522] iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 10
      [   87.655527] vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 10
      [  116.297184] Unable to handle kernel paging request for data at address 0x00000030
      [  116.297389] Faulting instruction address: 0xd000000007870524
      [  116.297465] Oops: Kernel access of bad area, sig: 11 [#1]
      [  116.297611] SMP NR_CPUS=2048
      [  116.297611] NUMA
      [  116.297627] PowerNV
      ...
      [  116.297954] CPU: 33 PID: 7067 Comm: qemu-system-ppc Not tainted 4.10.0-rc5-mdev-test #8
      [  116.297993] task: c000000e7718b680 task.stack: c000000e77214000
      [  116.298025] NIP: d000000007870524 LR: d000000007870518 CTR: 0000000000000000
      [  116.298064] REGS: c000000e77217990 TRAP: 0300   Not tainted  (4.10.0-rc5-mdev-test)
      [  116.298103] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
      [  116.298107]   CR: 84004444  XER: 00000000
      [  116.298154] CFAR: c00000000000888c DAR: 0000000000000030 DSISR: 40000000 SOFTE: 1
                     GPR00: d000000007870518 c000000e77217c10 d00000000787b0ed c000000eed2103c0
                     GPR04: 0000000000000000 0000000000000000 c000000eed2103e0 0000000f24320000
                     GPR08: 0000000000000104 0000000000000001 0000000000000000 d0000000078729b0
                     GPR12: c00000000025b7e0 c00000000fe08400 0000000000000001 000001002d31d100
                     GPR16: 000001002c22c850 00003ffff315c750 0000000043145680 0000000043141bc0
                     GPR20: ffffffffffffffed fffffffffffff000 0000000020003b65 d000000007706018
                     GPR24: c000000f16cf0d98 d000000007706000 c000000003f42980 c000000003f42980
                     GPR28: c000000f1575ac00 c000000003f429c8 0000000000000000 c000000eed2103c0
      [  116.298504] NIP [d000000007870524] tce_iommu_attach_group+0x10c/0x360 [vfio_iommu_spapr_tce]
      [  116.298555] LR [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce]
      [  116.298601] Call Trace:
      [  116.298610] [c000000e77217c10] [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce] (unreliable)
      [  116.298671] [c000000e77217cb0] [d0000000077033a0] vfio_fops_unl_ioctl+0x278/0x3e0 [vfio]
      [  116.298713] [c000000e77217d40] [c0000000002a3ebc] do_vfs_ioctl+0xcc/0x8b0
      [  116.298745] [c000000e77217de0] [c0000000002a4700] SyS_ioctl+0x60/0xc0
      [  116.298782] [c000000e77217e30] [c00000000000b220] system_call+0x38/0xfc
      [  116.298812] Instruction dump:
      [  116.298828] 7d3f4b78 409effc8 3d220000 e9298020 3c800140 38a00018 608480c0 e8690028
      [  116.298869] 4800249d e8410018 7c7f1b79 41820230 <e93e0030> 2fa90000 419e0114 e9090020
      [  116.298914] ---[ end trace 1e10b0ced08b9120 ]---
      
      This patch fixes the oops.
      Reported-by: default avatarVaibhav Jain <vaibhav@linux.vnet.ibm.com>
      Signed-off-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      bd00fdf1
  13. 02 Dec, 2016 6 commits
    • Alexey Kardashevskiy's avatar
      powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown · 4b6fad70
      Alexey Kardashevskiy authored
      At the moment the userspace tool is expected to request pinning of
      the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
      When the userspace process finishes, all the pinned pages need to
      be put; this is done as a part of the userspace memory context (MM)
      destruction which happens on the very last mmdrop().
      
      This approach has a problem that a MM of the userspace process
      may live longer than the userspace process itself as kernel threads
      use userspace process MMs which was runnning on a CPU where
      the kernel thread was scheduled to. If this happened, the MM remains
      referenced until this exact kernel thread wakes up again
      and releases the very last reference to the MM, on an idle system this
      can take even hours.
      
      This moves preregistered regions tracking from MM to VFIO; insteads of
      using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
      added so each container releases regions which it has pre-registered.
      
      This changes the userspace interface to return EBUSY if a memory
      region is already registered in a container. However it should not
      have any practical effect as the only userspace tool available now
      does register memory region once per container anyway.
      
      As tce_iommu_register_pages/tce_iommu_unregister_pages are called
      under container->lock, this does not need additional locking.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4b6fad70
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Reference mm in tce_container · bc82d122
      Alexey Kardashevskiy authored
      In some situations the userspace memory context may live longer than
      the userspace process itself so if we need to do proper memory context
      cleanup, we better have tce_container take a reference to mm_struct and
      use it later when the process is gone (@current or @current->mm is NULL).
      
      This references mm and stores the pointer in the container; this is done
      in a new helper - tce_iommu_mm_set() - when one of the following happens:
      - a container is enabled (IOMMU v1);
      - a first attempt to pre-register memory is made (IOMMU v2);
      - a DMA window is created (IOMMU v2).
      The @mm stays referenced till the container is destroyed.
      
      This replaces current->mm with container->mm everywhere except debug
      prints.
      
      This adds a check that current->mm is the same as the one stored in
      the container to prevent userspace from making changes to a memory
      context of other processes.
      
      DMA map/unmap ioctls() do not check for @mm as they already check
      for @enabled which is set after tce_iommu_mm_set() is called.
      
      This does not reference a task as multiple threads within the same mm
      are allowed to ioctl() to vfio and supposedly they will have same limits
      and capabilities and if they do not, we'll just fail with no harm made.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bc82d122
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Postpone default window creation · d9c72894
      Alexey Kardashevskiy authored
      We are going to allow the userspace to configure container in
      one memory context and pass container fd to another so
      we are postponing memory allocations accounted against
      the locked memory limit. One of previous patches took care of
      it_userspace.
      
      At the moment we create the default DMA window when the first group is
      attached to a container; this is done for the userspace which is not
      DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
      pre-registration - such client expects the default DMA window to exist.
      
      This postpones the default DMA window allocation till one of
      the folliwing happens:
      1. first map/unmap request arrives;
      2. new window is requested;
      This adds noop for the case when the userspace requested removal
      of the default window which has not been created yet.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d9c72894
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Add a helper to create default DMA window · 6f01cc69
      Alexey Kardashevskiy authored
      There is already a helper to create a DMA window which does allocate
      a table and programs it to the IOMMU group. However
      tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
      itself to simplify error path.
      
      Since we are going to delay the default window creation till
      the default window is accessed/removed or new window is added,
      we need a helper to create a default window from all these cases.
      
      This adds tce_iommu_create_default_window(). Since it relies on
      a VFIO container to have at least one IOMMU group (for future use),
      this changes tce_iommu_attach_group() to add a group to the container
      first and then call the new helper.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6f01cc69
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Postpone allocation of userspace version of TCE table · 39701e56
      Alexey Kardashevskiy authored
      The iommu_table struct manages a hardware TCE table and a vmalloc'd
      table with corresponding userspace addresses. Both are allocated when
      the default DMA window is created and this happens when the very first
      group is attached to a container.
      
      As we are going to allow the userspace to configure container in one
      memory context and pas container fd to another, we have to postpones
      such allocations till a container fd is passed to the destination
      user process so we would account locked memory limit against the actual
      container user constrainsts.
      
      This postpones the it_userspace array allocation till it is used first
      time for mapping. The unmapping patch already checks if the array is
      allocated.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      39701e56
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Stop using @current in mm_iommu_xxx · d7baee69
      Alexey Kardashevskiy authored
      This changes mm_iommu_xxx helpers to take mm_struct as a parameter
      instead of getting it from @current which in some situations may
      not have a valid reference to mm.
      
      This changes helpers to receive @mm and moves all references to @current
      to the caller, including checks for !current and !current->mm;
      checks in mm_iommu_preregistered() are removed as there is no caller
      yet.
      
      This moves the mm_iommu_adjust_locked_vm() call to the caller as
      it receives mm_iommu_table_group_mem_t but it needs mm.
      
      This should cause no behavioral change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d7baee69
  14. 11 May, 2016 1 commit
  15. 28 Apr, 2016 1 commit
  16. 11 Jun, 2015 13 commits
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Support Dynamic DMA windows · e633bc86
      Alexey Kardashevskiy authored
      This adds create/remove window ioctls to create and remove DMA windows.
      sPAPR defines a Dynamic DMA windows capability which allows
      para-virtualized guests to create additional DMA windows on a PCI bus.
      The existing linux kernels use this new window to map the entire guest
      memory and switch to the direct DMA operations saving time on map/unmap
      requests which would normally happen in a big amounts.
      
      This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
      VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
      Up to 2 windows are supported now by the hardware and by this driver.
      
      This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
      information such as a number of supported windows and maximum number
      levels of TCE tables.
      
      DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
      as we still want to support v2 on platforms which cannot do DDW for
      the sake of TCE acceleration in KVM (coming soon).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e633bc86
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Register memory and define IOMMU v2 · 2157e7b8
      Alexey Kardashevskiy authored
      The existing implementation accounts the whole DMA window in
      the locked_vm counter. This is going to be worse with multiple
      containers and huge DMA windows. Also, real-time accounting would requite
      additional tracking of accounted pages due to the page size difference -
      IOMMU uses 4K pages and system uses 4K or 64K pages.
      
      Another issue is that actual pages pinning/unpinning happens on every
      DMA map/unmap request. This does not affect the performance much now as
      we spend way too much time now on switching context between
      guest/userspace/host but this will start to matter when we add in-kernel
      DMA map/unmap acceleration.
      
      This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
      New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
      2 new ioctls to register/unregister DMA memory -
      VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
      which receive user space address and size of a memory region which
      needs to be pinned/unpinned and counted in locked_vm.
      New IOMMU splits physical pages pinning and TCE table update
      into 2 different operations. It requires:
      1) guest pages to be registered first
      2) consequent map/unmap requests to work only with pre-registered memory.
      For the default single window case this means that the entire guest
      (instead of 2GB) needs to be pinned before using VFIO.
      When a huge DMA window is added, no additional pinning will be
      required, otherwise it would be guest RAM + 2GB.
      
      The new memory registration ioctls are not supported by
      VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
      will require memory to be preregistered in order to work.
      
      The accounting is done per the user process.
      
      This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
      can do with v1 or v2 IOMMUs.
      
      In order to support memory pre-registration, we need a way to track
      the use of every registered memory region and only allow unregistration
      if a region is not in use anymore. So we need a way to tell from what
      region the just cleared TCE was from.
      
      This adds a userspace view of the TCE table into iommu_table struct.
      It contains userspace address, one per TCE entry. The table is only
      allocated when the ownership over an IOMMU group is taken which means
      it is only used from outside of the powernv code (such as VFIO).
      
      As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
      DDW API), this creates a default DMA window for IODA2 for consistency.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2157e7b8
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control · 46d3e1e1
      Alexey Kardashevskiy authored
      Before the IOMMU user (VFIO) would take control over the IOMMU table
      belonging to a specific IOMMU group. This approach did not allow sharing
      tables between IOMMU groups attached to the same container.
      
      This introduces a new IOMMU ownership flavour when the user can not
      just control the existing IOMMU table but remove/create tables on demand.
      If an IOMMU implements take/release_ownership() callbacks, this lets
      the user have full control over the IOMMU group. When the ownership
      is taken, the platform code removes all the windows so the caller must
      create them.
      Before returning the ownership back to the platform code, VFIO
      unprograms and removes all the tables it created.
      
      This changes IODA2's onwership handler to remove the existing table
      rather than manipulating with the existing one. From now on,
      iommu_take_ownership() and iommu_release_ownership() are only called
      from the vfio_iommu_spapr_tce driver.
      
      Old-style ownership is still supported allowing VFIO to run on older
      P5IOC2 and IODA IO controllers.
      
      No change in userspace-visible behaviour is expected. Since it recreates
      TCE tables on each ownership change, related kernel traces will appear
      more often.
      
      This adds a pnv_pci_ioda2_setup_default_config() which is called
      when PE is being configured at boot time and when the ownership is
      passed from VFIO to the platform code.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      46d3e1e1
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API · 4793d65d
      Alexey Kardashevskiy authored
      This extends iommu_table_group_ops by a set of callbacks to support
      dynamic DMA windows management.
      
      create_table() creates a TCE table with specific parameters.
      it receives iommu_table_group to know nodeid in order to allocate
      TCE table memory closer to the PHB. The exact format of allocated
      multi-level table might be also specific to the PHB model (not
      the case now though).
      This callback calculated the DMA window offset on a PCI bus from @num
      and stores it in a just created table.
      
      set_window() sets the window at specified TVT index + @num on PHB.
      
      unset_window() unsets the window from specified TVT.
      
      This adds a free() callback to iommu_table_ops to free the memory
      (potentially a tree of tables) allocated for the TCE table.
      
      create_table() and free() are supposed to be called once per
      VFIO container and set_window()/unset_window() are supposed to be
      called for every group in a container.
      
      This adds IOMMU capabilities to iommu_table_group such as default
      32bit window parameters and others. This makes use of new values in
      vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
      advertise pagemasks to the userspace.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4793d65d
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/powernv: Release replaced TCE · 05c6cfb9
      Alexey Kardashevskiy authored
      At the moment writing new TCE value to the IOMMU table fails with EBUSY
      if there is a valid entry already. However PAPR specification allows
      the guest to write new TCE value without clearing it first.
      
      Another problem this patch is addressing is the use of pool locks for
      external IOMMU users such as VFIO. The pool locks are to protect
      DMA page allocator rather than entries and since the host kernel does
      not control what pages are in use, there is no point in pool locks and
      exchange()+put_page(oldtce) is sufficient to avoid possible races.
      
      This adds an exchange() callback to iommu_table_ops which does the same
      thing as set() plus it returns replaced TCE and DMA direction so
      the caller can release the pages afterwards. The exchange() receives
      a physical address unlike set() which receives linear mapping address;
      and returns a physical address as the clear() does.
      
      This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
      for a platform to have exchange() implemented in order to support VFIO.
      
      This replaces iommu_tce_build() and iommu_clear_tce() with
      a single iommu_tce_xchg().
      
      This makes sure that TCE permission bits are not set in TCE passed to
      IOMMU API as those are to be calculated by platform code from
      DMA direction.
      
      This moves SetPageDirty() to the IOMMU code to make it work for both
      VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
      available later).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      05c6cfb9
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control · f87a8864
      Alexey Kardashevskiy authored
      This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
      which call in a loop iommu_take_ownership()/iommu_release_ownership()
      for every table on the group. As there is just one now, no change in
      behaviour is expected.
      
      At the moment the iommu_table struct has a set_bypass() which enables/
      disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
      which calls this callback when external IOMMU users such as VFIO are
      about to get over a PHB.
      
      The set_bypass() callback is not really an iommu_table function but
      IOMMU/PE function. This introduces a iommu_table_group_ops struct and
      adds take_ownership()/release_ownership() callbacks to it which are
      called when an external user takes/releases control over the IOMMU.
      
      This replaces set_bypass() with ownership callbacks as it is not
      necessarily just bypass enabling, it can be something else/more
      so let's give it more generic name.
      
      The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
      IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
      The following patches will replace iommu_take_ownership/
      iommu_release_ownership calls in IODA2 with full IOMMU table release/
      create.
      
      As we here and touching bypass control, this removes
      pnv_pci_ioda2_setup_bypass_pe() as it does not do much
      more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
      initialization to pnv_pci_ioda2_setup_dma_pe.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f87a8864
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group · 0eaf4def
      Alexey Kardashevskiy authored
      So far one TCE table could only be used by one IOMMU group. However
      IODA2 hardware allows programming the same TCE table address to
      multiple PE allowing sharing tables.
      
      This replaces a single pointer to a group in a iommu_table struct
      with a linked list of groups which provides the way of invalidating
      TCE cache for every PE when an actual TCE table is updated. This adds
      pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
      helpers to manage the list. However without VFIO, it is still going
      to be a single IOMMU group per iommu_table.
      
      This changes iommu_add_device() to add a device to a first group
      from the group list of a table as it is only called from the platform
      init code or PCI bus notifier and at these moments there is only
      one group per table.
      
      This does not change TCE invalidation code to loop through all
      attached groups in order to simplify this patch and because
      it is not really needed in most cases. IODA2 is fixed in a later
      patch.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0eaf4def
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Replace iommu_table with iommu_table_group · b348aa65
      Alexey Kardashevskiy authored
      Modern IBM POWERPC systems support multiple (currently two) TCE tables
      per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
      for TCE tables. Right now just one table is supported.
      
      This defines iommu_table_group struct which stores pointers to
      iommu_group and iommu_table(s). This replaces iommu_table with
      iommu_table_group where iommu_table was used to identify a group:
      - iommu_register_group();
      - iommudata of generic iommu_group;
      
      This removes @data from iommu_table as it_table_group provides
      same access to pnv_ioda_pe.
      
      For IODA, instead of embedding iommu_table, the new iommu_table_group
      keeps pointers to those. The iommu_table structs are allocated
      dynamically.
      
      For P5IOC2, both iommu_table_group and iommu_table are embedded into
      PE struct. As there is no EEH and SRIOV support for P5IOC2,
      iommu_free_table() should not be called on iommu_table struct pointers
      so we can keep it embedded in pnv_phb::p5ioc2.
      
      For pSeries, this replaces multiple calls of kzalloc_node() with a new
      iommu_pseries_alloc_group() helper and stores the table group struct
      pointer into the pci_dn struct. For release, a iommu_table_free_group()
      helper is added.
      
      This moves iommu_table struct allocation from SR-IOV code to
      the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
      pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
      This change is here because those lines had to be changed anyway.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b348aa65
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Rework groups attaching · 22af4859
      Alexey Kardashevskiy authored
      This is to make extended ownership and multiple groups support patches
      simpler for review.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      22af4859
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Moving pinning/unpinning to helpers · 649354b7
      Alexey Kardashevskiy authored
      This is a pretty mechanical patch to make next patches simpler.
      
      New tce_iommu_unuse_page() helper does put_page() now but it might skip
      that after the memory registering patch applied.
      
      As we are here, this removes unnecessary checks for a value returned
      by pfn_to_page() as it cannot possibly return NULL.
      
      This moves tce_iommu_disable() later to let tce_iommu_clear() know if
      the container has been enabled because if it has not been, then
      put_page() must not be called on TCEs from the TCE table. This situation
      is not yet possible but it will after KVM acceleration patchset is
      applied.
      
      This changes code to work with physical addresses rather than linear
      mapping addresses for better code readability. Following patches will
      add an xchg() callback for an IOMMU table which will accept/return
      physical addresses (unlike current tce_build()) which will eliminate
      redundant conversions.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      649354b7
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Disable DMA mappings on disabled container · 3c56e822
      Alexey Kardashevskiy authored
      At the moment DMA map/unmap requests are handled irrespective to
      the container's state. This allows the user space to pin memory which
      it might not be allowed to pin.
      
      This adds checks to MAP/UNMAP that the container is enabled, otherwise
      -EPERM is returned.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c56e822
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Move locked_vm accounting to helpers · 2d270df8
      Alexey Kardashevskiy authored
      There moves locked pages accounting to helpers.
      Later they will be reused for Dynamic DMA windows (DDW).
      
      This reworks debug messages to show the current value and the limit.
      
      This stores the locked pages number in the container so when unlocking
      the iommu table pointer won't be needed. This does not have an effect
      now but it will with the multiple tables per container as then we will
      allow attaching/detaching groups on fly and we may end up having
      a container with no group attached but with the counter incremented.
      
      While we are here, update the comment explaining why RLIMIT_MEMLOCK
      might be required to be bigger than the guest RAM. This also prints
      pid of the current process in pr_warn/pr_debug.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2d270df8
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Use it_page_size · 00663d4e
      Alexey Kardashevskiy authored
      This makes use of the it_page_size from the iommu_table struct
      as page size can differ.
      
      This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
      as recently introduced IOMMU_PAGE_XXX macros do not include
      IOMMU_PAGE_SHIFT.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      00663d4e