1. 19 Jun, 2015 3 commits
  2. 17 Jun, 2015 4 commits
  3. 16 Jun, 2015 1 commit
  4. 15 Jun, 2015 2 commits
  5. 11 Jun, 2015 30 commits
    • Anton Blanchard's avatar
      powerpc: Don't use gcc specific options on clang · 238abecd
      Anton Blanchard authored
      We have code to choose between several options, eg. -mabi=elfv2 vs
      -mcall-aixdesc, and -mcmodel=medium vs -mminimal-toc. But these are all
      GCC specific, so use cc-option on all of them.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      238abecd
    • Anton Blanchard's avatar
      powerpc: Don't use -mno-strict-align on clang · 92d6cf2d
      Anton Blanchard authored
      We added -mno-strict-align in commit f036b368 (powerpc: Work around little
      endian gcc bug) to fix gcc bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57134
      
      Clang doesn't understand it. We need to use a conditional because we can't use the
      simpler call cc-option here.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      92d6cf2d
    • Anton Blanchard's avatar
      powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it · a50a862e
      Anton Blanchard authored
      These options are not recognised on LLVM, so use call cc-option to check
      for support.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a50a862e
    • Anton Blanchard's avatar
      powerpc: Only use -mabi=altivec if toolchain supports it · 1fb3f5a7
      Anton Blanchard authored
      The -mabi=altivec option is not recognised on LLVM, so use call cc-option
      to check for support.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1fb3f5a7
    • Anton Blanchard's avatar
      powerpc: Fix duplicate const clang warning in user access code · b91c1e3e
      Anton Blanchard authored
      We see a large number of duplicate const errors in the user access
      code when building with llvm/clang:
      
        include/linux/pagemap.h:576:8: warning: duplicate 'const' declaration specifier
            [-Wduplicate-decl-specifier]
              ret = __get_user(c, uaddr);
      
      The problem is we are doing const __typeof__(*(ptr)), which will hit the
      warning if ptr is marked const.
      
      Removing const does not seem to have any effect on GCC code generation.
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b91c1e3e
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Support Dynamic DMA windows · e633bc86
      Alexey Kardashevskiy authored
      This adds create/remove window ioctls to create and remove DMA windows.
      sPAPR defines a Dynamic DMA windows capability which allows
      para-virtualized guests to create additional DMA windows on a PCI bus.
      The existing linux kernels use this new window to map the entire guest
      memory and switch to the direct DMA operations saving time on map/unmap
      requests which would normally happen in a big amounts.
      
      This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
      VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
      Up to 2 windows are supported now by the hardware and by this driver.
      
      This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
      information such as a number of supported windows and maximum number
      levels of TCE tables.
      
      DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
      as we still want to support v2 on platforms which cannot do DDW for
      the sake of TCE acceleration in KVM (coming soon).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e633bc86
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Register memory and define IOMMU v2 · 2157e7b8
      Alexey Kardashevskiy authored
      The existing implementation accounts the whole DMA window in
      the locked_vm counter. This is going to be worse with multiple
      containers and huge DMA windows. Also, real-time accounting would requite
      additional tracking of accounted pages due to the page size difference -
      IOMMU uses 4K pages and system uses 4K or 64K pages.
      
      Another issue is that actual pages pinning/unpinning happens on every
      DMA map/unmap request. This does not affect the performance much now as
      we spend way too much time now on switching context between
      guest/userspace/host but this will start to matter when we add in-kernel
      DMA map/unmap acceleration.
      
      This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
      New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
      2 new ioctls to register/unregister DMA memory -
      VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
      which receive user space address and size of a memory region which
      needs to be pinned/unpinned and counted in locked_vm.
      New IOMMU splits physical pages pinning and TCE table update
      into 2 different operations. It requires:
      1) guest pages to be registered first
      2) consequent map/unmap requests to work only with pre-registered memory.
      For the default single window case this means that the entire guest
      (instead of 2GB) needs to be pinned before using VFIO.
      When a huge DMA window is added, no additional pinning will be
      required, otherwise it would be guest RAM + 2GB.
      
      The new memory registration ioctls are not supported by
      VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
      will require memory to be preregistered in order to work.
      
      The accounting is done per the user process.
      
      This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
      can do with v1 or v2 IOMMUs.
      
      In order to support memory pre-registration, we need a way to track
      the use of every registered memory region and only allow unregistration
      if a region is not in use anymore. So we need a way to tell from what
      region the just cleared TCE was from.
      
      This adds a userspace view of the TCE table into iommu_table struct.
      It contains userspace address, one per TCE entry. The table is only
      allocated when the ownership over an IOMMU group is taken which means
      it is only used from outside of the powernv code (such as VFIO).
      
      As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
      DDW API), this creates a default DMA window for IODA2 for consistency.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2157e7b8
    • Alexey Kardashevskiy's avatar
      powerpc/mmu: Add userspace-to-physical addresses translation cache · 15b244a8
      Alexey Kardashevskiy authored
      We are adding support for DMA memory pre-registration to be used in
      conjunction with VFIO. The idea is that the userspace which is going to
      run a guest may want to pre-register a user space memory region so
      it all gets pinned once and never goes away. Having this done,
      a hypervisor will not have to pin/unpin pages on every DMA map/unmap
      request. This is going to help with multiple pinning of the same memory.
      
      Another use of it is in-kernel real mode (mmu off) acceleration of
      DMA requests where real time translation of guest physical to host
      physical addresses is non-trivial and may fail as linux ptes may be
      temporarily invalid. Also, having cached host physical addresses
      (compared to just pinning at the start and then walking the page table
      again on every H_PUT_TCE), we can be sure that the addresses which we put
      into TCE table are the ones we already pinned.
      
      This adds a list of memory regions to mm_context_t. Each region consists
      of a header and a list of physical addresses. This adds API to:
      1. register/unregister memory regions;
      2. do final cleanup (which puts all pre-registered pages);
      3. do userspace to physical address translation;
      4. manage usage counters; multiple registration of the same memory
      is allowed (once per container).
      
      This implements 2 counters per registered memory region:
      - @mapped: incremented on every DMA mapping; decremented on unmapping;
      initialized to 1 when a region is just registered; once it becomes zero,
      no more mappings allowe;
      - @used: incremented on every "register" ioctl; decremented on
      "unregister"; unregistration is allowed for DMA mapped regions unless
      it is the very last reference. For the very last reference this checks
      that the region is still mapped and returns -EBUSY so the userspace
      gets to know that memory is still pinned and unregistration needs to
      be retried; @used remains 1.
      
      Host physical addresses are stored in vmalloc'ed array. In order to
      access these in the real mode (mmu off), there is a real_vmalloc_addr()
      helper. In-kernel acceleration patchset will move it from KVM to MMU code.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      15b244a8
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control · 46d3e1e1
      Alexey Kardashevskiy authored
      Before the IOMMU user (VFIO) would take control over the IOMMU table
      belonging to a specific IOMMU group. This approach did not allow sharing
      tables between IOMMU groups attached to the same container.
      
      This introduces a new IOMMU ownership flavour when the user can not
      just control the existing IOMMU table but remove/create tables on demand.
      If an IOMMU implements take/release_ownership() callbacks, this lets
      the user have full control over the IOMMU group. When the ownership
      is taken, the platform code removes all the windows so the caller must
      create them.
      Before returning the ownership back to the platform code, VFIO
      unprograms and removes all the tables it created.
      
      This changes IODA2's onwership handler to remove the existing table
      rather than manipulating with the existing one. From now on,
      iommu_take_ownership() and iommu_release_ownership() are only called
      from the vfio_iommu_spapr_tce driver.
      
      Old-style ownership is still supported allowing VFIO to run on older
      P5IOC2 and IODA IO controllers.
      
      No change in userspace-visible behaviour is expected. Since it recreates
      TCE tables on each ownership change, related kernel traces will appear
      more often.
      
      This adds a pnv_pci_ioda2_setup_default_config() which is called
      when PE is being configured at boot time and when the ownership is
      passed from VFIO to the platform code.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      46d3e1e1
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table · 00547193
      Alexey Kardashevskiy authored
      This adds a way for the IOMMU user to know how much a new table will
      use so it can be accounted in the locked_vm limit before allocation
      happens.
      
      This stores the allocated table size in pnv_pci_ioda2_get_table_size()
      so the locked_vm counter can be updated correctly when a table is
      being disposed.
      
      This defines an iommu_table_group_ops callback to let VFIO know
      how much memory will be locked if a table is created.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      00547193
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release · c035e37b
      Alexey Kardashevskiy authored
      The existing code programmed TVT#0 with some address and then
      immediately released that memory.
      
      This makes use of pnv_pci_ioda2_unset_window() and
      pnv_pci_ioda2_set_bypass() which do correct resource release and
      TVT update.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c035e37b
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API · 4793d65d
      Alexey Kardashevskiy authored
      This extends iommu_table_group_ops by a set of callbacks to support
      dynamic DMA windows management.
      
      create_table() creates a TCE table with specific parameters.
      it receives iommu_table_group to know nodeid in order to allocate
      TCE table memory closer to the PHB. The exact format of allocated
      multi-level table might be also specific to the PHB model (not
      the case now though).
      This callback calculated the DMA window offset on a PCI bus from @num
      and stores it in a just created table.
      
      set_window() sets the window at specified TVT index + @num on PHB.
      
      unset_window() unsets the window from specified TVT.
      
      This adds a free() callback to iommu_table_ops to free the memory
      (potentially a tree of tables) allocated for the TCE table.
      
      create_table() and free() are supposed to be called once per
      VFIO container and set_window()/unset_window() are supposed to be
      called for every group in a container.
      
      This adds IOMMU capabilities to iommu_table_group such as default
      32bit window parameters and others. This makes use of new values in
      vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
      advertise pagemasks to the userspace.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4793d65d
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Implement multilevel TCE tables · bbb845c4
      Alexey Kardashevskiy authored
      TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
      on huge guests (hundreds of GB of RAM) so the kernel might be unable to
      allocate contiguous chunk of physical memory to store the TCE table.
      
      To address this, POWER8 CPU (actually, IODA2) supports multi-level
      TCE tables, up to 5 levels which splits the table into a tree of
      smaller subtables.
      
      This adds multi-level TCE tables support to
      pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
      helpers.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bbb845c4
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window · 43cb60ab
      Alexey Kardashevskiy authored
      This is a part of moving DMA window programming to an iommu_ops
      callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
      a first parameter (not pnv_ioda_pe) as it is going to be used as
      a callback for VFIO DDW code.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      43cb60ab
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages · aca6913f
      Alexey Kardashevskiy authored
      This is a part of moving TCE table allocation into an iommu_ops
      callback to support multiple IOMMU groups per one VFIO container.
      
      This moves the code which allocates the actual TCE tables to helpers:
      pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
      These do not allocate/free the iommu_table struct.
      
      This enforces window size to be a power of two.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      aca6913f
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Rework iommu_table creation · e5aad1e6
      Alexey Kardashevskiy authored
      This moves iommu_table creation to the beginning to make following changes
      easier to review. This starts using table parameters from the iommu_table
      struct.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e5aad1e6
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/powernv: Release replaced TCE · 05c6cfb9
      Alexey Kardashevskiy authored
      At the moment writing new TCE value to the IOMMU table fails with EBUSY
      if there is a valid entry already. However PAPR specification allows
      the guest to write new TCE value without clearing it first.
      
      Another problem this patch is addressing is the use of pool locks for
      external IOMMU users such as VFIO. The pool locks are to protect
      DMA page allocator rather than entries and since the host kernel does
      not control what pages are in use, there is no point in pool locks and
      exchange()+put_page(oldtce) is sufficient to avoid possible races.
      
      This adds an exchange() callback to iommu_table_ops which does the same
      thing as set() plus it returns replaced TCE and DMA direction so
      the caller can release the pages afterwards. The exchange() receives
      a physical address unlike set() which receives linear mapping address;
      and returns a physical address as the clear() does.
      
      This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
      for a platform to have exchange() implemented in order to support VFIO.
      
      This replaces iommu_tce_build() and iommu_clear_tce() with
      a single iommu_tce_xchg().
      
      This makes sure that TCE permission bits are not set in TCE passed to
      IOMMU API as those are to be calculated by platform code from
      DMA direction.
      
      This moves SetPageDirty() to the IOMMU code to make it work for both
      VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
      available later).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      05c6cfb9
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Implement accessor to TCE entry · c5bb44ed
      Alexey Kardashevskiy authored
      This replaces direct accesses to TCE table with a helper which
      returns an TCE entry address. This does not make difference now but will
      when multi-level TCE tables get introduces.
      
      No change in behavior is expected.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c5bb44ed
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Add TCE invalidation for all attached groups · e57080f1
      Alexey Kardashevskiy authored
      The iommu_table struct keeps a list of IOMMU groups it is used for.
      At the moment there is just a single group attached but further
      patches will add TCE table sharing. When sharing is enabled, TCE cache
      in each PE needs to be invalidated so does the patch.
      
      This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
      to enable TCE table sharing on PHBs older than IODA2.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e57080f1
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Move TCE kill register address to PE · 5780fb04
      Alexey Kardashevskiy authored
      At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
      property which contains the TCE kill register address. Writing to
      this register invalidates TCE cache on IODA/IODA2 hub.
      
      This moves the register address from iommu_table to pnv_pnb as this
      register belongs to PHB and invalidates TCE cache for all tables of
      all attached PEs.
      
      This moves the property reading/remapping code to a helper which is
      called when DMA is being configured for PE and which does DMA setup
      for both IODA1 and IODA2.
      
      This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
      invalidates cache for the entire table. It should be called after
      every call to opal_pci_map_pe_dma_window(). It was not required before
      because there was just a single TCE table and 64bit DMA was handled via
      bypass window (which has no table so no cache was used) but this is going
      to change with Dynamic DMA windows (DDW).
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5780fb04
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Fix IOMMU ownership control functions · b82c75bf
      Alexey Kardashevskiy authored
      This adds missing locks in iommu_take_ownership()/
      iommu_release_ownership().
      
      This marks all pages busy in iommu_table::it_map in order to catch
      errors if there is an attempt to use this table while ownership over it
      is taken.
      
      This only clears TCE content if there is no page marked busy in it_map.
      Clearing must be done outside of the table locks as iommu_clear_tce()
      called from iommu_clear_tces_and_put_pages() does this.
      
      In order to use bitmap_empty(), the existing code clears bit#0 which
      is set even in an empty table if it is bus-mapped at 0 as
      iommu_init_table() reserves page#0 to prevent buggy drivers
      from crashing when allocated page is bus-mapped at zero
      (which is correct). This restores the bit in the case of failure
      to bring the it_map to the state it was in when we called
      iommu_take_ownership().
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b82c75bf
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control · f87a8864
      Alexey Kardashevskiy authored
      This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
      which call in a loop iommu_take_ownership()/iommu_release_ownership()
      for every table on the group. As there is just one now, no change in
      behaviour is expected.
      
      At the moment the iommu_table struct has a set_bypass() which enables/
      disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
      which calls this callback when external IOMMU users such as VFIO are
      about to get over a PHB.
      
      The set_bypass() callback is not really an iommu_table function but
      IOMMU/PE function. This introduces a iommu_table_group_ops struct and
      adds take_ownership()/release_ownership() callbacks to it which are
      called when an external user takes/releases control over the IOMMU.
      
      This replaces set_bypass() with ownership callbacks as it is not
      necessarily just bypass enabling, it can be something else/more
      so let's give it more generic name.
      
      The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
      IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
      The following patches will replace iommu_take_ownership/
      iommu_release_ownership calls in IODA2 with full IOMMU table release/
      create.
      
      As we here and touching bypass control, this removes
      pnv_pci_ioda2_setup_bypass_pe() as it does not do much
      more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
      initialization to pnv_pci_ioda2_setup_dma_pe.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f87a8864
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group · 0eaf4def
      Alexey Kardashevskiy authored
      So far one TCE table could only be used by one IOMMU group. However
      IODA2 hardware allows programming the same TCE table address to
      multiple PE allowing sharing tables.
      
      This replaces a single pointer to a group in a iommu_table struct
      with a linked list of groups which provides the way of invalidating
      TCE cache for every PE when an actual TCE table is updated. This adds
      pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
      helpers to manage the list. However without VFIO, it is still going
      to be a single IOMMU group per iommu_table.
      
      This changes iommu_add_device() to add a device to a first group
      from the group list of a table as it is only called from the platform
      init code or PCI bus notifier and at these moments there is only
      one group per table.
      
      This does not change TCE invalidation code to loop through all
      attached groups in order to simplify this patch and because
      it is not really needed in most cases. IODA2 is fixed in a later
      patch.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0eaf4def
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Replace iommu_table with iommu_table_group · b348aa65
      Alexey Kardashevskiy authored
      Modern IBM POWERPC systems support multiple (currently two) TCE tables
      per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
      for TCE tables. Right now just one table is supported.
      
      This defines iommu_table_group struct which stores pointers to
      iommu_group and iommu_table(s). This replaces iommu_table with
      iommu_table_group where iommu_table was used to identify a group:
      - iommu_register_group();
      - iommudata of generic iommu_group;
      
      This removes @data from iommu_table as it_table_group provides
      same access to pnv_ioda_pe.
      
      For IODA, instead of embedding iommu_table, the new iommu_table_group
      keeps pointers to those. The iommu_table structs are allocated
      dynamically.
      
      For P5IOC2, both iommu_table_group and iommu_table are embedded into
      PE struct. As there is no EEH and SRIOV support for P5IOC2,
      iommu_free_table() should not be called on iommu_table struct pointers
      so we can keep it embedded in pnv_phb::p5ioc2.
      
      For pSeries, this replaces multiple calls of kzalloc_node() with a new
      iommu_pseries_alloc_group() helper and stores the table group struct
      pointer into the pci_dn struct. For release, a iommu_table_free_group()
      helper is added.
      
      This moves iommu_table struct allocation from SR-IOV code to
      the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
      pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
      This change is here because those lines had to be changed anyway.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b348aa65
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free() · decbda25
      Alexey Kardashevskiy authored
      The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
      supposed to be called on IODA1/2 and not called on p5ioc2. It receives
      start and end host addresses of TCE table.
      
      IODA2 actually needs PCI addresses to invalidate the cache. Those
      can be calculated from host addresses but since we are going
      to implement multi-level TCE tables, calculating PCI address from
      a host address might get either tricky or ugly as TCE table remains flat
      on PCI bus but not in RAM.
      
      This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
      pnt_tce_free and defines IODA1/2-specific callbacks which call generic
      ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
      using generic callbacks as before.
      
      This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
      number of pages which are PCI addresses shifted by IOMMU page shift.
      
      No change in behaviour is expected.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      decbda25
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table · da004c36
      Alexey Kardashevskiy authored
      This adds a iommu_table_ops struct and puts pointer to it into
      the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
      callbacks from ppc_md to the new struct where they really belong to.
      
      This adds the requirement for @it_ops to be initialized before calling
      iommu_init_table() to make sure that we do not leave any IOMMU table
      with iommu_table_ops uninitialized. This is not a parameter of
      iommu_init_table() though as there will be cases when iommu_init_table()
      will not be called on TCE tables, for example - VFIO.
      
      This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
      redundant prefixes.
      
      This removes tce_xxx_rm handlers from ppc_md but does not add
      them to iommu_table_ops as this will be done later if we decide to
      support TCE hypercalls in real mode. This removes _vm callbacks as
      only virtual mode is supported by now so this also removes @rm parameter.
      
      For pSeries, this always uses tce_buildmulti_pSeriesLP/
      tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
      tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
      present. The reason for this is we still have to support "multitce=off"
      boot parameter in disable_multitce() and we do not want to walk through
      all IOMMU tables in the system and replace "multi" callbacks with single
      ones.
      
      For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
      This makes the callbacks for them public. Later patches will extend
      callbacks for IODA1/2.
      
      No change in behaviour is expected.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      da004c36
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Do not set "read" flag if direction==DMA_NONE · 10b35b2b
      Alexey Kardashevskiy authored
      Normally a bitmap from the iommu_table is used to track what TCE entry
      is in use. Since we are going to use iommu_table without its locks and
      do xchg() instead, it becomes essential not to put bits which are not
      implied in the direction flag as the old TCE value (more precisely -
      the permission bits) will be used to decide whether to put the page or not.
      
      This adds iommu_direction_to_tce_perm() (its counterpart is there already)
      and uses it for powernv's pnv_tce_build().
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      10b35b2b
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Rework groups attaching · 22af4859
      Alexey Kardashevskiy authored
      This is to make extended ownership and multiple groups support patches
      simpler for review.
      
      This should cause no behavioural change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      22af4859
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Moving pinning/unpinning to helpers · 649354b7
      Alexey Kardashevskiy authored
      This is a pretty mechanical patch to make next patches simpler.
      
      New tce_iommu_unuse_page() helper does put_page() now but it might skip
      that after the memory registering patch applied.
      
      As we are here, this removes unnecessary checks for a value returned
      by pfn_to_page() as it cannot possibly return NULL.
      
      This moves tce_iommu_disable() later to let tce_iommu_clear() know if
      the container has been enabled because if it has not been, then
      put_page() must not be called on TCEs from the TCE table. This situation
      is not yet possible but it will after KVM acceleration patchset is
      applied.
      
      This changes code to work with physical addresses rather than linear
      mapping addresses for better code readability. Following patches will
      add an xchg() callback for an IOMMU table which will accept/return
      physical addresses (unlike current tce_build()) which will eliminate
      redundant conversions.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      649354b7
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Disable DMA mappings on disabled container · 3c56e822
      Alexey Kardashevskiy authored
      At the moment DMA map/unmap requests are handled irrespective to
      the container's state. This allows the user space to pin memory which
      it might not be allowed to pin.
      
      This adds checks to MAP/UNMAP that the container is enabled, otherwise
      -EPERM is returned.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c56e822