1. 21 Aug, 2019 7 commits
  2. 20 Aug, 2019 27 commits
  3. 19 Aug, 2019 6 commits
    • Cédric Le Goater's avatar
      powerpc/xmon: Add a dump of all XIVE interrupts · 39f14e79
      Cédric Le Goater authored
      Modify the xmon 'dxi' command to query all interrupts if no IRQ number
      is specified.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-4-clg@kaod.org
      39f14e79
    • Cédric Le Goater's avatar
      powerpc/xive: Fix dump of XIVE interrupt under pseries · b4868ff5
      Cédric Le Goater authored
      The xmon 'dxi' command calls OPAL to query the XIVE configuration of a
      interrupt. This can only be done on baremetal (PowerNV) and it will
      crash a pseries machine.
      
      Introduce a new XIVE get_irq_config() operation which implements a
      different query depending on the platform, PowerNV or pseries, and
      modify xmon to use a top level wrapper.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-3-clg@kaod.org
      b4868ff5
    • Cédric Le Goater's avatar
      powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL · c3e0dbd7
      Cédric Le Goater authored
      Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
      the OPAL logs and also outputs some of the fields of the internal XIVE
      structures in Linux. The OPAL calls can only be done on baremetal
      (PowerNV) and they crash a pseries machine. Fix by checking the
      hypervisor feature of the CPU.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-2-clg@kaod.org
      c3e0dbd7
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages · 201ed7f3
      Alexey Kardashevskiy authored
      At the moment we create a small window only for 32bit devices, the window
      maps 0..2GB of the PCI space only. For other devices we either use
      a sketchy bypass or hardware bypass but the former can only work if
      the amount of RAM is no bigger than the device's DMA mask and the latter
      requires devices to support at least 59bit DMA.
      
      This extends the default DMA window to the maximum size possible to allow
      a wider DMA mask than just 32bit. The default window size is now limited
      by the the iommu_table::it_map allocation bitmap which is a contiguous
      array, 1 bit per an IOMMU page.
      
      This increases the default IOMMU page size from hard coded 4K to
      the system page size to allow wider DMA masks.
      
      This increases the level number to not exceed the max order allocation
      limit per TCE level. By the same time, this keeps minimal levels number
      as 2 in order to save memory.
      
      As the extended window now overlaps the 32bit MMIO region, this adds
      an area reservation to iommu_init_table().
      
      After this change the default window size is 0x80000000000==1<<43 so
      devices limited to DMA mask smaller than the amount of system RAM can
      still use more than just 2GB of memory for DMA.
      
      This is an optimization and not a bug fix for DMA API usage.
      
      With the on-demand allocation of indirect TCE table levels enabled and
      2 levels, the first TCE level size is just
      1<<ceil((log2(0x7ffffffffff+1)-16)/2)=16384 TCEs or 2 system pages.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-5-aik@ozlabs.ru
      201ed7f3
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window · c37c792d
      Alexey Kardashevskiy authored
      We allocate only the first level of multilevel TCE tables for KVM
      already (alloc_userspace_copy==true), and the rest is allocated on demand.
      This is not enabled though for bare metal.
      
      This removes the KVM limitation (implicit, via the alloc_userspace_copy
      parameter) and always allocates just the first level. The on-demand
      allocation of missing levels is already implemented.
      
      As from now on DMA map might happen with disabled interrupts, this
      allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
      In practice just a single page is allocated there so chances for failure
      are quite low.
      
      To save time when creating a new clean table, this skips non-allocated
      indirect TCE entries in pnv_tce_free just like we already do in
      the VFIO IOMMU TCE driver.
      
      This changes the default level number from 1 to 2 to reduce the amount
      of memory required for the default 32bit DMA window at the boot time.
      The default window size is up to 2GB which requires 4MB of TCEs which is
      unlikely to be used entirely or at all as most devices these days are
      64bit capable so by switching to 2 levels by default we save 4032KB of
      RAM per a device.
      
      While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
      can trigger this path via VFIO, see the failure and try creating a table
      again with different parameters which might succeed.
      
      [1]:
      ===
      BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
      in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
      2 locks held by scsi_eh_1/1038:
       #0: 000000005efd659a (&host->eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
       #1: 0000000006cf56a6 (&(&host->lock)->rlock){....}, at: ata_exec_internal_sg+0xb0/0x5c0
      irq event stamp: 500
      hardirqs last  enabled at (499): [<c000000000cb8a74>] _raw_spin_unlock_irqrestore+0x94/0xd0
      hardirqs last disabled at (500): [<c000000000cb85c4>] _raw_spin_lock_irqsave+0x44/0x120
      softirqs last  enabled at (0): [<c000000000101120>] copy_process.isra.4.part.5+0x640/0x1a80
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634
      Call Trace:
      [c000003d064cef50] [c000000000c8e6c4] dump_stack+0xe8/0x164 (unreliable)
      [c000003d064cefa0] [c00000000014ed78] ___might_sleep+0x2f8/0x310
      [c000003d064cf020] [c0000000003ca084] __alloc_pages_nodemask+0x2a4/0x1560
      [c000003d064cf220] [c0000000000c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
      [c000003d064cf290] [c0000000000c2888] pnv_tce+0x128/0x3b0
      [c000003d064cf360] [c0000000000c2c00] pnv_tce_build+0xb0/0xf0
      [c000003d064cf3c0] [c0000000000bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
      [c000003d064cf400] [c00000000004cfe0] ppc_iommu_map_sg+0x210/0x550
      [c000003d064cf510] [c00000000004b7a4] dma_iommu_map_sg+0x74/0xb0
      [c000003d064cf530] [c000000000863944] ata_qc_issue+0x134/0x470
      [c000003d064cf5b0] [c000000000863ec4] ata_exec_internal_sg+0x244/0x5c0
      [c000003d064cf700] [c0000000008642d0] ata_exec_internal+0x90/0xe0
      [c000003d064cf780] [c0000000008650ac] ata_dev_read_id+0x2ec/0x640
      [c000003d064cf8d0] [c000000000878e28] ata_eh_recover+0x948/0x16d0
      [c000003d064cfa10] [c00000000087d760] sata_pmp_error_handler+0x480/0xbf0
      [c000003d064cfbc0] [c000000000884624] ahci_error_handler+0x74/0xe0
      [c000003d064cfbf0] [c000000000879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
      [c000003d064cfca0] [c00000000087a544] ata_scsi_error+0xb4/0x100
      [c000003d064cfd00] [c000000000802450] scsi_error_handler+0x120/0x510
      [c000003d064cfdb0] [c000000000140c48] kthread+0x1b8/0x1c0
      [c000003d064cfe20] [c00000000000bd8c] ret_from_kernel_thread+0x5c/0x70
      ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      irq event stamp: 2305
      
      ========================================================
      hardirqs last  enabled at (2305): [<c00000000000e4c8>] fast_exc_return_irq+0x28/0x34
      hardirqs last disabled at (2303): [<c000000000cb9fd0>] __do_softirq+0x4a0/0x654
      WARNING: possible irq lock inversion dependency detected
      5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: G        W
      softirqs last  enabled at (2304): [<c000000000cba054>] __do_softirq+0x524/0x654
      softirqs last disabled at (2297): [<c00000000010f278>] irq_exit+0x128/0x180
      --------------------------------------------------------
      swapper/0/0 just changed the state of lock:
      0000000006cf56a6 (&(&host->lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (fs_reclaim){+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(fs_reclaim);
                                     local_irq_disable();
                                     lock(&(&host->lock)->rlock);
                                     lock(fs_reclaim);
        <Interrupt>
          lock(&(&host->lock)->rlock);
      
       *** DEADLOCK ***
      
      no locks held by swapper/0/0.
      
      the shortest dependencies between 2nd lock and 1st lock:
       -> (fs_reclaim){+.+.} ops: 167579 {
          HARDIRQ-ON-W at:
                            lock_acquire+0xf8/0x2a0
                            fs_reclaim_acquire.part.23+0x44/0x60
                            kmem_cache_alloc_node_trace+0x80/0x590
                            alloc_desc+0x64/0x270
                            __irq_alloc_descs+0x2e4/0x3a0
                            irq_domain_alloc_descs+0xb0/0x150
                            irq_create_mapping+0x168/0x2c0
                            xics_smp_probe+0x2c/0x98
                            pnv_smp_probe+0x40/0x9c
                            smp_prepare_cpus+0x524/0x6c4
                            kernel_init_freeable+0x1b4/0x650
                            kernel_init+0x2c/0x148
                            ret_from_kernel_thread+0x5c/0x70
          SOFTIRQ-ON-W at:
                            lock_acquire+0xf8/0x2a0
                            fs_reclaim_acquire.part.23+0x44/0x60
                            kmem_cache_alloc_node_trace+0x80/0x590
                            alloc_desc+0x64/0x270
                            __irq_alloc_descs+0x2e4/0x3a0
                            irq_domain_alloc_descs+0xb0/0x150
                            irq_create_mapping+0x168/0x2c0
                            xics_smp_probe+0x2c/0x98
                            pnv_smp_probe+0x40/0x9c
                            smp_prepare_cpus+0x524/0x6c4
                            kernel_init_freeable+0x1b4/0x650
                            kernel_init+0x2c/0x148
                            ret_from_kernel_thread+0x5c/0x70
          INITIAL USE at:
                           lock_acquire+0xf8/0x2a0
                           fs_reclaim_acquire.part.23+0x44/0x60
                           kmem_cache_alloc_node_trace+0x80/0x590
                           alloc_desc+0x64/0x270
                           __irq_alloc_descs+0x2e4/0x3a0
                           irq_domain_alloc_descs+0xb0/0x150
                           irq_create_mapping+0x168/0x2c0
                           xics_smp_probe+0x2c/0x98
                           pnv_smp_probe+0x40/0x9c
                           smp_prepare_cpus+0x524/0x6c4
                           kernel_init_freeable+0x1b4/0x650
                           kernel_init+0x2c/0x148
                           ret_from_kernel_thread+0x5c/0x70
        }
      ===
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-4-aik@ozlabs.ru
      c37c792d
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Allow bypass-only for DMA · 4f7e0bab
      Alexey Kardashevskiy authored
      POWER8 and newer support a bypass mode which maps all host memory to
      PCI buses so an IOMMU table is not always required. However if we fail to
      create such a table, the DMA setup fails and the kernel does not boot.
      
      This skips the 32bit DMA setup check if the bypass is selected.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-3-aik@ozlabs.ru
      4f7e0bab