1. 21 Aug, 2019 2 commits
  2. 20 Aug, 2019 27 commits
  3. 19 Aug, 2019 10 commits
    • Cédric Le Goater's avatar
      powerpc/xmon: Add a dump of all XIVE interrupts · 39f14e79
      Cédric Le Goater authored
      Modify the xmon 'dxi' command to query all interrupts if no IRQ number
      is specified.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-4-clg@kaod.org
      39f14e79
    • Cédric Le Goater's avatar
      powerpc/xive: Fix dump of XIVE interrupt under pseries · b4868ff5
      Cédric Le Goater authored
      The xmon 'dxi' command calls OPAL to query the XIVE configuration of a
      interrupt. This can only be done on baremetal (PowerNV) and it will
      crash a pseries machine.
      
      Introduce a new XIVE get_irq_config() operation which implements a
      different query depending on the platform, PowerNV or pseries, and
      modify xmon to use a top level wrapper.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-3-clg@kaod.org
      b4868ff5
    • Cédric Le Goater's avatar
      powerpc/xmon: Check for HV mode when dumping XIVE info from OPAL · c3e0dbd7
      Cédric Le Goater authored
      Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in
      the OPAL logs and also outputs some of the fields of the internal XIVE
      structures in Linux. The OPAL calls can only be done on baremetal
      (PowerNV) and they crash a pseries machine. Fix by checking the
      hypervisor feature of the CPU.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190814154754.23682-2-clg@kaod.org
      c3e0dbd7
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages · 201ed7f3
      Alexey Kardashevskiy authored
      At the moment we create a small window only for 32bit devices, the window
      maps 0..2GB of the PCI space only. For other devices we either use
      a sketchy bypass or hardware bypass but the former can only work if
      the amount of RAM is no bigger than the device's DMA mask and the latter
      requires devices to support at least 59bit DMA.
      
      This extends the default DMA window to the maximum size possible to allow
      a wider DMA mask than just 32bit. The default window size is now limited
      by the the iommu_table::it_map allocation bitmap which is a contiguous
      array, 1 bit per an IOMMU page.
      
      This increases the default IOMMU page size from hard coded 4K to
      the system page size to allow wider DMA masks.
      
      This increases the level number to not exceed the max order allocation
      limit per TCE level. By the same time, this keeps minimal levels number
      as 2 in order to save memory.
      
      As the extended window now overlaps the 32bit MMIO region, this adds
      an area reservation to iommu_init_table().
      
      After this change the default window size is 0x80000000000==1<<43 so
      devices limited to DMA mask smaller than the amount of system RAM can
      still use more than just 2GB of memory for DMA.
      
      This is an optimization and not a bug fix for DMA API usage.
      
      With the on-demand allocation of indirect TCE table levels enabled and
      2 levels, the first TCE level size is just
      1<<ceil((log2(0x7ffffffffff+1)-16)/2)=16384 TCEs or 2 system pages.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-5-aik@ozlabs.ru
      201ed7f3
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window · c37c792d
      Alexey Kardashevskiy authored
      We allocate only the first level of multilevel TCE tables for KVM
      already (alloc_userspace_copy==true), and the rest is allocated on demand.
      This is not enabled though for bare metal.
      
      This removes the KVM limitation (implicit, via the alloc_userspace_copy
      parameter) and always allocates just the first level. The on-demand
      allocation of missing levels is already implemented.
      
      As from now on DMA map might happen with disabled interrupts, this
      allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1].
      In practice just a single page is allocated there so chances for failure
      are quite low.
      
      To save time when creating a new clean table, this skips non-allocated
      indirect TCE entries in pnv_tce_free just like we already do in
      the VFIO IOMMU TCE driver.
      
      This changes the default level number from 1 to 2 to reduce the amount
      of memory required for the default 32bit DMA window at the boot time.
      The default window size is up to 2GB which requires 4MB of TCEs which is
      unlikely to be used entirely or at all as most devices these days are
      64bit capable so by switching to 2 levels by default we save 4032KB of
      RAM per a device.
      
      While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace
      can trigger this path via VFIO, see the failure and try creating a table
      again with different parameters which might succeed.
      
      [1]:
      ===
      BUG: sleeping function called from invalid context at mm/page_alloc.c:4596
      in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1
      2 locks held by scsi_eh_1/1038:
       #0: 000000005efd659a (&host->eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80
       #1: 0000000006cf56a6 (&(&host->lock)->rlock){....}, at: ata_exec_internal_sg+0xb0/0x5c0
      irq event stamp: 500
      hardirqs last  enabled at (499): [<c000000000cb8a74>] _raw_spin_unlock_irqrestore+0x94/0xd0
      hardirqs last disabled at (500): [<c000000000cb85c4>] _raw_spin_lock_irqsave+0x44/0x120
      softirqs last  enabled at (0): [<c000000000101120>] copy_process.isra.4.part.5+0x640/0x1a80
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634
      Call Trace:
      [c000003d064cef50] [c000000000c8e6c4] dump_stack+0xe8/0x164 (unreliable)
      [c000003d064cefa0] [c00000000014ed78] ___might_sleep+0x2f8/0x310
      [c000003d064cf020] [c0000000003ca084] __alloc_pages_nodemask+0x2a4/0x1560
      [c000003d064cf220] [c0000000000c2530] pnv_alloc_tce_level.isra.0+0x90/0x130
      [c000003d064cf290] [c0000000000c2888] pnv_tce+0x128/0x3b0
      [c000003d064cf360] [c0000000000c2c00] pnv_tce_build+0xb0/0xf0
      [c000003d064cf3c0] [c0000000000bbd9c] pnv_ioda2_tce_build+0x3c/0xb0
      [c000003d064cf400] [c00000000004cfe0] ppc_iommu_map_sg+0x210/0x550
      [c000003d064cf510] [c00000000004b7a4] dma_iommu_map_sg+0x74/0xb0
      [c000003d064cf530] [c000000000863944] ata_qc_issue+0x134/0x470
      [c000003d064cf5b0] [c000000000863ec4] ata_exec_internal_sg+0x244/0x5c0
      [c000003d064cf700] [c0000000008642d0] ata_exec_internal+0x90/0xe0
      [c000003d064cf780] [c0000000008650ac] ata_dev_read_id+0x2ec/0x640
      [c000003d064cf8d0] [c000000000878e28] ata_eh_recover+0x948/0x16d0
      [c000003d064cfa10] [c00000000087d760] sata_pmp_error_handler+0x480/0xbf0
      [c000003d064cfbc0] [c000000000884624] ahci_error_handler+0x74/0xe0
      [c000003d064cfbf0] [c000000000879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0
      [c000003d064cfca0] [c00000000087a544] ata_scsi_error+0xb4/0x100
      [c000003d064cfd00] [c000000000802450] scsi_error_handler+0x120/0x510
      [c000003d064cfdb0] [c000000000140c48] kthread+0x1b8/0x1c0
      [c000003d064cfe20] [c00000000000bd8c] ret_from_kernel_thread+0x5c/0x70
      ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
      irq event stamp: 2305
      
      ========================================================
      hardirqs last  enabled at (2305): [<c00000000000e4c8>] fast_exc_return_irq+0x28/0x34
      hardirqs last disabled at (2303): [<c000000000cb9fd0>] __do_softirq+0x4a0/0x654
      WARNING: possible irq lock inversion dependency detected
      5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: G        W
      softirqs last  enabled at (2304): [<c000000000cba054>] __do_softirq+0x524/0x654
      softirqs last disabled at (2297): [<c00000000010f278>] irq_exit+0x128/0x180
      --------------------------------------------------------
      swapper/0/0 just changed the state of lock:
      0000000006cf56a6 (&(&host->lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (fs_reclaim){+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(fs_reclaim);
                                     local_irq_disable();
                                     lock(&(&host->lock)->rlock);
                                     lock(fs_reclaim);
        <Interrupt>
          lock(&(&host->lock)->rlock);
      
       *** DEADLOCK ***
      
      no locks held by swapper/0/0.
      
      the shortest dependencies between 2nd lock and 1st lock:
       -> (fs_reclaim){+.+.} ops: 167579 {
          HARDIRQ-ON-W at:
                            lock_acquire+0xf8/0x2a0
                            fs_reclaim_acquire.part.23+0x44/0x60
                            kmem_cache_alloc_node_trace+0x80/0x590
                            alloc_desc+0x64/0x270
                            __irq_alloc_descs+0x2e4/0x3a0
                            irq_domain_alloc_descs+0xb0/0x150
                            irq_create_mapping+0x168/0x2c0
                            xics_smp_probe+0x2c/0x98
                            pnv_smp_probe+0x40/0x9c
                            smp_prepare_cpus+0x524/0x6c4
                            kernel_init_freeable+0x1b4/0x650
                            kernel_init+0x2c/0x148
                            ret_from_kernel_thread+0x5c/0x70
          SOFTIRQ-ON-W at:
                            lock_acquire+0xf8/0x2a0
                            fs_reclaim_acquire.part.23+0x44/0x60
                            kmem_cache_alloc_node_trace+0x80/0x590
                            alloc_desc+0x64/0x270
                            __irq_alloc_descs+0x2e4/0x3a0
                            irq_domain_alloc_descs+0xb0/0x150
                            irq_create_mapping+0x168/0x2c0
                            xics_smp_probe+0x2c/0x98
                            pnv_smp_probe+0x40/0x9c
                            smp_prepare_cpus+0x524/0x6c4
                            kernel_init_freeable+0x1b4/0x650
                            kernel_init+0x2c/0x148
                            ret_from_kernel_thread+0x5c/0x70
          INITIAL USE at:
                           lock_acquire+0xf8/0x2a0
                           fs_reclaim_acquire.part.23+0x44/0x60
                           kmem_cache_alloc_node_trace+0x80/0x590
                           alloc_desc+0x64/0x270
                           __irq_alloc_descs+0x2e4/0x3a0
                           irq_domain_alloc_descs+0xb0/0x150
                           irq_create_mapping+0x168/0x2c0
                           xics_smp_probe+0x2c/0x98
                           pnv_smp_probe+0x40/0x9c
                           smp_prepare_cpus+0x524/0x6c4
                           kernel_init_freeable+0x1b4/0x650
                           kernel_init+0x2c/0x148
                           ret_from_kernel_thread+0x5c/0x70
        }
      ===
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-4-aik@ozlabs.ru
      c37c792d
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Allow bypass-only for DMA · 4f7e0bab
      Alexey Kardashevskiy authored
      POWER8 and newer support a bypass mode which maps all host memory to
      PCI buses so an IOMMU table is not always required. However if we fail to
      create such a table, the DMA setup fails and the kernel does not boot.
      
      This skips the 32bit DMA setup check if the bypass is selected.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-3-aik@ozlabs.ru
      4f7e0bab
    • Alexey Kardashevskiy's avatar
      powerpc/powernv/ioda: Fix race in TCE level allocation · 56090a39
      Alexey Kardashevskiy authored
      pnv_tce() returns a pointer to a TCE entry and originally a TCE table
      would be pre-allocated. For the default case of 2GB window the table
      needs only a single level and that is fine. However if more levels are
      requested, it is possible to get a race when 2 threads want a pointer
      to a TCE entry from the same page of TCEs.
      
      This adds cmpxchg to handle the race. Note that once TCE is non-zero,
      it cannot become zero again.
      
      Fixes: a68bd126 ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand")
      CC: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-2-aik@ozlabs.ru
      56090a39
    • Gautham R. Shenoy's avatar
      powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt() · c784be43
      Gautham R. Shenoy authored
      The calls to arch_add_memory()/arch_remove_memory() are always made
      with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
      On pSeries, arch_add_memory()/arch_remove_memory() eventually call
      resize_hpt() which in turn calls stop_machine() which acquires the
      read-side cpu_hotplug_lock again, thereby resulting in the recursive
      acquisition of this lock.
      
      In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
      lockup during a memory hotplug operation because cpus_read_lock() is a
      per-cpu rwsem read, which, in the fast-path (in the absence of the
      writer, which in our case is a CPU-hotplug operation) simply
      increments the read_count on the semaphore. Thus a recursive read in
      the fast-path doesn't cause any problems.
      
      However, we can hit this problem in practice if there is a concurrent
      CPU-Hotplug operation in progress which is waiting to acquire the
      write-side of the lock. This will cause the second recursive read to
      block until the writer finishes. While the writer is blocked since the
      first read holds the lock. Thus both the reader as well as the writers
      fail to make any progress thereby blocking both CPU-Hotplug as well as
      Memory Hotplug operations.
      
      Memory-Hotplug				CPU-Hotplug
      CPU 0					CPU 1
      ------                                  ------
      
      1. down_read(cpu_hotplug_lock.rw_sem)
         [memory_hotplug_begin]
      					2. down_write(cpu_hotplug_lock.rw_sem)
      					[cpu_up/cpu_down]
      3. down_read(cpu_hotplug_lock.rw_sem)
         [stop_machine()]
      
      Lockdep complains as follows in these code-paths.
      
       swapper/0/1 is trying to acquire lock:
       (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60
      
      but task is already holding lock:
      (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(cpu_hotplug_lock.rw_sem);
         lock(cpu_hotplug_lock.rw_sem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       3 locks held by swapper/0/1:
        #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
        #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
        #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0
      
      stack backtrace:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
       Call Trace:
         dump_stack+0xe8/0x164 (unreliable)
         __lock_acquire+0x1110/0x1c70
         lock_acquire+0x240/0x290
         cpus_read_lock+0x64/0xf0
         stop_machine+0x2c/0x60
         pseries_lpar_resize_hpt+0x19c/0x2c0
         resize_hpt_for_hotplug+0x70/0xd0
         arch_add_memory+0x58/0xfc
         devm_memremap_pages+0x5e8/0x8f0
         pmem_attach_disk+0x764/0x830
         nvdimm_bus_probe+0x118/0x240
         really_probe+0x230/0x4b0
         driver_probe_device+0x16c/0x1e0
         __driver_attach+0x148/0x1b0
         bus_for_each_dev+0x90/0x130
         driver_attach+0x34/0x50
         bus_add_driver+0x1a8/0x360
         driver_register+0x108/0x170
         __nd_driver_register+0xd0/0xf0
         nd_pmem_driver_init+0x34/0x48
         do_one_initcall+0x1e0/0x45c
         kernel_init_freeable+0x540/0x64c
         kernel_init+0x2c/0x160
         ret_from_kernel_thread+0x5c/0x68
      
      Fix this issue by
        1) Requiring all the calls to pseries_lpar_resize_hpt() be made
           with cpu_hotplug_lock held.
      
        2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
           as a consequence of 1)
      
        3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
           with cpu_hotplug_lock held.
      
      Fixes: dbcf929c ("powerpc/pseries: Add support for hash table resizing")
      Cc: stable@vger.kernel.org # v4.11+
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.com
      c784be43
    • Michael Ellerman's avatar
      Merge branch 'topic/ppc-kvm' into next · 1a47908e
      Michael Ellerman authored
      Merge our ppc-kvm topic branch. This contains several fixes for the XIVE
      interrupt controller that we are sharing with the KVM tree.
      1a47908e
    • Michael Ellerman's avatar
      Merge branch 'fixes' into next · 4215fa2d
      Michael Ellerman authored
      Merge in our fixes branch, which brings in clone3() as well as some
      implicit fallthrough fixes we want in next.
      4215fa2d
  4. 16 Aug, 2019 1 commit
    • Paul Mackerras's avatar
      powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race · da15c03b
      Paul Mackerras authored
      Testing has revealed the existence of a race condition where a XIVE
      interrupt being shut down can be in one of the XIVE interrupt queues
      (of which there are up to 8 per CPU, one for each priority) at the
      point where free_irq() is called.  If this happens, can return an
      interrupt number which has been shut down.  This can lead to various
      symptoms:
      
      - irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
        function gets called, resulting in the CPU's elevated interrupt
        priority (numerically lowered CPPR) never gets reset.  That then
        means that the CPU stops processing interrupts, causing device
        timeouts and other errors in various device drivers.
      
      - The irq descriptor or related data structures can be in the process
        of being freed as the interrupt code is using them.  This typically
        leads to crashes due to bad pointer dereferences.
      
      This race is basically what commit 62e04686 ("genirq: Add optional
      hardware synchronization for shutdown", 2019-06-28) is intended to
      fix, given a get_irqchip_state() method for the interrupt controller
      being used.  It works by polling the interrupt controller when an
      interrupt is being freed until the controller says it is not pending.
      
      With XIVE, the PQ bits of the interrupt source indicate the state of
      the interrupt source, and in particular the P bit goes from 0 to 1 at
      the point where the hardware writes an entry into the interrupt queue
      that this interrupt is directed towards.  Normally, the code will then
      process the interrupt and do an end-of-interrupt (EOI) operation which
      will reset PQ to 00 (assuming another interrupt hasn't been generated
      in the meantime).  However, there are situations where the code resets
      P even though a queue entry exists (for example, by setting PQ to 01,
      which disables the interrupt source), and also situations where the
      code leaves P at 1 after removing the queue entry (for example, this
      is done for escalation interrupts so they cannot fire again until
      they are explicitly re-enabled).
      
      The code already has a 'saved_p' flag for the interrupt source which
      indicates that a queue entry exists, although it isn't maintained
      consistently.  This patch adds a 'stale_p' flag to indicate that
      P has been left at 1 after processing a queue entry, and adds code
      to set and clear saved_p and stale_p as necessary to maintain a
      consistent indication of whether a queue entry may or may not exist.
      
      With this, we can implement xive_get_irqchip_state() by looking at
      stale_p, saved_p and the ESB PQ bits for the interrupt.
      
      There is some additional code to handle escalation interrupts
      properly; because they are enabled and disabled in KVM assembly code,
      which does not have access to the xive_irq_data struct for the
      escalation interrupt.  Hence, stale_p may be incorrect when the
      escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
      Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
      with some careful attention to barriers in order to ensure the correct
      result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().
      
      Finally, this adds code to make noise on the console (pr_crit and
      WARN_ON(1)) if we find an interrupt queue entry for an interrupt
      which does not have a descriptor.  While this won't catch the race
      reliably, if it does get triggered it will be an indication that
      the race is occurring and needs to be debugged.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
      da15c03b