1. 05 Sep, 2019 5 commits
    • Oliver O'Halloran's avatar
      powerpc/eeh: Defer printing stack trace · 25baf3d8
      Oliver O'Halloran authored
      Currently we print a stack trace in the event handler to help with
      debugging EEH issues. In the case of suprise hot-unplug this is unneeded,
      so we want to prevent printing the stack trace unless we know it's due to
      an actual device error. To accomplish this, we can save a stack trace at
      the point of detection and only print it once the EEH recovery handler has
      determined the freeze was due to an actual error.
      
      Since the whole point of this is to prevent spurious EEH output we also
      move a few prints out of the detection thread, or mark them as pr_debug
      so anyone interested can get output from the eeh_check_dev_failure()
      if they want.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903101605.2890-6-oohall@gmail.com
      25baf3d8
    • Oliver O'Halloran's avatar
      powerpc/eeh: Check slot presence state in eeh_handle_normal_event() · b104af5a
      Oliver O'Halloran authored
      When a device is surprise removed while undergoing IO we will probably
      get an EEH PE freeze due to MMIO timeouts and other errors. When a freeze
      is detected we send a recovery event to the EEH worker thread which will
      notify drivers, and perform recovery as needed.
      
      In the event of a hot-remove we don't want recovery to occur since there
      isn't a device to recover. The recovery process is fairly long due to
      the number of wait states (required by PCIe) which causes problems when
      devices are removed and replaced (e.g. hot swapping of U.2 NVMe drives).
      
      To determine if we need to skip the recovery process we can use the
      get_adapter_state() operation of the hotplug_slot to determine if the
      slot contains a device or not, and if the slot is empty we can skip
      recovery entirely.
      
      One thing to note is that the slot being EEH frozen does not prevent the
      hotplug driver from working. We don't have the EEH recovery thread
      remove any of the devices since it's assumed that the hotplug driver
      will handle tearing down the slot state.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903101605.2890-5-oohall@gmail.com
      b104af5a
    • Oliver O'Halloran's avatar
      powerpc/eeh: Make permanently failed devices non-actionable · 38ddc011
      Oliver O'Halloran authored
      If a device is torn down by a hotplug slot driver it's marked as removed
      and marked as permaantly failed. There's no point in trying to recover a
      permernantly failed device so it should be considered un-actionable.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903101605.2890-4-oohall@gmail.com
      38ddc011
    • Oliver O'Halloran's avatar
      powerpc/eeh: Fix race when freeing PDNs · 5ef753ae
      Oliver O'Halloran authored
      When hot-adding devices we rely on the hotplug driver to create pci_dn's
      for the devices under the hotplug slot. Converse, when hot-removing the
      driver will remove the pci_dn's that it created. This is a problem because
      the pci_dev is still live until it's refcount drops to zero. This can
      happen if the driver is slow to tear down it's internal state. Ideally, the
      driver would not attempt to perform any config accesses to the device once
      it's been marked as removed, but sometimes it happens. As a result, we
      might attempt to access the pci_dn for a device that has been torn down and
      the kernel may crash as a result.
      
      To fix this, don't free the pci_dn unless the corresponding pci_dev has
      been released.  If the pci_dev is still live, then we mark the pci_dn with
      a flag that indicates the pci_dev's release function should free it.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903101605.2890-3-oohall@gmail.com
      5ef753ae
    • Oliver O'Halloran's avatar
      powerpc/eeh: Clean up EEH PEs after recovery finishes · 799abe28
      Oliver O'Halloran authored
      When the last device in an eeh_pe is removed the eeh_pe structure itself
      (and any empty parents) are freed since they are no longer needed. This
      results in a crash when a hotplug driver is involved since the following
      may occur:
      
      1. Device is suprise removed.
      2. Driver performs an MMIO, which fails and queues and eeh_event.
      3. Hotplug driver receives a hotplug interrupt and removes any
         pci_devs that were under the slot.
      4. pci_dev is torn down and the eeh_pe is freed.
      5. The EEH event handler thread processes the eeh_event and crashes
         since the eeh_pe pointer in the eeh_event structure is no
         longer valid.
      
      Crashing is generally considered poor form. Instead of doing that use
      the fact PEs are marked as EEH_PE_INVALID to keep them around until the
      end of the recovery cycle, at which point we can safely prune any empty
      PEs.
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903101605.2890-2-oohall@gmail.com
      799abe28
  2. 30 Aug, 2019 35 commits