Commit 3c3ab37f authored by Bjorn Helgaas's avatar Bjorn Helgaas

Merge branch 'pci/aer'

  - Decode AER errors with names similar to "lspci" (Tyler Baicar)

  - Expose AER statistics in sysfs (Rajat Jain)

  - Clear AER status bits selectively based on the type of recovery (Oza
    Pawandeep)

  - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
    Gagniuc)

  - Don't clear AER status bits if we're using the "Firmware-First"
    strategy where firmware owns the registers (Alexandru Gagniuc)

* pci/aer:
  PCI/AER: Don't clear AER bits if error handling is Firmware-First
  PCI/AER: Remove duplicate PCI_EXP_AER_FLAGS definition
  PCI/portdrv: Remove pcie_portdrv_err_handler.slot_reset
  PCI/AER: Clear device status bits during ERR_COR handling
  PCI/AER: Clear device status bits during ERR_FATAL and ERR_NONFATAL
  PCI/AER: Remove ERR_FATAL code from ERR_NONFATAL path
  PCI/AER: Factor out ERR_NONFATAL status bit clearing
  PCI/AER: Clear only ERR_NONFATAL bits during non-fatal recovery
  PCI/AER: Clear only ERR_FATAL status bits during fatal recovery
  PCI/AER: Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST
  PCI/AER: Add sysfs attributes for rootport cumulative stats
  PCI/AER: Add sysfs attributes to provide AER stats and breakdown
  PCI/AER: Define aer_stats structure for AER capable devices
  PCI/AER: Move internal declarations to drivers/pci/pci.h
  PCI/AER: Adopt lspci names for AER error decoding
  PCI/AER: Expose internal API for obtaining AER information

# Conflicts:
#	drivers/pci/pci.h
parents af863d18 45687f96
==========================
PCIe Device AER statistics
==========================
These attributes show up under all the devices that are AER capable. These
statistical counters indicate the errors "as seen/reported by the device".
Note that this may mean that if an endpoint is causing problems, the AER
counters may increment at its link partner (e.g. root port) because the
errors may be "seen" / reported by the link partner and not the
problematic endpoint itself (which may report all counters as 0 as it never
saw any problems).
Where: /sys/bus/pci/devices/<dev>/aer_dev_correctable
Date: July 2018
Kernel Version: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of correctable errors seen and reported by this
PCI device using ERR_COR. Note that since multiple errors may
be reported using a single ERR_COR message, thus
TOTAL_ERR_COR at the end of the file may not match the actual
total of all the errors in the file. Sample output:
-------------------------------------------------------------------------
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
Receiver Error 2
Bad TLP 0
Bad DLLP 0
RELAY_NUM Rollover 0
Replay Timer Timeout 0
Advisory Non-Fatal 0
Corrected Internal Error 0
Header Log Overflow 0
TOTAL_ERR_COR 2
-------------------------------------------------------------------------
Where: /sys/bus/pci/devices/<dev>/aer_dev_fatal
Date: July 2018
Kernel Version: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable fatal errors seen and reported by this
PCI device using ERR_FATAL. Note that since multiple errors may
be reported using a single ERR_FATAL message, thus
TOTAL_ERR_FATAL at the end of the file may not match the actual
total of all the errors in the file. Sample output:
-------------------------------------------------------------------------
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_FATAL 0
-------------------------------------------------------------------------
Where: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
Date: July 2018
Kernel Version: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable nonfatal errors seen and reported by this
PCI device using ERR_NONFATAL. Note that since multiple errors
may be reported using a single ERR_FATAL message, thus
TOTAL_ERR_NONFATAL at the end of the file may not match the
actual total of all the errors in the file. Sample output:
-------------------------------------------------------------------------
localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
Poisoned TLP 0
Flow Control Protocol 0
Completion Timeout 0
Completer Abort 0
Unexpected Completion 0
Receiver Overflow 0
Malformed TLP 0
ECRC 0
Unsupported Request 0
ACS Violation 0
Uncorrectable Internal Error 0
MC Blocked TLP 0
AtomicOp Egress Blocked 0
TLP Prefix Blocked Error 0
TOTAL_ERR_NONFATAL 0
-------------------------------------------------------------------------
============================
PCIe Rootport AER statistics
============================
These attributes show up under only the rootports (or root complex event
collectors) that are AER capable. These indicate the number of error messages as
"reported to" the rootport. Please note that the rootports also transmit
(internally) the ERR_* messages for errors seen by the internal rootport PCI
device, so these counters include them and are thus cumulative of all the error
messages on the PCI hierarchy originating at that root port.
Where: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_cor
Date: July 2018
Kernel Version: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_COR messages reported to rootport.
Where: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_fatal
Date: July 2018
Kernel Version: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_FATAL messages reported to rootport.
Where: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_nonfatal
Date: July 2018
Kernel Version: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.
...@@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the device who sends ...@@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the device who sends
the error message to root port. Pls. refer to pci express specs for the error message to root port. Pls. refer to pci express specs for
other fields. other fields.
2.4 AER Statistics / Counters
When PCIe AER errors are captured, the counters / statistics are also exposed
in the form of sysfs attributes which are documented at
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
3. Developer Guide 3. Developer Guide
......
...@@ -1746,6 +1746,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = { ...@@ -1746,6 +1746,9 @@ static const struct attribute_group *pci_dev_attr_groups[] = {
#endif #endif
&pci_bridge_attr_group, &pci_bridge_attr_group,
&pcie_dev_attr_group, &pcie_dev_attr_group,
#ifdef CONFIG_PCIEAER
&aer_stats_attr_group,
#endif
NULL, NULL,
}; };
......
...@@ -311,6 +311,34 @@ static inline bool pci_dev_is_added(const struct pci_dev *dev) ...@@ -311,6 +311,34 @@ static inline bool pci_dev_is_added(const struct pci_dev *dev)
return test_bit(PCI_DEV_ADDED, &dev->priv_flags); return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
} }
#ifdef CONFIG_PCIEAER
#include <linux/aer.h>
#define AER_MAX_MULTI_ERR_DEVICES 5 /* Not likely to have more */
struct aer_err_info {
struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
int error_dev_num;
unsigned int id:16;
unsigned int severity:2; /* 0:NONFATAL | 1:FATAL | 2:COR */
unsigned int __pad1:5;
unsigned int multi_error_valid:1;
unsigned int first_error:5;
unsigned int __pad2:2;
unsigned int tlp_header_valid:1;
unsigned int status; /* COR/UNCOR Error Status */
unsigned int mask; /* COR/UNCOR Error Mask */
struct aer_header_log_regs tlp; /* TLP Header */
};
int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
#endif /* CONFIG_PCIEAER */
#ifdef CONFIG_PCI_ATS #ifdef CONFIG_PCI_ATS
void pci_restore_ats_state(struct pci_dev *dev); void pci_restore_ats_state(struct pci_dev *dev);
#else #else
...@@ -467,4 +495,19 @@ static inline int devm_of_pci_get_host_bridge_resources(struct device *dev, ...@@ -467,4 +495,19 @@ static inline int devm_of_pci_get_host_bridge_resources(struct device *dev,
} }
#endif #endif
#ifdef CONFIG_PCIEAER
void pci_no_aer(void);
void pci_aer_init(struct pci_dev *dev);
void pci_aer_exit(struct pci_dev *dev);
extern const struct attribute_group aer_stats_attr_group;
void pci_aer_clear_fatal_status(struct pci_dev *dev);
void pci_aer_clear_device_status(struct pci_dev *dev);
#else
static inline void pci_no_aer(void) { }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
static inline void pci_aer_exit(struct pci_dev *d) { }
static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline void pci_aer_clear_device_status(struct pci_dev *dev) { }
#endif
#endif /* DRIVERS_PCI_H */ #endif /* DRIVERS_PCI_H */
This diff is collapsed.
...@@ -252,6 +252,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev, ...@@ -252,6 +252,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
dev->error_state = state; dev->error_state = state;
pci_walk_bus(dev->subordinate, cb, &result_data); pci_walk_bus(dev->subordinate, cb, &result_data);
if (cb == report_resume) { if (cb == report_resume) {
pci_aer_clear_device_status(dev);
pci_cleanup_aer_uncorrect_error_status(dev); pci_cleanup_aer_uncorrect_error_status(dev);
dev->error_state = pci_channel_io_normal; dev->error_state = pci_channel_io_normal;
} }
...@@ -259,15 +260,10 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev, ...@@ -259,15 +260,10 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
/* /*
* If the error is reported by an end point, we think this * If the error is reported by an end point, we think this
* error is related to the upstream link of the end point. * error is related to the upstream link of the end point.
*/ * The error is non fatal so the bus is ok; just invoke
if (state == pci_channel_io_normal)
/*
* the error is non fatal so the bus is ok, just invoke
* the callback for the function that logged the error. * the callback for the function that logged the error.
*/ */
cb(dev, &result_data); cb(dev, &result_data);
else
pci_walk_bus(dev->bus, cb, &result_data);
} }
return result_data.result; return result_data.result;
...@@ -317,7 +313,8 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service) ...@@ -317,7 +313,8 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
* do error recovery on all subordinates of the bridge instead * do error recovery on all subordinates of the bridge instead
* of the bridge and clear the error status of the bridge. * of the bridge and clear the error status of the bridge.
*/ */
pci_cleanup_aer_uncorrect_error_status(dev); pci_aer_clear_fatal_status(dev);
pci_aer_clear_device_status(dev);
} }
if (result == PCI_ERS_RESULT_RECOVERED) { if (result == PCI_ERS_RESULT_RECOVERED) {
......
...@@ -42,17 +42,6 @@ __setup("pcie_ports=", pcie_port_setup); ...@@ -42,17 +42,6 @@ __setup("pcie_ports=", pcie_port_setup);
/* global data */ /* global data */
static int pcie_portdrv_restore_config(struct pci_dev *dev)
{
int retval;
retval = pci_enable_device(dev);
if (retval)
return retval;
pci_set_master(dev);
return 0;
}
#ifdef CONFIG_PM #ifdef CONFIG_PM
static int pcie_port_runtime_suspend(struct device *dev) static int pcie_port_runtime_suspend(struct device *dev)
{ {
...@@ -160,19 +149,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev) ...@@ -160,19 +149,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
return PCI_ERS_RESULT_RECOVERED; return PCI_ERS_RESULT_RECOVERED;
} }
static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
{
/* If fatal, restore cfg space for possible link reset at upstream */
if (dev->error_state == pci_channel_io_frozen) {
dev->state_saved = true;
pci_restore_state(dev);
pcie_portdrv_restore_config(dev);
pci_enable_pcie_error_reporting(dev);
}
return PCI_ERS_RESULT_RECOVERED;
}
static int resume_iter(struct device *device, void *data) static int resume_iter(struct device *device, void *data)
{ {
struct pcie_device *pcie_device; struct pcie_device *pcie_device;
...@@ -208,7 +184,6 @@ static const struct pci_device_id port_pci_ids[] = { { ...@@ -208,7 +184,6 @@ static const struct pci_device_id port_pci_ids[] = { {
static const struct pci_error_handlers pcie_portdrv_err_handler = { static const struct pci_error_handlers pcie_portdrv_err_handler = {
.error_detected = pcie_portdrv_error_detected, .error_detected = pcie_portdrv_error_detected,
.mmio_enabled = pcie_portdrv_mmio_enabled, .mmio_enabled = pcie_portdrv_mmio_enabled,
.slot_reset = pcie_portdrv_slot_reset,
.resume = pcie_portdrv_err_resume, .resume = pcie_portdrv_err_resume,
}; };
......
...@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev) ...@@ -2064,6 +2064,7 @@ static void pci_configure_device(struct pci_dev *dev)
static void pci_release_capabilities(struct pci_dev *dev) static void pci_release_capabilities(struct pci_dev *dev)
{ {
pci_aer_exit(dev);
pci_vpd_release(dev); pci_vpd_release(dev);
pci_iov_release(dev); pci_iov_release(dev);
pci_free_cap_save_buffers(dev); pci_free_cap_save_buffers(dev);
......
...@@ -299,6 +299,7 @@ struct pci_dev { ...@@ -299,6 +299,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */ u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER #ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */ u16 aer_cap; /* AER capability offset */
struct aer_stats *aer_stats; /* AER stats for this device */
#endif #endif
u8 pcie_cap; /* PCIe capability offset */ u8 pcie_cap; /* PCIe capability offset */
u8 msi_cap; /* MSI capability offset */ u8 msi_cap; /* MSI capability offset */
...@@ -1469,13 +1470,9 @@ static inline bool pcie_aspm_support_enabled(void) { return false; } ...@@ -1469,13 +1470,9 @@ static inline bool pcie_aspm_support_enabled(void) { return false; }
#endif #endif
#ifdef CONFIG_PCIEAER #ifdef CONFIG_PCIEAER
void pci_no_aer(void);
bool pci_aer_available(void); bool pci_aer_available(void);
int pci_aer_init(struct pci_dev *dev);
#else #else
static inline void pci_no_aer(void) { }
static inline bool pci_aer_available(void) { return false; } static inline bool pci_aer_available(void) { return false; }
static inline int pci_aer_init(struct pci_dev *d) { return -ENODEV; }
#endif #endif
#ifdef CONFIG_PCIE_ECRC #ifdef CONFIG_PCIE_ECRC
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment