• Tony Luck's avatar
    x86/mce: Add per-bank CMCI storm mitigation · 7eae17c4
    Tony Luck authored
    This is the core functionality to track CMCI storms at the machine check
    bank granularity. Subsequent patches will add the vendor specific hooks
    to supply input to the storm detection and take actions on the start/end
    of a storm.
    
    machine_check_poll() is called both by the CMCI interrupt code, and for
    periodic polls from a timer. Add a hook in this routine to maintain
    a bitmap history for each bank showing whether the bank logged an
    corrected error or not each time it is polled.
    
    In normal operation the interval between polls of these banks determines
    how far to shift the history. The 64 bit width corresponds to about one
    second.
    
    When a storm is observed a CPU vendor specific action is taken to reduce
    or stop CMCI from the bank that is the source of the storm.  The bank is
    added to the bitmap of banks for this CPU to poll. The polling rate is
    increased to once per second.  During a storm each bit in the history
    indicates the status of the bank each time it is polled. Thus the
    history covers just over a minute.
    
    Declare a storm for that bank if the number of corrected interrupts seen
    in that history is above some threshold (defined as 5 in this series,
    could be tuned later if there is data to suggest a better value).
    
    A storm on a bank ends if enough consecutive polls of the bank show no
    corrected errors (defined as 30, may also change). That calls the CPU
    vendor specific function to revert to normal operational mode, and
    changes the polling rate back to the default.
    
      [ bp: Massage. ]
    Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
    Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
    Link: https://lore.kernel.org/r/20231115195450.12963-3-tony.luck@intel.com
    7eae17c4
core.c 68.3 KB