• Yazen Ghannam's avatar
    RAS: Introduce a FRU memory poison manager · 6f15e617
    Yazen Ghannam authored
    Memory errors are an expected occurrence on systems with high memory
    density. Generally, errors within a small number of unique physical
    locations are acceptable, based on manufacturer and/or admin policy.
    During run time, memory with errors may be retired so it is no longer
    used by the system. This is done in mm through page poisoning, and the
    effect will remain until the system is restarted.
    
    If a memory location is consistently faulty, then the same run time
    error handling may occur in the next reboot cycle, leading to
    terminating jobs due to that already known bad memory. This could be
    prevented if information from the previous boot was not lost.
    
    Some add-in cards with driver-managed memory have on-board persistent
    storage. Their driver saves memory error information to the persistent
    storage during run time. The information is then restored after reset,
    and known bad memory will be retired before the hardware is used.
    A running log of bad memory locations is kept across multiple resets.
    
    A similar solution is desirable for CPUs. However, this solution should
    leverage industry-standard components as much as possible, rather than
    a bespoke platform driver.
    
    Two components are needed: a record format and a persistent storage
    interface.
    
    Implement a new module to manage the record formats on persistent
    storage. Use the requirements for an AMD MI300-based system to start.
    Vendor- and platform-specific details can be abstracted later as needed.
    
      [ bp: Massage commit message and code, squash 30-ish more fixes from
        Yazen and me. ]
    Signed-off-by: default avatarYazen Ghannam <yazen.ghannam@amd.com>
    Co-developed-by: <naveenkrishna.chatradhi@amd.com>
    Signed-off-by: <naveenkrishna.chatradhi@amd.com>
    Co-developed-by: <muralidhara.mk@amd.com>
    Signed-off-by: <muralidhara.mk@amd.com>
    Tested-by: <sathyapriya.k@amd.com>
    Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
    Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
    6f15e617
fmpm.c 18.4 KB