• Jiaqi Yan's avatar
    mm: memory-failure: add memory failure stats to sysfs · 44b8f8bf
    Jiaqi Yan authored
    Patch series "Introduce per NUMA node memory error statistics", v2.
    
    Background
    ==========
    
    In the RFC for Kernel Support of Memory Error Detection [1], one advantage
    of software-based scanning over hardware patrol scrubber is the ability to
    make statistics visible to system administrators.  The statistics include
    2 categories:
    
    * Memory error statistics, for example, how many memory error are
      encountered, how many of them are recovered by the kernel.  Note these
      memory errors are non-fatal to kernel: during the machine check
      exception (MCE) handling kernel already classified MCE's severity to be
      unnecessary to panic (but either action required or optional).
    
    * Scanner statistics, for example how many times the scanner have fully
      scanned a NUMA node, how many errors are first detected by the scanner.
    
    The memory error statistics are useful to userspace and actually not
    specific to scanner detected memory errors, and are the focus of this
    patchset.
    
    Motivation
    ==========
    
    Memory error stats are important to userspace but insufficient in kernel
    today.  Datacenter administrators can better monitor a machine's memory
    health with the visible stats.  For example, while memory errors are
    inevitable on servers with 10+ TB memory, starting server maintenance when
    there are only 1~2 recovered memory errors could be overreacting; in cloud
    production environment maintenance usually means live migrate all the
    workload running on the server and this usually causes nontrivial
    disruption to the customer.  Providing insight into the scope of memory
    errors on a system helps to determine the appropriate follow-up action. 
    In addition, the kernel's existing memory error stats need to be
    standardized so that userspace can reliably count on their usefulness.
    
    Today kernel provides following memory error info to userspace, but they
    are not sufficient or have disadvantages:
    * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
      not per NUMA node stats though
    * ras:memory_failure_event: only available after explicitly enabled
    * /dev/mcelog provides many useful info about the MCEs, but doesn't
      capture how memory_failure recovered memory MCEs
    * kernel logs: userspace needs to process log text
    
    Exposing memory error stats is also a good start for the in-kernel memory
    error detector.  Today the data source of memory error stats are either
    direct memory error consumption, or hardware patrol scrubber detection
    (either signaled as UCNA or SRAO).  Once in-kernel memory scanner is
    implemented, it will be the main source as it is usually configured to
    scan memory DIMMs constantly and faster than hardware patrol scrubber.
    
    How Implemented
    ===============
    
    As Naoya pointed out [2], exposing memory error statistics to userspace is
    useful independent of software or hardware scanner.  Therefore we
    implement the memory error statistics independent of the in-kernel memory
    error detector.  It exposes the following per NUMA node memory error
    counters:
    
      /sys/devices/system/node/node${X}/memory_failure/total
      /sys/devices/system/node/node${X}/memory_failure/recovered
      /sys/devices/system/node/node${X}/memory_failure/ignored
      /sys/devices/system/node/node${X}/memory_failure/failed
      /sys/devices/system/node/node${X}/memory_failure/delayed
    
    These counters describe how many raw pages are poisoned and after the
    attempted recoveries by the kernel, their resolutions: how many are
    recovered, ignored, failed, or delayed respectively.  This approach can be
    easier to extend for future use cases than /proc/meminfo, trace event, and
    log.  The following math holds for the statistics:
    
    * total = recovered + ignored + failed + delayed
    
    These memory error stats are reset during machine boot.
    
    The 1st commit introduces these sysfs entries.  The 2nd commit populates
    memory error stats every time memory_failure attempts memory error
    recovery.  The 3rd commit adds documentations for introduced stats.
    
    [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
    [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6
    
    
    This patch (of 3):
    
    Today kernel provides following memory error info to userspace, but each
    has its own disadvantage
    
    * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
      not per NUMA node stats though
    
    * ras:memory_failure_event: only available after explicitly enabled
    
    * /dev/mcelog provides many useful info about the MCEs, but
      doesn't capture how memory_failure recovered memory MCEs
    
    * kernel logs: userspace needs to process log text
    
    Exposes per NUMA node memory error stats as sysfs entries:
    
      /sys/devices/system/node/node${X}/memory_failure/total
      /sys/devices/system/node/node${X}/memory_failure/recovered
      /sys/devices/system/node/node${X}/memory_failure/ignored
      /sys/devices/system/node/node${X}/memory_failure/failed
      /sys/devices/system/node/node${X}/memory_failure/delayed
    
    These counters describe how many raw pages are poisoned and after the
    attempted recoveries by the kernel, their resolutions: how many are
    recovered, ignored, failed, or delayed respectively.  The following math
    holds for the statistics:
    
    * total = recovered + ignored + failed + delayed
    
    Link: https://lkml.kernel.org/r/20230120034622.2698268-1-jiaqiyan@google.com
    Link: https://lkml.kernel.org/r/20230120034622.2698268-2-jiaqiyan@google.com
    
    Signed-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    44b8f8bf
memory-failure.c 69.3 KB