• Qiuxu Zhuo's avatar
    EDAC, sb_edac: Don't create a second memory controller if HA1 is not present · 15cc3ae0
    Qiuxu Zhuo authored
    Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
    server (DELL PowerEdge 730xd):
    
      EDAC sbridge: Some needed devices are missing
      EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
      EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
      EDAC sbridge: Couldn't find mci handler
      EDAC sbridge: Couldn't find mci handler
      EDAC sbridge: Failed to register device with error -19.
    
    The refactored sb_edac driver creates the IMC1 (the 2nd memory
    controller) if any IMC1 device is present. In this case only
    HA1_TA of IMC1 was present, but the driver expected to find
    HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.
    
    The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
    Zhang inserted one DIMM per channel for each CPU, and did random error
    address injection test with this patch:
    
          4024  addresses fell in TOLM hole area
         12715  addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
         12774  addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
         12798  addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
         12913  addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
         12674  addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
         12686  addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
         12882  addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
         12934  addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
        106400  addresses were injected totally.
    
    The test result shows that all the 4 channels belong to IMC0 per CPU, so
    the server really only has one IMC per CPU.
    
    In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
    implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
    used and should be ignored.
    
    Thus, do not create a second memory controller if the key HA1 is absent.
    
    [1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
    [2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdfReported-and-tested-by: default avatarYi Zhang <yizhan@redhat.com>
    Signed-off-by: default avatarQiuxu Zhuo <qiuxu.zhuo@intel.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: linux-edac <linux-edac@vger.kernel.org>
    Fixes: e2f747b1 ("EDAC, sb_edac: Assign EDAC memory controller per h/w controller")
    Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
    [ Massage commit message. ]
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    15cc3ae0
sb_edac.c 92 KB