• Dave Hansen's avatar
    x86, sched: Add new topology for multi-NUMA-node CPUs · cebf15eb
    Dave Hansen authored
    I'm getting the spew below when booting with Haswell (Xeon
    E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
    in the BIOS.  It seems similar to the issue that some folks from
    AMD ran in to on their systems and addressed in this commit:
    
      161270fc ("x86/smp: Fix topology checks on AMD MCM CPUs")
    
    Both these Intel and AMD systems break an assumption which is
    being enforced by topology_sane(): a socket may not contain more
    than one NUMA node.
    
    AMD special-cased their system by looking for a cpuid flag.  The
    Intel mode is dependent on BIOS options and I do not know of a
    way which it is enumerated other than the tables being parsed
    during the CPU bringup process.  In other words, we have to trust
    the ACPI tables <shudder>.
    
    This detects the situation where a NUMA node occurs at a place in
    the middle of the "CPU" sched domains.  It replaces the default
    topology with one that relies on the NUMA information from the
    firmware (SRAT table) for all levels of sched domains above the
    hyperthreads.
    
    This also fixes a sysfs bug.  We used to freak out when we saw
    the "mc" group cross a node boundary, so we stopped building the
    MC group.  MC gets exported as the 'core_siblings_list' in
    /sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
    the same 'physical_package_id' to not be listed together in
    'core_siblings_list'.  This violates a statement from
    Documentation/ABI/testing/sysfs-devices-system-cpu:
    
    	core_siblings: internal kernel map of cpu#'s hardware threads
    	within the same physical_package_id.
    
    	core_siblings_list: human-readable list of the logical CPU
    	numbers within the same physical_package_id as cpu#.
    
    The sysfs effects here cause an issue with the hwloc tool where
    it gets confused and thinks there are more sockets than are
    physically present.
    
    Before this patch, there are two packages:
    
    # cd /sys/devices/system/cpu/
    # cat cpu*/topology/physical_package_id | sort | uniq -c
         18 0
         18 1
    
    But 4 _sets_ of core siblings:
    
    # cat cpu*/topology/core_siblings_list | sort | uniq -c
          9 0-8
          9 18-26
          9 27-35
          9 9-17
    
    After this set, there are only 2 sets of core siblings, which
    is what we expect for a 2-socket system.
    
    # cat cpu*/topology/physical_package_id | sort | uniq -c
         18 0
         18 1
    # cat cpu*/topology/core_siblings_list | sort | uniq -c
         18 0-17
         18 18-35
    
    Example spew:
    ...
    	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    	 #2  #3  #4  #5  #6  #7  #8
    	.... node  #1, CPUs:    #9
    	------------[ cut here ]------------
    	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
    	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
    	Modules linked in:
    	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
    	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
    	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
    	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
    	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
    	Call Trace:
    	[<ffffffff8172e485>] dump_stack+0x45/0x56
    	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
    	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
    	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
    	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
    	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
    	---[ end trace 3fe5f587a9fcde61 ]---
    	#10 #11 #12 #13 #14 #15 #16 #17
    	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
    	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    [ Added LLC domain and s/match_mc/match_die/ ]
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Igor Mammedov <imammedo@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Prarit Bhargava <prarit@redhat.com>
    Cc: Toshi Kani <toshi.kani@hp.com>
    Cc: brice.goglin@gmail.com
    Cc: "H. Peter Anvin" <hpa@linux.intel.com>
    Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    cebf15eb
smpboot.c 36.4 KB