Commit 8fc63a91 authored by Michael Ellerman's avatar Michael Ellerman

Merge branch 'smp-topo' into next

Merge a branch containing SMP topology updates from Srikar, purely so we can
include the cover letter which has a lot of good detail here:

PowerVM systems configured in shared processors mode have some unique
challenges. Some device-tree properties will be missing on a shared
processor. Hence some sched domains may not make sense for shared processor
systems.

Most shared processor systems are over-provisioned. Underlying PowerVM
Hypervisor would schedule at a Big Core (SMT8) granularity. The most recent
power processors support two almost independent cores. In a lightly loaded
condition, it helps the overall system performance if we pack to lesser number
of Big Cores.

Since each thread-group is independent, running threads on both the
thread-groups of a SMT8 core, should have a minimal adverse impact in
non over provisioned scenarios. These changes in this patchset will not
affect in the over provisioned scenario.  If there are more threads than
SMT domains, then asym_packing will not kick-in.

System Configuration
type=Shared mode=Uncapped smt=8 lcpu=96 mem=1066409344 kB cpus=96 ent=64.00
So *64 Entitled cores/ 96 Virtual processor* Scenario

lscpu
Architecture:                       ppc64le
Byte Order:                         Little Endian
CPU(s):                             768
On-line CPU(s) list:                0-767
Model name:                         POWER10 (architected), altivec supported
Model:                              2.0 (pvr 0080 0200)
Thread(s) per core:                 8
Core(s) per socket:                 16
Socket(s):                          6
Hypervisor vendor:                  pHyp
Virtualization type:                para
L1d cache:                          6 MiB (192 instances)
L1i cache:                          9 MiB (192 instances)
NUMA node(s):                       6
NUMA node0 CPU(s):                  0-7,32-39,80-87,128-135,176-183,224-231,272-279,320-327,368-375,416-423,464-471,512-519,560-567,608-615,656-663,704-711,752-759
NUMA node1 CPU(s):                  8-15,40-47,88-95,136-143,184-191,232-239,280-287,328-335,376-383,424-431,472-479,520-527,568-575,616-623,664-671,712-719,760-767
NUMA node4 CPU(s):                  64-71,112-119,160-167,208-215,256-263,304-311,352-359,400-407,448-455,496-503,544-551,592-599,640-647,688-695,736-743
NUMA node5 CPU(s):                  16-23,48-55,96-103,144-151,192-199,240-247,288-295,336-343,384-391,432-439,480-487,528-535,576-583,624-631,672-679,720-727
NUMA node6 CPU(s):                  72-79,120-127,168-175,216-223,264-271,312-319,360-367,408-415,456-463,504-511,552-559,600-607,648-655,696-703,744-751
NUMA node7 CPU(s):                  24-31,56-63,104-111,152-159,200-207,248-255,296-303,344-351,392-399,440-447,488-495,536-543,584-591,632-639,680-687,728-735

ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N  Min      Max      Median   Avg        Stddev     %Change
6.6.0-rc3  5  3840178  4059268  3978042  3973936.6  84264.456
+patch     5  3768393  3927901  3874994  3854046    71532.926  -3.01692

>From lparstat (when the workload stabilized)
Kernel     %user  %sys  %wait  %idle  physc  %entc  lbusy  app    vcsw       phint
6.6.0-rc3  4.16   0.00  0.00   95.84  26.06  40.72  4.16   69.88  276906989  578
+patch     4.16   0.00  0.00   95.83  17.70  27.66  4.17   78.26  70436663   119

ebizzy -t 128 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N Min      Max      Median   Avg        Stddev     %Change
6.6.0-rc3  5 5520692  5981856  5717709  5727053.2  176093.2
+patch     5 5305888  6259610  5854590  5843311    375917.03  2.02998

>From lparstat (when the workload stabilized)
Kernel     %user  %sys  %wait  %idle  physc  %entc  lbusy  app    vcsw       phint
6.6.0-rc3  16.66  0.00  0.00   83.33  45.49  71.08  16.67  50.50  288778533  581
+patch     16.65  0.00  0.00   83.35  30.15  47.11  16.65  65.76  85196150   133

ebizzy -t 512 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N  Min       Max       Median    Avg       Stddev     %Change
6.6.0-rc3  5  19563921  20049955  19701510  19728733  198295.18
+patch     5  19455992  20176445  19718427  19832017  304094.05  0.523521

>From lparstat (when the workload stabilized)
%Kernel     user  %sys  %wait  %idle  physc  %entc   lbusy  app   vcsw       phint
66.6.0-rc3  6.44  0.01  0.00   33.55  94.14  147.09  66.45  1.33  313345175  621
6+patch     6.44  0.01  0.00   33.55  94.15  147.11  66.45  1.33  109193889  309

System Configuration
type=Shared mode=Uncapped smt=8 lcpu=40 mem=1067539392 kB cpus=96 ent=40.00
So *40 Entitled cores/ 40 Virtual processor* Scenario

lscpu
Architecture:                       ppc64le
Byte Order:                         Little Endian
CPU(s):                             320
On-line CPU(s) list:                0-319
Model name:                         POWER10 (architected), altivec supported
Model:                              2.0 (pvr 0080 0200)
Thread(s) per core:                 8
Core(s) per socket:                 10
Socket(s):                          4
Hypervisor vendor:                  pHyp
Virtualization type:                para
L1d cache:                          2.5 MiB (80 instances)
L1i cache:                          3.8 MiB (80 instances)
NUMA node(s):                       4
NUMA node0 CPU(s):                  0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295
NUMA node1 CPU(s):                  8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303
NUMA node4 CPU(s):                  16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311
NUMA node5 CPU(s):                  24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319

ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N   Min      Max      Median   Avg        Stddev     %Change
6.6.0-rc3  5   3535518  3864532  3745967  3704233.2  130216.76
+patch     5   3608385  3708026  3649379  3651596.6  37862.163  -1.42099

%Kernel    user   %sys  %wait  %idle  physc  %entc  lbusy  app    vcsw     phint
6.6.0-rc3  10.00  0.01  0.00   89.99  22.98  57.45  10.01  41.01  1135139  262
+patch     10.00  0.00  0.00   90.00  16.95  42.37  10.00  47.05  925561   19

ebizzy -t 64 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N   Min      Max      Median   Avg        Stddev     %Change
6.6.0-rc3  5   4434984  4957281  4548786  4591298.2  211770.2
+patch     5   4461115  4835167  4544716  4607795.8  151474.85  0.359323

%Kernel    user   %sys  %wait  %idle  physc  %entc  lbusy  app    vcsw     phint
6.6.0-rc3  20.01  0.00  0.00   79.99  38.22  95.55  20.01  25.77  1287553  265
+patch     19.99  0.00  0.00   80.01  25.55  63.88  19.99  38.44  1077341  20

ebizzy -t 256 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N   Min      Max      Median   Avg        Stddev     %Change
6.6.0-rc3  5   8850648  8982659  8951911  8936869.2  52278.031
+patch     5   8751038  9060510  8981409  8942268.4  117070.6   0.0604149

%Kernel    user   %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw     phint
6.6.0-rc3  80.02  0.01  0.01   19.96  40.00  100.00  80.03  24.00  1597665  276
+patch     80.02  0.01  0.01   19.96  40.00  100.00  80.03  23.99  1383921  63

Observation:
We are able to see Improvement in ebizzy throughput even with lesser
core utilization (almost half the core utilization) in low utilization
scenarios while still retaining throughput in mid and higher utilization
scenarios.
Note: The numbers are with Uncapped + no-noise case. In the Capped and/or
noise case, due to contention on the Cores, the numbers are expected to
further improve.

Note: The numbers included (sched/fair: Enable group_asym_packing in find_idlest_group)
https://lore.kernel.org/all/20231018155036.2314342-1-srikar@linux.vnet.ibm.com/
parents 6f4b7052 c4697571
......@@ -77,10 +77,10 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
#endif
struct task_struct *secondary_current;
bool has_big_cores;
bool coregroup_enabled;
bool thread_group_shares_l2;
bool thread_group_shares_l3;
bool has_big_cores __ro_after_init;
bool coregroup_enabled __ro_after_init;
bool thread_group_shares_l2 __ro_after_init;
bool thread_group_shares_l3 __ro_after_init;
DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
......@@ -93,15 +93,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
EXPORT_PER_CPU_SYMBOL(cpu_core_map);
EXPORT_SYMBOL_GPL(has_big_cores);
enum {
#ifdef CONFIG_SCHED_SMT
smt_idx,
#endif
cache_idx,
mc_idx,
die_idx,
};
#define MAX_THREAD_LIST_SIZE 8
#define THREAD_GROUP_SHARE_L1 1
#define THREAD_GROUP_SHARE_L2_L3 2
......@@ -987,7 +978,7 @@ static int __init init_thread_group_cache_map(int cpu, int cache_property)
return 0;
}
static bool shared_caches;
static bool shared_caches __ro_after_init;
#ifdef CONFIG_SCHED_SMT
/* cpumask of CPUs with asymmetric SMT dependency */
......@@ -1003,6 +994,13 @@ static int powerpc_smt_flags(void)
}
#endif
/*
* On shared processor LPARs scheduled on a big core (which has two or more
* independent thread groups per core), prefer lower numbered CPUs, so
* that workload consolidates to lesser number of cores.
*/
static __ro_after_init DEFINE_STATIC_KEY_FALSE(splpar_asym_pack);
/*
* P9 has a slightly odd architecture where pairs of cores share an L2 cache.
* This topology makes it *much* cheaper to migrate tasks between adjacent cores
......@@ -1011,9 +1009,20 @@ static int powerpc_smt_flags(void)
*/
static int powerpc_shared_cache_flags(void)
{
if (static_branch_unlikely(&splpar_asym_pack))
return SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING;
return SD_SHARE_PKG_RESOURCES;
}
static int powerpc_shared_proc_flags(void)
{
if (static_branch_unlikely(&splpar_asym_pack))
return SD_ASYM_PACKING;
return 0;
}
/*
* We can't just pass cpu_l2_cache_mask() directly because
* returns a non-const pointer and the compiler barfs on that.
......@@ -1037,6 +1046,10 @@ static struct cpumask *cpu_coregroup_mask(int cpu)
static bool has_coregroup_support(void)
{
/* Coregroup identification not available on shared systems */
if (is_shared_processor())
return 0;
return coregroup_enabled;
}
......@@ -1045,16 +1058,6 @@ static const struct cpumask *cpu_mc_mask(int cpu)
return cpu_coregroup_mask(cpu);
}
static struct sched_domain_topology_level powerpc_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
#endif
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
{ cpu_mc_mask, SD_INIT_NAME(MC) },
{ cpu_cpu_mask, SD_INIT_NAME(PKG) },
{ NULL, },
};
static int __init init_big_cores(void)
{
int cpu;
......@@ -1682,43 +1685,45 @@ void start_secondary(void *unused)
BUG();
}
static void __init fixup_topology(void)
static struct sched_domain_topology_level powerpc_topology[6];
static void __init build_sched_topology(void)
{
int i;
int i = 0;
if (is_shared_processor() && has_big_cores)
static_branch_enable(&splpar_asym_pack);
#ifdef CONFIG_SCHED_SMT
if (has_big_cores) {
pr_info("Big cores detected but using small core scheduling\n");
powerpc_topology[smt_idx].mask = smallcore_smt_mask;
powerpc_topology[i++] = (struct sched_domain_topology_level){
smallcore_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT)
};
} else {
powerpc_topology[i++] = (struct sched_domain_topology_level){
cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT)
};
}
#endif
if (shared_caches) {
powerpc_topology[i++] = (struct sched_domain_topology_level){
shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE)
};
}
if (has_coregroup_support()) {
powerpc_topology[i++] = (struct sched_domain_topology_level){
cpu_mc_mask, powerpc_shared_proc_flags, SD_INIT_NAME(MC)
};
}
powerpc_topology[i++] = (struct sched_domain_topology_level){
cpu_cpu_mask, powerpc_shared_proc_flags, SD_INIT_NAME(PKG)
};
if (!has_coregroup_support())
powerpc_topology[mc_idx].mask = powerpc_topology[cache_idx].mask;
/*
* Try to consolidate topology levels here instead of
* allowing scheduler to degenerate.
* - Dont consolidate if masks are different.
* - Dont consolidate if sd_flags exists and are different.
*/
for (i = 1; i <= die_idx; i++) {
if (powerpc_topology[i].mask != powerpc_topology[i - 1].mask)
continue;
if (powerpc_topology[i].sd_flags && powerpc_topology[i - 1].sd_flags &&
powerpc_topology[i].sd_flags != powerpc_topology[i - 1].sd_flags)
continue;
if (!powerpc_topology[i - 1].sd_flags)
powerpc_topology[i - 1].sd_flags = powerpc_topology[i].sd_flags;
/* There must be one trailing NULL entry left. */
BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
powerpc_topology[i].mask = powerpc_topology[i + 1].mask;
powerpc_topology[i].sd_flags = powerpc_topology[i + 1].sd_flags;
#ifdef CONFIG_SCHED_DEBUG
powerpc_topology[i].name = powerpc_topology[i + 1].name;
#endif
}
set_sched_topology(powerpc_topology);
}
void __init smp_cpus_done(unsigned int max_cpus)
......@@ -1733,9 +1738,20 @@ void __init smp_cpus_done(unsigned int max_cpus)
smp_ops->bringup_done();
dump_numa_cpu_topology();
build_sched_topology();
}
fixup_topology();
set_sched_topology(powerpc_topology);
/*
* For asym packing, by default lower numbered CPU has higher priority.
* On shared processors, pack to lower numbered core. However avoid moving
* between thread_groups within the same core.
*/
int arch_asym_cpu_priority(int cpu)
{
if (static_branch_unlikely(&splpar_asym_pack))
return -cpu / threads_per_core;
return -cpu;
}
#ifdef CONFIG_HOTPLUG_CPU
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment