Commit 6f9d71c9 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Waiman's cgroup2 cpuset support has been finally merged closing one
   of the last remaining feature gaps.

 - cgroup.procs could show non-leader threads when cgroup2 threaded mode
   was used in certain ways. I forgot to push the fix during the last
   cycle.

 - A patch to fix mount option parsing when all mount options have been
   consumed by someone else (LSM).

 - cgroup_no_v1 boot param can now block named cgroup1 hierarchies too.

* 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param
  cgroup: fix parsing empty mount option string
  cpuset: Remove set but not used variable 'cs'
  cgroup: fix CSS_TASK_ITER_PROCS
  cgroup: Add .__DEBUG__. prefix to debug file names
  cpuset: Minor cgroup2 interface updates
  cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug
  cpuset: Add documentation about the new "cpuset.sched.partition" flag
  cpuset: Use descriptive text when reading/writing cpuset.sched.partition
  cpuset: Expose cpus.effective and mems.effective on cgroup v2 root
  cpuset: Make generate_sched_domains() work with partition
  cpuset: Make CPU hotplug work with partition
  cpuset: Track cpusets that use parent's effective_cpus
  cpuset: Add an error state to cpuset.sched.partition
  cpuset: Add new v2 cpuset.sched.partition flag
  cpuset: Simply allocation and freeing of cpumasks
  cpuset: Define data structures to support scheduling partition
  cpuset: Enable cpuset controller in default hierarchy
  cgroup: remove unnecessary unlikely()
parents 55db91fb 3fc9c12d
......@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-3-2. IO Latency Interface Files
5-4. PID
5-4-1. PID Interface Files
5-5. Device
5-6. RDMA
5-6-1. RDMA Interface Files
5-7. Misc
5-7-1. perf_event
5-5. Cpuset
5.5-1. Cpuset Interface Files
5-6. Device
5-7. RDMA
5-7-1. RDMA Interface Files
5-8. Misc
5-8-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
......@@ -1610,6 +1612,176 @@ through fork() or clone(). These will return -EAGAIN if the creation
of a new process would cause a cgroup policy to be violated.
Cpuset
------
The "cpuset" controller provides a mechanism for constraining
the CPU and memory node placement of tasks to only the resources
specified in the cpuset interface files in a task's current cgroup.
This is especially valuable on large NUMA systems where placing jobs
on properly sized subsets of the systems with careful processor and
memory placement to reduce cross-node memory access and contention
can improve overall system performance.
The "cpuset" controller is hierarchical. That means the controller
cannot use CPUs or memory nodes not allowed in its parent.
Cpuset Interface Files
~~~~~~~~~~~~~~~~~~~~~~
cpuset.cpus
A read-write multiple values file which exists on non-root
cpuset-enabled cgroups.
It lists the requested CPUs to be used by tasks within this
cgroup. The actual list of CPUs to be granted, however, is
subjected to constraints imposed by its parent and can differ
from the requested CPUs.
The CPU numbers are comma-separated numbers or ranges.
For example:
# cat cpuset.cpus
0-4,6,8-10
An empty value indicates that the cgroup is using the same
setting as the nearest cgroup ancestor with a non-empty
"cpuset.cpus" or all the available CPUs if none is found.
The value of "cpuset.cpus" stays constant until the next update
and won't be affected by any CPU hotplug events.
cpuset.cpus.effective
A read-only multiple values file which exists on all
cpuset-enabled cgroups.
It lists the onlined CPUs that are actually granted to this
cgroup by its parent. These CPUs are allowed to be used by
tasks within the current cgroup.
If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
all the CPUs from the parent cgroup that can be available to
be used by this cgroup. Otherwise, it should be a subset of
"cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
can be granted. In this case, it will be treated just like an
empty "cpuset.cpus".
Its value will be affected by CPU hotplug events.
cpuset.mems
A read-write multiple values file which exists on non-root
cpuset-enabled cgroups.
It lists the requested memory nodes to be used by tasks within
this cgroup. The actual list of memory nodes granted, however,
is subjected to constraints imposed by its parent and can differ
from the requested memory nodes.
The memory node numbers are comma-separated numbers or ranges.
For example:
# cat cpuset.mems
0-1,3
An empty value indicates that the cgroup is using the same
setting as the nearest cgroup ancestor with a non-empty
"cpuset.mems" or all the available memory nodes if none
is found.
The value of "cpuset.mems" stays constant until the next update
and won't be affected by any memory nodes hotplug events.
cpuset.mems.effective
A read-only multiple values file which exists on all
cpuset-enabled cgroups.
It lists the onlined memory nodes that are actually granted to
this cgroup by its parent. These memory nodes are allowed to
be used by tasks within the current cgroup.
If "cpuset.mems" is empty, it shows all the memory nodes from the
parent cgroup that will be available to be used by this cgroup.
Otherwise, it should be a subset of "cpuset.mems" unless none of
the memory nodes listed in "cpuset.mems" can be granted. In this
case, it will be treated just like an empty "cpuset.mems".
Its value will be affected by memory nodes hotplug events.
cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
and is not delegatable.
It accepts only the following input values when written to.
"root" - a paritition root
"member" - a non-root member of a partition
When set to be a partition root, the current cgroup is the
root of a new partition or scheduling domain that comprises
itself and all its descendants except those that are separate
partition roots themselves and their descendants. The root
cgroup is always a partition root.
There are constraints on where a partition root can be set.
It can only be set in a cgroup if all the following conditions
are true.
1) The "cpuset.cpus" is not empty and the list of CPUs are
exclusive, i.e. they are not shared by any of its siblings.
2) The parent cgroup is a partition root.
3) The "cpuset.cpus" is also a proper subset of the parent's
"cpuset.cpus.effective".
4) There is no child cgroups with cpuset enabled. This is for
eliminating corner cases that have to be handled if such a
condition is allowed.
Setting it to partition root will take the CPUs away from the
effective CPUs of the parent cgroup. Once it is set, this
file cannot be reverted back to "member" if there are any child
cgroups with cpuset enabled.
A parent partition cannot distribute all its CPUs to its
child partitions. There must be at least one cpu left in the
parent partition.
Once becoming a partition root, changes to "cpuset.cpus" is
generally allowed as long as the first condition above is true,
the change will not take away all the CPUs from the parent
partition and the new "cpuset.cpus" value is a superset of its
children's "cpuset.cpus" values.
Sometimes, external factors like changes to ancestors'
"cpuset.cpus" or cpu hotplug can cause the state of the partition
root to change. On read, the "cpuset.sched.partition" file
can show the following values.
"member" Non-root member of a partition
"root" Partition root
"root invalid" Invalid partition root
It is a partition root if the first 2 partition root conditions
above are true and at least one CPU from "cpuset.cpus" is
granted by the parent cgroup.
A partition root can become invalid if none of CPUs requested
in "cpuset.cpus" can be granted by the parent cgroup or the
parent cgroup is no longer a partition root itself. In this
case, it is not a real partition even though the restriction
of the first partition root condition above will still apply.
The cpu affinity of all the tasks in the cgroup will then be
associated with CPUs in the nearest ancestor partition.
An invalid partition root can be transitioned back to a
real partition root if at least one of the requested CPUs
can now be granted by its parent. In this case, the cpu
affinity of all the tasks in the formerly invalid partition
will be associated to the CPUs of the newly formed partition.
Changing the partition state of an invalid partition root to
"member" is always allowed even if child cpusets are present.
Device controller
-----------------
......
......@@ -486,10 +486,14 @@
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1
Format: { controller[,controller...] | "all" }
cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1
Format: { { controller | "all" | "named" }
[,{ controller | "all" | "named" }...] }
Like cgroup_disable, but only applies to cgroup v1;
the blacklisted controllers remain available in cgroup2.
"all" blacklists all controllers and "named" disables
named mounts. Specifying both "all" and "named" disables
all v1 hierarchies.
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
......
......@@ -92,6 +92,7 @@ enum {
CFTYPE_NO_PREFIX = (1 << 3), /* (DON'T USE FOR NEW FILES) no subsys prefix */
CFTYPE_WORLD_WRITABLE = (1 << 4), /* (DON'T USE FOR NEW FILES) S_IWUGO */
CFTYPE_DEBUG = (1 << 5), /* create when cgroup_debug */
/* internal flags, do not use outside cgroup core proper */
__CFTYPE_ONLY_ON_DFL = (1 << 16), /* only on default hierarchy */
......
......@@ -11,6 +11,8 @@
#define TRACE_CGROUP_PATH_LEN 1024
extern spinlock_t trace_cgroup_path_lock;
extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
extern bool cgroup_debug;
extern void __init enable_debug_cgroup(void);
/*
* cgroup_path() takes a spin lock. It is good practice not to take
......
......@@ -27,6 +27,9 @@
/* Controllers blocked by the commandline in v1 */
static u16 cgroup_no_v1_mask;
/* disable named v1 mounts */
static bool cgroup_no_v1_named;
/*
* pidlist destructions need to be flushed on cgroup destruction. Use a
* separate workqueue as flush domain.
......@@ -963,6 +966,10 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
}
if (!strncmp(token, "name=", 5)) {
const char *name = token + 5;
/* blocked by boot param? */
if (cgroup_no_v1_named)
return -ENOENT;
/* Can't specify an empty name */
if (!strlen(name))
return -EINVAL;
......@@ -1292,7 +1299,12 @@ static int __init cgroup_no_v1(char *str)
if (!strcmp(token, "all")) {
cgroup_no_v1_mask = U16_MAX;
break;
continue;
}
if (!strcmp(token, "named")) {
cgroup_no_v1_named = true;
continue;
}
for_each_subsys(ss, i) {
......
......@@ -86,6 +86,7 @@ EXPORT_SYMBOL_GPL(css_set_lock);
DEFINE_SPINLOCK(trace_cgroup_path_lock);
char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
bool cgroup_debug __read_mostly;
/*
* Protects cgroup_idr and css_idr so that IDs can be released without
......@@ -1429,12 +1430,15 @@ static char *cgroup_file_name(struct cgroup *cgrp, const struct cftype *cft,
struct cgroup_subsys *ss = cft->ss;
if (cft->ss && !(cft->flags & CFTYPE_NO_PREFIX) &&
!(cgrp->root->flags & CGRP_ROOT_NOPREFIX))
snprintf(buf, CGROUP_FILE_NAME_MAX, "%s.%s",
cgroup_on_dfl(cgrp) ? ss->name : ss->legacy_name,
!(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
const char *dbg = (cft->flags & CFTYPE_DEBUG) ? ".__DEBUG__." : "";
snprintf(buf, CGROUP_FILE_NAME_MAX, "%s%s.%s",
dbg, cgroup_on_dfl(cgrp) ? ss->name : ss->legacy_name,
cft->name);
else
} else {
strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
}
return buf;
}
......@@ -1774,7 +1778,7 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags)
*root_flags = 0;
if (!data)
if (!data || *data == '\0')
return 0;
while ((token = strsep(&data, ",")) != NULL) {
......@@ -3669,7 +3673,8 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
continue;
if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
continue;
if ((cft->flags & CFTYPE_DEBUG) && !cgroup_debug)
continue;
if (is_add) {
ret = cgroup_add_file(css, cgrp, cft);
if (ret) {
......@@ -4232,10 +4237,11 @@ static void css_task_iter_advance(struct css_task_iter *it)
lockdep_assert_held(&css_set_lock);
repeat:
if (it->task_pos) {
/*
* Advance iterator to find next entry. cset->tasks is consumed
* first and then ->mg_tasks. After ->mg_tasks, we move onto the
* next cset.
* Advance iterator to find next entry. cset->tasks is
* consumed first and then ->mg_tasks. After ->mg_tasks,
* we move onto the next cset.
*/
next = it->task_pos->next;
......@@ -4246,6 +4252,10 @@ static void css_task_iter_advance(struct css_task_iter *it)
css_task_iter_advance_css_set(it);
else
it->task_pos = next;
} else {
/* called from start, proceed to the first cset */
css_task_iter_advance_css_set(it);
}
/* if PROCS, skip over tasks which aren't group leaders */
if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos &&
......@@ -4285,7 +4295,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
it->cset_head = it->cset_pos;
css_task_iter_advance_css_set(it);
css_task_iter_advance(it);
spin_unlock_irq(&css_set_lock);
}
......@@ -5773,6 +5783,16 @@ static int __init cgroup_disable(char *str)
}
__setup("cgroup_disable=", cgroup_disable);
void __init __weak enable_debug_cgroup(void) { }
static int __init enable_cgroup_debug(char *str)
{
cgroup_debug = true;
enable_debug_cgroup();
return 1;
}
__setup("cgroup_debug", enable_cgroup_debug);
/**
* css_tryget_online_from_dir - get corresponding css from a cgroup dentry
* @dentry: directory dentry of interest
......@@ -6008,11 +6028,9 @@ static ssize_t show_delegatable_files(struct cftype *files, char *buf,
ret += snprintf(buf + ret, size - ret, "%s\n", cft->name);
if (unlikely(ret >= size)) {
WARN_ON(1);
if (WARN_ON(ret >= size))
break;
}
}
return ret;
}
......
......@@ -109,6 +109,16 @@ struct cpuset {
cpumask_var_t effective_cpus;
nodemask_t effective_mems;
/*
* CPUs allocated to child sub-partitions (default hierarchy only)
* - CPUs granted by the parent = effective_cpus U subparts_cpus
* - effective_cpus and subparts_cpus are mutually exclusive.
*
* effective_cpus contains only onlined CPUs, but subparts_cpus
* may have offlined ones.
*/
cpumask_var_t subparts_cpus;
/*
* This is old Memory Nodes tasks took on.
*
......@@ -134,6 +144,47 @@ struct cpuset {
/* for custom sched domain */
int relax_domain_level;
/* number of CPUs in subparts_cpus */
int nr_subparts_cpus;
/* partition root state */
int partition_root_state;
/*
* Default hierarchy only:
* use_parent_ecpus - set if using parent's effective_cpus
* child_ecpus_count - # of children with use_parent_ecpus set
*/
int use_parent_ecpus;
int child_ecpus_count;
};
/*
* Partition root states:
*
* 0 - not a partition root
*
* 1 - partition root
*
* -1 - invalid partition root
* None of the cpus in cpus_allowed can be put into the parent's
* subparts_cpus. In this case, the cpuset is not a real partition
* root anymore. However, the CPU_EXCLUSIVE bit will still be set
* and the cpuset can be restored back to a partition root if the
* parent cpuset can give more CPUs back to this child cpuset.
*/
#define PRS_DISABLED 0
#define PRS_ENABLED 1
#define PRS_ERROR -1
/*
* Temporary cpumasks for working with partitions that are passed among
* functions to avoid memory allocation in inner functions.
*/
struct tmpmasks {
cpumask_var_t addmask, delmask; /* For partition root */
cpumask_var_t new_cpus; /* For update_cpumasks_hier() */
};
static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
......@@ -218,9 +269,15 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}
static inline int is_partition_root(const struct cpuset *cs)
{
return cs->partition_root_state > 0;
}
static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
(1 << CS_MEM_EXCLUSIVE)),
.partition_root_state = PRS_ENABLED,
};
/**
......@@ -418,6 +475,65 @@ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
is_mem_exclusive(p) <= is_mem_exclusive(q);
}
/**
* alloc_cpumasks - allocate three cpumasks for cpuset
* @cs: the cpuset that have cpumasks to be allocated.
* @tmp: the tmpmasks structure pointer
* Return: 0 if successful, -ENOMEM otherwise.
*
* Only one of the two input arguments should be non-NULL.
*/
static inline int alloc_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
{
cpumask_var_t *pmask1, *pmask2, *pmask3;
if (cs) {
pmask1 = &cs->cpus_allowed;
pmask2 = &cs->effective_cpus;
pmask3 = &cs->subparts_cpus;
} else {
pmask1 = &tmp->new_cpus;
pmask2 = &tmp->addmask;
pmask3 = &tmp->delmask;
}
if (!zalloc_cpumask_var(pmask1, GFP_KERNEL))
return -ENOMEM;
if (!zalloc_cpumask_var(pmask2, GFP_KERNEL))
goto free_one;
if (!zalloc_cpumask_var(pmask3, GFP_KERNEL))
goto free_two;
return 0;
free_two:
free_cpumask_var(*pmask2);
free_one:
free_cpumask_var(*pmask1);
return -ENOMEM;
}
/**
* free_cpumasks - free cpumasks in a tmpmasks structure
* @cs: the cpuset that have cpumasks to be free.
* @tmp: the tmpmasks structure pointer
*/
static inline void free_cpumasks(struct cpuset *cs, struct tmpmasks *tmp)
{
if (cs) {
free_cpumask_var(cs->cpus_allowed);
free_cpumask_var(cs->effective_cpus);
free_cpumask_var(cs->subparts_cpus);
}
if (tmp) {
free_cpumask_var(tmp->new_cpus);
free_cpumask_var(tmp->addmask);
free_cpumask_var(tmp->delmask);
}
}
/**
* alloc_trial_cpuset - allocate a trial cpuset
* @cs: the cpuset that the trial cpuset duplicates
......@@ -430,31 +546,24 @@ static struct cpuset *alloc_trial_cpuset(struct cpuset *cs)
if (!trial)
return NULL;
if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL))
goto free_cs;
if (!alloc_cpumask_var(&trial->effective_cpus, GFP_KERNEL))
goto free_cpus;
if (alloc_cpumasks(trial, NULL)) {
kfree(trial);
return NULL;
}
cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
cpumask_copy(trial->effective_cpus, cs->effective_cpus);
return trial;
free_cpus:
free_cpumask_var(trial->cpus_allowed);
free_cs:
kfree(trial);
return NULL;
}
/**
* free_trial_cpuset - free the trial cpuset
* @trial: the trial cpuset to be freed
* free_cpuset - free the cpuset
* @cs: the cpuset to be freed
*/
static void free_trial_cpuset(struct cpuset *trial)
static inline void free_cpuset(struct cpuset *cs)
{
free_cpumask_var(trial->effective_cpus);
free_cpumask_var(trial->cpus_allowed);
kfree(trial);
free_cpumasks(cs, NULL);
kfree(cs);
}
/*
......@@ -660,13 +769,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
int ndoms = 0; /* number of sched domains in result */
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
bool root_load_balance = is_sched_load_balance(&top_cpuset);
doms = NULL;
dattr = NULL;
csa = NULL;
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(&top_cpuset)) {
if (root_load_balance && !top_cpuset.nr_subparts_cpus) {
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
......@@ -689,6 +799,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
csn = 0;
rcu_read_lock();
if (root_load_balance)
csa[csn++] = &top_cpuset;
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
......@@ -699,6 +811,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
* parent's cpus, so just skip them, and then we call
* update_domain_attr_tree() to calc relax_domain_level of
* the corresponding sched domain.
*
* If root is load-balancing, we can skip @cp if it
* is a subset of the root's effective_cpus.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
......@@ -706,10 +821,15 @@ static int generate_sched_domains(cpumask_var_t **domains,
housekeeping_cpumask(HK_FLAG_DOMAIN))))
continue;
if (root_load_balance &&
cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
continue;
if (is_sched_load_balance(cp))
csa[csn++] = cp;
/* skip @cp's subtree */
/* skip @cp's subtree if not a partition root */
if (!is_partition_root(cp))
pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
......@@ -838,7 +958,12 @@ static void rebuild_sched_domains_locked(void)
* passing doms with offlined cpu to partition_sched_domains().
* Anyways, hotplug work item will rebuild sched domains.
*/
if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
if (!top_cpuset.nr_subparts_cpus &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
if (top_cpuset.nr_subparts_cpus &&
!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
goto out;
/* Generate domain masks and attrs */
......@@ -881,10 +1006,248 @@ static void update_tasks_cpumask(struct cpuset *cs)
css_task_iter_end(&it);
}
/**
* compute_effective_cpumask - Compute the effective cpumask of the cpuset
* @new_cpus: the temp variable for the new effective_cpus mask
* @cs: the cpuset the need to recompute the new effective_cpus mask
* @parent: the parent cpuset
*
* If the parent has subpartition CPUs, include them in the list of
* allowable CPUs in computing the new effective_cpus mask. Since offlined
* CPUs are not removed from subparts_cpus, we have to use cpu_active_mask
* to mask those out.
*/
static void compute_effective_cpumask(struct cpumask *new_cpus,
struct cpuset *cs, struct cpuset *parent)
{
if (parent->nr_subparts_cpus) {
cpumask_or(new_cpus, parent->effective_cpus,
parent->subparts_cpus);
cpumask_and(new_cpus, new_cpus, cs->cpus_allowed);
cpumask_and(new_cpus, new_cpus, cpu_active_mask);
} else {
cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
}
}
/*
* Commands for update_parent_subparts_cpumask
*/
enum subparts_cmd {
partcmd_enable, /* Enable partition root */
partcmd_disable, /* Disable partition root */
partcmd_update, /* Update parent's subparts_cpus */
};
/**
* update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
* @cpuset: The cpuset that requests change in partition root state
* @cmd: Partition root state change command
* @newmask: Optional new cpumask for partcmd_update
* @tmp: Temporary addmask and delmask
* Return: 0, 1 or an error code
*
* For partcmd_enable, the cpuset is being transformed from a non-partition
* root to a partition root. The cpus_allowed mask of the given cpuset will
* be put into parent's subparts_cpus and taken away from parent's
* effective_cpus. The function will return 0 if all the CPUs listed in
* cpus_allowed can be granted or an error code will be returned.
*
* For partcmd_disable, the cpuset is being transofrmed from a partition
* root back to a non-partition root. any CPUs in cpus_allowed that are in
* parent's subparts_cpus will be taken away from that cpumask and put back
* into parent's effective_cpus. 0 should always be returned.
*
* For partcmd_update, if the optional newmask is specified, the cpu
* list is to be changed from cpus_allowed to newmask. Otherwise,
* cpus_allowed is assumed to remain the same. The cpuset should either
* be a partition root or an invalid partition root. The partition root
* state may change if newmask is NULL and none of the requested CPUs can
* be granted by the parent. The function will return 1 if changes to
* parent's subparts_cpus and effective_cpus happen or 0 otherwise.
* Error code should only be returned when newmask is non-NULL.
*
* The partcmd_enable and partcmd_disable commands are used by
* update_prstate(). The partcmd_update command is used by
* update_cpumasks_hier() with newmask NULL and update_cpumask() with
* newmask set.
*
* The checking is more strict when enabling partition root than the
* other two commands.
*
* Because of the implicit cpu exclusive nature of a partition root,
* cpumask changes that violates the cpu exclusivity rule will not be
* permitted when checked by validate_change(). The validate_change()
* function will also prevent any changes to the cpu list if it is not
* a superset of children's cpu lists.
*/
static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
struct cpumask *newmask,
struct tmpmasks *tmp)
{
struct cpuset *parent = parent_cs(cpuset);
int adding; /* Moving cpus from effective_cpus to subparts_cpus */
int deleting; /* Moving cpus from subparts_cpus to effective_cpus */
bool part_error = false; /* Partition error? */
lockdep_assert_held(&cpuset_mutex);
/*
* The parent must be a partition root.
* The new cpumask, if present, or the current cpus_allowed must
* not be empty.
*/
if (!is_partition_root(parent) ||
(newmask && cpumask_empty(newmask)) ||
(!newmask && cpumask_empty(cpuset->cpus_allowed)))
return -EINVAL;
/*
* Enabling/disabling partition root is not allowed if there are
* online children.
*/
if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
return -EBUSY;
/*
* Enabling partition root is not allowed if not all the CPUs
* can be granted from parent's effective_cpus or at least one
* CPU will be left after that.
*/
if ((cmd == partcmd_enable) &&
(!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
return -EINVAL;
/*
* A cpumask update cannot make parent's effective_cpus become empty.
*/
adding = deleting = false;
if (cmd == partcmd_enable) {
cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
parent->subparts_cpus);
} else if (newmask) {
/*
* partcmd_update with newmask:
*
* delmask = cpus_allowed & ~newmask & parent->subparts_cpus
* addmask = newmask & parent->effective_cpus
* & ~parent->subparts_cpus
*/
cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
deleting = cpumask_and(tmp->delmask, tmp->delmask,
parent->subparts_cpus);
cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
* Return error if the new effective_cpus could become empty.
*/
if (adding &&
cpumask_equal(parent->effective_cpus, tmp->addmask)) {
if (!deleting)
return -EINVAL;
/*
* As some of the CPUs in subparts_cpus might have
* been offlined, we need to compute the real delmask
* to confirm that.
*/
if (!cpumask_and(tmp->addmask, tmp->delmask,
cpu_active_mask))
return -EINVAL;
cpumask_copy(tmp->addmask, parent->effective_cpus);
}
} else {
/*
* partcmd_update w/o newmask:
*
* addmask = cpus_allowed & parent->effectiveb_cpus
*
* Note that parent's subparts_cpus may have been
* pre-shrunk in case there is a change in the cpu list.
* So no deletion is needed.
*/
adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
parent->effective_cpus);
part_error = cpumask_equal(tmp->addmask,
parent->effective_cpus);
}
if (cmd == partcmd_update) {
int prev_prs = cpuset->partition_root_state;
/*
* Check for possible transition between PRS_ENABLED
* and PRS_ERROR.
*/
switch (cpuset->partition_root_state) {
case PRS_ENABLED:
if (part_error)
cpuset->partition_root_state = PRS_ERROR;
break;
case PRS_ERROR:
if (!part_error)
cpuset->partition_root_state = PRS_ENABLED;
break;
}
/*
* Set part_error if previously in invalid state.
*/
part_error = (prev_prs == PRS_ERROR);
}
if (!part_error && (cpuset->partition_root_state == PRS_ERROR))
return 0; /* Nothing need to be done */
if (cpuset->partition_root_state == PRS_ERROR) {
/*
* Remove all its cpus from parent's subparts_cpus.
*/
adding = false;
deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
parent->subparts_cpus);
}
if (!adding && !deleting)
return 0;
/*
* Change the parent's subparts_cpus.
* Newly added CPUs will be removed from effective_cpus and
* newly deleted ones will be added back to effective_cpus.
*/
spin_lock_irq(&callback_lock);
if (adding) {
cpumask_or(parent->subparts_cpus,
parent->subparts_cpus, tmp->addmask);
cpumask_andnot(parent->effective_cpus,
parent->effective_cpus, tmp->addmask);
}
if (deleting) {
cpumask_andnot(parent->subparts_cpus,
parent->subparts_cpus, tmp->delmask);
/*
* Some of the CPUs in subparts_cpus might have been offlined.
*/
cpumask_and(tmp->delmask, tmp->delmask, cpu_active_mask);
cpumask_or(parent->effective_cpus,
parent->effective_cpus, tmp->delmask);
}
parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
spin_unlock_irq(&callback_lock);
return cmd == partcmd_update;
}
/*
* update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
* @cs: the cpuset to consider
* @new_cpus: temp variable for calculating new effective_cpus
* @tmp: temp variables for calculating effective_cpus & partition setup
*
* When congifured cpumask is changed, the effective cpumasks of this cpuset
* and all its descendants need to be updated.
......@@ -893,7 +1256,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
*
* Called with cpuset_mutex held
*/
static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
{
struct cpuset *cp;
struct cgroup_subsys_state *pos_css;
......@@ -903,27 +1266,115 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
compute_effective_cpumask(tmp->new_cpus, cp, parent);
/*
* If it becomes empty, inherit the effective mask of the
* parent, which is guaranteed to have some CPUs.
*/
if (is_in_v2_mode() && cpumask_empty(new_cpus))
cpumask_copy(new_cpus, parent->effective_cpus);
if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
cpumask_copy(tmp->new_cpus, parent->effective_cpus);
if (!cp->use_parent_ecpus) {
cp->use_parent_ecpus = true;
parent->child_ecpus_count++;
}
} else if (cp->use_parent_ecpus) {
cp->use_parent_ecpus = false;
WARN_ON_ONCE(!parent->child_ecpus_count);
parent->child_ecpus_count--;
}
/* Skip the whole subtree if the cpumask remains the same. */
if (cpumask_equal(new_cpus, cp->effective_cpus)) {
/*
* Skip the whole subtree if the cpumask remains the same
* and has no partition root state.
*/
if (!cp->partition_root_state &&
cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
/*
* update_parent_subparts_cpumask() should have been called
* for cs already in update_cpumask(). We should also call
* update_tasks_cpumask() again for tasks in the parent
* cpuset if the parent's subparts_cpus changes.
*/
if ((cp != cs) && cp->partition_root_state) {
switch (parent->partition_root_state) {
case PRS_DISABLED:
/*
* If parent is not a partition root or an
* invalid partition root, clear the state
* state and the CS_CPU_EXCLUSIVE flag.
*/
WARN_ON_ONCE(cp->partition_root_state
!= PRS_ERROR);
cp->partition_root_state = 0;
/*
* clear_bit() is an atomic operation and
* readers aren't interested in the state
* of CS_CPU_EXCLUSIVE anyway. So we can
* just update the flag without holding
* the callback_lock.
*/
clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
break;
case PRS_ENABLED:
if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
update_tasks_cpumask(parent);
break;
case PRS_ERROR:
/*
* When parent is invalid, it has to be too.
*/
cp->partition_root_state = PRS_ERROR;
if (cp->nr_subparts_cpus) {
cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus);
}
break;
}
}
if (!css_tryget_online(&cp->css))
continue;
rcu_read_unlock();
spin_lock_irq(&callback_lock);
cpumask_copy(cp->effective_cpus, new_cpus);
cpumask_copy(cp->effective_cpus, tmp->new_cpus);
if (cp->nr_subparts_cpus &&
(cp->partition_root_state != PRS_ENABLED)) {
cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus);
} else if (cp->nr_subparts_cpus) {
/*
* Make sure that effective_cpus & subparts_cpus
* are mutually exclusive.
*
* In the unlikely event that effective_cpus
* becomes empty. we clear cp->nr_subparts_cpus and
* let its child partition roots to compete for
* CPUs again.
*/
cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
cp->subparts_cpus);
if (cpumask_empty(cp->effective_cpus)) {
cpumask_copy(cp->effective_cpus, tmp->new_cpus);
cpumask_clear(cp->subparts_cpus);
cp->nr_subparts_cpus = 0;
} else if (!cpumask_subset(cp->subparts_cpus,
tmp->new_cpus)) {
cpumask_andnot(cp->subparts_cpus,
cp->subparts_cpus, tmp->new_cpus);
cp->nr_subparts_cpus
= cpumask_weight(cp->subparts_cpus);
}
}
spin_unlock_irq(&callback_lock);
WARN_ON(!is_in_v2_mode() &&
......@@ -932,11 +1383,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
update_tasks_cpumask(cp);
/*
* If the effective cpumask of any non-empty cpuset is changed,
* we need to rebuild sched domains.
* On legacy hierarchy, if the effective cpumask of any non-
* empty cpuset is changed, we need to rebuild sched domains.
* On default hierarchy, the cpuset needs to be a partition
* root as well.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
is_sched_load_balance(cp))
is_sched_load_balance(cp) &&
(!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
is_partition_root(cp)))
need_rebuild_sched_domains = true;
rcu_read_lock();
......@@ -948,6 +1403,35 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
rebuild_sched_domains_locked();
}
/**
* update_sibling_cpumasks - Update siblings cpumasks
* @parent: Parent cpuset
* @cs: Current cpuset
* @tmp: Temp variables
*/
static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
struct tmpmasks *tmp)
{
struct cpuset *sibling;
struct cgroup_subsys_state *pos_css;
/*
* Check all its siblings and call update_cpumasks_hier()
* if their use_parent_ecpus flag is set in order for them
* to use the right effective_cpus value.
*/
rcu_read_lock();
cpuset_for_each_child(sibling, pos_css, parent) {
if (sibling == cs)
continue;
if (!sibling->use_parent_ecpus)
continue;
update_cpumasks_hier(sibling, tmp);
}
rcu_read_unlock();
}
/**
* update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
* @cs: the cpuset to consider
......@@ -958,6 +1442,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
const char *buf)
{
int retval;
struct tmpmasks tmp;
/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
if (cs == &top_cpuset)
......@@ -989,12 +1474,50 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (retval < 0)
return retval;
#ifdef CONFIG_CPUMASK_OFFSTACK
/*
* Use the cpumasks in trialcs for tmpmasks when they are pointers
* to allocated cpumasks.
*/
tmp.addmask = trialcs->subparts_cpus;
tmp.delmask = trialcs->effective_cpus;
tmp.new_cpus = trialcs->cpus_allowed;
#endif
if (cs->partition_root_state) {
/* Cpumask of a partition root cannot be empty */
if (cpumask_empty(trialcs->cpus_allowed))
return -EINVAL;
if (update_parent_subparts_cpumask(cs, partcmd_update,
trialcs->cpus_allowed, &tmp) < 0)
return -EINVAL;
}
spin_lock_irq(&callback_lock);
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
/*
* Make sure that subparts_cpus is a subset of cpus_allowed.
*/
if (cs->nr_subparts_cpus) {
cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
cs->cpus_allowed);
cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
}
spin_unlock_irq(&callback_lock);
/* use trialcs->cpus_allowed as a temp variable */
update_cpumasks_hier(cs, trialcs->cpus_allowed);
update_cpumasks_hier(cs, &tmp);
if (cs->partition_root_state) {
struct cpuset *parent = parent_cs(cs);
/*
* For partition root, update the cpumasks of sibling
* cpusets if they use parent's effective_cpus.
*/
if (parent->child_ecpus_count)
update_sibling_cpumasks(parent, cs, &tmp);
}
return 0;
}
......@@ -1348,7 +1871,95 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
if (spread_flag_changed)
update_tasks_flags(cs);
out:
free_trial_cpuset(trialcs);
free_cpuset(trialcs);
return err;
}
/*
* update_prstate - update partititon_root_state
* cs: the cpuset to update
* val: 0 - disabled, 1 - enabled
*
* Call with cpuset_mutex held.
*/
static int update_prstate(struct cpuset *cs, int val)
{
int err;
struct cpuset *parent = parent_cs(cs);
struct tmpmasks tmp;
if ((val != 0) && (val != 1))
return -EINVAL;
if (val == cs->partition_root_state)
return 0;
/*
* Cannot force a partial or invalid partition root to a full
* partition root.
*/
if (val && cs->partition_root_state)
return -EINVAL;
if (alloc_cpumasks(NULL, &tmp))
return -ENOMEM;
err = -EINVAL;
if (!cs->partition_root_state) {
/*
* Turning on partition root requires setting the
* CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
* cannot be NULL.
*/
if (cpumask_empty(cs->cpus_allowed))
goto out;
err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
if (err)
goto out;
err = update_parent_subparts_cpumask(cs, partcmd_enable,
NULL, &tmp);
if (err) {
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
goto out;
}
cs->partition_root_state = PRS_ENABLED;
} else {
/*
* Turning off partition root will clear the
* CS_CPU_EXCLUSIVE bit.
*/
if (cs->partition_root_state == PRS_ERROR) {
cs->partition_root_state = 0;
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
err = 0;
goto out;
}
err = update_parent_subparts_cpumask(cs, partcmd_disable,
NULL, &tmp);
if (err)
goto out;
cs->partition_root_state = 0;
/* Turning off CS_CPU_EXCLUSIVE will not return error */
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
}
/*
* Update cpumask of parent's tasks except when it is the top
* cpuset as some system daemons cannot be mapped to other CPUs.
*/
if (parent != &top_cpuset)
update_tasks_cpumask(parent);
if (parent->child_ecpus_count)
update_sibling_cpumasks(parent, cs, &tmp);
rebuild_sched_domains_locked();
out:
free_cpumasks(NULL, &tmp);
return err;
}
......@@ -1498,10 +2109,8 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
cgroup_taskset_first(tset, &css);
cs = css_cs(css);
mutex_lock(&cpuset_mutex);
css_cs(css)->attach_in_progress--;
......@@ -1593,10 +2202,12 @@ typedef enum {
FILE_MEMLIST,
FILE_EFFECTIVE_CPULIST,
FILE_EFFECTIVE_MEMLIST,
FILE_SUBPARTS_CPULIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
FILE_SCHED_LOAD_BALANCE,
FILE_PARTITION_ROOT,
FILE_SCHED_RELAX_DOMAIN_LEVEL,
FILE_MEMORY_PRESSURE_ENABLED,
FILE_MEMORY_PRESSURE,
......@@ -1732,7 +2343,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
break;
}
free_trial_cpuset(trialcs);
free_cpuset(trialcs);
out_unlock:
mutex_unlock(&cpuset_mutex);
kernfs_unbreak_active_protection(of->kn);
......@@ -1770,6 +2381,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
case FILE_EFFECTIVE_MEMLIST:
seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
break;
case FILE_SUBPARTS_CPULIST:
seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->subparts_cpus));
break;
default:
ret = -EINVAL;
}
......@@ -1824,12 +2438,60 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
return 0;
}
static int sched_partition_show(struct seq_file *seq, void *v)
{
struct cpuset *cs = css_cs(seq_css(seq));
switch (cs->partition_root_state) {
case PRS_ENABLED:
seq_puts(seq, "root\n");
break;
case PRS_DISABLED:
seq_puts(seq, "member\n");
break;
case PRS_ERROR:
seq_puts(seq, "root invalid\n");
break;
}
return 0;
}
static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
struct cpuset *cs = css_cs(of_css(of));
int val;
int retval = -ENODEV;
buf = strstrip(buf);
/*
* Convert "root" to ENABLED, and convert "member" to DISABLED.
*/
if (!strcmp(buf, "root"))
val = PRS_ENABLED;
else if (!strcmp(buf, "member"))
val = PRS_DISABLED;
else
return -EINVAL;
css_get(&cs->css);
mutex_lock(&cpuset_mutex);
if (!is_cpuset_online(cs))
goto out_unlock;
retval = update_prstate(cs, val);
out_unlock:
mutex_unlock(&cpuset_mutex);
css_put(&cs->css);
return retval ?: nbytes;
}
/*
* for the common functions, 'private' gives the type of file
*/
static struct cftype files[] = {
static struct cftype legacy_files[] = {
{
.name = "cpus",
.seq_show = cpuset_common_seq_show,
......@@ -1931,6 +2593,60 @@ static struct cftype files[] = {
{ } /* terminate */
};
/*
* This is currently a minimal set for the default hierarchy. It can be
* expanded later on by migrating more features and control files from v1.
*/
static struct cftype dfl_files[] = {
{
.name = "cpus",
.seq_show = cpuset_common_seq_show,
.write = cpuset_write_resmask,
.max_write_len = (100U + 6 * NR_CPUS),
.private = FILE_CPULIST,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "mems",
.seq_show = cpuset_common_seq_show,
.write = cpuset_write_resmask,
.max_write_len = (100U + 6 * MAX_NUMNODES),
.private = FILE_MEMLIST,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "cpus.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_CPULIST,
},
{
.name = "mems.effective",
.seq_show = cpuset_common_seq_show,
.private = FILE_EFFECTIVE_MEMLIST,
},
{
.name = "cpus.partition",
.seq_show = sched_partition_show,
.write = sched_partition_write,
.private = FILE_PARTITION_ROOT,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "cpus.subpartitions",
.seq_show = cpuset_common_seq_show,
.private = FILE_SUBPARTS_CPULIST,
.flags = CFTYPE_DEBUG,
},
{ } /* terminate */
};
/*
* cpuset_css_alloc - allocate a cpuset css
* cgrp: control group that the new cpuset will be part of
......@@ -1947,26 +2663,19 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
cs = kzalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
goto free_cs;
if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
goto free_cpus;
if (alloc_cpumasks(cs, NULL)) {
kfree(cs);
return ERR_PTR(-ENOMEM);
}
set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
cpumask_clear(cs->effective_cpus);
nodes_clear(cs->effective_mems);
fmeter_init(&cs->fmeter);
cs->relax_domain_level = -1;
return &cs->css;
free_cpus:
free_cpumask_var(cs->cpus_allowed);
free_cs:
kfree(cs);
return ERR_PTR(-ENOMEM);
}
static int cpuset_css_online(struct cgroup_subsys_state *css)
......@@ -1993,6 +2702,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
if (is_in_v2_mode()) {
cpumask_copy(cs->effective_cpus, parent->effective_cpus);
cs->effective_mems = parent->effective_mems;
cs->use_parent_ecpus = true;
parent->child_ecpus_count++;
}
spin_unlock_irq(&callback_lock);
......@@ -2035,7 +2746,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
/*
* If the cpuset being removed has its flag 'sched_load_balance'
* enabled, then simulate turning sched_load_balance off, which
* will call rebuild_sched_domains_locked().
* will call rebuild_sched_domains_locked(). That is not needed
* in the default hierarchy where only changes in partition
* will cause repartitioning.
*
* If the cpuset has the 'sched.partition' flag enabled, simulate
* turning 'sched.partition" off.
*/
static void cpuset_css_offline(struct cgroup_subsys_state *css)
......@@ -2044,9 +2760,20 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
mutex_lock(&cpuset_mutex);
if (is_sched_load_balance(cs))
if (is_partition_root(cs))
update_prstate(cs, 0);
if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
is_sched_load_balance(cs))
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
if (cs->use_parent_ecpus) {
struct cpuset *parent = parent_cs(cs);
cs->use_parent_ecpus = false;
parent->child_ecpus_count--;
}
cpuset_dec();
clear_bit(CS_ONLINE, &cs->flags);
......@@ -2057,9 +2784,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css)
{
struct cpuset *cs = css_cs(css);
free_cpumask_var(cs->effective_cpus);
free_cpumask_var(cs->cpus_allowed);
kfree(cs);
free_cpuset(cs);
}
static void cpuset_bind(struct cgroup_subsys_state *root_css)
......@@ -2105,8 +2830,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.post_attach = cpuset_post_attach,
.bind = cpuset_bind,
.fork = cpuset_fork,
.legacy_cftypes = files,
.legacy_cftypes = legacy_files,
.dfl_cftypes = dfl_files,
.early_init = true,
.threaded = true,
};
/**
......@@ -2121,6 +2848,7 @@ int __init cpuset_init(void)
BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
......@@ -2227,20 +2955,29 @@ hotplug_update_tasks(struct cpuset *cs,
update_tasks_nodemask(cs);
}
static bool force_rebuild;
void cpuset_force_rebuild(void)
{
force_rebuild = true;
}
/**
* cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
* @cs: cpuset in interest
* @tmp: the tmpmasks structure pointer
*
* Compare @cs's cpu and mem masks against top_cpuset and if some have gone
* offline, update @cs accordingly. If @cs ends up with no CPU or memory,
* all its tasks are moved to the nearest ancestor with both resources.
*/
static void cpuset_hotplug_update_tasks(struct cpuset *cs)
static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
{
static cpumask_t new_cpus;
static nodemask_t new_mems;
bool cpus_updated;
bool mems_updated;
struct cpuset *parent;
retry:
wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
......@@ -2255,9 +2992,60 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
goto retry;
}
cpumask_and(&new_cpus, cs->cpus_allowed, parent_cs(cs)->effective_cpus);
nodes_and(new_mems, cs->mems_allowed, parent_cs(cs)->effective_mems);
parent = parent_cs(cs);
compute_effective_cpumask(&new_cpus, cs, parent);
nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
if (cs->nr_subparts_cpus)
/*
* Make sure that CPUs allocated to child partitions
* do not show up in effective_cpus.
*/
cpumask_andnot(&new_cpus, &new_cpus, cs->subparts_cpus);
if (!tmp || !cs->partition_root_state)
goto update_tasks;
/*
* In the unlikely event that a partition root has empty
* effective_cpus or its parent becomes erroneous, we have to
* transition it to the erroneous state.
*/
if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
(parent->partition_root_state == PRS_ERROR))) {
if (cs->nr_subparts_cpus) {
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
compute_effective_cpumask(&new_cpus, cs, parent);
}
/*
* If the effective_cpus is empty because the child
* partitions take away all the CPUs, we can keep
* the current partition and let the child partitions
* fight for available CPUs.
*/
if ((parent->partition_root_state == PRS_ERROR) ||
cpumask_empty(&new_cpus)) {
update_parent_subparts_cpumask(cs, partcmd_disable,
NULL, tmp);
cs->partition_root_state = PRS_ERROR;
}
cpuset_force_rebuild();
}
/*
* On the other hand, an erroneous partition root may be transitioned
* back to a regular one or a partition root with no CPU allocated
* from the parent may change to erroneous.
*/
if (is_partition_root(parent) &&
((cs->partition_root_state == PRS_ERROR) ||
!cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
cpuset_force_rebuild();
update_tasks:
cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
mems_updated = !nodes_equal(new_mems, cs->effective_mems);
......@@ -2271,13 +3059,6 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
mutex_unlock(&cpuset_mutex);
}
static bool force_rebuild;
void cpuset_force_rebuild(void)
{
force_rebuild = true;
}
/**
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
*
......@@ -2300,6 +3081,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
static nodemask_t new_mems;
bool cpus_updated, mems_updated;
bool on_dfl = is_in_v2_mode();
struct tmpmasks tmp, *ptmp = NULL;
if (on_dfl && !alloc_cpumasks(NULL, &tmp))
ptmp = &tmp;
mutex_lock(&cpuset_mutex);
......@@ -2307,6 +3092,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
cpumask_copy(&new_cpus, cpu_active_mask);
new_mems = node_states[N_MEMORY];
/*
* If subparts_cpus is populated, it is likely that the check below
* will produce a false positive on cpus_updated when the cpu list
* isn't changed. It is extra work, but it is better to be safe.
*/
cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
......@@ -2315,6 +3105,22 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
spin_lock_irq(&callback_lock);
if (!on_dfl)
cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
/*
* Make sure that CPUs allocated to child partitions
* do not show up in effective_cpus. If no CPU is left,
* we clear the subparts_cpus & let the child partitions
* fight for the CPUs again.
*/
if (top_cpuset.nr_subparts_cpus) {
if (cpumask_subset(&new_cpus,
top_cpuset.subparts_cpus)) {
top_cpuset.nr_subparts_cpus = 0;
cpumask_clear(top_cpuset.subparts_cpus);
} else {
cpumask_andnot(&new_cpus, &new_cpus,
top_cpuset.subparts_cpus);
}
}
cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
spin_unlock_irq(&callback_lock);
/* we don't mess with cpumasks of tasks in top_cpuset */
......@@ -2343,7 +3149,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
continue;
rcu_read_unlock();
cpuset_hotplug_update_tasks(cs);
cpuset_hotplug_update_tasks(cs, ptmp);
rcu_read_lock();
css_put(&cs->css);
......@@ -2356,6 +3162,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
force_rebuild = false;
rebuild_sched_domains();
}
free_cpumasks(NULL, ptmp);
}
void cpuset_update_active_cpus(void)
......
......@@ -373,11 +373,9 @@ struct cgroup_subsys debug_cgrp_subsys = {
* On v2, debug is an implicit controller enabled by "cgroup_debug" boot
* parameter.
*/
static int __init enable_cgroup_debug(char *str)
void __init enable_debug_cgroup(void)
{
debug_cgrp_subsys.dfl_cftypes = debug_files;
debug_cgrp_subsys.implicit_on_dfl = true;
debug_cgrp_subsys.threaded = true;
return 1;
}
__setup("cgroup_debug", enable_cgroup_debug);
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment