Commit 895b9b12 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Added Michal Koutný as a maintainer

 - Counters in pids.events were behaving inconsistently. pids.events
   made properly hierarchical and pids.events.local added

 - misc.peak and misc.events.local added

 - cpuset remote partition creation and cpuset.cpus.exclusive handling
   improved

 - Code cleanups, non-critical fixes, doc updates

 - for-6.10-fixes is merged in to receive two non-critical fixes that
   didn't trigger pull

* tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
  cgroup: Add Michal Koutný as a maintainer
  cgroup/misc: Introduce misc.events.local
  cgroup/rstat: add force idle show helper
  cgroup: Protect css->cgroup write under css_set_lock
  cgroup/misc: Introduce misc.peak
  cgroup_misc: add kernel-doc comments for enum misc_res_type
  cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  selftest/cgroup: Update test_cpuset_prs.sh to match changes
  cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus
  cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition
  selftest/cgroup: Fix test_cpuset_prs.sh problems reported by test robot
  cgroup/cpuset: Fix remote root partition creation problem
  cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit()
  cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls
  cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCE
  kernel/cgroup: cleanup cgroup_base_files when fail to add cgroup_psi_files
  selftests: cgroup: Add basic tests for pids controller
  selftests: cgroup: Lexicographic order in Makefile
  cgroup/pids: Add pids.events.local
  cgroup/pids: Make event counters hierarchical
  ...
parents f97b956b 9283ff5b
......@@ -36,7 +36,8 @@ superset of parent/child/pids.current.
The pids.events file contains event counters:
- max: Number of times fork failed because limit was hit.
- max: Number of times fork failed in the cgroup because limit was hit in
self or ancestors.
Example
-------
......
......@@ -239,6 +239,13 @@ cgroup v2 currently supports the following mount options.
will not be tracked by the memory controller (even if cgroup
v2 is remounted later on).
pids_localevents
The option restores v1-like behavior of pids.events:max, that is only
local (inside cgroup proper) fork failures are counted. Without this
option pids.events.max represents any pids.max enforcemnt across
cgroup's subtree.
Organizing Processes and Threads
--------------------------------
......@@ -2205,12 +2212,18 @@ PID Interface Files
descendants has ever reached.
pids.events
A read-only flat-keyed file which exists on non-root cgroups. The
following entries are defined. Unless specified otherwise, a value
change in this file generates a file modified event.
A read-only flat-keyed file which exists on non-root cgroups. Unless
specified otherwise, a value change in this file generates a file
modified event. The following entries are defined.
max
Number of times fork failed because limit was hit.
The number of times the cgroup's total number of processes hit the pids.max
limit (see also pids_localevents).
pids.events.local
Similar to pids.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
Organisational operations are not blocked by cgroup policies, so it is
possible to have pids.current > pids.max. This can be done by either
......@@ -2346,8 +2359,12 @@ Cpuset Interface Files
is always a subset of it.
Users can manually set it to a value that is different from
"cpuset.cpus". The only constraint in setting it is that the
list of CPUs must be exclusive with respect to its sibling.
"cpuset.cpus". One constraint in setting it is that the list of
CPUs must be exclusive with respect to "cpuset.cpus.exclusive"
of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup
isn't set, its "cpuset.cpus" value, if set, cannot be a subset
of it to leave at least one CPU available when the exclusive
CPUs are taken away.
For a parent cgroup, any one of its exclusive CPUs can only
be distributed to at most one of its child cgroups. Having an
......@@ -2363,8 +2380,8 @@ Cpuset Interface Files
cpuset-enabled cgroups.
This file shows the effective set of exclusive CPUs that
can be used to create a partition root. The content of this
file will always be a subset of "cpuset.cpus" and its parent's
can be used to create a partition root. The content
of this file will always be a subset of its parent's
"cpuset.cpus.exclusive.effective" if its parent is not the root
cgroup. It will also be a subset of "cpuset.cpus.exclusive"
if it is set. If "cpuset.cpus.exclusive" is not set, it is
......@@ -2625,6 +2642,15 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
res_a 3
res_b 0
misc.peak
A read-only flat-keyed file shown in all cgroups. It shows the
historical maximum usage of the resources in the cgroup and its
children.::
$ cat misc.peak
res_a 10
res_b 8
misc.max
A read-write flat-keyed file shown in the non root cgroups. Allowed
maximum usage of the resources in the cgroup and its children.::
......@@ -2654,6 +2680,11 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
The number of times the cgroup's resource usage was
about to go over the max boundary.
misc.events.local
Similar to misc.events but the fields in the file are local to the
cgroup i.e. not hierarchical. The file modified event generated on
this file reflects only the local events.
Migration and Ownership
~~~~~~~~~~~~~~~~~~~~~~~
......
......@@ -5528,6 +5528,7 @@ CONTROL GROUP (CGROUP)
M: Tejun Heo <tj@kernel.org>
M: Zefan Li <lizefan.x@bytedance.com>
M: Johannes Weiner <hannes@cmpxchg.org>
M: Michal Koutný <mkoutny@suse.com>
L: cgroups@vger.kernel.org
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
......
......@@ -119,7 +119,12 @@ enum {
/*
* Enable hugetlb accounting for the memory controller.
*/
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
/*
* Enable legacy local pids.events.
*/
CGRP_ROOT_PIDS_LOCAL_EVENTS = (1 << 20),
};
/* cftype->flags */
......
......@@ -9,15 +9,16 @@
#define _MISC_CGROUP_H_
/**
* Types of misc cgroup entries supported by the host.
* enum misc_res_type - Types of misc cgroup entries supported by the host.
*/
enum misc_res_type {
#ifdef CONFIG_KVM_AMD_SEV
/* AMD SEV ASIDs resource */
/** @MISC_CG_RES_SEV: AMD SEV ASIDs resource */
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
/** @MISC_CG_RES_SEV_ES: AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
#endif
/** @MISC_CG_RES_TYPES: count of enum misc_res_type constants */
MISC_CG_RES_TYPES
};
......@@ -30,13 +31,16 @@ struct misc_cg;
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
* @watermark: Historical maximum usage of the resource.
* @usage: Current usage of the resource.
* @events: Number of times, the resource limit exceeded.
*/
struct misc_res {
u64 max;
atomic64_t watermark;
atomic64_t usage;
atomic64_t events;
atomic64_t events_local;
};
/**
......@@ -50,6 +54,8 @@ struct misc_cg {
/* misc.events */
struct cgroup_file events_file;
/* misc.events.local */
struct cgroup_file events_local_file;
struct misc_res res[MISC_CG_RES_TYPES];
};
......
......@@ -1744,8 +1744,11 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
if (cgroup_psi_enabled()) {
ret = cgroup_addrm_files(css, cgrp,
cgroup_psi_files, true);
if (ret < 0)
if (ret < 0) {
cgroup_addrm_files(css, cgrp,
cgroup_base_files, false);
return ret;
}
}
} else {
ret = cgroup_addrm_files(css, cgrp,
......@@ -1839,9 +1842,9 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
RCU_INIT_POINTER(scgrp->subsys[ssid], NULL);
rcu_assign_pointer(dcgrp->subsys[ssid], css);
ss->root = dst_root;
css->cgroup = dcgrp;
spin_lock_irq(&css_set_lock);
css->cgroup = dcgrp;
WARN_ON(!list_empty(&dcgrp->e_csets[ss->id]));
list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id],
e_cset_node[ss->id]) {
......@@ -1922,6 +1925,7 @@ enum cgroup2_param {
Opt_memory_localevents,
Opt_memory_recursiveprot,
Opt_memory_hugetlb_accounting,
Opt_pids_localevents,
nr__cgroup2_params
};
......@@ -1931,6 +1935,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
fsparam_flag("pids_localevents", Opt_pids_localevents),
{}
};
......@@ -1960,6 +1965,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
case Opt_memory_hugetlb_accounting:
ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
return 0;
case Opt_pids_localevents:
ctx->flags |= CGRP_ROOT_PIDS_LOCAL_EVENTS;
return 0;
}
return -EINVAL;
}
......@@ -1989,6 +1997,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
if (root_flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
cgrp_dfl_root.flags |= CGRP_ROOT_PIDS_LOCAL_EVENTS;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_PIDS_LOCAL_EVENTS;
}
}
......@@ -2004,6 +2017,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
seq_puts(seq, ",memory_recursiveprot");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
seq_puts(seq, ",memory_hugetlb_accounting");
if (cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
seq_puts(seq, ",pids_localevents");
return 0;
}
......@@ -6686,8 +6701,10 @@ void cgroup_exit(struct task_struct *tsk)
WARN_ON_ONCE(list_empty(&tsk->cg_list));
cset = task_css_set(tsk);
css_set_move_task(tsk, cset, NULL, false);
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
cset->nr_tasks--;
/* matches the signal->live check in css_task_iter_advance() */
if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live))
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
if (dl_task(tsk))
dec_dl_tasks_cs(tsk);
......@@ -6714,10 +6731,12 @@ void cgroup_release(struct task_struct *task)
ss->release(task);
} while_each_subsys_mask();
spin_lock_irq(&css_set_lock);
css_set_skip_task_iters(task_css_set(task), task);
list_del_init(&task->cg_list);
spin_unlock_irq(&css_set_lock);
if (!list_empty(&task->cg_list)) {
spin_lock_irq(&css_set_lock);
css_set_skip_task_iters(task_css_set(task), task);
list_del_init(&task->cg_list);
spin_unlock_irq(&css_set_lock);
}
}
void cgroup_free(struct task_struct *task)
......@@ -7062,7 +7081,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
"favordynmods\n"
"memory_localevents\n"
"memory_recursiveprot\n"
"memory_hugetlb_accounting\n");
"memory_hugetlb_accounting\n"
"pids_localevents\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
......
This diff is collapsed.
......@@ -121,6 +121,30 @@ static void misc_cg_cancel_charge(enum misc_res_type type, struct misc_cg *cg,
misc_res_name[type]);
}
static void misc_cg_update_watermark(struct misc_res *res, u64 new_usage)
{
u64 old;
while (true) {
old = atomic64_read(&res->watermark);
if (new_usage <= old)
break;
if (atomic64_cmpxchg(&res->watermark, old, new_usage) == old)
break;
}
}
static void misc_cg_event(enum misc_res_type type, struct misc_cg *cg)
{
atomic64_inc(&cg->res[type].events_local);
cgroup_file_notify(&cg->events_local_file);
for (; parent_misc(cg); cg = parent_misc(cg)) {
atomic64_inc(&cg->res[type].events);
cgroup_file_notify(&cg->events_file);
}
}
/**
* misc_cg_try_charge() - Try charging the misc cgroup.
* @type: Misc res type to charge.
......@@ -159,14 +183,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
ret = -EBUSY;
goto err_charge;
}
misc_cg_update_watermark(res, new_usage);
}
return 0;
err_charge:
for (j = i; j; j = parent_misc(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}
misc_cg_event(type, i);
for (j = cg; j != i; j = parent_misc(j))
misc_cg_cancel_charge(type, j, amount);
......@@ -307,6 +329,29 @@ static int misc_cg_current_show(struct seq_file *sf, void *v)
return 0;
}
/**
* misc_cg_peak_show() - Show the peak usage of the misc cgroup.
* @sf: Interface file
* @v: Arguments passed
*
* Context: Any context.
* Return: 0 to denote successful print.
*/
static int misc_cg_peak_show(struct seq_file *sf, void *v)
{
int i;
u64 watermark;
struct misc_cg *cg = css_misc(seq_css(sf));
for (i = 0; i < MISC_CG_RES_TYPES; i++) {
watermark = atomic64_read(&cg->res[i].watermark);
if (READ_ONCE(misc_res_capacity[i]) || watermark)
seq_printf(sf, "%s %llu\n", misc_res_name[i], watermark);
}
return 0;
}
/**
* misc_cg_capacity_show() - Show the total capacity of misc res on the host.
* @sf: Interface file
......@@ -331,20 +376,33 @@ static int misc_cg_capacity_show(struct seq_file *sf, void *v)
return 0;
}
static int misc_events_show(struct seq_file *sf, void *v)
static int __misc_events_show(struct seq_file *sf, bool local)
{
struct misc_cg *cg = css_misc(seq_css(sf));
u64 events;
int i;
for (i = 0; i < MISC_CG_RES_TYPES; i++) {
events = atomic64_read(&cg->res[i].events);
if (local)
events = atomic64_read(&cg->res[i].events_local);
else
events = atomic64_read(&cg->res[i].events);
if (READ_ONCE(misc_res_capacity[i]) || events)
seq_printf(sf, "%s.max %llu\n", misc_res_name[i], events);
}
return 0;
}
static int misc_events_show(struct seq_file *sf, void *v)
{
return __misc_events_show(sf, false);
}
static int misc_events_local_show(struct seq_file *sf, void *v)
{
return __misc_events_show(sf, true);
}
/* Misc cgroup interface files */
static struct cftype misc_cg_files[] = {
{
......@@ -357,6 +415,10 @@ static struct cftype misc_cg_files[] = {
.name = "current",
.seq_show = misc_cg_current_show,
},
{
.name = "peak",
.seq_show = misc_cg_peak_show,
},
{
.name = "capacity",
.seq_show = misc_cg_capacity_show,
......@@ -368,6 +430,12 @@ static struct cftype misc_cg_files[] = {
.file_offset = offsetof(struct misc_cg, events_file),
.seq_show = misc_events_show,
},
{
.name = "events.local",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct misc_cg, events_local_file),
.seq_show = misc_events_local_show,
},
{}
};
......
......@@ -38,6 +38,14 @@
#define PIDS_MAX (PID_MAX_LIMIT + 1ULL)
#define PIDS_MAX_STR "max"
enum pidcg_event {
/* Fork failed in subtree because this pids_cgroup limit was hit. */
PIDCG_MAX,
/* Fork failed in this pids_cgroup because ancestor limit was hit. */
PIDCG_FORKFAIL,
NR_PIDCG_EVENTS,
};
struct pids_cgroup {
struct cgroup_subsys_state css;
......@@ -49,11 +57,12 @@ struct pids_cgroup {
atomic64_t limit;
int64_t watermark;
/* Handle for "pids.events" */
/* Handles for pids.events[.local] */
struct cgroup_file events_file;
struct cgroup_file events_local_file;
/* Number of times fork failed because limit was hit. */
atomic64_t events_limit;
atomic64_t events[NR_PIDCG_EVENTS];
atomic64_t events_local[NR_PIDCG_EVENTS];
};
static struct pids_cgroup *css_pids(struct cgroup_subsys_state *css)
......@@ -148,12 +157,13 @@ static void pids_charge(struct pids_cgroup *pids, int num)
* pids_try_charge - hierarchically try to charge the pid count
* @pids: the pid cgroup state
* @num: the number of pids to charge
* @fail: storage of pid cgroup causing the fail
*
* This function follows the set limit. It will fail if the charge would cause
* the new value to exceed the hierarchical limit. Returns 0 if the charge
* succeeded, otherwise -EAGAIN.
*/
static int pids_try_charge(struct pids_cgroup *pids, int num)
static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup **fail)
{
struct pids_cgroup *p, *q;
......@@ -166,9 +176,10 @@ static int pids_try_charge(struct pids_cgroup *pids, int num)
* p->limit is %PIDS_MAX then we know that this test will never
* fail.
*/
if (new > limit)
if (new > limit) {
*fail = p;
goto revert;
}
/*
* Not technically accurate if we go over limit somewhere up
* the hierarchy, but that's tolerable for the watermark.
......@@ -229,6 +240,36 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
}
}
static void pids_event(struct pids_cgroup *pids_forking,
struct pids_cgroup *pids_over_limit)
{
struct pids_cgroup *p = pids_forking;
bool limit = false;
/* Only log the first time limit is hit. */
if (atomic64_inc_return(&p->events_local[PIDCG_FORKFAIL]) == 1) {
pr_info("cgroup: fork rejected by pids controller in ");
pr_cont_cgroup_path(p->css.cgroup);
pr_cont("\n");
}
cgroup_file_notify(&p->events_local_file);
if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) ||
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
return;
for (; parent_pids(p); p = parent_pids(p)) {
if (p == pids_over_limit) {
limit = true;
atomic64_inc(&p->events_local[PIDCG_MAX]);
cgroup_file_notify(&p->events_local_file);
}
if (limit)
atomic64_inc(&p->events[PIDCG_MAX]);
cgroup_file_notify(&p->events_file);
}
}
/*
* task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies
* on cgroup_threadgroup_change_begin() held by the copy_process().
......@@ -236,7 +277,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
static int pids_can_fork(struct task_struct *task, struct css_set *cset)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
struct pids_cgroup *pids, *pids_over_limit;
int err;
if (cset)
......@@ -244,16 +285,10 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset)
else
css = task_css_check(current, pids_cgrp_id, true);
pids = css_pids(css);
err = pids_try_charge(pids, 1);
if (err) {
/* Only log the first time events_limit is incremented. */
if (atomic64_inc_return(&pids->events_limit) == 1) {
pr_info("cgroup: fork rejected by pids controller in ");
pr_cont_cgroup_path(css->cgroup);
pr_cont("\n");
}
cgroup_file_notify(&pids->events_file);
}
err = pids_try_charge(pids, 1, &pids_over_limit);
if (err)
pids_event(pids, pids_over_limit);
return err;
}
......@@ -337,15 +372,68 @@ static s64 pids_peak_read(struct cgroup_subsys_state *css,
return READ_ONCE(pids->watermark);
}
static int pids_events_show(struct seq_file *sf, void *v)
static int __pids_events_show(struct seq_file *sf, bool local)
{
struct pids_cgroup *pids = css_pids(seq_css(sf));
enum pidcg_event pe = PIDCG_MAX;
atomic64_t *events;
if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) ||
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS) {
pe = PIDCG_FORKFAIL;
local = true;
}
events = local ? pids->events_local : pids->events;
seq_printf(sf, "max %lld\n", (s64)atomic64_read(&events[pe]));
return 0;
}
seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pids->events_limit));
static int pids_events_show(struct seq_file *sf, void *v)
{
__pids_events_show(sf, false);
return 0;
}
static int pids_events_local_show(struct seq_file *sf, void *v)
{
__pids_events_show(sf, true);
return 0;
}
static struct cftype pids_files[] = {
{
.name = "max",
.write = pids_max_write,
.seq_show = pids_max_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "current",
.read_s64 = pids_current_read,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = pids_peak_read,
},
{
.name = "events",
.seq_show = pids_events_show,
.file_offset = offsetof(struct pids_cgroup, events_file),
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "events.local",
.seq_show = pids_events_local_show,
.file_offset = offsetof(struct pids_cgroup, events_local_file),
.flags = CFTYPE_NOT_ON_ROOT,
},
{ } /* terminate */
};
static struct cftype pids_files_legacy[] = {
{
.name = "max",
.write = pids_max_write,
......@@ -371,6 +459,7 @@ static struct cftype pids_files[] = {
{ } /* terminate */
};
struct cgroup_subsys pids_cgrp_subsys = {
.css_alloc = pids_css_alloc,
.css_free = pids_css_free,
......@@ -379,7 +468,7 @@ struct cgroup_subsys pids_cgrp_subsys = {
.can_fork = pids_can_fork,
.cancel_fork = pids_cancel_fork,
.release = pids_release,
.legacy_cftypes = pids_files,
.legacy_cftypes = pids_files_legacy,
.dfl_cftypes = pids_files,
.threaded = true,
};
......@@ -594,49 +594,46 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
}
}
static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat *bstat)
{
#ifdef CONFIG_SCHED_CORE
u64 forceidle_time = bstat->forceidle_sum;
do_div(forceidle_time, NSEC_PER_USEC);
seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
#endif
}
void cgroup_base_stat_cputime_show(struct seq_file *seq)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
u64 usage, utime, stime;
struct cgroup_base_stat bstat;
#ifdef CONFIG_SCHED_CORE
u64 forceidle_time;
#endif
if (cgroup_parent(cgrp)) {
cgroup_rstat_flush_hold(cgrp);
usage = cgrp->bstat.cputime.sum_exec_runtime;
cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
&utime, &stime);
#ifdef CONFIG_SCHED_CORE
forceidle_time = cgrp->bstat.forceidle_sum;
#endif
cgroup_rstat_flush_release(cgrp);
} else {
root_cgroup_cputime(&bstat);
usage = bstat.cputime.sum_exec_runtime;
utime = bstat.cputime.utime;
stime = bstat.cputime.stime;
#ifdef CONFIG_SCHED_CORE
forceidle_time = bstat.forceidle_sum;
#endif
/* cgrp->bstat of root is not actually used, reuse it */
root_cgroup_cputime(&cgrp->bstat);
usage = cgrp->bstat.cputime.sum_exec_runtime;
utime = cgrp->bstat.cputime.utime;
stime = cgrp->bstat.cputime.stime;
}
do_div(usage, NSEC_PER_USEC);
do_div(utime, NSEC_PER_USEC);
do_div(stime, NSEC_PER_USEC);
#ifdef CONFIG_SCHED_CORE
do_div(forceidle_time, NSEC_PER_USEC);
#endif
seq_printf(seq, "usage_usec %llu\n"
"user_usec %llu\n"
"system_usec %llu\n",
usage, utime, stime);
#ifdef CONFIG_SCHED_CORE
seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
#endif
cgroup_force_idle_show(seq, &cgrp->bstat);
}
/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
......
# SPDX-License-Identifier: GPL-2.0-only
test_memcontrol
test_core
test_freezer
test_kmem
test_kill
test_cpu
test_cpuset
test_zswap
test_freezer
test_hugetlb_memcg
test_kill
test_kmem
test_memcontrol
test_pids
test_zswap
wait_inotify
......@@ -6,26 +6,29 @@ all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh
TEST_PROGS := test_stress.sh test_cpuset_prs.sh test_cpuset_v1_hp.sh
TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
TEST_GEN_PROGS += test_freezer
TEST_GEN_PROGS += test_kill
# Keep the lists lexicographically sorted
TEST_GEN_PROGS = test_core
TEST_GEN_PROGS += test_cpu
TEST_GEN_PROGS += test_cpuset
TEST_GEN_PROGS += test_zswap
TEST_GEN_PROGS += test_freezer
TEST_GEN_PROGS += test_hugetlb_memcg
TEST_GEN_PROGS += test_kill
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_memcontrol
TEST_GEN_PROGS += test_pids
TEST_GEN_PROGS += test_zswap
LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h
include ../lib.mk
$(OUTPUT)/test_memcontrol: cgroup_util.c
$(OUTPUT)/test_kmem: cgroup_util.c
$(OUTPUT)/test_core: cgroup_util.c
$(OUTPUT)/test_freezer: cgroup_util.c
$(OUTPUT)/test_kill: cgroup_util.c
$(OUTPUT)/test_cpu: cgroup_util.c
$(OUTPUT)/test_cpuset: cgroup_util.c
$(OUTPUT)/test_zswap: cgroup_util.c
$(OUTPUT)/test_freezer: cgroup_util.c
$(OUTPUT)/test_hugetlb_memcg: cgroup_util.c
$(OUTPUT)/test_kill: cgroup_util.c
$(OUTPUT)/test_kmem: cgroup_util.c
$(OUTPUT)/test_memcontrol: cgroup_util.c
$(OUTPUT)/test_pids: cgroup_util.c
$(OUTPUT)/test_zswap: cgroup_util.c
......@@ -28,6 +28,14 @@ CPULIST=$(cat $CGROUP2/cpuset.cpus.effective)
NR_CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//")
[[ $NR_CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
# Check to see if /dev/console exists and is writable
if [[ -c /dev/console && -w /dev/console ]]
then
CONSOLE=/dev/console
else
CONSOLE=/dev/null
fi
# Set verbose flag and delay factor
PROG=$1
VERBOSE=0
......@@ -103,8 +111,8 @@ console_msg()
{
MSG=$1
echo "$MSG"
echo "" > /dev/console
echo "$MSG" > /dev/console
echo "" > $CONSOLE
echo "$MSG" > $CONSOLE
pause 0.01
}
......@@ -161,6 +169,14 @@ test_add_proc()
# T = put a task into cgroup
# O<c>=<v> = Write <v> to CPU online file of <c>
#
# ECPUs - effective CPUs of cpusets
# Pstate - partition root state
# ISOLCPUS - isolated CPUs (<icpus>[,<icpus2>])
#
# Note that if there are 2 fields in ISOLCPUS, the first one is for
# sched-debug matching which includes offline CPUs and single-CPU partitions
# while the second one is for matching cpuset.cpus.isolated.
#
SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
TEST_MATRIX=(
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
......@@ -220,23 +236,29 @@ TEST_MATRIX=(
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3:P2 . . 0 A1:0-1,A2:2-3,A3:2-3 A1:P0,A2:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X3:P2 . . 0 A1:0-2,A2:3,A3:3 A1:P0,A2:P2 3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2 . 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:C3 . 0 A1:0-2,A2:1-2,A3:3 A1:P0,A3:P2 3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:C3 . 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 C2-3 . . . P2 0 A1:0-3,A2:1-3,A3:2-3,B1:2-3 A1:P0,A3:P0,B1:P-2"
" C0-3:S+ C1-3:S+ C2-3 C4-5 . . . P2 0 B1:4-5 B1:P2 4-5"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2 P2 0 A3:2-3,B1:4 A3:P2,B1:P2 2-4"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2:C1-3 P2 0 A3:2-3,B1:4 A3:P2,B1:P2 2-4"
" C0-3:S+ C1-3:S+ C2-3 C4 X1-3 X1-3:P2 P2 . 0 A2:1,A3:2-3 A2:P2,A3:P2 1-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2 P2:C4-5 0 A3:2-3,B1:4-5 A3:P2,B1:P2 2-5"
" C4:X0-3:S+ X1-3:S+ X2-3 . . P2 . . 0 A1:4,A2:1-3,A3:1-3 A2:P2 1-3"
" C4:X0-3:S+ X1-3:S+ X2-3 . . . P2 . 0 A1:4,A2:4,A3:2-3 A3:P2 2-3"
# Nested remote/local partition tests
" C0-3:S+ C1-3:S+ C2-3 C4-5 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4-5 \
A1:P0,A2:P1,A3:P2,B1:P1 2-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4,2-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 . P1 0 A1:0-1,A2:2-3,A3:2-3,B1:4 \
A1:P0,A2:P1,A3:P0,B1:P1"
" C0-3:S+ C1-3:S+ C3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:2,A3:3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4,3"
" C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X4:P1 . 0 A1:0-1,A2:2-3,A3:4 \
A1:P0,A2:P2,A3:P1 2-4,2-3"
" C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X3-4:P1 . 0 A1:0-1,A2:2,A3:3-4 \
A1:P0,A2:P2,A3:P1 2"
" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
. . X5 . . 0 A1:0-4,A2:1-4,A3:2-4 \
A1:P0,A2:P-2,A3:P-1"
......@@ -262,8 +284,8 @@ TEST_MATRIX=(
. . X2-3 P2 . . 0 A1:0-2,A2:3,XA2:3 A2:P2 3"
# Invalid to valid local partition direct transition tests
" C1-3:S+:P2 C2-3:X1:P2 . . . . . . 0 A1:1-3,XA1:1-3,A2:2-3:XA2: A1:P2,A2:P-2 1-3"
" C1-3:S+:P2 C2-3:X1:P2 . . . X3:P2 . . 0 A1:1-2,XA1:1-3,A2:3:XA2:3 A1:P2,A2:P2 1-3"
" C1-3:S+:P2 X4:P2 . . . . . . 0 A1:1-3,XA1:1-3,A2:1-3:XA2: A1:P2,A2:P-2 1-3"
" C1-3:S+:P2 X4:P2 . . . X3:P2 . . 0 A1:1-2,XA1:1-3,A2:3:XA2:3 A1:P2,A2:P2 1-3"
" C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4,B1:4-6 A1:P-2,B1:P0"
" C0-3:P2 . . C4-6 C0-4:C0-3 . . . 0 A1:0-3,B1:4-6 A1:P2,B1:P0 0-3"
" C0-3:P2 . . C3-5:C4-5 . . . . 0 A1:0-3,B1:4-5 A1:P2,B1:P0 0-3"
......@@ -274,21 +296,18 @@ TEST_MATRIX=(
" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
. . X4 . . 0 A1:1-3,A2:1-3,A3:2-3,XA2:,XA3: A1:P2,A2:P-2,A3:P-2 1-3"
" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
. . C4 . . 0 A1:1-3,A2:1-3,A3:2-3,XA2:,XA3: A1:P2,A2:P-2,A3:P-2 1-3"
. . C4:X . . 0 A1:1-3,A2:1-3,A3:2-3,XA2:,XA3: A1:P2,A2:P-2,A3:P-2 1-3"
# Local partition CPU change tests
" C0-5:S+:P2 C4-5:S+:P1 . . . C3-5 . . 0 A1:0-2,A2:3-5 A1:P2,A2:P1 0-2"
" C0-5:S+:P2 C4-5:S+:P1 . . C1-5 . . . 0 A1:1-3,A2:4-5 A1:P2,A2:P1 1-3"
# cpus_allowed/exclusive_cpus update tests
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. C4 . P2 . 0 A1:4,A2:4,XA2:,XA3:,A3:4 \
. X:C4 . P2 . 0 A1:4,A2:4,XA2:,XA3:,A3:4 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. X1 . P2 . 0 A1:0-3,A2:1-3,XA1:1,XA2:,XA3:,A3:2-3 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. . C3 P2 . 0 A1:0-2,A2:0-2,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P2 3"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. . X3 P2 . 0 A1:0-2,A2:1-2,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P2 3"
......@@ -296,10 +315,7 @@ TEST_MATRIX=(
. . X3 . . 0 A1:0-3,A2:1-3,XA2:3,XA3:3,A3:2-3 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. . C3 . . 0 A1:0-3,A2:3,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. C4 . . . 0 A1:4,A2:4,A3:4,XA1:,XA2:,XA3 \
. X4 . . . 0 A1:0-3,A2:1-3,A3:2-3,XA1:4,XA2:,XA3 \
A1:P0,A3:P-2"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
......@@ -346,6 +362,9 @@ TEST_MATRIX=(
" C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P-1"
" C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P0,B1:P-1"
# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
" C0-3 . . C4-5 X5 . . . 0 A1:0-3,B1:4-5"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
# Failure cases:
......@@ -355,6 +374,9 @@ TEST_MATRIX=(
# Changes to cpuset.cpus.exclusive that violate exclusivity rule is rejected
" C0-3 . . C4-5 X0-3 . . X3-5 1 A1:0-3,B1:4-5"
# cpuset.cpus cannot be a subset of sibling cpuset.cpus.exclusive
" C0-3 . . C4-5 X3-5 . . . 1 A1:0-3,B1:4-5"
)
#
......@@ -556,14 +578,15 @@ check_cgroup_states()
do
set -- $(echo $CHK | sed -e "s/:/ /g")
CGRP=$1
CGRP_DIR=$CGRP
STATE=$2
FILE=
EVAL=$(expr substr $STATE 2 2)
[[ $CGRP = A2 ]] && CGRP=A1/A2
[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
[[ $CGRP = A2 ]] && CGRP_DIR=A1/A2
[[ $CGRP = A3 ]] && CGRP_DIR=A1/A2/A3
case $STATE in
P*) FILE=$CGRP/cpuset.cpus.partition
P*) FILE=$CGRP_DIR/cpuset.cpus.partition
;;
*) echo "Unknown state: $STATE!"
exit 1
......@@ -587,6 +610,16 @@ check_cgroup_states()
;;
esac
[[ $EVAL != $VAL ]] && return 1
#
# For root partition, dump sched-domains info to console if
# verbose mode set for manual comparison with sched debug info.
#
[[ $VAL -eq 1 && $VERBOSE -gt 0 ]] && {
DOMS=$(cat $CGRP_DIR/cpuset.cpus.effective)
[[ -n "$DOMS" ]] &&
echo " [$CGRP] sched-domain: $DOMS" > $CONSOLE
}
done
return 0
}
......@@ -694,9 +727,9 @@ null_isolcpus_check()
[[ $VERBOSE -gt 0 ]] || return 0
# Retry a few times before printing error
RETRY=0
while [[ $RETRY -lt 5 ]]
while [[ $RETRY -lt 8 ]]
do
pause 0.01
pause 0.02
check_isolcpus "."
[[ $? -eq 0 ]] && return 0
((RETRY++))
......@@ -726,7 +759,7 @@ run_state_test()
while [[ $I -lt $CNT ]]
do
echo "Running test $I ..." > /dev/console
echo "Running test $I ..." > $CONSOLE
[[ $VERBOSE -gt 1 ]] && {
echo ""
eval echo \${$TEST[$I]}
......@@ -783,7 +816,7 @@ run_state_test()
while [[ $NEWLIST != $CPULIST && $RETRY -lt 8 ]]
do
# Wait a bit longer & recheck a few times
pause 0.01
pause 0.02
((RETRY++))
NEWLIST=$(cat cpuset.cpus.effective)
done
......
// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <errno.h>
#include <linux/limits.h>
#include <signal.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include "../kselftest.h"
#include "cgroup_util.h"
static int run_success(const char *cgroup, void *arg)
{
return 0;
}
static int run_pause(const char *cgroup, void *arg)
{
return pause();
}
/*
* This test checks that pids.max prevents forking new children above the
* specified limit in the cgroup.
*/
static int test_pids_max(const char *root)
{
int ret = KSFT_FAIL;
char *cg_pids;
int pid;
cg_pids = cg_name(root, "pids_test");
if (!cg_pids)
goto cleanup;
if (cg_create(cg_pids))
goto cleanup;
if (cg_read_strcmp(cg_pids, "pids.max", "max\n"))
goto cleanup;
if (cg_write(cg_pids, "pids.max", "2"))
goto cleanup;
if (cg_enter_current(cg_pids))
goto cleanup;
pid = cg_run_nowait(cg_pids, run_pause, NULL);
if (pid < 0)
goto cleanup;
if (cg_run_nowait(cg_pids, run_success, NULL) != -1 || errno != EAGAIN)
goto cleanup;
if (kill(pid, SIGINT))
goto cleanup;
ret = KSFT_PASS;
cleanup:
cg_enter_current(root);
cg_destroy(cg_pids);
free(cg_pids);
return ret;
}
/*
* This test checks that pids.events are counted in cgroup associated with pids.max
*/
static int test_pids_events(const char *root)
{
int ret = KSFT_FAIL;
char *cg_parent = NULL, *cg_child = NULL;
int pid;
cg_parent = cg_name(root, "pids_parent");
cg_child = cg_name(cg_parent, "pids_child");
if (!cg_parent || !cg_child)
goto cleanup;
if (cg_create(cg_parent))
goto cleanup;
if (cg_write(cg_parent, "cgroup.subtree_control", "+pids"))
goto cleanup;
if (cg_create(cg_child))
goto cleanup;
if (cg_write(cg_parent, "pids.max", "2"))
goto cleanup;
if (cg_read_strcmp(cg_child, "pids.max", "max\n"))
goto cleanup;
if (cg_enter_current(cg_child))
goto cleanup;
pid = cg_run_nowait(cg_child, run_pause, NULL);
if (pid < 0)
goto cleanup;
if (cg_run_nowait(cg_child, run_success, NULL) != -1 || errno != EAGAIN)
goto cleanup;
if (kill(pid, SIGINT))
goto cleanup;
if (cg_read_key_long(cg_child, "pids.events", "max ") != 0)
goto cleanup;
if (cg_read_key_long(cg_parent, "pids.events", "max ") != 1)
goto cleanup;
ret = KSFT_PASS;
cleanup:
cg_enter_current(root);
if (cg_child)
cg_destroy(cg_child);
if (cg_parent)
cg_destroy(cg_parent);
free(cg_child);
free(cg_parent);
return ret;
}
#define T(x) { x, #x }
struct pids_test {
int (*fn)(const char *root);
const char *name;
} tests[] = {
T(test_pids_max),
T(test_pids_events),
};
#undef T
int main(int argc, char **argv)
{
char root[PATH_MAX];
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(tests));
if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n");
/*
* Check that pids controller is available:
* pids is listed in cgroup.controllers
*/
if (cg_read_strstr(root, "cgroup.controllers", "pids"))
ksft_exit_skip("pids controller isn't available\n");
if (cg_read_strstr(root, "cgroup.subtree_control", "pids"))
if (cg_write(root, "cgroup.subtree_control", "+pids"))
ksft_exit_skip("Failed to set pids controller\n");
for (int i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
ksft_test_result_pass("%s\n", tests[i].name);
break;
case KSFT_SKIP:
ksft_test_result_skip("%s\n", tests[i].name);
break;
default:
ksft_test_result_fail("%s\n", tests[i].name);
break;
}
}
ksft_finished();
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment