cgroup: convert to kernfs

cgroup filesystem code was derived from the original sysfs implementation which was heavily intertwined with vfs objects and locking with the goal of re-using the existing vfs infrastructure. That experiment turned out rather disastrous and sysfs switched, a long time ago, to distributed filesystem model where a separate representation is maintained which is queried by vfs. Unfortunately, cgroup stuck with the failed experiment all these years and accumulated even more problems over time. Locking and object lifetime management being entangled with vfs is probably the most egregious. vfs is never designed to be misused like this and cgroup ends up jumping through various convoluted dancing to make things work. Even then, operations across multiple cgroups can't be done safely as it'll deadlock with rename locking. Recently, kernfs is separated out from sysfs so that it can be used by users other than sysfs. This patch converts cgroup to use kernfs, which will bring the following benefits. * Separation from vfs internals. Locking and object lifetime management is contained in cgroup proper making things a lot simpler. This removes significant amount of locking convolutions, hairy object lifetime rules and the restriction on multi-cgroup operations. * Can drop a lot of code to implement filesystem interface as most are provided by kernfs. * Proper "severing" semantics, which allows controllers to not worry about lingering file accesses after offline. While the preceding patches did as much as possible to make the transition less painful, large part of the conversion has to be one discrete step making this patch rather large. The rest of the commit message lists notable changes in different areas. Overall ------- * vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn, cgroupfs_root->sb w/ ->kf_root. * All dentry accessors are removed. Helpers to map from kernfs constructs are added. * All vfs plumbing around dentry, inode and bdi removed. * cgroup_mount() now directly looks for matching root and then proceeds to create a new one if not found. Synchronization and object lifetime ----------------------------------- * vfs inode locking removed. Among other things, this removes the need for the convolution in cgroup_cfts_commit(). Future patches will further simplify it. * vfs refcnting replaced with cgroup internal ones. cgroup->refcnt, cgroupfs_root->refcnt added. cgroup_put_root() now directly puts root->refcnt and when it reaches zero proceeds to destroy it thus merging cgroup_put_root() and the former cgroup_kill_sb(). Simliarly, cgroup_put() now directly schedules cgroup_free_rcu() when refcnt reaches zero. * Unlike before, kernfs objects don't hold onto cgroup objects. When cgroup destroys a kernfs node, all existing operations are drained and the association is broken immediately. The same for cgroupfs_roots and mounts. * All operations which come through kernfs guarantee that the associated cgroup is and stays valid for the duration of operation; however, there are two paths which need to find out the associated cgroup from dentry without going through kernfs - css_tryget_from_dir() and cgroupstats_build(). For these two, kernfs_node->priv is RCU managed so that they can dereference it under RCU read lock. File and directory handling --------------------------- * File and directory operations converted to kernfs_ops and kernfs_syscall_ops. * xattrs is implicitly supported by kernfs. No need to worry about it from cgroup. This means that "xattr" mount option is no longer necessary. A future patch will add a deprecated warning message when sane_behavior. * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a private copy of one of the kernfs_ops to set its atomic_write_len. cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated to handle it. * cftype->lockdep_key added so that kernfs lockdep annotation can be per cftype. * Inidividual file entries and open states are now managed by kernfs. No need to worry about them from cgroup. cfent, cgroup_open_file and their friends are removed. * kernfs_nodes are created deactivated and kernfs_activate() invocations added to places where creation of new nodes are committed. * cgroup_rmdir() uses kernfs_[un]break_active_protection() for self-removal. v2: - Li pointed out in an earlier patch that specifying "name=" during mount without subsystem specification should succeed if there's an existing hierarchy with a matching name although it should fail with -EINVAL if a new hierarchy should be created. Prior to the conversion, this used by handled by deferring failure from NULL return from cgroup_root_from_opts(), which was necessary because root was being created before checking for existing ones. Note that cgroup_root_from_opts() returned an ERR_PTR() value for error conditions which require immediate mount failure. As we now have separate search and creation steps, deferring failure from cgroup_root_from_opts() is no longer necessary. cgroup_root_from_opts() is updated to always return ERR_PTR() value on failure. - The logic to match existing roots is updated so that a mount attempt with a matching name but different subsys_mask are rejected. This was handled by a separate matching loop under the comment "Check for name clashes with existing mounts" but got lost during conversion. Merge the check into the main search loop. - Add __rcu __force casting in RCU_INIT_POINTER() in cgroup_destroy_locked() to avoid the sparse address space warning reported by kbuild test bot. Maybe we want an explicit interface to use kn->priv as RCU protected pointer? v3: Make CONFIG_CGROUPS select CONFIG_KERNFS. v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: kbuild test robot fengguang.wu@intel.com>

cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs implementation which was heavily intertwined with vfs objects and locking with the goal of re-using the existing vfs infrastructure. That experiment turned out rather disastrous and sysfs switched, a long time ago, to distributed filesystem model where a separate representation is maintained which is queried by vfs. Unfortunately, cgroup stuck with the failed experiment all these years and accumulated even more problems over time. Locking and object lifetime management being entangled with vfs is probably the most egregious. vfs is never designed to be misused like this and cgroup ends up jumping through various convoluted dancing to make things work. Even then, operations across multiple cgroups can't be done safely as it'll deadlock with rename locking. Recently, kernfs is separated out from sysfs so that it can be used by users other than sysfs. This patch converts cgroup to use kernfs, which will bring the following benefits. * Separation from vfs internals. Locking and object lifetime management is contained in cgroup proper making things a lot simpler. This removes significant amount of locking convolutions, hairy object lifetime rules and the restriction on multi-cgroup operations. * Can drop a lot of code to implement filesystem interface as most are provided by kernfs. * Proper "severing" semantics, which allows controllers to not worry about lingering file accesses after offline. While the preceding patches did as much as possible to make the transition less painful, large part of the conversion has to be one discrete step making this patch rather large. The rest of the commit message lists notable changes in different areas. Overall ------- * vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn, cgroupfs_root->sb w/ ->kf_root. * All dentry accessors are removed. Helpers to map from kernfs constructs are added. * All vfs plumbing around dentry, inode and bdi removed. * cgroup_mount() now directly looks for matching root and then proceeds to create a new one if not found. Synchronization and object lifetime ----------------------------------- * vfs inode locking removed. Among other things, this removes the need for the convolution in cgroup_cfts_commit(). Future patches will further simplify it. * vfs refcnting replaced with cgroup internal ones. cgroup->refcnt, cgroupfs_root->refcnt added. cgroup_put_root() now directly puts root->refcnt and when it reaches zero proceeds to destroy it thus merging cgroup_put_root() and the former cgroup_kill_sb(). Simliarly, cgroup_put() now directly schedules cgroup_free_rcu() when refcnt reaches zero. * Unlike before, kernfs objects don't hold onto cgroup objects. When cgroup destroys a kernfs node, all existing operations are drained and the association is broken immediately. The same for cgroupfs_roots and mounts. * All operations which come through kernfs guarantee that the associated cgroup is and stays valid for the duration of operation; however, there are two paths which need to find out the associated cgroup from dentry without going through kernfs - css_tryget_from_dir() and cgroupstats_build(). For these two, kernfs_node->priv is RCU managed so that they can dereference it under RCU read lock. File and directory handling --------------------------- * File and directory operations converted to kernfs_ops and kernfs_syscall_ops. * xattrs is implicitly supported by kernfs. No need to worry about it from cgroup. This means that "xattr" mount option is no longer necessary. A future patch will add a deprecated warning message when sane_behavior. * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a private copy of one of the kernfs_ops to set its atomic_write_len. cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated to handle it. * cftype->lockdep_key added so that kernfs lockdep annotation can be per cftype. * Inidividual file entries and open states are now managed by kernfs. No need to worry about them from cgroup. cfent, cgroup_open_file and their friends are removed. * kernfs_nodes are created deactivated and kernfs_activate() invocations added to places where creation of new nodes are committed. * cgroup_rmdir() uses kernfs_[un]break_active_protection() for self-removal. v2: - Li pointed out in an earlier patch that specifying "name=" during mount without subsystem specification should succeed if there's an existing hierarchy with a matching name although it should fail with -EINVAL if a new hierarchy should be created. Prior to the conversion, this used by handled by deferring failure from NULL return from cgroup_root_from_opts(), which was necessary because root was being created before checking for existing ones. Note that cgroup_root_from_opts() returned an ERR_PTR() value for error conditions which require immediate mount failure. As we now have separate search and creation steps, deferring failure from cgroup_root_from_opts() is no longer necessary. cgroup_root_from_opts() is updated to always return ERR_PTR() value on failure. - The logic to match existing roots is updated so that a mount attempt with a matching name but different subsys_mask are rejected. This was handled by a separate matching loop under the comment "Check for name clashes with existing mounts" but got lost during conversion. Merge the check into the main search loop. - Add __rcu __force casting in RCU_INIT_POINTER() in cgroup_destroy_locked() to avoid the sparse address space warning reported by kbuild test bot. Maybe we want an explicit interface to use kn->priv as RCU protected pointer? v3: Make CONFIG_CGROUPS select CONFIG_KERNFS. v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: kbuild test robot fengguang.wu@intel.com>
2bd59d48 · Tejun Heo · f2e85d57 · 2bd59d48 · 2bd59d48 · 2bd59d48
Commit 2bd59d48 authored Feb 11, 2014 by Tejun Heo
Expand all Show whitespace changes
Inline Side-by-side

Showing with 383 additions and 785 deletions

include/linux/cgroup.h include/linux/cgroup.h +20 -32

init/Kconfig init/Kconfig +1 -0

kernel/cgroup.c kernel/cgroup.c +362 -753

No files found.
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -18,10 +18,10 @@
 #include <linux/rwsem.h>
 #include <linux/idr.h>
 #include <linux/workqueue.h>
-#include <linux/xattr.h>
 #include <linux/fs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/seq_file.h>
+#include <linux/kernfs.h>
 #ifdef CONFIG_CGROUPS
@@ -159,16 +159,17 @@ struct cgroup {
 	/* the number of attached css's */
 	int nr_css;
+	atomic_t refcnt;
 	/*
 	 * We link our 'sibling' struct into our parent's 'children'.
 	 * Our children link their 'sibling' into our 'children'.
 	 */
 	struct list_head sibling;	/* my parent's children */
 	struct list_head children;	/* my children */
-	struct list_head files;		/* my files */
 	struct cgroup *parent;		/* my parent */
-	struct dentry *dentry;		/* cgroup fs entry, RCU protected */
+	struct kernfs_node *kn;		/* cgroup kernfs entry */
 	/*
 	 * Monotonically increasing unique serial number which defines a
@@ -222,9 +223,6 @@ struct cgroup {
 	/* For css percpu_ref killing and RCU-protected deletion */
 	struct rcu_head rcu_head;
 	struct work_struct destroy_work;
-	/* directory xattrs */
-	struct simple_xattrs xattrs;
 };
 #define MAX_CGROUP_ROOT_NAMELEN 64
@@ -291,15 +289,17 @@ enum {
 /*
 * A cgroupfs_root represents the root of a cgroup hierarchy, and may be
- * associated with a superblock to form an active hierarchy.  This is
+ * associated with a kernfs_root to form an active hierarchy.  This is
 * internal to cgroup core.  Don't access directly from controllers.
 */
 struct cgroupfs_root {
-	struct super_block *sb;
+	struct kernfs_root *kf_root;
 	/* The bitmask of subsystems attached to this hierarchy */
 	unsigned long subsys_mask;
+	atomic_t refcnt;
 	/* Unique id for this hierarchy. */
 	int hierarchy_id;
@@ -415,6 +415,9 @@ struct cftype {
 	 */
 	struct cgroup_subsys *ss;
+	/* kernfs_ops to use, initialized automatically during registration */
+	struct kernfs_ops *kf_ops;
 	/*
 	 * read_u64() is a shortcut for the common case of returning a
 	 * single integer. Use it in place of read()
@@ -460,6 +463,10 @@ struct cftype {
 	 * kick type for multiplexing.
 	 */
 	int (*trigger)(struct cgroup_subsys_state *css, unsigned int event);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lock_class_key	lockdep_key;
+#endif
 };
 /*
@@ -472,26 +479,6 @@ struct cftype_set {
 	struct cftype			*cfts;
 };
-/*
- * cgroupfs file entry, pointed to from leaf dentry->d_fsdata.  Don't
- * access directly.
- */
-struct cfent {
-	struct list_head		node;
-	struct dentry			*dentry;
-	struct cftype			*type;
-	struct cgroup_subsys_state	*css;
-	/* file xattrs */
-	struct simple_xattrs		xattrs;
-};
-/* seq_file->private points to the following, only ->priv is public */
-struct cgroup_open_file {
-	struct cfent			*cfe;
-	void				*priv;
-};
 /*
 * See the comment above CGRP_ROOT_SANE_BEHAVIOR for details.  This
 * function can be called as long as @cgrp is accessible.
@@ -510,16 +497,17 @@ static inline const char *cgroup_name(const struct cgroup *cgrp)
 /* returns ino associated with a cgroup, 0 indicates unmounted root */
 static inline ino_t cgroup_ino(struct cgroup *cgrp)
 {
-	if (cgrp->dentry)
+	if (cgrp->kn)
-		return cgrp->dentry->d_inode->i_ino;
+		return cgrp->kn->ino;
 	else
 		return 0;
 }
 static inline struct cftype *seq_cft(struct seq_file *seq)
 {
-	struct cgroup_open_file *of = seq->private;
+	struct kernfs_open_file *of = seq->private;
-	return of->cfe->type;
+	return of->kn->priv;
 }
 struct cgroup_subsys_state *seq_css(struct seq_file *seq);

--- a/init/Kconfig
+++ b/init/Kconfig
@@ -854,6 +854,7 @@ config NUMA_BALANCING
 menuconfig CGROUPS
 	boolean "Control Group support"
+	select KERNFS
 	help
 	  This option adds support for grouping sets of processes together, for
 	  use with process control subsystems such as Cpusets, CFS, memory

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c