Commit 2bd59d48 authored by Tejun Heo's avatar Tejun Heo

cgroup: convert to kernfs

cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.

Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.

Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.

* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.

* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.

* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.

While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.

Overall
-------

* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.

* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.

* All vfs plumbing around dentry, inode and bdi removed.

* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.

Synchronization and object lifetime
-----------------------------------

* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.

* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.

* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.

* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.

File and directory handling
---------------------------

* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.

* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.

* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.

* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.

* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.

* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.

* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.

v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.

      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.

    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.

    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?

v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.

v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").
Signed-off-by: default avatarTejun Heo <tj@kernel.org>
Acked-by: default avatarLi Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
parent f2e85d57
...@@ -18,10 +18,10 @@ ...@@ -18,10 +18,10 @@
#include <linux/rwsem.h> #include <linux/rwsem.h>
#include <linux/idr.h> #include <linux/idr.h>
#include <linux/workqueue.h> #include <linux/workqueue.h>
#include <linux/xattr.h>
#include <linux/fs.h> #include <linux/fs.h>
#include <linux/percpu-refcount.h> #include <linux/percpu-refcount.h>
#include <linux/seq_file.h> #include <linux/seq_file.h>
#include <linux/kernfs.h>
#ifdef CONFIG_CGROUPS #ifdef CONFIG_CGROUPS
...@@ -159,16 +159,17 @@ struct cgroup { ...@@ -159,16 +159,17 @@ struct cgroup {
/* the number of attached css's */ /* the number of attached css's */
int nr_css; int nr_css;
atomic_t refcnt;
/* /*
* We link our 'sibling' struct into our parent's 'children'. * We link our 'sibling' struct into our parent's 'children'.
* Our children link their 'sibling' into our 'children'. * Our children link their 'sibling' into our 'children'.
*/ */
struct list_head sibling; /* my parent's children */ struct list_head sibling; /* my parent's children */
struct list_head children; /* my children */ struct list_head children; /* my children */
struct list_head files; /* my files */
struct cgroup *parent; /* my parent */ struct cgroup *parent; /* my parent */
struct dentry *dentry; /* cgroup fs entry, RCU protected */ struct kernfs_node *kn; /* cgroup kernfs entry */
/* /*
* Monotonically increasing unique serial number which defines a * Monotonically increasing unique serial number which defines a
...@@ -222,9 +223,6 @@ struct cgroup { ...@@ -222,9 +223,6 @@ struct cgroup {
/* For css percpu_ref killing and RCU-protected deletion */ /* For css percpu_ref killing and RCU-protected deletion */
struct rcu_head rcu_head; struct rcu_head rcu_head;
struct work_struct destroy_work; struct work_struct destroy_work;
/* directory xattrs */
struct simple_xattrs xattrs;
}; };
#define MAX_CGROUP_ROOT_NAMELEN 64 #define MAX_CGROUP_ROOT_NAMELEN 64
...@@ -291,15 +289,17 @@ enum { ...@@ -291,15 +289,17 @@ enum {
/* /*
* A cgroupfs_root represents the root of a cgroup hierarchy, and may be * A cgroupfs_root represents the root of a cgroup hierarchy, and may be
* associated with a superblock to form an active hierarchy. This is * associated with a kernfs_root to form an active hierarchy. This is
* internal to cgroup core. Don't access directly from controllers. * internal to cgroup core. Don't access directly from controllers.
*/ */
struct cgroupfs_root { struct cgroupfs_root {
struct super_block *sb; struct kernfs_root *kf_root;
/* The bitmask of subsystems attached to this hierarchy */ /* The bitmask of subsystems attached to this hierarchy */
unsigned long subsys_mask; unsigned long subsys_mask;
atomic_t refcnt;
/* Unique id for this hierarchy. */ /* Unique id for this hierarchy. */
int hierarchy_id; int hierarchy_id;
...@@ -415,6 +415,9 @@ struct cftype { ...@@ -415,6 +415,9 @@ struct cftype {
*/ */
struct cgroup_subsys *ss; struct cgroup_subsys *ss;
/* kernfs_ops to use, initialized automatically during registration */
struct kernfs_ops *kf_ops;
/* /*
* read_u64() is a shortcut for the common case of returning a * read_u64() is a shortcut for the common case of returning a
* single integer. Use it in place of read() * single integer. Use it in place of read()
...@@ -460,6 +463,10 @@ struct cftype { ...@@ -460,6 +463,10 @@ struct cftype {
* kick type for multiplexing. * kick type for multiplexing.
*/ */
int (*trigger)(struct cgroup_subsys_state *css, unsigned int event); int (*trigger)(struct cgroup_subsys_state *css, unsigned int event);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lock_class_key lockdep_key;
#endif
}; };
/* /*
...@@ -472,26 +479,6 @@ struct cftype_set { ...@@ -472,26 +479,6 @@ struct cftype_set {
struct cftype *cfts; struct cftype *cfts;
}; };
/*
* cgroupfs file entry, pointed to from leaf dentry->d_fsdata. Don't
* access directly.
*/
struct cfent {
struct list_head node;
struct dentry *dentry;
struct cftype *type;
struct cgroup_subsys_state *css;
/* file xattrs */
struct simple_xattrs xattrs;
};
/* seq_file->private points to the following, only ->priv is public */
struct cgroup_open_file {
struct cfent *cfe;
void *priv;
};
/* /*
* See the comment above CGRP_ROOT_SANE_BEHAVIOR for details. This * See the comment above CGRP_ROOT_SANE_BEHAVIOR for details. This
* function can be called as long as @cgrp is accessible. * function can be called as long as @cgrp is accessible.
...@@ -510,16 +497,17 @@ static inline const char *cgroup_name(const struct cgroup *cgrp) ...@@ -510,16 +497,17 @@ static inline const char *cgroup_name(const struct cgroup *cgrp)
/* returns ino associated with a cgroup, 0 indicates unmounted root */ /* returns ino associated with a cgroup, 0 indicates unmounted root */
static inline ino_t cgroup_ino(struct cgroup *cgrp) static inline ino_t cgroup_ino(struct cgroup *cgrp)
{ {
if (cgrp->dentry) if (cgrp->kn)
return cgrp->dentry->d_inode->i_ino; return cgrp->kn->ino;
else else
return 0; return 0;
} }
static inline struct cftype *seq_cft(struct seq_file *seq) static inline struct cftype *seq_cft(struct seq_file *seq)
{ {
struct cgroup_open_file *of = seq->private; struct kernfs_open_file *of = seq->private;
return of->cfe->type;
return of->kn->priv;
} }
struct cgroup_subsys_state *seq_css(struct seq_file *seq); struct cgroup_subsys_state *seq_css(struct seq_file *seq);
......
...@@ -854,6 +854,7 @@ config NUMA_BALANCING ...@@ -854,6 +854,7 @@ config NUMA_BALANCING
menuconfig CGROUPS menuconfig CGROUPS
boolean "Control Group support" boolean "Control Group support"
select KERNFS
help help
This option adds support for grouping sets of processes together, for This option adds support for grouping sets of processes together, for
use with process control subsystems such as Cpusets, CFS, memory use with process control subsystems such as Cpusets, CFS, memory
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment