1. 29 Jun, 2013 1 commit
    • Tejun Heo's avatar
      cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy · c7ba8287
      Tejun Heo authored
      0ce6cba3 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when
      comparing mount options") only updated the remount path but
      CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options
      while mounting an existing hierarchy.  As option mismatch triggers a
      warning but doesn't fail the mount without sane_behavior, this only
      triggers a spurious warning message.
      
      Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new
      and existing root options.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c7ba8287
  2. 28 Jun, 2013 2 commits
    • Tejun Heo's avatar
      cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options · 0ce6cba3
      Tejun Heo authored
      1672d040 ("cgroup: fix cgroupfs_root early destruction path")
      introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
      subsys binding on a new root; however, this broke remounts.
      cgroup_remount() doesn't allow changing root options via remount and
      CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
      makes the function reject all remounts.
      
      Fix it by putting the options part in the lower 16 bits of root->flags
      and masking the comparions.  While at it, make cgroup_remount() emit
      an error message explaining why it's rejecting a remount request, so
      that it's less of a mystery.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      0ce6cba3
    • Tejun Heo's avatar
      cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts() · e2bd416f
      Tejun Heo authored
      eb178d06 ("cgroup: grab cgroup_mutex in
      drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
      grab cgroup_mutex to make lockdep assertion in for_each_subsys()
      happy.  Unfortunately, cgroup_remount() calls the function while
      holding cgroup_mutex in its failure path leading to the following
      deadlock.
      
      # mount -t cgroup -o remount,memory,blkio cgroup blkio
      
       cgroup: option changes via remount are deprecated (pid=525 comm=mount)
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.10.0-rc4-work+ #1 Not tainted
       ---------------------------------------------
       mount/525 is trying to acquire lock:
        (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
      
       but task is already holding lock:
        (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
      	CPU0
      	----
         lock(cgroup_mutex);
         lock(cgroup_mutex);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       4 locks held by mount/525:
        #0:  (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
        #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
        #2:  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
        #3:  (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200
      
       stack backtrace:
       CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
        ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
        ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
       Call Trace:
        [<ffffffff81c24bb1>] dump_stack+0x19/0x1b
        [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
        [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
        [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
        [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
        [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
        [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
        [<ffffffff811e9d41>] do_mount+0x5f1/0xa30
        [<ffffffff811ea203>] SyS_mount+0x83/0xc0
        [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b
      
      Fix it by moving the drop_parsed_module_refcounts() invocation outside
      cgroup_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e2bd416f
  3. 26 Jun, 2013 5 commits
    • Tejun Heo's avatar
      cgroup: always use RCU accessors for protected accesses · a4ea1cc9
      Tejun Heo authored
      kernel/cgroup.c still has places where a RCU pointer is set and
      accessed directly without going through RCU_INIT_POINTER() or
      rcu_dereference_protected().  They're all properly protected accesses
      so nothing is broken but it leads to spurious sparse RCU address space
      warnings.
      
      Substitute direct accesses with RCU_INIT_POINTER() and
      rcu_dereference_protected().  Note that %true is specified as the
      extra condition for all derference updates.  This isn't ideal as all
      it does is suppressing warning without actually policing
      synchronization rules; however, most are scheduled to be removed
      pretty soon along with css_id itself, so no reason to be more
      elaborate.
      
      Combined with the previous changes, this removes all RCU related
      sparse warnings from cgroup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by; Li Zefan <lizefan@huawei.com>
      a4ea1cc9
    • Tejun Heo's avatar
      cgroup: fix RCU accesses around task->cgroups · a8ad805c
      Tejun Heo authored
      There are several places in kernel/cgroup.c where task->cgroups is
      accessed and modified without going through proper RCU accessors.
      None is broken as they're all lock protected accesses; however, this
      still triggers sparse RCU address space warnings.
      
      * Consistently use task_css_set() for task->cgroups dereferencing.
      
      * Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
        exit.
      
      * Remove unnecessary rcu_dereference_raw() from cset->subsys[]
        dereference in cgroup_exit().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a8ad805c
    • Tejun Heo's avatar
      cgroup: fix RCU accesses to task->cgroups · 14611e51
      Tejun Heo authored
      task->cgroups is a RCU pointer pointing to struct css_set.  A task
      switches to a different css_set on cgroup migration but a css_set
      doesn't change once created and its pointers to cgroup_subsys_states
      aren't RCU protected.
      
      task_subsys_state[_check]() is the macro to acquire css given a task
      and subsys_id pair.  It RCU-dereferences task->cgroups->subsys[] not
      task->cgroups, so the RCU pointer task->cgroups ends up being
      dereferenced without read_barrier_depends() after it.  It's broken.
      
      Fix it by introducing task_css_set[_check]() which does
      RCU-dereference on task->cgroups.  task_subsys_state[_check]() is
      reimplemented to directly dereference ->subsys[] of the css_set
      returned from task_css_set[_check]().
      
      This removes some of sparse RCU warnings in cgroup.
      
      v2: Fixed unbalanced parenthsis and there's no need to use
          rcu_dereference_raw() when !CONFIG_PROVE_RCU.  Both spotted by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      14611e51
    • Tejun Heo's avatar
      cgroup: grab cgroup_mutex in drop_parsed_module_refcounts() · eb178d06
      Tejun Heo authored
      This isn't strictly necessary as all subsystems specified in
      @subsys_mask are guaranteed to be pinned; however, it does spuriously
      trigger lockdep warning.  Let's grab cgroup_mutex around it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      eb178d06
    • Tejun Heo's avatar
      cgroup: fix cgroupfs_root early destruction path · 1672d040
      Tejun Heo authored
      cgroupfs_root used to have ->actual_subsys_mask in addition to
      ->subsys_mask.  a8a648c4 ("cgroup: remove
      cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
      essentially temporary and doesn't belong in cgroupfs_root; however,
      the patch made it impossible to tell whether a cgroupfs_root actually
      has the subsystems bound or just have the bits set leading to the
      following BUG when trying to mount with subsystems which are already
      mounted elsewhere.
      
       kernel BUG at kernel/cgroup.c:1038!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       ...
       CPU: 1 PID: 7973 Comm: mount Tainted: G        W    3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
       task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
       RIP: 0010:[<ffffffff81249b29>]  [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
       ...
       Call Trace:
        [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
        [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
        [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
        [<ffffffff81257169>] cpuset_mount+0xd9/0x110
        [<ffffffff813d2580>] mount_fs+0xb0/0x2d0
        [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
        [<ffffffff814070b5>] do_new_mount+0x145/0x2c0
        [<ffffffff814085d6>] do_mount+0x356/0x3c0
        [<ffffffff8140873d>] SyS_mount+0xfd/0x140
        [<ffffffff854eb600>] tracesys+0xdd/0xe2
      
      We still want rebind_subsystems() to take added/removed masks, so
      let's fix it by marking whether a cgroupfs_root has finished binding
      or not.  Also, document what's going on around ->subsys_mask
      initialization so that similar mistakes aren't repeated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      1672d040
  4. 25 Jun, 2013 3 commits
    • Tejun Heo's avatar
      cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy · fc76df70
      Tejun Heo authored
      Before 1a574231 ("cgroup: make hierarchy_id use cyclic idr"),
      hierarchy IDs were allocated from 0.  As the dummy hierarchy was
      always the one first initialized, it got assigned 0 and all other
      hierarchies from 1.  The patch accidentally changed the minimum
      useable ID to 2.
      
      Let's restore ID 0 for dummy_root and while at it reserve 1 for
      unified hierarchy.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      fc76df70
    • Tejun Heo's avatar
      cgroup: implement for_each_[builtin_]subsys() · 30159ec7
      Tejun Heo authored
      There are quite a few places where all loaded [builtin] subsys are
      iterated.  Implement for_each_[builtin_]subsys() and replace manual
      iterations with those to simplify those places a bit.  The new
      iterators automatically skip NULL subsystems.  This shouldn't cause
      any functional difference.
      
      Iteration loops which scan all subsystems and then skipping modular
      ones explicitly are converted to use for_each_builtin_subsys().
      
      While at it, reorder variable declarations and adjust whitespaces a
      bit in the affected functions.
      
      v2: Add lockdep_assert_held() in for_each_subsys() and add comments
          about synchronization as suggested by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      30159ec7
    • Tejun Heo's avatar
      cgroup: move init_css_set initialization inside cgroup_mutex · 82fe9b0d
      Tejun Heo authored
      cgroup_init() was doing init_css_set initialization outside
      cgroup_mutex, which is fine but we want to add lockdep annotation on
      subsystem iterations and cgroup_init() will trigger it spuriously.
      Move init_css_set initialization inside cgroup_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      82fe9b0d
  5. 24 Jun, 2013 5 commits
    • Tejun Heo's avatar
      cgroup: s/for_each_subsys()/for_each_root_subsys()/ · 5549c497
      Tejun Heo authored
      for_each_subsys() walks over subsystems attached to a hierarchy and
      we're gonna add iterators which walk over all available subsystems.
      Rename for_each_subsys() to for_each_root_subsys() so that it's more
      appropriately named and for_each_subsys() can be used to iterate all
      subsystems.
      
      While at it, remove unnecessary underbar prefix from macro arguments,
      put them inside parentheses, and adjust indentation for the two
      for_each_*() macros.
      
      This patch is purely cosmetic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5549c497
    • Tejun Heo's avatar
      cgroup: clean up find_css_set() and friends · b326f9d0
      Tejun Heo authored
      find_css_set() passes uninitialized on-stack template[] array to
      find_existing_css_set() which sets the entries for all subsystems.
      Passing around an uninitialized array is a bit icky and we want to
      introduce an iterator which only iterates loaded subsystems.  Let's
      initialize it on definition.
      
      While at it, also make the following cosmetic cleanups.
      
      * Convert to proper /** comments.
      
      * Reorder variable declarations.
      
      * Replace comment on synchronization with lockdep_assert_held().
      
      This patch doesn't make any functional differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b326f9d0
    • Tejun Heo's avatar
      cgroup: remove cgroup->actual_subsys_mask · a8a648c4
      Tejun Heo authored
      cgroup curiously has two subsystem masks, ->subsys_mask and
      ->actual_subsys_mask.  The latter only exists because the new target
      subsys_mask is passed into rebind_subsystems() via @root>subsys_mask.
      rebind_subsystems() needs to know what the current mask is to decide
      how to reach the target mask so ->actual_subsys_mask is used as the
      temp location to remember the current state.
      
      Adding a temporary field to a permanent data structure is rather silly
      and can be misleading.  Update rebind_subsystems() to take @added_mask
      and @removed_mask instead and remove @root->actual_subsys_mask.
      
      This patch shouldn't introduce any behavior changes.
      
      v2: Comment and description updated as suggested by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a8a648c4
    • Tejun Heo's avatar
      cgroup: prefix global variables with "cgroup_" · 9871bf95
      Tejun Heo authored
      Global variable names in kernel/cgroup.c are asking for trouble -
      subsys, roots, rootnode and so on.  Rename them to have "cgroup_"
      prefix.
      
      * s/subsys/cgroup_subsys/
      
      * s/rootnode/cgroup_dummy_root/
      
      * s/dummytop/cgroup_cummy_top/
      
      * s/roots/cgroup_roots/
      
      * s/root_count/cgroup_root_count/
      
      This patch is purely cosmetic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      9871bf95
    • Tejun Heo's avatar
      cgroup: convert CFTYPE_* flags to enums · 02c402d9
      Tejun Heo authored
      Purely cosmetic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      02c402d9
  6. 19 Jun, 2013 1 commit
  7. 18 Jun, 2013 7 commits
    • Tejun Heo's avatar
      cgroup: clean up cgroup_serial_nr_cursor · 00356bd5
      Tejun Heo authored
      cgroup_serial_nr_cursor was created atomic64_t because I thought it
      was never gonna used for anything other than assigning unique numbers
      to cgroups and didn't want to worry about synchronization; however,
      now we're using it as an event-stamp to distinguish cgroups created
      before and after certain point which assumes that it's protected by
      cgroup_mutex.
      
      Let's make it clear by making it a u64.  Also, rename it to
      cgroup_serial_nr_next and make it point to the next nr to allocate so
      that where it's pointing to is clear and more conventional.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      00356bd5
    • Li Zefan's avatar
      cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre() · e8c82d20
      Li Zefan authored
      We used root->allcg_list to iterate cgroup hierarchy because at that time
      cgroup_for_each_descendant_pre() hasn't been invented.
      
      tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the
          assignment right above releasing cgroup_mutex and explain what's
          going on there.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e8c82d20
    • Li Zefan's avatar
      cgroup: make serial_nr_cursor available throughout cgroup.c · 794611a1
      Li Zefan authored
      The next patch will use it to determine if a cgroup is newly created
      while we're iterating the cgroup hierarchy.
      
      tj: Rephrased the comment on top of cgroup_serial_nr_cursor.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      794611a1
    • Li Zefan's avatar
      cgroup: fix memory leak in cgroup_rm_cftypes() · f57947d2
      Li Zefan authored
      The memory allocated in cgroup_add_cftypes() should be freed. The
      effect of this bug is we leak a bit memory everytime we unload
      cfq-iosched module if blkio cgroup is enabled.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f57947d2
    • Li Zefan's avatar
      cgroup: fix umount vs cgroup_event_remove() race · 1c8158ee
      Li Zefan authored
       commit 5db9a4d9
       Author: Tejun Heo <tj@kernel.org>
       Date:   Sat Jul 7 16:08:18 2012 -0700
      
           cgroup: fix cgroup hierarchy umount race
      
      This commit fixed a race caused by the dput() in css_dput_fn(), but
      the dput() in cgroup_event_remove() can also lead to the same BUG().
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      1c8158ee
    • Li Zefan's avatar
      cgroup: fix umount vs cgroup_cfts_commit() race · 084457f2
      Li Zefan authored
      cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex
      is dropped, but dget() won't prevent cgroupfs from being umounted. When
      the race happens, vfs will see some dentries with non-zero refcnt while
      umount is in process.
      
      Keep running this:
        mount -t cgroup -o blkio xxx /cgroup
        umount /cgroup
      
      And this:
        modprobe cfq-iosched
        rmmod cfs-iosched
      
      After a while, the BUG() in shrink_dcache_for_umount_subtree() may
      be triggered:
      
        BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup]
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      084457f2
    • Tejun Heo's avatar
      cgroup: disallow rename(2) if sane_behavior · 6db8e85c
      Tejun Heo authored
      cgroup's rename(2) isn't a proper migration implementation - it can't
      move the cgroup to a different parent in the hierarchy.  All it can do
      is swapping the name string for that cgroup.  This isn't useful and
      can mislead users to think that cgroup supports proper cgroup-level
      migration.  Disallow rename(2) if sane_behavior.
      
      v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs
          return value when ->rename is not implemented as suggested by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6db8e85c
  8. 14 Jun, 2013 6 commits
    • Tejun Heo's avatar
      cgroup: update sane_behavior documentation · f63674fd
      Tejun Heo authored
      f12dc020 ("cgroup: mark "tasks" cgroup file as insane") and
      cc5943a7 ("cgroup: mark "notify_on_release" and "release_agent"
      cgroup files insane") forgot to update the changed behavior
      documentation in cgroup.h.  Update it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f63674fd
    • Tejun Heo's avatar
      cgroup: use percpu refcnt for cgroup_subsys_states · d3daf28d
      Tejun Heo authored
      A css (cgroup_subsys_state) is how each cgroup is represented to a
      controller.  As such, it can be used in hot paths across the various
      subsystems different controllers are associated with.
      
      One of the common operations is reference counting, which up until now
      has been implemented using a global atomic counter and can have
      significant adverse impact on scalability.  For example, css refcnt
      can be gotten and put multiple times by blkcg for each IO request.
      For highops configurations which try to do as much per-cpu as
      possible, the global frequent refcnting can be very expensive.
      
      In general, given the various and hugely diverse paths css's end up
      being used from, we need to make it cheap and highly scalable.  In its
      usage, css refcnting isn't very different from module refcnting.
      
      This patch converts css refcnting to use the recently added
      percpu_ref.  css_get/tryget/put() directly maps to the matching
      percpu_ref operations and the deactivation logic is no longer
      necessary as percpu_ref already has refcnt killing.
      
      The only complication is that as the refcnt is per-cpu,
      percpu_ref_kill() in itself doesn't ensure that further tryget
      operations will fail, which we need to guarantee before invoking
      ->css_offline()'s.  This is resolved collecting kill confirmation
      using percpu_ref_kill_and_confirm() and initiating the offline phase
      of destruction after all css refcnt's are confirmed to be seen as
      killed on all CPUs.  The previous patches already splitted destruction
      into two phases, so percpu_ref_kill_and_confirm() can be hooked up
      easily.
      
      This patch removes css_refcnt() which is used for rcu dereference
      sanity check in css_id().  While we can add a percpu refcnt API to ask
      the same question, css_id() itself is scheduled to be removed fairly
      soon, so let's not bother with it.  Just drop the sanity check and use
      rcu_dereference_raw() instead.
      
      v2: - init_cgroup_css() was calling percpu_ref_init() without checking
            the return value.  This causes two problems - the obvious lack
            of error handling and percpu_ref_init() being called from
            cgroup_init_subsys() before the allocators are up, which
            triggers warnings but doesn't cause actual problems as the
            refcnt isn't used for roots anyway.  Fix both by moving
            percpu_ref_init() to cgroup_create().
      
          - The base references were put too early by
            percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
            refs one extra time.  This wasn't noticeable because css's go
            through another RCU grace period before being freed.  Update
            cgroup_destroy_locked() to grab an extra reference before
            killing the refcnts.  This problem was noticed by Kent.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarKent Overstreet <koverstreet@google.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Alasdair G. Kergon" <agk@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      d3daf28d
    • Tejun Heo's avatar
      Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11 · 2b0e53a7
      Tejun Heo authored
      This is to receive percpu_refcount which will replace atomic_t
      reference count in cgroup_subsys_state.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2b0e53a7
    • Tejun Heo's avatar
      cgroup: split cgroup destruction into two steps · ea15f8cc
      Tejun Heo authored
      Split cgroup_destroy_locked() into two steps and put the latter half
      into cgroup_offline_fn() which is executed from a work item.  The
      latter half is responsible for offlining the css's, removing the
      cgroup from internal lists, and propagating release notification to
      the parent.  The separation is to allow using percpu refcnt for css.
      
      Note that this allows for other cgroup operations to happen between
      the first and second halves of destruction, including creating a new
      cgroup with the same name.  As the target cgroup is marked DEAD in the
      first half and cgroup internals don't care about the names of cgroups,
      this should be fine.  A comment explaining this will be added by the
      next patch which implements the actual percpu refcnting.
      
      As RCU freeing is guaranteed to happen after the second step of
      destruction, we can use the same work item for both.  This patch
      renames cgroup->free_work to ->destroy_work and uses it for both
      purposes.  INIT_WORK() is now performed right before queueing the work
      item.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ea15f8cc
    • Tejun Heo's avatar
      cgroup: reorder the operations in cgroup_destroy_locked() · 455050d2
      Tejun Heo authored
      This patch reorders the operations in cgroup_destroy_locked() such
      that the userland visible parts happen before css offlining and
      removal from the ->sibling list.  This will be used to make css use
      percpu refcnt.
      
      While at it, split out CGRP_DEAD related comment from the refcnt
      deactivation one and correct / clarify how different guarantees are
      met.
      
      While this patch changes the specific order of operations, it
      shouldn't cause any noticeable behavior difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      455050d2
    • Tejun Heo's avatar
      percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() · dbece3a0
      Tejun Heo authored
      Implement percpu_tryget() which stops giving out references once the
      percpu_ref is visible as killed.  Because the refcnt is per-cpu,
      different CPUs will start to see a refcnt as killed at different
      points in time and tryget() may continue to succeed on subset of cpus
      for a while after percpu_ref_kill() returns.
      
      For use cases where it's necessary to know when all CPUs start to see
      the refcnt as dead, percpu_ref_kill_and_confirm() is added.  The new
      function takes an extra argument @confirm_kill which is invoked when
      the refcnt is guaranteed to be viewed as killed on all CPUs.
      
      While this isn't the prettiest interface, it doesn't force synchronous
      wait and is much safer than requiring the caller to do its own
      call_rcu().
      
      v2: Patch description rephrased to emphasize that tryget() may
          continue to succeed on some CPUs after kill() returns as suggested
          by Kent.
      
      v3: Function comment in percpu_ref_kill_and_confirm() updated warning
          people to not depend on the implied RCU grace period from the
          confirm callback as it's an implementation detail.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Slightly-Grumpily-Acked-by: default avatarKent Overstreet <koverstreet@google.com>
      dbece3a0
  9. 13 Jun, 2013 10 commits
    • Tejun Heo's avatar
      percpu-refcount: implement percpu_ref_cancel_init() · bc497bd3
      Tejun Heo authored
      Normally, percpu_ref_init() initializes and percpu_ref_kill()
      initiates destruction which completes asynchronously.  The
      asynchronous destruction can be problematic in init failure path where
      the caller wants to destroy half-constructed object - distinguishing
      half-constructed objects from the usual release method can be painful
      for complex objects.
      
      This patch implements percpu_ref_cancel_init() which synchronously
      destroys the percpu_ref without invoking release.  To avoid
      unintentional misuses, the function requires the ref to have finished
      percpu_ref_init() but never used and triggers WARN otherwise.
      
      v2: Explain the weird name and usage restriction in the function
          comment.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarKent Overstreet <koverstreet@google.com>
      bc497bd3
    • Tejun Heo's avatar
      percpu-refcount: add __must_check to percpu_ref_init() and don't use... · acac7883
      Tejun Heo authored
      percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()
      
      Two small changes.
      
      * Unlike most init functions, percpu_ref_init() allocates memory and
        may fail.  Let's mark it with __must_check in case the caller
        forgets.
      
      * percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to
        dereference @ref->pcpu_count, which can be misleading.  The pointer
        is guaranteed to be valid and visible and can't change underneath
        the function.  Drop ACCESS_ONCE().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      acac7883
    • Tejun Heo's avatar
      cgroup: remove cgroup->count and use · 6f3d828f
      Tejun Heo authored
      cgroup->count tracks the number of css_sets associated with the cgroup
      and used only to verify that no css_set is associated when the cgroup
      is being destroyed.  It's superflous as the destruction path can
      simply check whether cgroup->cset_links is empty instead.
      
      Drop cgroup->count and check ->cset_links directly from
      cgroup_destroy_locked().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6f3d828f
    • Tejun Heo's avatar
      cgroup: drop unnecessary RCU dancing from __put_css_set() · ddd69148
      Tejun Heo authored
      __put_css_set() does RCU read access on @cgrp across dropping
      @cgrp->count so that it can continue accessing @cgrp even if the count
      reached zero and destruction of the cgroup commenced.  Given that both
      sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
      is unnecessary.  Just making cgroup_destroy_locked() grab css_set_lock
      while checking @cgrp->count is enough.
      
      Remove the RCU read locking from __put_css_set() and make
      cgroup_destroy_locked() read-lock css_set_lock when checking
      @cgrp->count.  This will also allow removing @cgrp->count.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ddd69148
    • Tejun Heo's avatar
      cgroup: rename CGRP_REMOVED to CGRP_DEAD · 54766d4a
      Tejun Heo authored
      We will add another flag indicating that the cgroup is in the process
      of being killed.  REMOVING / REMOVED is more difficult to distinguish
      and cgroup_is_removing()/cgroup_is_removed() are a bit awkward.  Also,
      later percpu_ref usage will involve "kill"ing the refcnt.
      
       s/CGRP_REMOVED/CGRP_DEAD/
       s/cgroup_is_removed()/cgroup_is_dead()
      
      This patch is purely cosmetic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      54766d4a
    • Tejun Heo's avatar
      cgroup: clean up css_[try]get() and css_put() · 5de0107e
      Tejun Heo authored
      * __css_get() isn't used by anyone.  Fold it into css_get().
      
      * Add proper function comments to all css reference functions.
      
      This patch is purely cosmetic.
      
      v2: Typo fix as per Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5de0107e
    • Tejun Heo's avatar
      cgroup: use kzalloc() instead of kmalloc() · f4f4be2b
      Tejun Heo authored
      There's no point in using kmalloc() instead of the clearing variant
      for trivial stuff.  We can live dangerously elsewhere.  Use kzalloc()
      instead and drop 0 inits.
      
      While at it, do trivial code reorganization in cgroup_file_open().
      
      This patch doesn't introduce any functional changes.
      
      v2: I was caught in the very distant past where list_del() didn't
          poison and the initial version converted list_del()s to
          list_del_init()s too.  Li and Kent took me out of the stasis
          chamber.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <koverstreet@google.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      f4f4be2b
    • Tejun Heo's avatar
      cgroup: bring some sanity to naming around cg_cgroup_link · 69d0206c
      Tejun Heo authored
      cgroups and css_sets are mapped M:N and this M:N mapping is
      represented by struct cg_cgroup_link which forms linked lists on both
      sides.  The naming around this mapping is already confusing and struct
      cg_cgroup_link exacerbates the situation quite a bit.
      
      >From cgroup side, it starts off ->css_sets and runs through
      ->cgrp_link_list.  From css_set side, it starts off ->cg_links and
      runs through ->cg_link_list.  This is rather reversed as
      cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
      Also, this is the only place which is still using the confusing "cg"
      for css_sets.  This patch cleans it up a bit.
      
      * s/cgroup->css_sets/cgroup->cset_links/
        s/css_set->cg_links/css_set->cgrp_links/
        s/cgroup_iter->cg_link/cgroup_iter->cset_link/
      
      * s/cg_cgroup_link/cgrp_cset_link/
      
      * s/cgrp_cset_link->cg/cgrp_cset_link->cset/
        s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
        s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/
      
      * s/init_css_set_link/init_cgrp_cset_link/
        s/free_cg_links/free_cgrp_cset_links/
        s/allocate_cg_links/allocate_cgrp_cset_links/
      
      * s/cgl[12]/link[12]/ in compare_css_sets()
      
      * s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
        adustments.
      
      * Comment and whiteline adjustments.
      
      After the changes, we have
      
      	list_for_each_entry(link, &cont->cset_links, cset_link) {
      		struct css_set *cset = link->cset;
      
      instead of
      
      	list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
      		struct css_set *cset = link->cg;
      
      This patch is purely cosmetic.
      
      v2: Fix broken sentences in the patch description.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      69d0206c
    • Tejun Heo's avatar
      cgroup: consistently use @cset for struct css_set variables · 5abb8855
      Tejun Heo authored
      cgroup.c uses @cg for most struct css_set variables, which in itself
      could be a bit confusing, but made much worse by the fact that there
      are places which use @cg for struct cgroup variables.
      compare_css_sets() epitomizes this confusion - @[old_]cg are struct
      css_set while @cg[12] are struct cgroup.
      
      It's not like the whole deal with cgroup, css_set and cg_cgroup_link
      isn't already confusing enough.  Let's give it some sanity by
      uniformly using @cset for all struct css_set variables.
      
      * s/cg/cset/ for all css_set variables.
      
      * s/oldcg/old_cset/ s/oldcgrp/old_cgrp/.  The same for the ones
        prefixed with "new".
      
      * s/cg/cgrp/ for cgroup variables in compare_css_sets().
      
      * s/css/cset/ for the cgroup variable in task_cgroup_from_root().
      
      * Whiteline adjustments.
      
      This patch is purely cosmetic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5abb8855
    • Tejun Heo's avatar
      cgroup: remove now unused css_depth() · 3fc3db9a
      Tejun Heo authored
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      3fc3db9a