1. 06 Dec, 2013 5 commits
    • Tejun Heo's avatar
      cgroup: reorder operations in cgroup_create() · 0d80255e
      Tejun Heo authored
      cgroup_create() currently does the followings.
      
      1. alloc cgroup
      2. alloc css's
      3. create the directory and commit to cgroup creation
      4. online css's
      5. create cgroup and css files
      
      The sequence performs allocations before other operations but it
      doesn't buy anything because each of the above steps may fail and
      should be unrollable.  Reorganize the sequence such that cgroup
      operations are done before css operations.
      
      1. alloc cgroup
      2. create the directory and files and commit to cgroup creation
      3. alloc css's
      4. create files for and online css's
      
      This simplifies the code a bit and enables further simplification and
      separating out css creation from cgroup creation which is necessary
      for the planned unified hierarchy where css's will be created and
      destroyed dynamically across the lifetime of a cgroup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0d80255e
    • Tejun Heo's avatar
      cgroup: make for_each_subsys() useable under cgroup_root_mutex · 780cd8b3
      Tejun Heo authored
      We want to use for_each_subsys() in cgroupfs_root handling where only
      cgroup_root_mutex is held.  The only way cgroup_subsys[] can change is
      through module load/unload, make cgroup_[un]load_subsys() grab
      cgroup_root_mutex too and update the lockdep annotation in
      for_each_subsys() to allow either cgroup_mutex or cgroup_root_mutex.
      
      * Lockdep annotation is moved from inner 'if' condition to outer 'for'
        init caluse.  There's no reason to execute the assertion every loop.
      
      * Loop index @i is renamed to @ssid.  Indices iterating through subsys
        will be [re]named to @ssid gradually.
      
      v2: cgroup_assert_mutex_or_root_locked() caused build failure if
          !CONFIG_LOCKEDP.  Conditionalize its definition.  The build failure
          was reported by kbuild test bot.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      780cd8b3
    • Tejun Heo's avatar
      cgroup: css iterations and css_from_dir() are safe under cgroup_mutex · 87fb54f1
      Tejun Heo authored
      Currently, all css iterations and css_from_dir() require RCU read lock
      whether the caller is holding cgroup_mutex or not, which is
      unnecessarily restrictive.  They are all safe to use under
      cgroup_mutex without holding RCU read lock.
      
      Factor out cgroup_assert_mutex_or_rcu_locked() from css_from_id() and
      apply it to all css iteration functions and css_from_dir().
      
      v2: cgroup_assert_mutex_or_rcu_locked() definition doesn't need to be
          inside CONFIG_PROVE_RCU ifdef as rcu_lockdep_assert() is always
          defined and conditionalized.  Move it outside of the ifdef block.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      87fb54f1
    • Tejun Heo's avatar
      Merge branch 'for-3.13-fixes' into for-3.14 · e58e1ca4
      Tejun Heo authored
      Pulling in as patches depending on 266ccd50 ("cgroup: fix
      cgroup_create() error handling path") are scheduled.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e58e1ca4
    • Tejun Heo's avatar
      cgroup: fix cgroup_create() error handling path · 266ccd50
      Tejun Heo authored
      ae7f164a ("cgroup: move cgroup->subsys[] assignment to
      online_css()") moved cgroup->subsys[] assignements later in
      cgroup_create() but didn't update error handling path accordingly
      leading to the following oops and leaking later css's after an
      online_css() failure.  The oops is from cgroup destruction path being
      invoked on the partially constructed cgroup which is not ready to
      handle empty slots in cgrp->subsys[] array.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        PGD a780a067 PUD aadbe067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
        Hardware name:
        task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
        RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
        RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
        RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
        RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
        R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
        R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
        FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
        Stack:
         ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
         ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
         ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
        Call Trace:
         [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
         [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
         [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
         [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
         [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
      
      This patch moves reference bumping inside online_css() loop, clears
      css_ar[] as css's are brought online successfully, and updates
      err_destroy path so that either a css is fully online and destroyed by
      cgroup_destroy_locked() or the error path frees it.  This creates a
      duplicate css free logic in the error path but it will be cleaned up
      soon.
      
      v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
          invoked with a cgroup which doesn't have all css's populated.
          Update cgroup_destroy_locked() so that it skips NULL css's.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Reported-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: stable@vger.kernel.org # v3.12+
      266ccd50
  2. 05 Dec, 2013 12 commits
    • Tejun Heo's avatar
      cgroup: unify pidlist and other file handling · 6612f05b
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  With the previous
      changes, the difference between pidlist and other files are very
      small.  Both are served by seq_file in a pretty standard way with the
      only difference being !pidlist files use single_open().
      
      This patch adds cftype->seq_start(), ->seq_next and ->seq_stop() and
      implements the matching cgroup_seqfile_start/next/stop() which either
      emulates single_open() behavior or invokes cftype->seq_*() operations
      if specified.  This allows using single seq_operations for both
      pidlist and other files and makes cgroup_pidlist_operations and
      cgorup_pidlist_open() no longer necessary.  As cgroup_pidlist_open()
      was the only user of cftype->open(), the method is dropped together.
      
      This brings cftype file interface very close to kernfs interface and
      mapping shouldn't be too difficult.  Once converted to kernfs, most of
      the plumbing code including cgroup_seqfile_*() will be removed as
      kernfs provides those facilities.
      
      This patch does not introduce any behavior changes.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      
      v3: Refreshed on top of the updated "cgroup: attach cgroup_open_file
          to all cgroup files".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6612f05b
    • Tejun Heo's avatar
      cgroup: replace cftype->read_seq_string() with cftype->seq_show() · 2da8ca82
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      replaces cftype->read_seq_string() with cftype->seq_show() which is
      not limited to single_open() operation and will map directcly to
      kernfs seq_file interface.
      
      The conversions are mechanical.  As ->seq_show() doesn't have @css and
      @cft, the functions which make use of them are converted to use
      seq_css() and seq_cft() respectively.  In several occassions, e.f. if
      it has seq_string in its name, the function name is updated to fit the
      new method better.
      
      This patch does not introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      2da8ca82
    • Tejun Heo's avatar
      cgroup: attach cgroup_open_file to all cgroup files · 7da11279
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      attaches cgroup_open_file, which used to be attached to pidlist files,
      to all cgroup files, introduces seq_css/cft() accessors to determine
      the cgroup_subsys_state and cftype associated with a given cgroup
      seq_file, exports them as public interface.
      
      This doesn't cause any behavior changes but unifies cgroup file
      handling across different file types and will help converting them to
      kernfs seq_show() interface.
      
      v2: Li pointed out that the original patch was using
          single_open_size() incorrectly assuming that the size param is
          private data size.  Fix it by allocating @of separately and
          passing it to single_open() and explicitly freeing it in the
          release path.  This isn't the prettiest but this path is gonna be
          restructured by the following patches pretty soon.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7da11279
    • Tejun Heo's avatar
      cgroup: generalize cgroup_pidlist_open_file · 5d22444f
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch renames
      cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
      only contains a field to identify the specific file, ->cfe, and an
      opaque ->priv pointer.  When cgroup is converted to kernfs, this will
      be replaced by kernfs_open_file which contains about the same
      information.
      
      As whether the file is "cgroup.procs" or "tasks" should now be
      determined from cgroup_open_file->cfe, the cftype->private for the two
      files now carry the file type and cgroup_pidlist_start() reads the
      type through cfe->type->private.  This makes the distinction between
      cgroup_tasks_open() and cgroup_procs_open() unnecessary.
      cgroup_pidlist_open() is now directly used as the open method.
      
      This patch doesn't make any behavior changes.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5d22444f
    • Tejun Heo's avatar
      cgroup: unify read path so that seq_file is always used · 896f5199
      Tejun Heo authored
      With the recent removal of cftype->read() and ->read_map(), only three
      operations are remaining, ->read_u64(), ->read_s64() and
      ->read_seq_string().  Currently, the first two are handled directly
      while the last is handled through seq_file.
      
      It is trivial to serve the first two through the seq_file path too.
      This patch restructures read path so that all operations are served
      through cgroup_seqfile_show().  This makes all cgroup files seq_file -
      single_open/release() are now used by default,
      cgroup_seqfile_operations is dropped, and cgroup_file_operations uses
      seq_read() for read.
      
      This simplifies the code and makes the read path easy to convert to
      use kernfs.
      
      Note that, while cgroup_file_operations uses seq_read() for read, it
      still uses generic_file_llseek() for seeking instead of seq_lseek().
      This is different from cgroup_seqfile_operations but shouldn't break
      anything and brings the seeking behavior aligned with kernfs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      896f5199
    • Tejun Heo's avatar
      cgroup: unify cgroup_write_X64() and cgroup_write_string() · a742c59d
      Tejun Heo authored
      cgroup_write_X64() and cgroup_write_string() both implement about the
      same buffering logic.  Unify the two into cgroup_file_write() which
      always allocates dynamic buffer for simplicity and uses kstrto*()
      instead of simple_strto*().
      
      This patch doesn't make any visible behavior changes except for
      possibly different error value from kstrsto*().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a742c59d
    • Tejun Heo's avatar
      cgroup: remove cftype->read(), ->read_map() and ->write() · 6e0755b0
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      After recent updates, ->read() and ->read_map() don't have any user
      left and ->write() never had any user.  Remove them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      6e0755b0
    • Tejun Heo's avatar
      hugetlb_cgroup: convert away from cftype->read() · 716f479d
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      All users of cftype->read() can be easily served, usually better, by
      seq_file and other methods.  Update hugetlb_cgroup_read() to return
      u64 instead of printing itself and rename it to
      hugetlb_cgroup_read_u64().
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      716f479d
    • Tejun Heo's avatar
      netprio_cgroup: convert away from cftype->read_map() · e92e113c
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string().  Update read_priomap() to use ->read_seq_string()
      instead.
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e92e113c
    • Tejun Heo's avatar
      memcg: convert away from cftype->read() and ->read_map() · 791badbd
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string(), and all users of cftype->read() can be easily
      served, usually better, by seq_file and other methods.
      
      Update mem_cgroup_read() to return u64 instead of printing itself and
      rename it to mem_cgroup_read_u64(), and update
      mem_cgroup_oom_control_read() to use ->read_seq_string() instead of
      ->read_map().
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      791badbd
    • Tejun Heo's avatar
      cpuset: convert away from cftype->read() · 51ffe411
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      All users of cftype->read() can be easily served, usually better, by
      seq_file and other methods.  Rename cpuset_common_file_read() to
      cpuset_common_read_seq_string() and convert it to use
      read_seq_string() interface instead.  This not only simplifies the
      code but also makes it more versatile.  Before, the file couldn't
      output if the result is longer than PAGE_SIZE.  After the conversion,
      seq_file automatically grows the buffer until the output can fit.
      
      This patch doesn't make any visible behavior changes except for being
      able to handle output larger than PAGE_SIZE.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      51ffe411
    • Tejun Heo's avatar
      cgroup, sched: convert away from cftype->read_map() · 44ffc75b
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string().  Update cpu_stats_show() and cpuacct_stats_show()
      accordingly.
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      44ffc75b
  3. 29 Nov, 2013 9 commits
    • Tejun Heo's avatar
      cgroup: don't guarantee cgroup.procs is sorted if sane_behavior · afb2bc14
      Tejun Heo authored
      For some reason, tasks and cgroup.procs guarantee that the result is
      sorted.  This is the only reason this whole pidlist logic is necessary
      instead of just iterating through sorted member tasks.  We can't do
      anything about the existing interface but at least ensure that such
      expectation doesn't exist for the new interface so that pidlist logic
      may be removed in the distant future.
      
      This patch scrambles the sort order if sane_behavior so that the
      output is usually not sorted in the new interface.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      afb2bc14
    • Tejun Heo's avatar
      cgroup: remove cgroup_pidlist->use_count · 04502365
      Tejun Heo authored
      After the recent changes, pidlist ref is held only between
      cgroup_pidlist_start() and cgroup_pidlist_stop() during which
      cgroup->pidlist_mutex is also held.  IOW, the reference count is
      redundant now.  While in use, it's always one and pidlist_mutex is
      held - holding the mutex has exactly the same protection.
      
      This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
      so that pidlist_mutex is not released inbetween and drops
      pidlist->use_count.
      
      This patch shouldn't introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      04502365
    • Tejun Heo's avatar
      cgroup: load and release pidlists from seq_file start and stop respectively · 4bac00d1
      Tejun Heo authored
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      The previous patches implemented delayed release and restructured
      pidlist handling so that pidlists can be loaded and released from
      seq_file start / stop.  This patch actually moves pidlist load to
      start and release to stop.
      
      This means that pidlist is pinned only between start and stop and may
      go away between two consecutive read calls if the two calls are apart
      by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
      thus can't re-use the stored cgroup_pid_list_open_file->pidlist
      directly.  During start, it's only used as a hint indicating whether
      this is the first start after open or not and pidlist is always looked
      up or created.
      
      pidlist_mutex locking and reference counting are moved out of
      pidlist_array_load() so that pidlist_array_load() can perform lookup
      and creation atomically.  While this enlarges the area covered by
      pidlist_mutex, given how the lock is used, it's highly unlikely to be
      noticeable.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      4bac00d1
    • Tejun Heo's avatar
      cgroup: remove cgroup_pidlist->rwsem · 069df3b7
      Tejun Heo authored
      cgroup_pidlist locking is needlessly complicated.  It has outer
      cgroup->pidlist_mutex to protect the list of pidlists associated with
      a cgroup and then each pidlist has rwsem to synchronize updates and
      reads.  Given that the only read access is from seq_file operations
      which are always invoked back-to-back, the rwsem is a giant overkill.
      All it does is adding unnecessary complexity.
      
      This patch removes cgroup_pidlist->rwsem and protects all accesses to
      pidlists belonging to a cgroup with cgroup->pidlist_mutex.
      pidlist->rwsem locking is removed if it's nested inside
      cgroup->pidlist_mutex; otherwise, it's replaced with
      cgroup->pidlist_mutex locking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      069df3b7
    • Tejun Heo's avatar
      cgroup: refactor cgroup_pidlist_find() · e6b81710
      Tejun Heo authored
      Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
      separate out finding proper to cgroup_pidlist_find().  Also, move
      locking to the caller.
      
      This patch is preparation for pidlist restructure and doesn't
      introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e6b81710
    • Tejun Heo's avatar
      cgroup: introduce struct cgroup_pidlist_open_file · 62236858
      Tejun Heo authored
      For pidlist files, seq_file->private pointed to the loaded
      cgroup_pidlist; however, pidlist loading is planned to be moved to
      cgroup_pidlist_start() for kernfs conversion and seq_file->private
      needs to carry more information from open to allow that.
      
      This patch introduces struct cgroup_pidlist_open_file which contains
      type, cgrp and pidlist and updates pidlist seq_file->private to point
      to it using seq_open_private() and seq_release_private().  Note that
      this eventually will be replaced by kernfs_open_file.
      
      While this patch makes more information available to seq_file
      operations, they don't use it yet and this patch doesn't introduce any
      behavior changes except for allocation of the extra private struct.
      
      v2: use __seq_open_private() instead of seq_open_private() for brevity
          as suggested by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      62236858
    • Tejun Heo's avatar
      cgroup: implement delayed destruction for cgroup_pidlist · b1a21367
      Tejun Heo authored
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      This patch implements delayed release of pidlist.  As pidlists could
      be lingering on cgroup removal waiting for the timer to expire, cgroup
      free path needs to queue the destruction work item immediately and
      flush.  As those work items are self-destroying, each work item can't
      be flushed directly.  A new workqueue - cgroup_pidlist_destroy_wq - is
      added to serve as flush domain.
      
      Note that this patch just adds delayed release on top of the current
      implementation and doesn't change where pidlist is loaded and
      released.  Following patches will make those changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b1a21367
    • Tejun Heo's avatar
      cgroup: remove cftype->release() · b9f3ceca
      Tejun Heo authored
      Now that pidlist files don't use cftype->release(), it doesn't have
      any user left.  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b9f3ceca
    • Tejun Heo's avatar
      cgroup: don't skip seq_open on write only opens on pidlist files · ac1e69aa
      Tejun Heo authored
      Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
      if the file is opened write-only, which is a sensible optimization as
      pidlist loading can be costly and there often are occasions where
      tasks or cgroup.procs is opened write-only.  However, pidlist init and
      release are planned to be moved to cgroup_pidlist_start/stop()
      respectively which would make this optimization unnecessary.
      
      This patch removes the optimization and always fully initializes
      pidlist files regardless of open mode.  This will help moving pidlist
      handling to start/stop by unifying rw paths and removes the need for
      specifying cftype->release() in addition to .release in
      cgroup_pidlist_operations as file->f_op is now always overridden.  As
      pidlist files were the only user of cftype->release(), the next patch
      will remove the method.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ac1e69aa
  4. 27 Nov, 2013 3 commits
    • Tejun Heo's avatar
      cgroup: Merge branch 'for-3.13-fixes' into for-3.14 · c729b11e
      Tejun Heo authored
      Pull to receive e605b365 ("cgroup: fix cgroup_subsys_state leak
      for seq_files") as for-3.14 is scheduled to have a lot of changes
      which depend on it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c729b11e
    • Tejun Heo's avatar
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo authored
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
    • Peter Zijlstra's avatar
      cpuset: Fix memory allocator deadlock · 0fc0287c
      Peter Zijlstra authored
      Juri hit the below lockdep report:
      
      [    4.303391] ======================================================
      [    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
      [    4.303395] ------------------------------------------------------
      [    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
      [    4.303417]
      [    4.303417] and this task is already holding:
      [    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
      [    4.303431] which would create a new lock dependency:
      [    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    4.303436]
      
      [    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
      [    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
      [    4.303922]    HARDIRQ-ON-W at:
      [    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
      [    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303933]    SOFTIRQ-ON-W at:
      [    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
      [    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303959]    INITIAL USE at:
      [    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
      [    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303972]  }
      
      Which reports that we take mems_allowed_seq with interrupts enabled. A
      little digging found that this can only be from
      cpuset_change_task_nodemask().
      
      This is an actual deadlock because an interrupt doing an allocation will
      hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
      forever waiting for the write side to complete.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reported-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      0fc0287c
  5. 22 Nov, 2013 11 commits