1. 05 Dec, 2013 3 commits
    • Tejun Heo's avatar
      memcg: convert away from cftype->read() and ->read_map() · 791badbd
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string(), and all users of cftype->read() can be easily
      served, usually better, by seq_file and other methods.
      
      Update mem_cgroup_read() to return u64 instead of printing itself and
      rename it to mem_cgroup_read_u64(), and update
      mem_cgroup_oom_control_read() to use ->read_seq_string() instead of
      ->read_map().
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      791badbd
    • Tejun Heo's avatar
      cpuset: convert away from cftype->read() · 51ffe411
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      All users of cftype->read() can be easily served, usually better, by
      seq_file and other methods.  Rename cpuset_common_file_read() to
      cpuset_common_read_seq_string() and convert it to use
      read_seq_string() interface instead.  This not only simplifies the
      code but also makes it more versatile.  Before, the file couldn't
      output if the result is longer than PAGE_SIZE.  After the conversion,
      seq_file automatically grows the buffer until the output can fit.
      
      This patch doesn't make any visible behavior changes except for being
      able to handle output larger than PAGE_SIZE.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      51ffe411
    • Tejun Heo's avatar
      cgroup, sched: convert away from cftype->read_map() · 44ffc75b
      Tejun Heo authored
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string().  Update cpu_stats_show() and cpuacct_stats_show()
      accordingly.
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      44ffc75b
  2. 29 Nov, 2013 9 commits
    • Tejun Heo's avatar
      cgroup: don't guarantee cgroup.procs is sorted if sane_behavior · afb2bc14
      Tejun Heo authored
      For some reason, tasks and cgroup.procs guarantee that the result is
      sorted.  This is the only reason this whole pidlist logic is necessary
      instead of just iterating through sorted member tasks.  We can't do
      anything about the existing interface but at least ensure that such
      expectation doesn't exist for the new interface so that pidlist logic
      may be removed in the distant future.
      
      This patch scrambles the sort order if sane_behavior so that the
      output is usually not sorted in the new interface.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      afb2bc14
    • Tejun Heo's avatar
      cgroup: remove cgroup_pidlist->use_count · 04502365
      Tejun Heo authored
      After the recent changes, pidlist ref is held only between
      cgroup_pidlist_start() and cgroup_pidlist_stop() during which
      cgroup->pidlist_mutex is also held.  IOW, the reference count is
      redundant now.  While in use, it's always one and pidlist_mutex is
      held - holding the mutex has exactly the same protection.
      
      This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
      so that pidlist_mutex is not released inbetween and drops
      pidlist->use_count.
      
      This patch shouldn't introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      04502365
    • Tejun Heo's avatar
      cgroup: load and release pidlists from seq_file start and stop respectively · 4bac00d1
      Tejun Heo authored
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      The previous patches implemented delayed release and restructured
      pidlist handling so that pidlists can be loaded and released from
      seq_file start / stop.  This patch actually moves pidlist load to
      start and release to stop.
      
      This means that pidlist is pinned only between start and stop and may
      go away between two consecutive read calls if the two calls are apart
      by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
      thus can't re-use the stored cgroup_pid_list_open_file->pidlist
      directly.  During start, it's only used as a hint indicating whether
      this is the first start after open or not and pidlist is always looked
      up or created.
      
      pidlist_mutex locking and reference counting are moved out of
      pidlist_array_load() so that pidlist_array_load() can perform lookup
      and creation atomically.  While this enlarges the area covered by
      pidlist_mutex, given how the lock is used, it's highly unlikely to be
      noticeable.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      4bac00d1
    • Tejun Heo's avatar
      cgroup: remove cgroup_pidlist->rwsem · 069df3b7
      Tejun Heo authored
      cgroup_pidlist locking is needlessly complicated.  It has outer
      cgroup->pidlist_mutex to protect the list of pidlists associated with
      a cgroup and then each pidlist has rwsem to synchronize updates and
      reads.  Given that the only read access is from seq_file operations
      which are always invoked back-to-back, the rwsem is a giant overkill.
      All it does is adding unnecessary complexity.
      
      This patch removes cgroup_pidlist->rwsem and protects all accesses to
      pidlists belonging to a cgroup with cgroup->pidlist_mutex.
      pidlist->rwsem locking is removed if it's nested inside
      cgroup->pidlist_mutex; otherwise, it's replaced with
      cgroup->pidlist_mutex locking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      069df3b7
    • Tejun Heo's avatar
      cgroup: refactor cgroup_pidlist_find() · e6b81710
      Tejun Heo authored
      Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
      separate out finding proper to cgroup_pidlist_find().  Also, move
      locking to the caller.
      
      This patch is preparation for pidlist restructure and doesn't
      introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e6b81710
    • Tejun Heo's avatar
      cgroup: introduce struct cgroup_pidlist_open_file · 62236858
      Tejun Heo authored
      For pidlist files, seq_file->private pointed to the loaded
      cgroup_pidlist; however, pidlist loading is planned to be moved to
      cgroup_pidlist_start() for kernfs conversion and seq_file->private
      needs to carry more information from open to allow that.
      
      This patch introduces struct cgroup_pidlist_open_file which contains
      type, cgrp and pidlist and updates pidlist seq_file->private to point
      to it using seq_open_private() and seq_release_private().  Note that
      this eventually will be replaced by kernfs_open_file.
      
      While this patch makes more information available to seq_file
      operations, they don't use it yet and this patch doesn't introduce any
      behavior changes except for allocation of the extra private struct.
      
      v2: use __seq_open_private() instead of seq_open_private() for brevity
          as suggested by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      62236858
    • Tejun Heo's avatar
      cgroup: implement delayed destruction for cgroup_pidlist · b1a21367
      Tejun Heo authored
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      This patch implements delayed release of pidlist.  As pidlists could
      be lingering on cgroup removal waiting for the timer to expire, cgroup
      free path needs to queue the destruction work item immediately and
      flush.  As those work items are self-destroying, each work item can't
      be flushed directly.  A new workqueue - cgroup_pidlist_destroy_wq - is
      added to serve as flush domain.
      
      Note that this patch just adds delayed release on top of the current
      implementation and doesn't change where pidlist is loaded and
      released.  Following patches will make those changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b1a21367
    • Tejun Heo's avatar
      cgroup: remove cftype->release() · b9f3ceca
      Tejun Heo authored
      Now that pidlist files don't use cftype->release(), it doesn't have
      any user left.  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b9f3ceca
    • Tejun Heo's avatar
      cgroup: don't skip seq_open on write only opens on pidlist files · ac1e69aa
      Tejun Heo authored
      Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
      if the file is opened write-only, which is a sensible optimization as
      pidlist loading can be costly and there often are occasions where
      tasks or cgroup.procs is opened write-only.  However, pidlist init and
      release are planned to be moved to cgroup_pidlist_start/stop()
      respectively which would make this optimization unnecessary.
      
      This patch removes the optimization and always fully initializes
      pidlist files regardless of open mode.  This will help moving pidlist
      handling to start/stop by unifying rw paths and removes the need for
      specifying cftype->release() in addition to .release in
      cgroup_pidlist_operations as file->f_op is now always overridden.  As
      pidlist files were the only user of cftype->release(), the next patch
      will remove the method.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ac1e69aa
  3. 27 Nov, 2013 3 commits
    • Tejun Heo's avatar
      cgroup: Merge branch 'for-3.13-fixes' into for-3.14 · c729b11e
      Tejun Heo authored
      Pull to receive e605b365 ("cgroup: fix cgroup_subsys_state leak
      for seq_files") as for-3.14 is scheduled to have a lot of changes
      which depend on it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c729b11e
    • Tejun Heo's avatar
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo authored
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
    • Peter Zijlstra's avatar
      cpuset: Fix memory allocator deadlock · 0fc0287c
      Peter Zijlstra authored
      Juri hit the below lockdep report:
      
      [    4.303391] ======================================================
      [    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
      [    4.303395] ------------------------------------------------------
      [    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
      [    4.303417]
      [    4.303417] and this task is already holding:
      [    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
      [    4.303431] which would create a new lock dependency:
      [    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    4.303436]
      
      [    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
      [    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
      [    4.303922]    HARDIRQ-ON-W at:
      [    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
      [    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303933]    SOFTIRQ-ON-W at:
      [    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
      [    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303959]    INITIAL USE at:
      [    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
      [    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303972]  }
      
      Which reports that we take mems_allowed_seq with interrupts enabled. A
      little digging found that this can only be from
      cpuset_change_task_nodemask().
      
      This is an actual deadlock because an interrupt doing an allocation will
      hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
      forever waiting for the write side to complete.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reported-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      0fc0287c
  4. 22 Nov, 2013 25 commits
    • Tejun Heo's avatar
      cgroup: Merge branch 'memcg_event' into for-3.14 · edab9510
      Tejun Heo authored
      Merge v3.12 based patch series to move cgroup_event implementation to
      memcg into for-3.14.  The following two commits cause a conflict in
      kernel/cgroup.c
      
        2ff2a7d0 ("cgroup: kill css_id")
        79bd9814 ("cgroup, memcg: move cgroup_event implementation to memcg")
      
      Each patch removes a struct definition from kernel/cgroup.c.  As the
      two are adjacent, they cause a context conflict.  Easily resolved by
      removing both structs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      edab9510
    • Tejun Heo's avatar
      cgroup: unexport cgroup_css() and remove __file_cft() · b36824c7
      Tejun Heo authored
      Now that cgroup_event is made memcg specific, the temporarily exported
      functions are no longer necessary.  Unexport cgroup_css() and remove
      __file_cft() which doesn't have any user left.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      b36824c7
    • Tejun Heo's avatar
      memcg: rename cgroup_event to mem_cgroup_event · 3bc942f3
      Tejun Heo authored
      cgroup_event is only available in memcg now.  Let's brand it that way.
      While at it, add a comment encouraging deprecation of the feature and
      remove the respective section from cgroup documentation.
      
      This patch is cosmetic.
      
      v3: Typo update as per Li Zefan.
      
      v2: Index in cgroups.txt updated accordingly as suggested by Li Zefan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      3bc942f3
    • Tejun Heo's avatar
      memcg: make cgroup_event deal with mem_cgroup instead of cgroup_subsys_state · 59b6f873
      Tejun Heo authored
      cgroup_event is now memcg specific.  Replace cgroup_event->css with
      ->memcg and convert [un]register_event() callbacks to take mem_cgroup
      pointer instead of cgroup_subsys_state one.  This simplifies the code
      slightly and makes css_to_vmpressure() unnecessary which is removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      59b6f873
    • Tejun Heo's avatar
      memcg: remove cgroup_event->cft · 347c4a87
      Tejun Heo authored
      The only use of cgroup_event->cft is distinguishing "usage_in_bytes"
      and "memsw.usgae_in_bytes" for mem_cgroup_usage_[un]register_event(),
      which can be done by adding an explicit argument to the function and
      implementing two wrappers so that the two cases can be distinguished
      from the function alone.
      
      Remove cgroup_event->cft and the related code including
      [un]register_events() methods.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      347c4a87
    • Tejun Heo's avatar
      cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg · fba94807
      Tejun Heo authored
      cgroup_event is being moved from cgroup core to memcg and the
      implementation is already moved by the previous patch.  This patch
      moves the data fields and callbacks.
      
      * cgroup->event_list[_lock] are moved to mem_cgroup.
      
      * cftype->[un]register_event() are moved to cgroup_event.  This makes
        it impossible for individual cftype definitions to specify their
        event callbacks.  This is worked around by simply hard-coding
        filename to event callback mapping in cgroup_write_event_control().
        This is awkward and inflexible, which is actually desirable given
        that we don't want to grow more usages of this feature.
      
      * eventfd_ctx declaration is removed from cgroup.h, which makes
        vmpressure.h miss eventfd_ctx declaration.  Include eventfd.h from
        vmpressure.h.
      
      v2: Use file name from dentry instead of cftype.  This will allow
          removing all cftype handling in the function.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      fba94807
    • Tejun Heo's avatar
      memcg: cgroup_write_event_control() now knows @css is for memcg · b5557c4c
      Tejun Heo authored
      @css for cgroup_write_event_control() is now always for memcg and the
      target file should be a memcg file too.  Drop code which assumes @css
      is dummy_css and the target file may belong to different subsystems.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      b5557c4c
    • Tejun Heo's avatar
      cgroup, memcg: move cgroup_event implementation to memcg · 79bd9814
      Tejun Heo authored
      cgroup_event is way over-designed and tries to build a generic
      flexible event mechanism into cgroup - fully customizable event
      specification for each user of the interface.  This is utterly
      unnecessary and overboard especially in the light of the planned
      unified hierarchy as there's gonna be single agent.  Simply generating
      events at fixed points, or if that's too restrictive, configureable
      cadence or single set of configureable points should be enough.
      
      Thankfully, memcg is the only user and gets to keep it.  Replacing it
      with something simpler on sane_behavior is strongly recommended.
      
      This patch moves cgroup_event and "cgroup.event_control"
      implementation to mm/memcontrol.c.  Clearing of events on cgroup
      destruction is moved from cgroup_destroy_locked() to
      mem_cgroup_css_offline(), which shouldn't make any noticeable
      difference.
      
      cgroup_css() and __file_cft() are exported to enable the move;
      however, this will soon be reverted once the event code is updated to
      be memcg specific.
      
      Note that "cgroup.event_control" will now exist only on the hierarchy
      with memcg attached to it.  While this change is visible to userland,
      it is unlikely to be noticeable as the file has never been meaningful
      outside memcg.
      
      Aside from the above change, this is pure code relocation.
      
      v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
          poll.h inclusion moved from cgroup.c to memcontrol.c.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      79bd9814
    • Tejun Heo's avatar
      cgroup: use a dedicated workqueue for cgroup destruction · e5fca243
      Tejun Heo authored
      Since be445626 ("cgroup: remove synchronize_rcu() from
      cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
      freeing is performed from a work item from that point on and a later
      commit, ea15f8cc ("cgroup: split cgroup destruction into two
      steps"), moves css offlining to workqueue too.
      
      As cgroup destruction isn't depended upon for memory reclaim, the
      destruction work items were put on the system_wq; unfortunately, some
      controller may block in the destruction path for considerable duration
      while holding cgroup_mutex.  As large part of destruction path is
      synchronized through cgroup_mutex, when combined with high rate of
      cgroup removals, this has potential to fill up system_wq's max_active
      of 256.
      
      Also, it turns out that memcg's css destruction path ends up queueing
      and waiting for work items on system_wq through work_on_cpu().  If
      such operation happens while system_wq is fully occupied by cgroup
      destruction work items, work_on_cpu() can't make forward progress
      because system_wq is full and other destruction work items on
      system_wq can't make forward progress because the work item waiting
      for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
      
      This can be fixed by queueing destruction work items on a separate
      workqueue.  This patch creates a dedicated workqueue -
      cgroup_destroy_wq - for this purpose.  As these work items shouldn't
      have inter-dependencies and mostly serialized by cgroup_mutex anyway,
      giving high concurrency level doesn't buy anything and the workqueue's
      @max_active is set to 1 so that destruction work items are executed
      one by one on each CPU.
      
      Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
      cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
      separate core_initcall().  In the future, we probably want to reorder
      so that workqueue init happens before cgroup_init().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarShawn Bohrer <shawn.bohrer@gmail.com>
      Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
      Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
      Cc: stable@vger.kernel.org # v3.9+
      e5fca243
    • Linus Torvalds's avatar
      Linux 3.13-rc1 · 6ce4eac1
      Linus Torvalds authored
      6ce4eac1
    • Linus Torvalds's avatar
      Merge tag 'ecryptfs-3.13-rc1-quiet-checkers' of... · 57498f9c
      Linus Torvalds authored
      Merge tag 'ecryptfs-3.13-rc1-quiet-checkers' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
      
      Pull minor eCryptfs fix from Tyler Hicks:
       "Quiet static checkers by removing unneeded conditionals"
      
      * tag 'ecryptfs-3.13-rc1-quiet-checkers' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
        eCryptfs: file->private_data is always valid
      57498f9c
    • Linus Torvalds's avatar
      Merge tag 'sound-fix2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · e48f88a3
      Linus Torvalds authored
      Pull second set of sound fixes from Takashi Iwai:
       "A collection of small fixes in HD-audio quirks and runtime PM, ASoC
        rcar, abs8500 and other codecs.  Most of commits are for stable
        kernels, too"
      
      * tag 'sound-fix2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda - Set current_headset_type to ALC_HEADSET_TYPE_ENUM (janitorial)
        ALSA: hda - Provide missing pin configs for VAIO with ALC260
        ALSA: hda - Add headset quirk for Dell Inspiron 3135
        ALSA: hda - Fix the headphone jack detection on Sony VAIO TX
        ALSA: hda - Fix missing bass speaker on ASUS N550
        ALSA: hda - Fix unbalanced runtime PM notification at resume
        ASoC: arizona: Set FLL to free-run before disabling
        ALSA: hda - A casual Dell Headset quirk
        ASoC: rcar: fixup dma_async_issue_pending() timing
        ASoC: rcar: off by one in rsnd_scu_set_route()
        ASoC: wm5110: Add post SYSCLK register patch for rev D chip
        ASoC: ab8500: Revert to using custom I/O functions
        ALSA: hda - Also enable mute/micmute LED control for "Lenovo dock" fixup
        ALSA: firewire-lib: include sound/asound.h to refer to snd_pcm_format_t
        ALSA: hda - Select FW_LOADER from CONFIG_SND_HDA_CODEC_CA0132_DSP
        ALSA: hda - Enable mute/mic-mute LEDs for more Thinkpads with Realtek codec
        ASoC: rcar: fixup mod access before checking
      e48f88a3
    • Linus Torvalds's avatar
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · aecde27c
      Linus Torvalds authored
      Pull DRM fixes from Dave Airlie:
       "I was going to leave this until post -rc1 but sysfs fixes broke
        hotplug in userspace, so I had to fix it harder, otherwise a set of
        pulls from intel, radeon and vmware,
      
        The vmware/ttm changes are bit larger but since its early and they are
        unlikely to break anything else I put them in, it lets vmware work
        with dri3"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (36 commits)
        drm/sysfs: fix hotplug regression since lifetime changes
        drm/exynos: g2d: fix memory leak to userptr
        drm/i915: Fix gen3 self-refresh watermarks
        drm/ttm: Remove set_need_resched from the ttm fault handler
        drm/ttm: Don't move non-existing data
        drm/radeon: hook up backlight functions for CI and KV family.
        drm/i915: Replicate BIOS eDP bpp clamping hack for hsw
        drm/i915: Do not enable package C8 on unsupported hardware
        drm/i915: Hold pc8 lock around toggling pc8.gpu_idle
        drm/i915: encoder->get_config is no longer optional
        drm/i915/tv: add ->get_config callback
        drm/radeon/cik: Add macrotile mode array query
        drm/radeon/cik: Return backend map information to userspace
        drm/vmwgfx: Make vmwgfx dma buffers prime aware
        drm/vmwgfx: Make surfaces prime-aware
        drm/vmwgfx: Hook up the prime ioctls
        drm/ttm: Add a minimal prime implementation for ttm base objects
        drm/vmwgfx: Fix false lockdep warning
        drm/ttm: Allow execbuf util reserves without ticket
        drm/i915: restore the early forcewake cleanup
        ...
      aecde27c
    • Linus Torvalds's avatar
      Merge tag 'pci-v3.13-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · e3414786
      Linus Torvalds authored
      Pull PCI updates from Bjorn Helgaas:
       "Miscellaneous
         - Remove duplicate disable from pcie_portdrv_remove() (Yinghai Lu)
         - Fix whitespace, capitalization, and spelling errors (Bjorn Helgaas)"
      
      * tag 'pci-v3.13-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: Remove duplicate pci_disable_device() from pcie_portdrv_remove()
        PCI: Fix whitespace, capitalization, and spelling errors
      e3414786
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · b0e3636f
      Linus Torvalds authored
      Pull SCSI target updates from Nicholas Bellinger:
       "Things have been quiet this round with mostly bugfixes, percpu
        conversions, and other minor iscsi-target conformance testing changes.
      
        The highlights include:
      
         - Add demo_mode_discovery attribute for iscsi-target (Thomas)
         - Convert tcm_fc(FCoE) to use percpu-ida pre-allocation
         - Add send completion interrupt coalescing for ib_isert
         - Convert target-core to use percpu-refcounting for se_lun
         - Fix mutex_trylock usage bug in iscsit_increment_maxcmdsn
         - tcm_loop updates (Hannes)
         - target-core ALUA cleanups + prep for v3.14 SCSI Referrals support (Hannes)
      
        v3.14 is currently shaping to be a busy development cycle in target
        land, with initial support for T10 Referrals and T10 DIF currently on
        the roadmap"
      
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (40 commits)
        iscsi-target: chap auth shouldn't match username with trailing garbage
        iscsi-target: fix extract_param to handle buffer length corner case
        iscsi-target: Expose default_erl as TPG attribute
        target_core_configfs: split up ALUA supported states
        target_core_alua: Make supported states configurable
        target_core_alua: Store supported ALUA states
        target_core_alua: Rename ALUA_ACCESS_STATE_OPTIMIZED
        target_core_alua: spellcheck
        target core: rename (ex,im)plict -> (ex,im)plicit
        percpu-refcount: Add percpu-refcount.o to obj-y
        iscsi-target: Do not reject non-immediate CmdSNs exceeding MaxCmdSN
        iscsi-target: Convert iscsi_session statistics to atomic_long_t
        target: Convert se_device statistics to atomic_long_t
        target: Fix delayed Task Aborted Status (TAS) handling bug
        iscsi-target: Reject unsupported multi PDU text command sequence
        ib_isert: Avoid duplicate iscsit_increment_maxcmdsn call
        iscsi-target: Fix mutex_trylock usage in iscsit_increment_maxcmdsn
        target: Core does not need blkdev.h
        target: Pass through I/O topology for block backstores
        iser-target: Avoid using FRMR for single dma entry requests
        ...
      b0e3636f
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging · 0032cdef
      Linus Torvalds authored
      Pull hwmon fixes from Guenter Roeck:
       - acpi_power_meter: Fix return value check from call to
         acpi_bus_get_device
       - nct6775: Fix/improve NCT6791 support
       - lm75: Add support for GMT G751
      
      * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (acpi_power_meter) Fix acpi_bus_get_device() return value check
        hwmon: (nct6775) NCT6791 supports weight control only for CPUFAN
        hwmon: (nct6775) Monitor additional temperature registers
        hwmon: (lm75) Add support for GMT G751 chip
      0032cdef
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · d2c2ad54
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix memory leaks and other issues in mwifiex driver, from Amitkumar
          Karwar.
      
       2) skb_segment() can choke on packets using frag lists, fix from
          Herbert Xu with help from Eric Dumazet and others.
      
       3) IPv4 output cached route instantiation properly handles races
          involving two threads trying to install the same route, but we
          forgot to propagate this logic to input routes as well.  Fix from
          Alexei Starovoitov.
      
       4) Put protections in place to make sure that recvmsg() paths never
          accidently copy uninitialized memory back into userspace and also
          make sure that we never try to use more that sockaddr_storage for
          building the on-kernel-stack copy of a sockaddr.  Fixes from Hannes
          Frederic Sowa.
      
       5) R8152 driver transmit flow bug fixes from Hayes Wang.
      
       6) Fix some minor fallouts from genetlink changes, from Johannes Berg
          and Michael Opdenacker.
      
       7) AF_PACKET sendmsg path can race with netdevice unregister notifier,
          fix by using RCU to make sure the network device doesn't go away
          from under us.  Fix from Daniel Borkmann.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
        gso: handle new frag_list of frags GRO packets
        genetlink: fix genl_set_err() group ID
        genetlink: fix genlmsg_multicast() bug
        packet: fix use after free race in send path when dev is released
        xen-netback: stop the VIF thread before unbinding IRQs
        wimax: remove dead code
        net/phy: Add the autocross feature for forced links on VSC82x4
        net/phy: Add VSC8662 support
        net/phy: Add VSC8574 support
        net/phy: Add VSC8234 support
        net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage)
        net: rework recvmsg handler msg_name and msg_namelen logic
        bridge: flush br's address entry in fdb when remove the
        net: core: Always propagate flag changes to interfaces
        ipv4: fix race in concurrent ip_route_input_slow()
        r8152: fix incorrect type in assignment
        r8152: support stopping/waking tx queue
        r8152: modify the tx flow
        r8152: fix tx/rx memory overflow
        netfilter: ebt_ip6: fix source and destination matching
        ...
      d2c2ad54
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm · 7fa850ab
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
       "Some small fixes for this merge window, most of them quite self
        explanatory - the biggest thing here is a fix for the ARMv7 LPAE
        suspend/resume support"
      
      * 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
        ARM: 7894/1: kconfig: select GENERIC_CLOCKEVENTS if HAVE_ARM_ARCH_TIMER
        ARM: 7893/1: bitops: only emit .arch_extension mp if CONFIG_SMP
        ARM: 7892/1: Fix warning for V7M builds
        ARM: 7888/1: seccomp: not compatible with ARM OABI
        ARM: 7886/1: make OABI default to off
        ARM: 7885/1: Save/Restore 64-bit TTBR registers on LPAE suspend/resume
        ARM: 7884/1: mm: Fix ECC mem policy printk
        ARM: 7883/1: fix mov to mvn conversion in case of 64 bit phys_addr_t and BE
        ARM: 7882/1: mm: fix __phys_to_virt to work with 64 bit phys_addr_t in BE case
        ARM: 7881/1: __fixup_smp read of SCU config should do byteswap in BE case
        ARM: Fix nommu.c build warning
      7fa850ab
    • Linus Torvalds's avatar
      Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm · c874e6fc
      Linus Torvalds authored
      Pull KVM fixes from Gleb Natapov.
      
      * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: kvm_clear_guest_page(): fix empty_zero_page usage
        kvm: mmu: delay mmu audit activation
        arm/arm64: KVM: Fix hyp mappings of vmalloc regions
      c874e6fc
    • Linus Torvalds's avatar
      Merge git://git.kvack.org/~bcrl/aio-next · d0f278c1
      Linus Torvalds authored
      Pull aio fixes from Benjamin LaHaise.
      
      * git://git.kvack.org/~bcrl/aio-next:
        aio: nullify aio->ring_pages after freeing it
        aio: prevent double free in ioctx_alloc
        aio: Fix a trinity splat
      d0f278c1
    • Linus Torvalds's avatar
      Merge branch 'for-3.13' of git://linux-nfs.org/~bfields/linux · 533db9b3
      Linus Torvalds authored
      Pull nfsd bugfixes from Bruce Fields:
       "A couple nfsd bugfixes"
      
      * 'for-3.13' of git://linux-nfs.org/~bfields/linux:
        nfsd4: fix xdr decoding of large non-write compounds
        nfsd: make sure to balance get/put_write_access
        nfsd: split up nfsd_setattr
      533db9b3
    • Linus Torvalds's avatar
      Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes · c85e0727
      Linus Torvalds authored
      Pull GFS2 fixes from Steven Whitehouse:
       "A couple of small, but important bug fixes for GFS2.  The first one
        fixes a possible NULL pointer dereference, and the second one resolves
        a reference counting issue in one of the lesser used paths through
        atomic_open"
      
      * tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
        GFS2: Fix ref count bug relating to atomic_open
        GFS2: fix potential NULL pointer dereference
      c85e0727
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · fb0d1eb8
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "Almost all of these are bug fixes.  Dave Sterba's documentation update
        is the big exception because he removed our promises to set any
        machine running Btrfs on fire"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Documentation: filesystems: update btrfs tools section
        Documentation: filesystems: add new btrfs mount options
        btrfs: update kconfig help text
        btrfs: fix bio_size_ok() for max_sectors > 0xffff
        btrfs: Use trace condition for get_extent tracepoint
        btrfs: fix typo in the log message
        Btrfs: fix list delete warning when removing ordered root from the list
        Btrfs: print bytenr instead of page pointer in check-int
        Btrfs: remove dead codes from ctree.h
        Btrfs: don't wait for ordered data outside desired range
        Btrfs: fix lockdep error in async commit
        Btrfs: avoid heavy operations in btrfs_commit_super
        Btrfs: fix __btrfs_start_workers retval
        Btrfs: disable online raid-repair on ro mounts
        Btrfs: do not inc uncorrectable_errors counter on ro scrubs
        Btrfs: only drop modified extents if we logged the whole inode
        Btrfs: make sure to copy everything if we rename
        Btrfs: don't BUG_ON() if we get an error walking backrefs
      fb0d1eb8
    • Linus Torvalds's avatar
      Merge tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs · 6ea9786e
      Linus Torvalds authored
      Pull second xfs update from Ben Myers:
       "There are a couple of patches that I wasn't quite sure about in time
        for our initial 3.13 pull request, a bugfix, and an update to add Dave
        to MAINTAINERS:
      
        Here we have a performance fix for inode iversion, increased inode
        cluster size for v5 superblock filesystems, a fix for error handling
        in xfs_bmap_add_attrfork, and a MAINTAINERS update to add Dave"
      
      * tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs:
        xfs: open code inc_inode_iversion when logging an inode
        xfs: increase inode cluster size for v5 filesystems
        xfs: fix unlock in xfs_bmap_add_attrfork
        xfs: update maintainers
      6ea9786e
    • Linus Torvalds's avatar
      Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux · 24f971ab
      Linus Torvalds authored
      Pull SLAB changes from Pekka Enberg:
       "The patches from Joonsoo Kim switch mm/slab.c to use 'struct page' for
        slab internals similar to mm/slub.c.  This reduces memory usage and
        improves performance:
      
          https://lkml.org/lkml/2013/10/16/155
      
        Rest of the changes are bug fixes from various people"
      
      * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (21 commits)
        mm, slub: fix the typo in mm/slub.c
        mm, slub: fix the typo in include/linux/slub_def.h
        slub: Handle NULL parameter in kmem_cache_flags
        slab: replace non-existing 'struct freelist *' with 'void *'
        slab: fix to calm down kmemleak warning
        slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabled
        slab: rename slab_bufctl to slab_freelist
        slab: remove useless statement for checking pfmemalloc
        slab: use struct page for slab management
        slab: replace free and inuse in struct slab with newly introduced active
        slab: remove SLAB_LIMIT
        slab: remove kmem_bufctl_t
        slab: change the management method of free objects of the slab
        slab: use __GFP_COMP flag for allocating slab pages
        slab: use well-defined macro, virt_to_slab()
        slab: overloading the RCU head over the LRU for RCU free
        slab: remove cachep in struct slab_rcu
        slab: remove nodeid in struct slab
        slab: remove colouroff in struct slab
        slab: change return type of kmem_getpages() to struct page
        ...
      24f971ab