1. 01 Apr, 2013 17 commits
    • Tejun Heo's avatar
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param... · d55262c4
      Tejun Heo authored
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
      
      Unbound workqueues are now NUMA aware.  Let's add some control knobs
      and update sysfs interface accordingly.
      
      * Add kernel param workqueue.numa_disable which disables NUMA affinity
        globally.
      
      * Replace sysfs file "pool_id" with "pool_ids" which contain
        node:pool_id pairs.  This change is userland-visible but "pool_id"
        hasn't seen a release yet, so this is okay.
      
      * Add a new sysf files "numa" which can toggle NUMA affinity on
        individual workqueues.  This is implemented as attrs->no_numa whichn
        is special in that it isn't part of a pool's attributes.  It only
        affects how apply_workqueue_attrs() picks which pools to use.
      
      After "pool_ids" change, first_pwq() doesn't have any user left.
      Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      d55262c4
    • Tejun Heo's avatar
      workqueue: implement NUMA affinity for unbound workqueues · 4c16bd32
      Tejun Heo authored
      Currently, an unbound workqueue has single current, or first, pwq
      (pool_workqueue) to which all new work items are queued.  This often
      isn't optimal on NUMA machines as workers may jump around across node
      boundaries and work items get assigned to workers without any regard
      to NUMA affinity.
      
      This patch implements NUMA affinity for unbound workqueues.  Instead
      of mapping all entries of numa_pwq_tbl[] to the same pwq,
      apply_workqueue_attrs() now creates a separate pwq covering the
      intersecting CPUs for each NUMA node which has online CPUs in
      @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
      are mapped to pwqs covering whole @attrs->cpumask.
      
      As CPUs come up and go down, the pool association is changed
      accordingly.  Changing pool association may involve allocating new
      pools which may fail.  To avoid failing CPU_DOWN, each workqueue
      always keeps a default pwq which covers whole attrs->cpumask which is
      used as fallback if pool creation fails during a CPU hotplug
      operation.
      
      This ensures that all work items issued on a NUMA node is executed on
      the same node as long as the workqueue allows execution on the CPUs of
      the node.
      
      As this maps a workqueue to multiple pwqs and max_active is per-pwq,
      this change the behavior of max_active.  The limit is now per NUMA
      node instead of global.  While this is an actual change, max_active is
      already per-cpu for per-cpu workqueues and primarily used as safety
      mechanism rather than for active concurrency control.  Concurrency is
      usually limited from workqueue users by the number of concurrently
      active work items and this change shouldn't matter much.
      
      v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
          by Lai.
      
      v3: The previous version incorrectly made a workqueue spanning
          multiple nodes spread work items over all online CPUs when some of
          its nodes don't have any desired cpus.  Reimplemented so that NUMA
          affinity is properly updated as CPUs go up and down.  This problem
          was spotted by Lai Jiangshan.
      
      v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
          however, wq may be freed at any time after dfl_pwq is put making
          the clearing use-after-free.  Clear wq->dfl_pwq before putting it.
      
      v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
          @pwq_tbl after success.  Fixed.
      
          Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
          application of new attrs is excluded via CPU hotplug.  Removed.
      
          Documentation on CPU affinity guarantee on CPU_DOWN added.
      
          All changes are suggested by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      4c16bd32
    • Tejun Heo's avatar
      workqueue: introduce put_pwq_unlocked() · dce90d47
      Tejun Heo authored
      Factor out lock pool, put_pwq(), unlock sequence into
      put_pwq_unlocked().  The two existing places are converted and there
      will be more with NUMA affinity support.
      
      This is to prepare for NUMA affinity support for unbound workqueues
      and doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      dce90d47
    • Tejun Heo's avatar
      workqueue: introduce numa_pwq_tbl_install() · 1befcf30
      Tejun Heo authored
      Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
      from apply_workqueue_attrs() into numa_pwq_tbl_install().  link_pwq()
      is made safe to call multiple times.  numa_pwq_tbl_install() links the
      pwq, installs it into numa_pwq_tbl[] at the specified node and returns
      the old entry.
      
      @last_pwq is removed from link_pwq() as the return value of the new
      function can be used instead.
      
      This is to prepare for NUMA affinity support for unbound workqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      1befcf30
    • Tejun Heo's avatar
      workqueue: use NUMA-aware allocation for pool_workqueues · e50aba9a
      Tejun Heo authored
      Use kmem_cache_alloc_node() with @pool->node instead of
      kmem_cache_zalloc() when allocating a pool_workqueue so that it's
      allocated on the same node as the associated worker_pool.  As there's
      no no kmem_cache_zalloc_node(), move zeroing to init_pwq().
      
      This was suggested by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      e50aba9a
    • Tejun Heo's avatar
      workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() · f147f29e
      Tejun Heo authored
      Break init_and_link_pwq() into init_pwq() and link_pwq() and move
      unbound-workqueue specific handling into apply_workqueue_attrs().
      Also, factor out unbound pool and pool_workqueue allocation into
      alloc_unbound_pwq().
      
      This reorganization is to prepare for NUMA affinity and doesn't
      introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      f147f29e
    • Tejun Heo's avatar
      workqueue: map an unbound workqueues to multiple per-node pool_workqueues · df2d5ae4
      Tejun Heo authored
      Currently, an unbound workqueue has only one "current" pool_workqueue
      associated with it.  It may have multple pool_workqueues but only the
      first pool_workqueue servies new work items.  For NUMA affinity, we
      want to change this so that there are multiple current pool_workqueues
      serving different NUMA nodes.
      
      Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
      points to the pool_workqueue to use for each possible node.  This
      replaces first_pwq() in __queue_work() and workqueue_congested().
      
      numa_pwq_tbl[] is currently initialized to point to the same
      pool_workqueue as first_pwq() so this patch doesn't make any behavior
      changes.
      
      v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
          may be called only with wq->mutex held.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      df2d5ae4
    • Tejun Heo's avatar
      workqueue: move hot fields of workqueue_struct to the end · 2728fd2f
      Tejun Heo authored
      Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
      them to the cacheline.  These two fields are used in the work item
      issue path and thus hot.  The scheduled NUMA affinity support will add
      dispatch table at the end of workqueue_struct and relocating these two
      fields will allow us hitting only single cacheline on hot paths.
      
      Note that wq->pwqs isn't moved although it currently is being used in
      the work item issue path for unbound workqueues.  The dispatch table
      mentioned above will replace its use in the issue path, so it will
      become cold once NUMA support is implemented.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      2728fd2f
    • Tejun Heo's avatar
      workqueue: make workqueue->name[] fixed len · ecf6881f
      Tejun Heo authored
      Currently workqueue->name[] is of flexible length.  We want to use the
      flexible field for something more useful and there isn't much benefit
      in allowing arbitrary name length anyway.  Make it fixed len capping
      at 24 bytes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      ecf6881f
    • Tejun Heo's avatar
      workqueue: add workqueue->unbound_attrs · 6029a918
      Tejun Heo authored
      Currently, when exposing attrs of an unbound workqueue via sysfs, the
      workqueue_attrs of first_pwq() is used as that should equal the
      current state of the workqueue.
      
      The planned NUMA affinity support will make unbound workqueues make
      use of multiple pool_workqueues for different NUMA nodes and the above
      assumption will no longer hold.  Introduce workqueue->unbound_attrs
      which records the current attrs in effect and use it for sysfs instead
      of first_pwq()->attrs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      6029a918
    • Tejun Heo's avatar
      workqueue: determine NUMA node of workers accourding to the allowed cpumask · f3f90ad4
      Tejun Heo authored
      When worker tasks are created using kthread_create_on_node(),
      currently only per-cpu ones have the matching NUMA node specified.
      All unbound workers are always created with NUMA_NO_NODE.
      
      Now that an unbound worker pool may have an arbitrary cpumask
      associated with it, this isn't optimal.  Add pool->node which is
      determined by the pool's cpumask.  If the pool's cpumask is contained
      inside a NUMA node proper, the pool is associated with that node, and
      all workers of the pool are created on that node.
      
      This currently only makes difference for unbound worker pools with
      cpumask contained inside single NUMA node, but this will serve as
      foundation for making all unbound pools NUMA-affine.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      f3f90ad4
    • Tejun Heo's avatar
      workqueue: drop 'H' from kworker names of unbound worker pools · e3c916a4
      Tejun Heo authored
      Currently, all workqueue workers which have negative nice value has
      'H' postfixed to their names.  This is necessary for per-cpu workers
      as they use the CPU number instead of pool->id to identify the pool
      and the 'H' postfix is the only thing distinguishing normal and
      highpri workers.
      
      As workers for unbound pools use pool->id, the 'H' postfix is purely
      informational.  TASK_COMM_LEN is 16 and after the static part and
      delimiters, there are only five characters left for the pool and
      worker IDs.  We're expecting to have more unbound pools with the
      scheduled NUMA awareness support.  Let's drop the non-essential 'H'
      postfix from unbound kworker name.
      
      While at it, restructure kthread_create*() invocation to help future
      NUMA related changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      e3c916a4
    • Tejun Heo's avatar
      workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] · bce90380
      Tejun Heo authored
      Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
      and wq_numa_possible_cpumask[] in preparation.  The former is the
      highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
      each NUMA node.
      
      This patch only introduces these.  Future patches will make use of
      them.
      
      v2: NUMA initialization move into wq_numa_init().  Also, the possible
          cpumask array is not created if there aren't multiple nodes on the
          system.  wq_numa_enabled bool added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      bce90380
    • Tejun Heo's avatar
      workqueue: move pwq_pool_locking outside of get/put_unbound_pool() · a892cacc
      Tejun Heo authored
      The scheduled NUMA affinity support for unbound workqueues would need
      to walk workqueues list and pool related operations on each workqueue.
      
      Move wq_pool_mutex locking out of get/put_unbound_pool() to their
      callers so that pool operations can be performed while walking the
      workqueues list, which is also protected by wq_pool_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      a892cacc
    • Tejun Heo's avatar
      workqueue: fix memory leak in apply_workqueue_attrs() · 4862125b
      Tejun Heo authored
      apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
      in its success path.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      4862125b
    • Tejun Heo's avatar
      workqueue: fix unbound workqueue attrs hashing / comparison · 13e2e556
      Tejun Heo authored
      29c91e99 ("workqueue: implement attribute-based unbound worker_pool
      management") implemented attrs based worker_pool matching.  It tried
      to avoid false negative when comparing cpumasks with custom hash
      function; unfortunately, the hash and comparison functions fail to
      ignore CPUs which are not possible.  It incorrectly assumed that
      bitmap_copy() skips leftover bits in the last word of bitmap and
      cpumask_equal() ignores impossible CPUs.
      
      This patch updates attrs->cpumask handling such that impossible CPUs
      are properly ignored.
      
      * Hash and copy functions no longer do anything special.  They expect
        their callers to clear impossible CPUs.
      
      * alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
        instead of setting all bits and explicit cpumask_setall() for
        unbound_std_wq_attrs[] in init_workqueues() is dropped.
      
      * apply_workqueue_attrs() is now responsible for ignoring impossible
        CPUs.  It makes a copy of @attrs and clears impossible CPUs before
        doing anything else.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      13e2e556
    • Tejun Heo's avatar
      workqueue: fix race condition in unbound workqueue free path · bc0caf09
      Tejun Heo authored
      8864b4e5 ("workqueue: implement get/put_pwq()") implemented pwq
      (pool_workqueue) refcnting which frees workqueue when the last pwq
      goes away.  It determined whether it was the last pwq by testing
      wq->pwqs is empty.  Unfortunately, the test was done outside wq->mutex
      and multiple pwq release could race and try to free wq multiple times
      leading to oops.
      
      Test wq->pwqs emptiness while holding wq->mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc0caf09
  2. 25 Mar, 2013 6 commits
    • Lai Jiangshan's avatar
      workqueue: remove pwq_lock which is no longer used · b5927605
      Lai Jiangshan authored
      To simplify locking, the previous patches expanded wq->mutex to
      protect all fields of each workqueue instance including the pwqs list
      leaving pwq_lock without any user.  Remove the unused pwq_lock.
      
      tj: Rebased on top of the current dev branch.  Updated description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b5927605
    • Lai Jiangshan's avatar
      workqueue: protect wq->saved_max_active with wq->mutex · a357fc03
      Lai Jiangshan authored
      We're expanding wq->mutex to cover all fields specific to each
      workqueue with the end goal of replacing pwq_lock which will make
      locking simpler and easier to understand.
      
      This patch makes wq->saved_max_active protected by wq->mutex instead
      of pwq_lock.  As pwq_lock locking around pwq_adjust_mac_active() is no
      longer necessary, this patch also replaces pwq_lock lockings of
      for_each_pwq() around pwq_adjust_max_active() to wq->mutex.
      
      tj: Rebased on top of the current dev branch.  Updated description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a357fc03
    • Lai Jiangshan's avatar
      workqueue: protect wq->pwqs and iteration with wq->mutex · b09f4fd3
      Lai Jiangshan authored
      We're expanding wq->mutex to cover all fields specific to each
      workqueue with the end goal of replacing pwq_lock which will make
      locking simpler and easier to understand.
      
      init_and_link_pwq() and pwq_unbound_release_workfn() already grab
      wq->mutex when adding or removing a pwq from wq->pwqs list.  This
      patch makes it official that the list is wq->mutex protected for
      writes and updates readers accoridingly.  Explicit IRQ toggles for
      sched-RCU read-locking in flush_workqueue_prep_pwqs() and
      drain_workqueues() are removed as the surrounding wq->mutex can
      provide sufficient synchronization.
      
      Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex()
      and checks for wq->mutex too.
      
      pwq_lock locking and assertion are not removed by this patch and a
      couple of for_each_pwq() iterations are still protected by it.
      They'll be removed by future patches.
      
      tj: Rebased on top of the current dev branch.  Updated description.
          Folded in assert_rcu_or_wq_mutex() renaming from a later patch
          along with associated comment updates.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b09f4fd3
    • Lai Jiangshan's avatar
      workqueue: protect wq->nr_drainers and ->flags with wq->mutex · 87fc741e
      Lai Jiangshan authored
      We're expanding wq->mutex to cover all fields specific to each
      workqueue with the end goal of replacing pwq_lock which will make
      locking simpler and easier to understand.
      
      wq->nr_drainers and ->flags are specific to each workqueue.  Protect
      ->nr_drainers and ->flags with wq->mutex instead of pool_mutex.
      
      tj: Rebased on top of the current dev branch.  Updated description.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      87fc741e
    • Lai Jiangshan's avatar
      workqueue: rename wq->flush_mutex to wq->mutex · 3c25a55d
      Lai Jiangshan authored
      Currently pwq->flush_mutex protects many fields of a workqueue
      including, especially, the pwqs list.  We're going to expand this
      mutex to protect most of a workqueue and eventually replace pwq_lock,
      which will make locking simpler and easier to understand.
      
      Drop the "flush_" prefix in preparation.
      
      This patch is pure rename.
      
      tj: Rebased on top of the current dev branch.  Updated description.
          Use WQ: and WR: instead of Q: and QR: for synchronization labels.
      Signed-off-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      3c25a55d
    • Lai Jiangshan's avatar
      workqueue: rename wq_mutex to wq_pool_mutex · 68e13a67
      Lai Jiangshan authored
      wq->flush_mutex will be renamed to wq->mutex and cover all fields
      specific to each workqueue and eventually replace pwq_lock, which will
      make locking simpler and easier to understand.
      
      Rename wq_mutex to wq_pool_mutex to avoid confusion with wq->mutex.
      After the scheduled changes, wq_pool_mutex won't be protecting
      anything specific to each workqueue instance anyway.
      
      This patch is pure rename.
      
      tj: s/wqs_mutex/wq_pool_mutex/.  Rewrote description.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      68e13a67
  3. 20 Mar, 2013 5 commits
  4. 19 Mar, 2013 5 commits
    • Tejun Heo's avatar
      workqueue: restore CPU affinity of unbound workers on CPU_ONLINE · 7dbc725e
      Tejun Heo authored
      With the recent addition of the custom attributes support, unbound
      pools may have allowed cpumask which isn't full.  As long as some of
      CPUs in the cpumask are online, its workers will maintain cpus_allowed
      as set on worker creation; however, once no online CPU is left in
      cpus_allowed, the scheduler will reset cpus_allowed of any workers
      which get scheduled so that they can execute.
      
      To remain compliant to the user-specified configuration, CPU affinity
      needs to be restored when a CPU becomes online for an unbound pool
      which doesn't currently have any online CPUs before.
      
      This patch implement restore_unbound_workers_cpumask(), which is
      called from CPU_ONLINE for all unbound pools, checks whether the
      coming up CPU is the first allowed online one, and, if so, invokes
      set_cpus_allowed_ptr() with the configured cpumask on all workers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      7dbc725e
    • Tejun Heo's avatar
      workqueue: directly restore CPU affinity of workers from CPU_ONLINE · a9ab775b
      Tejun Heo authored
      Rebinding workers of a per-cpu pool after a CPU comes online involves
      a lot of back-and-forth mostly because only the task itself could
      adjust CPU affinity if PF_THREAD_BOUND was set.
      
      As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
      coerce the workers themselves to perform set_cpus_allowed_ptr().  Due
      to the various states a worker can be in, this led to three different
      paths a worker may be rebound.  worker->rebind_work is queued to busy
      workers.  Idle ones are signaled by unlinking worker->entry and call
      idle_worker_rebind().  The manager isn't covered by either and
      implements its own mechanism.
      
      PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
      itself now can manipulate CPU affinity of workers.  This patch
      replaces the existing rebind mechanism with direct one where
      CPU_ONLINE iterates over all workers using for_each_pool_worker(),
      restores CPU affinity, and clears WORKER_UNBOUND.
      
      There are a couple subtleties.  All bound idle workers should have
      their runqueues set to that of the bound CPU; however, if the target
      task isn't running, set_cpus_allowed_ptr() just updates the
      cpus_allowed mask deferring the actual migration to when the task
      wakes up.  This is worked around by waking up idle workers after
      restoring CPU affinity before any workers can become bound.
      
      Another subtlety is stems from matching @pool->nr_running with the
      number of running unbound workers.  While DISASSOCIATED, all workers
      are unbound and nr_running is zero.  As workers become bound again,
      nr_running needs to be adjusted accordingly; however, there is no good
      way to tell whether a given worker is running without poking into
      scheduler internals.  Instead of clearing UNBOUND directly,
      rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
      REBOUND, which will later be cleared by the workers themselves while
      preparing for the next round of work item execution.  The only change
      needed for the workers is clearing REBOUND along with PREP.
      
      * This patch leaves for_each_busy_worker() without any user.  Removed.
      
      * idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
        and rebind logic in manager_workers() removed.
      
      * worker_thread() now looks at WORKER_DIE instead of testing whether
        @worker->entry is empty to determine whether it needs to do
        something special as dying is the only special thing now.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      a9ab775b
    • Tejun Heo's avatar
      workqueue: relocate rebind_workers() · bd7c089e
      Tejun Heo authored
      rebind_workers() will be reimplemented in a way which makes it mostly
      decoupled from the rest of worker management.  Move rebind_workers()
      so that it's located with other CPU hotplug related functions.
      
      This patch is pure function relocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      bd7c089e
    • Tejun Heo's avatar
      workqueue: convert worker_pool->worker_ida to idr and implement for_each_pool_worker() · 822d8405
      Tejun Heo authored
      Make worker_ida an idr - worker_idr and use it to implement
      for_each_pool_worker() which will be used to simplify worker rebinding
      on CPU_ONLINE.
      
      pool->worker_idr is protected by both pool->manager_mutex and
      pool->lock so that it can be iterated while holding either lock.
      
      * create_worker() allocates ID without installing worker pointer and
        installs the pointer later using idr_replace().  This is because
        worker ID is needed when creating the actual task to name it and the
        new worker shouldn't be visible to iterations before fully
        initialized.
      
      * In destroy_worker(), ID removal is moved before kthread_stop().
        This is again to guarantee that only fully working workers are
        visible to for_each_pool_worker().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      822d8405
    • Tejun Heo's avatar
      sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY · 14a40ffc
      Tejun Heo authored
      PF_THREAD_BOUND was originally used to mark kernel threads which were
      bound to a specific CPU using kthread_bind() and a task with the flag
      set allows cpus_allowed modifications only to itself.  Workqueue is
      currently abusing it to prevent userland from meddling with
      cpus_allowed of workqueue workers.
      
      What we need is a flag to prevent userland from messing with
      cpus_allowed of certain kernel tasks.  In kernel, anyone can
      (incorrectly) squash the flag, and, for worker-type usages,
      restricting cpus_allowed modification to the task itself doesn't
      provide meaningful extra proection as other tasks can inject work
      items to the task anyway.
      
      This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
      sched_setaffinity() checks the flag and return -EINVAL if set.
      set_cpus_allowed_ptr() is no longer affected by the flag.
      
      This will allow simplifying workqueue worker CPU affinity management.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      14a40ffc
  5. 14 Mar, 2013 7 commits
    • Tejun Heo's avatar
      workqueue: rename workqueue_lock to wq_mayday_lock · 2e109a28
      Tejun Heo authored
      With the recent locking updates, the only thing protected by
      workqueue_lock is workqueue->maydays list.  Rename workqueue_lock to
      wq_mayday_lock.
      
      This patch is pure rename.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2e109a28
    • Tejun Heo's avatar
      workqueue: separate out pool_workqueue locking into pwq_lock · 794b18bc
      Tejun Heo authored
      This patch continues locking cleanup from the previous patch.  It
      breaks out pool_workqueue synchronization from workqueue_lock into a
      new spinlock - pwq_lock.  The followings are protected by pwq_lock.
      
      * workqueue->pwqs
      * workqueue->saved_max_active
      
      The conversion is straight-forward.  workqueue_lock usages which cover
      the above two are converted to pwq_lock.  New locking label PW added
      for things protected by pwq_lock and FR is updated to mean flush_mutex
      + pwq_lock + sched-RCU.
      
      This patch shouldn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      794b18bc
    • Tejun Heo's avatar
      workqueue: separate out pool and workqueue locking into wq_mutex · 5bcab335
      Tejun Heo authored
      Currently, workqueue_lock protects most shared workqueue resources -
      the pools, workqueues, pool_workqueues, draining, ID assignments,
      mayday handling and so on.  The coverage has grown organically and
      there is no identified bottleneck coming from workqueue_lock, but it
      has grown a bit too much and scheduled rebinding changes need the
      pools and workqueues to be protected by a mutex instead of a spinlock.
      
      This patch breaks out pool and workqueue synchronization from
      workqueue_lock into a new mutex - wq_mutex.  The followings are
      protected by wq_mutex.
      
      * worker_pool_idr and unbound_pool_hash
      * pool->refcnt
      * workqueues list
      * workqueue->flags, ->nr_drainers
      
      Most changes are mostly straight-forward.  workqueue_lock is replaced
      with wq_mutex where applicable and workqueue_lock lock/unlocks are
      added where wq_mutex conversion leaves data structures not protected
      by wq_mutex without locking.  irq / preemption flippings were added
      where the conversion affects them.  Things worth noting are
      
      * New WQ and WR locking lables added along with
        assert_rcu_or_wq_mutex().
      
      * worker_pool_assign_id() now expects to be called under wq_mutex.
      
      * create_mutex is removed from get_unbound_pool().  It now just holds
        wq_mutex.
      
      This patch shouldn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      5bcab335
    • Tejun Heo's avatar
      workqueue: relocate global variable defs and function decls in workqueue.c · 7d19c5ce
      Tejun Heo authored
      They're split across debugobj code for some reason.  Collect them.
      
      This patch is pure relocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7d19c5ce
    • Tejun Heo's avatar
      workqueue: better define locking rules around worker creation / destruction · cd549687
      Tejun Heo authored
      When a manager creates or destroys workers, the operations are always
      done with the manager_mutex held; however, initial worker creation or
      worker destruction during pool release don't grab the mutex.  They are
      still correct as initial worker creation doesn't require
      synchronization and grabbing manager_arb provides enough exclusion for
      pool release path.
      
      Still, let's make everyone follow the same rules for consistency and
      such that lockdep annotations can be added.
      
      Update create_and_start_worker() and put_unbound_pool() to grab
      manager_mutex around thread creation and destruction respectively and
      add lockdep assertions to create_worker() and destroy_worker().
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      cd549687
    • Tejun Heo's avatar
      workqueue: factor out initial worker creation into create_and_start_worker() · ebf44d16
      Tejun Heo authored
      get_unbound_pool(), workqueue_cpu_up_callback() and init_workqueues()
      have similar code pieces to create and start the initial worker factor
      those out into create_and_start_worker().
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ebf44d16
    • Tejun Heo's avatar
      workqueue: rename worker_pool->assoc_mutex to ->manager_mutex · bc3a1afc
      Tejun Heo authored
      Manager operations are currently governed by two mutexes -
      pool->manager_arb and ->assoc_mutex.  The former is used to decide who
      gets to be the manager and the latter to exclude the actual manager
      operations including creation and destruction of workers.  Anyone who
      grabs ->manager_arb must perform manager role; otherwise, the pool
      might stall.
      
      Grabbing ->assoc_mutex blocks everyone else from performing manager
      operations but doesn't require the holder to perform manager duties as
      it's merely blocking manager operations without becoming the manager.
      
      Because the blocking was necessary when [dis]associating per-cpu
      workqueues during CPU hotplug events, the latter was named
      assoc_mutex.  The mutex is scheduled to be used for other purposes, so
      this patch gives it a more fitting generic name - manager_mutex - and
      updates / adds comments to explain synchronization around the manager
      role and operations.
      
      This patch is pure rename / doc update.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc3a1afc