1. 08 Aug, 2023 18 commits
    • Tejun Heo's avatar
      workqueue: Generalize unbound CPU pods · 84193c07
      Tejun Heo authored
      While renamed to pod, the code still assumes that the pods are defined by
      NUMA boundaries. Let's generalize it:
      
      * workqueue_attrs->affn_scope is added. Each enum represents the type of
        boundaries that define the pods. There are currently two scopes -
        WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
        - one pod per NUMA node. The latter defines one global pod across the
        whole system.
      
      * struct wq_pod_type is added which describes how pods are configured for
        each affnity scope. For each pod, it lists the member CPUs and the
        preferred NUMA node for memory allocations. The reverse mapping from CPU
        to pod is also available.
      
      * wq_pod_enabled is dropped. Pod is now always enabled. The previously
        disabled behavior is now implemented through WQ_AFFN_SYSTEM.
      
      * get_unbound_pool() wants to determine the NUMA node to allocate memory
        from for the new pool. The variables are renamed from node to pod but the
        logic still assumes they're one and the same. Clearly distinguish them -
        walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
        NUMA node.
      
      * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
        node. Take @cpu instead and determine the cpumask to use from the pod_type
        matching @attrs.
      
      * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
        NULL so that it can indicate -EINVAL on invalid affinity scopes.
      
      This patch allows CPUs to be grouped into pods however desired per type.
      While this patch causes some internal behavior changes, nothing material
      should change for workqueue users.
      
      v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
          WQ_AFFN_NR_TYPES which indicates that the function is called with a
          worker_pool's attrs instead of a workqueue's.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      84193c07
    • Tejun Heo's avatar
      workqueue: Factor out clearing of workqueue-only attrs fields · 5de7a03c
      Tejun Heo authored
      workqueue_attrs can be used for both workqueues and worker_pools. However,
      some fields, currently only ->ordered, only apply to workqueues and should
      be cleared to the default / invalid values.
      
      Currently, an unbound workqueue explicitly clears attrs->ordered in
      get_unbound_pool() after copying the source workqueue attrs, while per-cpu
      workqueues rely on the fact that zeroing on allocation gives us the desired
      default value for pool->attrs->ordered.
      
      This is fragile. Let's add wqattrs_clear_for_pool() which clears
      attrs->ordered and is called from both init_worker_pool() and
      get_unbound_pool(). This will ease adding more workqueue-only attrs fields.
      
      In get_unbound_pool(), pool->node initialization is moved upwards for
      readability. This shouldn't cause any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      5de7a03c
    • Tejun Heo's avatar
      workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod() · 0f36ee24
      Tejun Heo authored
      For an unbound pool, multiple cpumasks are involved.
      
      U: The user-specified cpumask (may be filtered with cpu_possible_mask).
      
      A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
         leaves no CPU, wq_unbound_cpumask is used.
      
      P: Per-pod subsets of #A.
      
      wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
      wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.
      
      wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
      calculate the new #P for each workqueue, it needs to call
      wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
      wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
      wq->dfl_pwq->pool->attrs.
      
      This is rather fragile because we're calling wq_calc_pod_cpumask() with
      @attrs of a worker_pool rather than the workqueue's actual attrs when what
      we want to calculate is the workqueue's cpumask on the pod. While this works
      fine currently, future changes will add fields which are used differently
      between workqueues and worker_pools and this subtlety will bite us.
      
      This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
      into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
      wq->unbound_attrs and use the new helper to obtain #A freshly instead of
      abusing wq->dfl_pwq->pool_attrs.
      
      This shouldn't cause any behavior changes in the current code.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
      0f36ee24
    • Tejun Heo's avatar
      workqueue: Initialize unbound CPU pods later in the boot · 2930155b
      Tejun Heo authored
      During boot, to initialize unbound CPU pods, wq_pod_init() was called from
      workqueue_init(). This is early enough for NUMA nodes to be set up but
      before SMP is brought up and CPU topology information is populated.
      
      Workqueue is in the process of improving CPU locality for unbound workqueues
      and will need access to topology information during pod init. This adds a
      new init function workqueue_init_topology() which is called after CPU
      topology information is available and replaces wq_pod_init().
      
      As unbound CPU pods are now initialized after workqueues are activated, we
      need to revisit the workqueues to apply the pod configuration. Workqueues
      which are created before workqueue_init_topology() are set up so that they
      always use the default worker pool. After pods are set up in
      workqueue_init_topology(), wq_update_pod() is called on all existing
      workqueues to update the pool associations accordingly.
      
      Note that wq_update_pod_attrs_buf allocation is moved to
      workqueue_init_early(). This isn't necessary right now but enables further
      generalization of pod handling in the future.
      
      This patch changes the initialization sequence but the end result should be
      the same.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2930155b
    • Tejun Heo's avatar
      workqueue: Move wq_pod_init() below workqueue_init() · a86feae6
      Tejun Heo authored
      wq_pod_init() is called from workqueue_init() and responsible for
      initializing unbound CPU pods according to NUMA node. Workqueue is in the
      process of improving affinity awareness and wants to use other topology
      information to initialize unbound CPU pods; however, unlike NUMA nodes,
      other topology information isn't yet available in workqueue_init().
      
      The next patch will introduce a later stage init function for workqueue
      which will be responsible for initializing unbound CPU pods. Relocate
      wq_pod_init() below workqueue_init() where the new init function is going to
      be located so that the diff can show the content differences.
      
      Just a relocation. No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a86feae6
    • Tejun Heo's avatar
      workqueue: Rename NUMA related names to use pod instead · fef59c9c
      Tejun Heo authored
      Workqueue is in the process of improving CPU affinity awareness. It will
      become more flexible and won't be tied to NUMA node boundaries. This patch
      renames all NUMA related names in workqueue.c to use "pod" instead.
      
      While "pod" isn't a very common term, it short and captures the grouping of
      CPUs well enough. These names are only going to be used within workqueue
      implementation proper, so the specific naming doesn't matter that much.
      
      * wq_numa_possible_cpumask -> wq_pod_cpus
      
      * wq_numa_enabled -> wq_pod_enabled
      
      * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf
      
      * workqueue_select_cpu_near -> select_numa_node_cpu
      
        This rename is different from others. The function is only used by
        queue_work_node() and specifically tries to find a CPU in the specified
        NUMA node. As workqueue affinity will become more flexible and untied from
        NUMA, this function's name should specifically describe that it's for
        NUMA.
      
      * wq_calc_node_cpumask -> wq_calc_pod_cpumask
      
      * wq_update_unbound_numa -> wq_update_pod
      
      * wq_numa_init -> wq_pod_init
      
      * node -> pod in local variables
      
      Only renames. No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fef59c9c
    • Tejun Heo's avatar
      workqueue: Rename workqueue_attrs->no_numa to ->ordered · af73f5c9
      Tejun Heo authored
      With the recent removal of NUMA related module param and sysfs knob,
      workqueue_attrs->no_numa is now only used to implement ordered workqueues.
      Let's rename the field so that it's less confusing especially with the
      planned CPU affinity awareness improvements.
      
      Just a rename. No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      af73f5c9
    • Tejun Heo's avatar
      workqueue: Make unbound workqueues to use per-cpu pool_workqueues · 636b927e
      Tejun Heo authored
      A pwq (pool_workqueue) represents an association between a workqueue and a
      worker_pool. When a work item is queued, the workqueue selects the pwq to
      use, which in turn determines the pool, and queues the work item to the pool
      through the pwq. pwq is also what implements the maximum concurrency limit -
      @max_active.
      
      As a per-cpu workqueue should be assocaited with a different worker_pool on
      each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
      However, unbound workqueues were sharing a pwq within each NUMA node by
      default. The sharing has several downsides:
      
      * Because @max_active is per-pwq, the meaning of @max_active changes
        depending on the machine configuration and whether workqueue NUMA locality
        support is enabled.
      
      * Makes per-cpu and unbound code deviate.
      
      * Gets in the way of making workqueue CPU locality awareness more flexible.
      
      This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
      workqueues do by making the following changes:
      
      * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
        just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
        workqueues.
      
      * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
        the specified pwq to the target CPU's wq->cpu_pwq.
      
      * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
        unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
        This makes the return value of wq_calc_node_cpumask() unnecessary. It now
        returns void.
      
      * @max_active now means the same thing for both per-cpu and unbound
        workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
        documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
        used in workqueue implementation and will be removed later.
      
      * All unbound pwq operations which used to be per-numa-node are now per-cpu.
      
      For most unbound workqueue users, this shouldn't cause noticeable changes.
      Work item issue and completion will be a small bit faster, flush_workqueue()
      would become a bit more expensive, and the total concurrency limit would
      likely become higher. All @max_active==1 use cases are currently being
      audited for conversion into alloc_ordered_workqueue() and they shouldn't be
      affected once the audit and conversion is complete.
      
      One area where the behavior change may be more noticeable is
      workqueue_congested() as the reported congestion state is now per CPU
      instead of NUMA node. There are only two users of this interface -
      drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
      cc'd. Inputs on the behavior change would be very much appreciated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Wenjia Zhang <wenjia@linux.ibm.com>
      Cc: Jan Karcher <jaka@linux.ibm.com>
      636b927e
    • Tejun Heo's avatar
      workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug · 4cbfd3de
      Tejun Heo authored
      When a CPU went online or offline, wq_update_unbound_numa() was called only
      on the CPU which was going up or down. This works fine because all CPUs on
      the same NUMA node share the same pool_workqueue slot - one CPU updating it
      updates it for everyone in the node.
      
      However, future changes will make each CPU use a separate pool_workqueue
      even when they're sharing the same worker_pool, which requires updating
      pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
      on hotplug.
      
      To accommodate the planned changes, this patch updates
      workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
      all CPUs sharing the same NUMA node as the CPU going up or down. In the
      current code, the second+ calls would be noops and there shouldn't be any
      behavior changes.
      
      * As wq_update_unbound_numa() is now called on multiple CPUs per each
        hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
        is added. The former indicates the CPU being hot[un]plugged and the latter
        the CPU whose pool_workqueue is being updated.
      
      * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
        with the new @hotplug_cpu.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4cbfd3de
    • Tejun Heo's avatar
      workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones · 687a9aa5
      Tejun Heo authored
      Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
      through a per-cpu allocation and thus, unlike unbound workqueues, not
      reference counted. This difference in lifetime management between the two
      types is a bit confusing.
      
      Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
      isn't suitiable for the planned CPU locality related improvements. The plan
      is to unify pwq handling across per-cpu and unbound workqueues so that
      they're always accessed through wq->cpu_pwq.
      
      In preparation, this patch makes per-cpu pwq's to be allocated, reference
      counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
      pointers to pwq's instead of containing them directly.
      
      pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
      also used for per-cpu work items.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      687a9aa5
    • Tejun Heo's avatar
      workqueue: Use a kthread_worker to release pool_workqueues · 967b494e
      Tejun Heo authored
      pool_workqueue release path is currently bounced to system_wq; however, this
      is a bit tricky because this bouncing occurs while holding a pool lock and
      thus has risk of causing a A-A deadlock. This is currently addressed by the
      fact that only unbound workqueues use this bouncing path and system_wq is a
      per-cpu workqueue.
      
      While this works, it's brittle and requires a work-around like setting the
      lockdep subclass for the lock of unbound pools. Besides, future changes will
      use the bouncing path for per-cpu workqueues too making the current approach
      unusable.
      
      Let's just use a dedicated kthread_worker to untangle the dependency. This
      is just one more kthread for all workqueues and makes the pwq release logic
      simpler and more robust.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      967b494e
    • Tejun Heo's avatar
      workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa · fcecfa8f
      Tejun Heo authored
      Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
      specific knobs won't make sense anymore. Remove them. Also, the pool_ids
      knob was used for debugging and not really meaningful given that there is no
      visibility into the pools associated with those IDs. Remove it too. A future
      patch will improve overall visibility.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      fcecfa8f
    • Tejun Heo's avatar
      workqueue: Relocate worker and work management functions · 797e8345
      Tejun Heo authored
      Collect first_idle_worker(), worker_enter/leave_idle(),
      find_worker_executing_work(), move_linked_works() and wake_up_worker() into
      one place. These functions will later be used to implement higher level
      worker management logic.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      797e8345
    • Tejun Heo's avatar
      workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq · ee1ceef7
      Tejun Heo authored
      wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
      The field name being plural is unusual and confusing. Rename it to singular.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ee1ceef7
    • Tejun Heo's avatar
      workqueue: Not all work insertion needs to wake up a worker · fe089f87
      Tejun Heo authored
      insert_work() always tried to wake up a worker; however, the only time it
      needs to try to wake up a worker is when a new active work item is queued.
      When a work item goes on the inactive list or queueing a flush work item,
      there's no reason to try to wake up a worker.
      
      This patch moves the worker wakeup logic out of insert_work() and places it
      in the active new work item queueing path in __queue_work().
      
      While at it:
      
      * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
        pool.
      
      * Every caller of insert_work() calls debug_work_activate(). Consolidate the
        invocations into insert_work().
      
      * In __queue_work() pool->watchdog_ts update is relocated slightly. This is
        to better accommodate future changes.
      
      This makes wakeups more precise and will help the planned change to assign
      work items to workers before waking them up. No behavior changes intended.
      
      v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
          suggested by Lai.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      fe089f87
    • Tejun Heo's avatar
      workqueue: Cleanups around process_scheduled_works() · c0ab017d
      Tejun Heo authored
      * Drop the trivial optimization in worker_thread() where it bypasses calling
        process_scheduled_works() if the first work item isn't linked. This is a
        mostly pointless micro optimization and gets in the way of improving the
        work processing path.
      
      * Consolidate pool->watchdog_ts updates in the two callers into
        process_scheduled_works().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c0ab017d
    • Tejun Heo's avatar
      workqueue: Drop the special locking rule for worker->flags and worker_pool->flags · bc8b50c2
      Tejun Heo authored
      worker->flags used to be accessed from scheduler hooks without grabbing
      pool->lock for concurrency management. This is no longer true since
      6d25be57 ("sched/core, workqueues: Distangle worker accounting from rq
      lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
      All relevant users are accessing it under the pool lock.
      
      Let's drop the special "X" rule and use the "L" rule for these flag fields
      instead. While at it, replace the CONTEXT comment with
      lockdep_assert_held().
      
      This allows worker_set/clr_flags() to be used from context which isn't the
      worker itself. This will be used later to implement assinging work items to
      workers before waking them up so that workqueue can have better control over
      which worker executes which work item on which CPU.
      
      The only actual changes are sanity checks. There shouldn't be any visible
      behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc8b50c2
    • Tejun Heo's avatar
      workqueue: Merge branch 'for-6.5-fixes' into for-6.6 · 87437656
      Tejun Heo authored
      Unbound workqueue execution locality improvement patchset is about to
      applied which will cause merge conflicts with changes in for-6.5-fixes.
      Let's avoid future merge conflict by pulling in for-6.5-fixes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      87437656
  2. 07 Aug, 2023 1 commit
  3. 25 Jul, 2023 1 commit
    • Tejun Heo's avatar
      workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000 · aa6fde93
      Tejun Heo authored
      wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
      Once detected, they're excluded from concurrency management to prevent them
      from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
      enabled, repeat offenders are also reported so that the code can be updated.
      
      The default threshold is 10ms which is long enough to do fair bit of work on
      modern CPUs while short enough to be usually not noticeable. This
      unfortunately leads to a lot of, arguable spurious, detections on very slow
      CPUs. Using the same threshold across CPUs whose performance levels may be
      apart by multiple levels of magnitude doesn't make whole lot of sense.
      
      This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
      is below 4000. This is obviously very inaccurate but it doesn't have to be
      accurate to be useful. The mechanism is still useful when the threshold is
      fully scaled up and the benefits of reports are usually shared with everyone
      regardless of who's reporting, so as long as there are sufficient number of
      fast machines reporting, we don't lose much.
      
      Some (or is it all?) ARM CPUs systemtically report significantly lower
      BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
      are, it's at least a missed opportunity and it probably would be a good idea
      to teach workqueue about it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      aa6fde93
  4. 11 Jul, 2023 1 commit
  5. 10 Jul, 2023 4 commits
  6. 09 Jul, 2023 10 commits
  7. 08 Jul, 2023 5 commits