1. 03 Sep, 2019 7 commits
    • Patrick Bellasi's avatar
      sched/uclamp: Use TG's clamps to restrict TASK's clamps · 3eac870a
      Patrick Bellasi authored
      When a task specific clamp value is configured via sched_setattr(2), this
      value is accounted in the corresponding clamp bucket every time the task is
      {en,de}qeued. However, when cgroups are also in use, the task specific
      clamp values could be restricted by the task_group (TG) clamp values.
      
      Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
      task is enqueued, it's accounted in the clamp bucket tracking the smaller
      clamp between the task specific value and its TG effective value. This
      allows to:
      
      1. ensure cgroup clamps are always used to restrict task specific requests,
         i.e. boosted not more than its TG effective protection and capped at
         least as its TG effective limit.
      
      2. implement a "nice-like" policy, where tasks are still allowed to request
         less than what enforced by their TG effective limits and protections
      
      Do this by exploiting the concept of "effective" clamp, which is already
      used by a TG to track parent enforced restrictions.
      
      Apply task group clamp restrictions only to tasks belonging to a child
      group. While, for tasks in the root group or in an autogroup, system
      defaults are still enforced.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3eac870a
    • Patrick Bellasi's avatar
      sched/uclamp: Propagate system defaults to the root group · 7274a5c1
      Patrick Bellasi authored
      The clamp values are not tunable at the level of the root task group.
      That's for two main reasons:
      
       - the root group represents "system resources" which are always
         entirely available from the cgroup standpoint.
      
       - when tuning/restricting "system resources" makes sense, tuning must
         be done using a system wide API which should also be available when
         control groups are not.
      
      When a system wide restriction is available, cgroups should be aware of
      its value in order to know exactly how much "system resources" are
      available for the subgroups.
      
      Utilization clamping supports already the concepts of:
      
       - system defaults: which define the maximum possible clamp values
         usable by tasks.
      
       - effective clamps: which allows a parent cgroup to constraint (maybe
         temporarily) its descendants without losing the information related
         to the values "requested" from them.
      
      Exploit these two concepts and bind them together in such a way that,
      whenever system default are tuned, the new values are propagated to
      (possibly) restrict or relax the "effective" value of nested cgroups.
      
      When cgroups are in use, force an update of all the RUNNABLE tasks.
      Otherwise, keep things simple and do just a lazy update next time each
      task will be enqueued.
      Do that since we assume a more strict resource control is required when
      cgroups are in use. This allows also to keep "effective" clamp values
      updated in case we need to expose them to user-space.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7274a5c1
    • Patrick Bellasi's avatar
      sched/uclamp: Propagate parent clamps · 0b60ba2d
      Patrick Bellasi authored
      In order to properly support hierarchical resources control, the cgroup
      delegation model requires that attribute writes from a child group never
      fail but still are locally consistent and constrained based on parent's
      assigned resources. This requires to properly propagate and aggregate
      parent attributes down to its descendants.
      
      Implement this mechanism by adding a new "effective" clamp value for each
      task group. The effective clamp value is defined as the smaller value
      between the clamp value of a group and the effective clamp value of its
      parent. This is the actual clamp value enforced on tasks in a task group.
      
      Since it's possible for a cpu.uclamp.min value to be bigger than the
      cpu.uclamp.max value, ensure local consistency by restricting each
      "protection" (i.e. min utilization) with the corresponding "limit"
      (i.e. max utilization).
      
      Do that at effective clamps propagation to ensure all user-space write
      never fails while still always tracking the most restrictive values.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0b60ba2d
    • Patrick Bellasi's avatar
      sched/uclamp: Extend CPU's cgroup controller · 2480c093
      Patrick Bellasi authored
      The cgroup CPU bandwidth controller allows to assign a specified
      (maximum) bandwidth to the tasks of a group. However this bandwidth is
      defined and enforced only on a temporal base, without considering the
      actual frequency a CPU is running on. Thus, the amount of computation
      completed by a task within an allocated bandwidth can be very different
      depending on the actual frequency the CPU is running that task.
      The amount of computation can be affected also by the specific CPU a
      task is running on, especially when running on asymmetric capacity
      systems like Arm's big.LITTLE.
      
      With the availability of schedutil, the scheduler is now able
      to drive frequency selections based on actual task utilization.
      Moreover, the utilization clamping support provides a mechanism to
      bias the frequency selection operated by schedutil depending on
      constraints assigned to the tasks currently RUNNABLE on a CPU.
      
      Giving the mechanisms described above, it is now possible to extend the
      cpu controller to specify the minimum (or maximum) utilization which
      should be considered for tasks RUNNABLE on a cpu.
      This makes it possible to better defined the actual computational
      power assigned to task groups, thus improving the cgroup CPU bandwidth
      controller which is currently based just on time constraints.
      
      Extend the CPU controller with a couple of new attributes uclamp.{min,max}
      which allow to enforce utilization boosting and capping for all the
      tasks in a group.
      
      Specifically:
      
      - uclamp.min: defines the minimum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run at least at a
      	      minimum frequency which corresponds to the uclamp.min
      	      utilization
      
      - uclamp.max: defines the maximum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run up to a
      	      maximum frequency which corresponds to the uclamp.max
      	      utilization
      
      These attributes:
      
      a) are available only for non-root nodes, both on default and legacy
         hierarchies, while system wide clamps are defined by a generic
         interface which does not depends on cgroups. This system wide
         interface enforces constraints on tasks in the root node.
      
      b) enforce effective constraints at each level of the hierarchy which
         are a restriction of the group requests considering its parent's
         effective constraints. Root group effective constraints are defined
         by the system wide interface.
         This mechanism allows each (non-root) level of the hierarchy to:
         - request whatever clamp values it would like to get
         - effectively get only up to the maximum amount allowed by its parent
      
      c) have higher priority than task-specific clamps, defined via
         sched_setattr(), thus allowing to control and restrict task requests.
      
      Add two new attributes to the cpu controller to collect "requested"
      clamp values. Allow that at each non-root level of the hierarchy.
      Keep it simple by not caring now about "effective" values computation
      and propagation along the hierarchy.
      
      Update sysctl_sched_uclamp_handler() to use the newly introduced
      uclamp_mutex so that we serialize system default updates with cgroup
      relate updates.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2480c093
    • Matt Fleming's avatar
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming authored
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes on different sockets -- 2 hops apart.
      
      For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
      (CPUs 32-39) like so,
      
        $ numactl -C 0-7,32-39 ./spinner 16
      
      causes all threads to fork and remain on node 0 until the active
      balancer kicks in after a few seconds and forcibly moves some threads
      to node 4.
      
      Override node_reclaim_distance for AMD Zen.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a55c7454
    • Matt Fleming's avatar
      arch, ia64: Make NUMA select SMP · a2cbfd46
      Matt Fleming authored
      While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
      theory, it doesn't make much sense in practice.
      
      Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.
      
      The motivation for this patch is to allow a new NUMA variable to be
      initialised in kernel/sched/topology.c.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-2-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a2cbfd46
    • Peter Zijlstra's avatar
      sched, perf: MAINTAINERS update, add submaintainers and reviewers · bb874816
      Peter Zijlstra authored
      The below entries are a little unorthodox; I've not found other entries in
      MAINTAINER that subdivide responsibilities like this, and certainly the lovely
      get_maintainers.pl script will not get it, but I'm thinking to a human it
      should be plenty clear and we're all very good at ignoring email anyway.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bb874816
  2. 12 Aug, 2019 1 commit
    • Phil Auld's avatar
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · a46d14ec
      Phil Auld authored
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.com
      a46d14ec
  3. 08 Aug, 2019 12 commits
  4. 25 Jul, 2019 20 commits