1. 03 Sep, 2019 5 commits
    • Patrick Bellasi's avatar
      sched/uclamp: Propagate parent clamps · 0b60ba2d
      Patrick Bellasi authored
      In order to properly support hierarchical resources control, the cgroup
      delegation model requires that attribute writes from a child group never
      fail but still are locally consistent and constrained based on parent's
      assigned resources. This requires to properly propagate and aggregate
      parent attributes down to its descendants.
      
      Implement this mechanism by adding a new "effective" clamp value for each
      task group. The effective clamp value is defined as the smaller value
      between the clamp value of a group and the effective clamp value of its
      parent. This is the actual clamp value enforced on tasks in a task group.
      
      Since it's possible for a cpu.uclamp.min value to be bigger than the
      cpu.uclamp.max value, ensure local consistency by restricting each
      "protection" (i.e. min utilization) with the corresponding "limit"
      (i.e. max utilization).
      
      Do that at effective clamps propagation to ensure all user-space write
      never fails while still always tracking the most restrictive values.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0b60ba2d
    • Patrick Bellasi's avatar
      sched/uclamp: Extend CPU's cgroup controller · 2480c093
      Patrick Bellasi authored
      The cgroup CPU bandwidth controller allows to assign a specified
      (maximum) bandwidth to the tasks of a group. However this bandwidth is
      defined and enforced only on a temporal base, without considering the
      actual frequency a CPU is running on. Thus, the amount of computation
      completed by a task within an allocated bandwidth can be very different
      depending on the actual frequency the CPU is running that task.
      The amount of computation can be affected also by the specific CPU a
      task is running on, especially when running on asymmetric capacity
      systems like Arm's big.LITTLE.
      
      With the availability of schedutil, the scheduler is now able
      to drive frequency selections based on actual task utilization.
      Moreover, the utilization clamping support provides a mechanism to
      bias the frequency selection operated by schedutil depending on
      constraints assigned to the tasks currently RUNNABLE on a CPU.
      
      Giving the mechanisms described above, it is now possible to extend the
      cpu controller to specify the minimum (or maximum) utilization which
      should be considered for tasks RUNNABLE on a cpu.
      This makes it possible to better defined the actual computational
      power assigned to task groups, thus improving the cgroup CPU bandwidth
      controller which is currently based just on time constraints.
      
      Extend the CPU controller with a couple of new attributes uclamp.{min,max}
      which allow to enforce utilization boosting and capping for all the
      tasks in a group.
      
      Specifically:
      
      - uclamp.min: defines the minimum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run at least at a
      	      minimum frequency which corresponds to the uclamp.min
      	      utilization
      
      - uclamp.max: defines the maximum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run up to a
      	      maximum frequency which corresponds to the uclamp.max
      	      utilization
      
      These attributes:
      
      a) are available only for non-root nodes, both on default and legacy
         hierarchies, while system wide clamps are defined by a generic
         interface which does not depends on cgroups. This system wide
         interface enforces constraints on tasks in the root node.
      
      b) enforce effective constraints at each level of the hierarchy which
         are a restriction of the group requests considering its parent's
         effective constraints. Root group effective constraints are defined
         by the system wide interface.
         This mechanism allows each (non-root) level of the hierarchy to:
         - request whatever clamp values it would like to get
         - effectively get only up to the maximum amount allowed by its parent
      
      c) have higher priority than task-specific clamps, defined via
         sched_setattr(), thus allowing to control and restrict task requests.
      
      Add two new attributes to the cpu controller to collect "requested"
      clamp values. Allow that at each non-root level of the hierarchy.
      Keep it simple by not caring now about "effective" values computation
      and propagation along the hierarchy.
      
      Update sysctl_sched_uclamp_handler() to use the newly introduced
      uclamp_mutex so that we serialize system default updates with cgroup
      relate updates.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2480c093
    • Matt Fleming's avatar
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming authored
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes on different sockets -- 2 hops apart.
      
      For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
      (CPUs 32-39) like so,
      
        $ numactl -C 0-7,32-39 ./spinner 16
      
      causes all threads to fork and remain on node 0 until the active
      balancer kicks in after a few seconds and forcibly moves some threads
      to node 4.
      
      Override node_reclaim_distance for AMD Zen.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a55c7454
    • Matt Fleming's avatar
      arch, ia64: Make NUMA select SMP · a2cbfd46
      Matt Fleming authored
      While it does make sense to allow CONFIG_NUMA and !CONFIG_SMP in
      theory, it doesn't make much sense in practice.
      
      Follow other architectures and make CONFIG_NUMA select CONFIG_SMP.
      
      The motivation for this patch is to allow a new NUMA variable to be
      initialised in kernel/sched/topology.c.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-2-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a2cbfd46
    • Peter Zijlstra's avatar
      sched, perf: MAINTAINERS update, add submaintainers and reviewers · bb874816
      Peter Zijlstra authored
      The below entries are a little unorthodox; I've not found other entries in
      MAINTAINER that subdivide responsibilities like this, and certainly the lovely
      get_maintainers.pl script will not get it, but I'm thinking to a human it
      should be plenty clear and we're all very good at ignoring email anyway.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bb874816
  2. 12 Aug, 2019 1 commit
    • Phil Auld's avatar
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · a46d14ec
      Phil Auld authored
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.com
      a46d14ec
  3. 08 Aug, 2019 12 commits
  4. 25 Jul, 2019 22 commits