1. 24 Nov, 2020 7 commits
  2. 19 Nov, 2020 10 commits
    • Ionela Voinescu's avatar
      sched/topology: Condition EAS enablement on FIE support · fa50e2b4
      Ionela Voinescu authored
      In order to make accurate predictions across CPUs and for all performance
      states, Energy Aware Scheduling (EAS) needs frequency-invariant load
      tracking signals.
      
      EAS task placement aims to minimize energy consumption, and does so in
      part by limiting the search space to only CPUs with the highest spare
      capacity (CPU capacity - CPU utilization) in their performance domain.
      Those candidates are the placement choices that will keep frequency at
      its lowest possible and therefore save the most energy.
      
      But without frequency invariance, a CPU's utilization is relative to the
      CPU's current performance level, and not relative to its maximum
      performance level, which determines its capacity. As a result, it will
      fail to correctly indicate any potential spare capacity obtained by an
      increase in a CPU's performance level. Therefore, a non-invariant
      utilization signal would render the EAS task placement logic invalid.
      
      Now that we properly report support for the Frequency Invariance Engine
      (FIE) through arch_scale_freq_invariant() for arm and arm64 systems,
      while also ensuring a re-evaluation of the EAS use conditions for
      possible invariance status change, we can assert this is the case when
      initializing EAS. Warn and bail out otherwise.
      Suggested-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201027180713.7642-4-ionela.voinescu@arm.com
      fa50e2b4
    • Ionela Voinescu's avatar
      arm64: Rebuild sched domains on invariance status changes · ecec9e86
      Ionela Voinescu authored
      Task scheduler behavior depends on frequency invariance (FI) support and
      the resulting invariant load tracking signals. For example, in order to
      make accurate predictions across CPUs for all performance states, Energy
      Aware Scheduling (EAS) needs frequency-invariant load tracking signals
      and therefore it has a direct dependency on FI. This dependency is known,
      but EAS enablement is not yet conditioned on the presence of FI during
      the built of the scheduling domain hierarchy.
      
      Before this is done, the following must be considered: while
      arch_scale_freq_invariant() will see changes in FI support and could
      be used to condition the use of EAS, it could return different values
      during system initialisation.
      
      For arm64, such a scenario will happen for a system that does not support
      cpufreq driven FI, but does support counter-driven FI. For such a system,
      arch_scale_freq_invariant() will return false if called before counter
      based FI initialisation, but change its status to true after it.
      If EAS becomes explicitly dependent on FI this would affect the task
      scheduler behavior which builds its scheduling domain hierarchy well
      before the late counter-based FI init. During that process, EAS would be
      disabled due to its dependency on FI.
      
      Two points of future early calls to arch_scale_freq_invariant() which
      would determine EAS enablement are:
       - (1) drivers/base/arch_topology.c:126 <<update_topology_flags_workfn>>
      		rebuild_sched_domains();
             This will happen after CPU capacity initialisation.
       - (2) kernel/sched/cpufreq_schedutil.c:917 <<rebuild_sd_workfn>>
      		rebuild_sched_domains_energy();
      		-->rebuild_sched_domains();
             This will happen during sched_cpufreq_governor_change() for the
             schedutil cpufreq governor.
      
      Therefore, before enforcing the presence of FI support for the use of EAS,
      ensure the following: if there is a change in FI support status after
      counter init, use the existing rebuild_sched_domains_energy() function to
      trigger a rebuild of the scheduling and performance domains that in turn
      will determine the enablement of EAS.
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lkml.kernel.org/r/20201027180713.7642-3-ionela.voinescu@arm.com
      ecec9e86
    • Ionela Voinescu's avatar
      sched/topology,schedutil: Wrap sched domains rebuild · 31f6a8c0
      Ionela Voinescu authored
      Add the rebuild_sched_domains_energy() function to wrap the functionality
      that rebuilds the scheduling domains if any of the Energy Aware Scheduling
      (EAS) initialisation conditions change. This functionality is used when
      schedutil is added or removed or when EAS is enabled or disabled
      through the sched_energy_aware sysctl.
      
      Therefore, create a single function that is used in both these cases and
      that can be later reused.
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarQuentin Perret <qperret@google.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20201027180713.7642-2-ionela.voinescu@arm.com
      31f6a8c0
    • Dietmar Eggemann's avatar
      sched/uclamp: Allow to reset a task uclamp constraint value · 480a6ca2
      Dietmar Eggemann authored
      In case the user wants to stop controlling a uclamp constraint value
      for a task, use the magic value -1 in sched_util_{min,max} with the
      appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
      the reset.
      
      The advantage over the 'additional flag' approach (i.e. introducing
      SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
      exported via uapi. This avoids the need to document how this new flag
      has be used in conjunction with the existing uclamp related flags.
      
      The following subtle issue is fixed as well. When a uclamp constraint
      value is set on a !user_defined uclamp_se it is currently first reset
      and then set.
      Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
      stands for the 'sched class change' case.
      The related condition 'if (uc_se->user_defined)' moved from
      __setscheduler_uclamp() into uclamp_reset().
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarYun Hsiang <hsiang023167@gmail.com>
      Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com
      480a6ca2
    • Tal Zussman's avatar
    • Barry Song's avatar
      Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug · 9032dc21
      Barry Song authored
      This document seems to be out of date for many, many years. Even it has
      misspelled from the first day.
      ARCH_HASH_SCHED_TUNE should be ARCH_HAS_SCHED_TUNE
      ARCH_HASH_SCHED_DOMAIN should be ARCH_HAS_SCHED_DOMAIN
      
      Since v2.6.14, kernel completely deleted the relevant code and even
      arch_init_sched_domains() was deleted.
      
      Right now, kernel is asking architectures to call set_sched_topology() to
      override the default sched domains.
      
      On the other hand, to print the schedule debug information, users need to
      set sched_debug cmdline or enable it by sysfs entry. So this patch also
      adds the description for sched_debug.
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20201113115018.1628-1-song.bao.hua@hisilicon.com
      9032dc21
    • Valentin Schneider's avatar
      sched/topology: Warn when NUMA diameter > 2 · b5b21734
      Valentin Schneider authored
      NUMA topologies where the shortest path between some two nodes requires
      three or more hops (i.e. diameter > 2) end up being misrepresented in the
      scheduler topology structures.
      
      This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
      + sched_debug on the cmdline, although this will only yield a warning about
      sched_group spans not matching sched_domain spans:
      
        ERROR: groups don't span domain->span
      
      Add an explicit warning for that case, triggered regardless of
      CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.
      
      The topology described in the comment can be booted up on QEMU by appending
      the following to your usual QEMU incantation:
      
          -smp cores=4 \
          -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
          -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
          -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
          -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
          -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20
      
      A somewhat more realistic topology (6-node mesh) with the same affliction
      can be conjured with:
      
          -smp cores=6 \
          -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
          -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
          -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
          -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
          -numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
          -numa dist,src=0,dst=5,val=20, \
          -numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
          -numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
          -numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
          -numa dist,src=2,dst=5,val=40, \
          -numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
          -numa dist,src=4,dst=5,val=20
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
      b5b21734
    • Daniel Jordan's avatar
      cpuset: fix race between hotplug work and later CPU offline · 406100f3
      Daniel Jordan authored
      One of our machines keeled over trying to rebuild the scheduler domains.
      Mainline produces the same splat:
      
        BUG: unable to handle page fault for address: 0000607f820054db
        CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
        Workqueue: events cpuset_hotplug_workfn
        RIP: build_sched_domains
        Call Trace:
         partition_sched_domains_locked
         rebuild_sched_domains_locked
         cpuset_hotplug_workfn
      
      It happens with cgroup2 and exclusive cpusets only.  This reproducer
      triggers it on an 8-cpu vm and works most effectively with no
      preexisting child cgroups:
      
        cd $UNIFIED_ROOT
        mkdir cg1
        echo 4-7 > cg1/cpuset.cpus
        echo root > cg1/cpuset.cpus.partition
      
        # with smt/control reading 'on',
        echo off > /sys/devices/system/cpu/smt/control
      
      RIP maps to
      
        sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
      
      from sd_init().  sd_id is calculated earlier in the same function:
      
        cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
        sd_id = cpumask_first(sched_domain_span(sd));
      
      tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
      and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
      value from per_cpu_ptr() above.
      
      The problem is a race between cpuset_hotplug_workfn() and a later
      offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
      when N is still online, the offline clears N from cpu_sibling_map, and
      then the worker uses the stale effective masks that still have N to
      generate the scheduling domains, leading the worker to read
      N's empty cpu_sibling_map in sd_init().
      
      rebuild_sched_domains_locked() prevented the race during the cgroup2
      cpuset series up until the Fixes commit changed its check.  Make the
      check more robust so that it can detect an offline CPU in any exclusive
      cpuset's effective mask, not just the top one.
      
      Fixes: 0ccea8fe ("cpuset: Make generate_sched_domains() work with partition")
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
      406100f3
    • Peter Zijlstra's avatar
      sched: Fix migration_cpu_stop() WARN · 1293771e
      Peter Zijlstra authored
      Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch
      of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that
      case is just plain wrong to begin with, since per the earlier branch
      that isn't the actual CPU of the task.
      
      Replace both instances of is_cpu_allowed() by a direct p->cpus_mask
      test using task_cpu().
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Debugged-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      1293771e
    • Valentin Schneider's avatar
      sched/core: Add missing completion for affine_move_task() waiters · d707faa6
      Valentin Schneider authored
      Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on
      a wait_for_completion(). The problematic pattern seems to be:
      
        affine_move_task()
            // task_running() case
            stop_one_cpu();
            wait_for_completion(&pending->done);
      
      Combined with, on the stopper side:
      
        migration_cpu_stop()
          // Task moved between unlocks and scheduling the stopper
          task_rq(p) != rq &&
          // task_running() case
          dest_cpu >= 0
      
          => no complete_all()
      
      This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should
      be more likely to see this given the targeted task has a much bigger window
      to block and be woken up elsewhere before the stopper runs.
      
      Make migration_cpu_stop() always look at pending affinity requests; signal
      their completion if the stopper hits a rq mismatch but the task is
      still within its allowed mask. When Migrate-Disable isn't involved, this
      matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop()
      behaviour.
      
      Fixes: 6d337eab ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
      Reported-by: default avatarQian Cai <cai@redhat.com>
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com
      d707faa6
  3. 10 Nov, 2020 23 commits