- 16 Jun, 2023 8 commits
-
-
Hao Jia authored
After commit 8ad075c2 ("sched: Async unthrottling for cfs bandwidth"), we may update the rq clock multiple times in the loop of __cfsb_csd_unthrottle(). A prior (although less common) instance of this problem exists in unthrottle_offline_cfs_rqs(). Cure both by ensuring update_rq_clock() is called before the loop and setting RQCF_ACT_SKIP during the loop, to supress further updates. The alternative would be pulling update_rq_clock() out of unthrottle_cfs_rq(), but that gives an even bigger mess. Fixes: 8ad075c2 ("sched: Async unthrottling for cfs bandwidth") Reviewed-By:
Ben Segall <bsegall@google.com> Suggested-by:
Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by:
Hao Jia <jiahao.os@bytedance.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-4-jiahao.os@bytedance.com
-
Hao Jia authored
There is a double update_rq_clock() invocation: __balance_push_cpu_stop() update_rq_clock() __migrate_task() update_rq_clock() Sadly select_fallback_rq() also needs update_rq_clock() for __do_set_cpus_allowed(), it is not possible to remove the update from __balance_push_cpu_stop(). So remove it from __migrate_task() and ensure all callers of this function call update_rq_clock() prior to calling it. Signed-off-by:
Hao Jia <jiahao.os@bytedance.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-3-jiahao.os@bytedance.com
-
Hao Jia authored
When using a cpufreq governor that uses cpufreq_add_update_util_hook(), it is possible to trigger a missing update_rq_clock() warning for the CPU hotplug path: rq_attach_root() set_rq_offline() rq_offline_rt() __disable_runtime() sched_rt_rq_enqueue() enqueue_top_rt_rq() cpufreq_update_util() data->func(data, rq_clock(rq), flags) Move update_rq_clock() from sched_cpu_deactivate() (one of it's callers) into set_rq_offline() such that it covers all set_rq_offline() usage. Additionally change rq_attach_root() to use rq_lock_irqsave() so that it will properly manage the runqueue clock flags. Suggested-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Hao Jia <jiahao.os@bytedance.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com
-
Vineeth Pillai authored
Update the details of GRUB to reflect the updated logic. Signed-off-by:
Vineeth Pillai (Google) <vineeth@bitbyteword.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by:
Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230530135526.2385378-2-vineeth@bitbyteword.org
-
Vineeth Pillai authored
According to the GRUB[1] rule, the runtime is depreciated as: "dq = -max{u, (1 - Uinact - Uextra)} dt" (1) To guarantee that deadline tasks doesn't starve lower class tasks, we do not allocate the full bandwidth of the cpu to deadline tasks. Maximum bandwidth usable by deadline tasks is denoted by "Umax". Considering Umax, equation (1) becomes: "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt" (2) Current implementation has a minor bug in equation (2), which this patch fixes. The reclamation logic is verified by a sample program which creates multiple deadline threads and observing their utilization. The tests were run on an isolated cpu(isolcpus=3) on a 4 cpu system. Tests on 6.3.0 ============== RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95% TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33 TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35 RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95% TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69 TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69 RUN 3: 2 tasks Task 1: runtime=1ms, deadline=period=10ms Task 2: runtime=1ms, deadline=period=100ms TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.67 TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.37 TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.38 TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.23 As seen above, the reclamation doesn't reclaim the maximum allowed bandwidth and as the bandwidth of tasks gets smaller, the reclaimed bandwidth also comes down. Tests with this patch applied ============================= RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95% TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.19 TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.16 RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95% TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.27 TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.21 RUN 3: 2 tasks Task 1: runtime=1ms, deadline=period=10ms Task 2: runtime=1ms, deadline=period=100ms TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.64 TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.66 TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.45 TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.73 Running tasks on all cpus allowing for migration also showed that the utilization is reclaimed to the maximum. Running 10 tasks on 3 cpus SCHED_FLAG_RECLAIM - top shows: %Cpu0 : 94.6 us, 0.0 sy, 0.0 ni, 5.4 id, 0.0 wa %Cpu1 : 95.2 us, 0.0 sy, 0.0 ni, 4.8 id, 0.0 wa %Cpu2 : 95.8 us, 0.0 sy, 0.0 ni, 4.2 id, 0.0 wa [1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng. (2015). Parallel and sequential reclaiming in multicore real-time global scheduling. Signed-off-by:
Vineeth Pillai (Google) <vineeth@bitbyteword.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by:
Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230530135526.2385378-1-vineeth@bitbyteword.org
-
Arve Hjønnevåg authored
kthread_park and wait_woken have a similar race that kthread_stop and wait_woken used to have before it was fixed in commit cb6538e7 ("sched/wait: Fix a kthread race with wait_woken()"). Extend that fix to also cover kthread_park. [jstultz: Made changes suggested by Peter to optimize memory loads] Signed-off-by:
Arve Hjønnevåg <arve@android.com> Signed-off-by:
John Stultz <jstultz@google.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230602212350.535358-1-jstultz@google.com
-
Miaohe Lin authored
All callers of set_sched_topology() are within __init section. Mark it __init too. Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
-
Tom Rix authored
cppcheck reports kernel/sched/fair.c:7436:17: style: Local variable 'cpu_util' shadows outer function [shadowFunction] unsigned long cpu_util; ^ Clean this up by renaming the variable to eff_util Signed-off-by:
Tom Rix <trix@redhat.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Reviewed-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20230611122535.183654-1-trix@redhat.com
-
- 06 Jun, 2023 1 commit
-
-
Peter Zijlstra authored
The readl_relaxed() to __raw_readl() change meant to loose the instrumentation, but also (inadvertently) lost the byteswap. Fixes: 24ee7607 ("arm64/arch_timer: Provide noinstr sched_clock_read() functions") Reported-by:
Mark Rutland <mark.rutland@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/20230606080614.GB905437@hirez.programming.kicks-ass.net
-
- 05 Jun, 2023 19 commits
-
-
Dietmar Eggemann authored
The responsiveness of the Per Entity Load Tracking (PELT) util_avg in mobile devices is still considered too low for utilization changes during task ramp-up. In Android this manifests in the fact that the first frames of a UI activity are very prone to be jankframes (a frame which doesn't meet the required frame rendering time, e.g. 16ms@60Hz) since the CPU frequency is normally low at this point and has to ramp up quickly. The beginning of an UI activity is also characterized by the occurrence of CPU contention, especially on little CPUs. Current little CPUs can have an original CPU capacity of only ~ 150 which means that the actual CPU capacity at lower frequency can even be much smaller. Schedutil maps CPU util_avg into CPU frequency request via: util = effective_cpu_util(..., cpu_util_cfs(cpu), ...) -> util = map_util_perf(util) -> freq = map_util_freq(util, ...) CPU contention for CFS tasks can be detected by 'CPU runnable > CPU utililization' in cpu_util_cfs_boost() -> cpu_util(..., boost = 1). Schedutil uses 'runnable boosting' by calling cpu_util_cfs_boost(). To be in sync with schedutil's CPU frequency selection, Energy Aware Scheduling (EAS) also calls cpu_util(..., boost = 1) during max util detection. Moreover, 'runnable boosting' is also used in load-balance for busiest CPU selection when the migration type is 'migrate_util', i.e. only at sched domains which don't have the SD_SHARE_PKG_RESOURCES flag set. Suggested-by:
Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230515115735.296329-3-dietmar.eggemann@arm.com
-
Dietmar Eggemann authored
There is a lot of code duplication in cpu_util_next() & cpu_util_cfs(). Remove this by allowing cpu_util_next() to be called with p = NULL. Rename cpu_util_next() to cpu_util() since the '_next' suffix is no longer necessary to distinct cpu utilization related functions. Implement cpu_util_cfs(cpu) as cpu_util(cpu, p = NULL, -1). This will allow to code future related cpu util changes only in one place, namely in cpu_util(). Signed-off-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230515115735.296329-2-dietmar.eggemann@arm.com
-
Peter Zijlstra authored
With the introduction of local_clock_noinstr(), local_clock() itself is no longer marked noinstr, use the correct function. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Rafael J. Wysocki <rafael@kernel.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102716.045980863@infradead.org
-
Peter Zijlstra authored
Now that all ARCH_WANTS_NO_INSTR architectures (arm64, loongarch, s390, x86) provide sched_clock_noinstr(), use this to provide local_clock_noinstr(). This local_clock_noinstr() will be safe to use from noinstr code with the assumption that any such noinstr code is non-preemptible (it had better be, entry code will have IRQs disabled while __cpuidle must have preemption disabled). Specifically, preempt_enable_notrace(), a common part of many a sched_clock() implementation calls out to schedule() -- even though, per the above, it will never trigger -- which frustrates noinstr validation. vmlinux.o: warning: objtool: local_clock+0xb5: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.978624636@infradead.org
-
Peter Zijlstra authored
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. vmlinux.o: warning: objtool: native_sched_clock+0x96: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section vmlinux.o: warning: objtool: kvm_clock_read+0x22: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.910937674@infradead.org
-
Peter Zijlstra authored
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by making the Hyper-V TSC and MSR sched_clock implementations noinstr. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by:
Michael Kelley <mikelley@microsoft.com> Signed-off-by:
Michael Kelley <mikelley@microsoft.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.843039089@infradead.org
-
Peter Zijlstra authored
Currently hv_read_tsc_page_tsc() (ab)uses the (valid) time value of U64_MAX as an error return. This breaks the clean wrap-around of the clock. Modify the function signature to return a boolean state and provide another u64 pointer to store the actual time on success. This obviates the need to steal one time value and restores the full counter width. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.775630881@infradead.org
-
Peter Zijlstra authored
Because of how the virtual clocks use U64_MAX as an exception value instead of a valid time, the clocks can no longer be assumed to wrap cleanly. This is then compounded by arch_vdso_cycles_ok() rejecting everything with the MSB/Sign-bit set. Therefore, the effective mask becomes S64_MAX, and the comment with vdso_calc_delta() that states the mask is U64_MAX and isn't optimized out is just plain silly. Now, the code has a negative filter -- to deal with TSC wobbles: if (cycles > last) which is just plain wrong, because it should've been written as: if ((s64)(cycles - last) > 0) to take wrapping into account, but per all the above, we don't actually wrap on u64 anymore. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.704767397@infradead.org
-
Peter Zijlstra authored
In order to prevent the following complaint from happening, always inline the u128 variant of mul_u64_u64_shr() -- which is what x86_64 will use. vmlinux.o: warning: objtool: read_hv_sched_clock_tsc+0x5a: call to mul_u64_u64_shr.constprop.0() leaves .noinstr.text section It should compile into something like: asm("mul %[mul];" "shrd %rdx, %rax, %cl" : "+&a" (a) : "c" shift, [mul] "r" (mul) : "d"); Which is silly not to inline, but it happens. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.637420396@infradead.org
-
Peter Zijlstra authored
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Heiko Carstens <hca@linux.ibm.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.570170436@infradead.org
-
Peter Zijlstra authored
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_read() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.502547082@infradead.org
-
Peter Zijlstra authored
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_read() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.435618812@infradead.org
-
Peter Zijlstra authored
The next patch will want to use __raw_readl() from a noinstr section and as such that needs to be marked __always_inline to avoid the compiler being a silly bugger. Turns out it already is, but its siblings are not. Finish the work started in commit e43f1331 ("arm64: Ask the compiler to __always_inline functions used by KVM at HYP") for consistenies sake. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.368919762@infradead.org
-
Peter Zijlstra authored
With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. As such, pull out the preempt_{dis,en}able_notrace() requirements from the sched_clock_read() implementations by explicitly providing it in the sched_clock() function. This further requires said sched_clock_read() functions to be noinstr themselves, for ARCH_WANTS_NO_INSTR users. See the next few patches. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.302350330@infradead.org
-
Peter Zijlstra authored
The read side of seqcount_latch consists of: do { seq = raw_read_seqcount_latch(&latch->seq); ... } while (read_seqcount_latch_retry(&latch->seq, seq)); which is asymmetric in the raw_ department, and sure enough, read_seqcount_latch_retry() includes (explicit) instrumentation where raw_read_seqcount_latch() does not. This inconsistency becomes a problem when trying to use it from noinstr code. As such, fix it by renaming and re-implementing raw_read_seqcount_latch_retry() without the instrumentation. Specifically the instrumentation in question is kcsan_atomic_next(0) in do___read_seqcount_retry(). Loosing this annotation is not a problem because raw_read_seqcount_latch() does not pass through kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX). Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Petr Mladek <pmladek@suse.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.233598176@infradead.org
-
Peter Zijlstra authored
Instead of having a number of fixed topologies to pick from; build one on the fly. This is both simpler now and simpler to extend in the future. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230601153522.GB559993%40hirez.programming.kicks-ass.net
-
Peter Zijlstra authored
With the introduction of task_struct::saved_state in commit 5f220be2 ("sched/wakeup: Prepare for RT sleeping spin/rwlocks") matching the task state has gotten more complicated. That same commit changed try_to_wake_up() to consider both states, but wait_task_inactive() has been neglected. Sebastian noted that the wait_task_inactive() usage in ptrace_check_attach() can misbehave when ptrace_stop() is blocked on the tasklist_lock after it sets TASK_TRACED. Therefore extract a common helper from ttwu_state_match() and use that to teach wait_task_inactive() about the PREEMPT_RT locks. Originally-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20230601091234.GW83892@hirez.programming.kicks-ass.net
-
Peter Zijlstra authored
While modifying wait_task_inactive() for PREEMPT_RT; the build robot noted that UP got broken. This led to audit and consideration of the UP implementation of wait_task_inactive(). It looks like the UP implementation is also broken for PREEMPT; consider task_current_syscall() getting preempted between the two calls to wait_task_inactive(). Therefore move the wait_task_inactive() implementation out of CONFIG_SMP and unconditionally use it. Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230602103731.GA630648%40hirez.programming.kicks-ass.net
-
Yicong Yang authored
We've run into the case that the balancer tries to balance a migration disabled task and trigger the warning in set_task_cpu() like below: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240 Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1 Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021 pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : set_task_cpu+0x188/0x240 lr : load_balance+0x5d0/0xc60 sp : ffff80000803bc70 x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040 x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001 x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78 x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000 x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000 x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530 x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001 Call trace: set_task_cpu+0x188/0x240 load_balance+0x5d0/0xc60 rebalance_domains+0x26c/0x380 _nohz_idle_balance.isra.0+0x1e0/0x370 run_rebalance_domains+0x6c/0x80 __do_softirq+0x128/0x3d8 ____do_softirq+0x18/0x24 call_on_irq_stack+0x2c/0x38 do_softirq_own_stack+0x24/0x3c __irq_exit_rcu+0xcc/0xf4 irq_exit_rcu+0x18/0x24 el1_interrupt+0x4c/0xe4 el1h_64_irq_handler+0x18/0x2c el1h_64_irq+0x74/0x78 arch_cpu_idle+0x18/0x4c default_idle_call+0x58/0x194 do_idle+0x244/0x2b0 cpu_startup_entry+0x30/0x3c secondary_start_kernel+0x14c/0x190 __secondary_switched+0xb0/0xb4 ---[ end trace 0000000000000000 ]--- Further investigation shows that the warning is superfluous, the migration disabled task is just going to be migrated to its current running CPU. This is because that on load balance if the dst_cpu is not allowed by the task, we'll re-select a new_dst_cpu as a candidate. If no task can be balanced to dst_cpu we'll try to balance the task to the new_dst_cpu instead. In this case when the migration disabled task is not on CPU it only allows to run on its current CPU, load balance will select its current CPU as new_dst_cpu and later triggers the warning above. The new_dst_cpu is chosen from the env->dst_grpmask. Currently it contains CPUs in sched_group_span() and if we have overlapped groups it's possible to run into this case. This patch makes env->dst_grpmask of group_balance_mask() which exclude any CPUs from the busiest group and solve the issue. For balancing in a domain with no overlapped groups the behaviour keeps same as before. Suggested-by:
Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by:
Yicong Yang <yangyicong@hisilicon.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230530082507.10444-1-yangyicong@huawei.com
-
- 30 May, 2023 6 commits
-
-
Miaohe Lin authored
The default deadline bandwidth control structure has been removed since commit eb77cf1c ("sched/deadline: Remove unused def_dl_bandwidth") leading to unused init_dl_bandwidth() and struct dl_bandwidth. Remove them to clean up the code. Signed-off-by:
Miaohe Lin <linmiaohe@huawei.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230524102514.407486-1-linmiaohe@huawei.com
-
Arnd Bergmann authored
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED, and empty one that is only referenced when FAIR_GROUP_SCHED is disabled but CGROUP_SCHED is still enabled. If both are turned off, the functions are still defined but the misisng prototype causes a W=1 warning: kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group' kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group' kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group' kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group' Move the alternatives into the header as static inline functions with the correct combination of #ifdef checks to avoid the warning without adding even more complexity. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-6-arnd@kernel.org
-
Arnd Bergmann authored
Having the prototype next to the caller but not visible to the callee causes a W=1 warning: kernel/sched/fair.c:11985:6: error: no previous prototype for 'task_vruntime_update' [-Werror=missing-prototypes] Move this to a header, as we do for all other function declarations. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-5-arnd@kernel.org
-
Arnd Bergmann authored
init_cfs_bandwidth() is only used when CONFIG_FAIR_GROUP_SCHED is enabled, and without this causes a W=1 warning for the missing prototype: kernel/sched/fair.c:6131:6: error: no previous prototype for 'init_cfs_bandwidth' The normal implementation is only defined for CONFIG_CFS_BANDWIDTH, so the stub exists when CFS_BANDWIDTH is disabled but FAIR_GROUP_SCHED is enabled. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-4-arnd@kernel.org
-
Arnd Bergmann authored
The schedule_user() function is used on powerpc and sparc architectures, but only ever called from assembler, so it has no prototype, causing a harmless W=1 warning: kernel/sched/core.c:6730:35: error: no previous prototype for 'schedule_user' [-Werror=missing-prototypes] Add a prototype in sched/sched.h to shut up the warning. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-3-arnd@kernel.org
-
Arnd Bergmann authored
This function is only used when CONFIG_SMP is enabled, without that there is no caller and no prototype: kernel/sched/fair.c:688:5: error: no previous prototype for 'sched_update_scaling' [-Werror=missing-prototypes Hide the definition in the same #ifdef check as the declaration. Fixes: 8a99b683 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-2-arnd@kernel.org
-
- 20 May, 2023 1 commit
-
-
Yang Yang authored
Psi_group's poll_min_period is determined by the minimum window size of psi_trigger when creating new triggers. While destroying a psi_trigger, there is no need to reset poll_min_period if the psi_trigger being destroyed did not have the minimum window size, since in this condition poll_min_period will remain the same as before. Signed-off-by:
Yang Yang <yang.yang29@zte.com.cn> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Suren Baghdasaryan <surenb@google.com> Link: https://lkml.kernel.org/r/20230514163338.834345-1-surenb@google.com
-
- 08 May, 2023 5 commits
-
-
晏艳(采苓) authored
Commit e6fe3f42 ("sched: Make multiple runqueue task counters 32-bit") changed the type for rq->nr_uninterruptible from "unsigned long" to "unsigned int", but left wrong cast print to /sys/kernel/debug/sched/debug and to the console. For example, nr_uninterruptible's value is fffffff7 with type "unsigned int", (long)nr_uninterruptible shows 4294967287 while (int)nr_uninterruptible prints -9. So using int cast fixes wrong printing. Signed-off-by:
Yan Yan <yanyan.yan@antgroup.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230506074253.44526-1-yanyan.yan@antgroup.com
-
Tim C Chen authored
When a degenerate cluster domain for core with SMT CPUs is removed, the SD_SHARE_CPUCAPACITY flag in the local child sched group was not propagated to the new parent. We need this flag to properly determine whether the local sched group is SMT. Set the flag in the local child sched group of the new parent sched domain. Signed-off-by:
Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Link: https://lkml.kernel.org/r/73cf0959eafa53c02e7ef6bf805d751d9190e55d.1683156492.git.tim.c.chen@linux.intel.com
-
Suren Baghdasaryan authored
Current 500ms min window size for psi triggers limits polling interval to 50ms to prevent polling threads from using too much cpu bandwidth by polling too frequently. However the number of cgroups with triggers is unlimited, so this protection can be defeated by creating multiple cgroups with psi triggers (triggers in each cgroup are served by a single "psimon" kernel thread). Instead of limiting min polling period, which also limits the latency of psi events, it's better to limit psi trigger creation to authorized users only, like we do for system-wide psi triggers (/proc/pressure/* files can be written only by processes with CAP_SYS_RESOURCE capability). This also makes access rules for cgroup psi files consistent with system-wide ones. Add a CAP_SYS_RESOURCE capability check for cgroup psi file writers and remove the psi window min size limitation. Suggested-by:
Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> Signed-off-by:
Suren Baghdasaryan <surenb@google.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/cover.1676067791.git.quic_sudaraja@quicinc.com/
-
Chen Yu authored
Intel Meteor Lake hybrid processors have cores in two separate dies. The cores in one of the dies have higher maximum frequency. Use the SD_ASYM_ PACKING flag to give higher priority to the die with CPUs of higher maximum frequency. Suggested-by:
Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by:
Chen Yu <yu.c.chen@intel.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com
-
Ricardo Neri authored
X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler knows how to handle SMT and non-SMT cores of different priority. There is no reason for SMT siblings of a core to have different priorities. Signed-off-by:
Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Len Brown <len.brown@intel.com> Tested-by:
Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230406203148.19182-12-ricardo.neri-calderon@linux.intel.com
-