- 04 Sep, 2024 13 commits
-
-
Tejun Heo authored
This patch adds scx_flatcg example scheduler which implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer by compounding the active weight share at each level. This flattening of hierarchy can bring a substantial performance gain when the cgroup hierarchy is nested multiple levels. in a simple benchmark using wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two apache instances competing with 2:1 weight ratio nested four level deep. However, the gain comes at the cost of not being able to properly handle thundering herd of cgroups. For example, if many cgroups which are nested behind a low priority parent cgroup wake up around the same time, they may be able to consume more CPU cycles than they are entitled to. In many use cases, this isn't a real concern especially given the performance gain. Also, there are ways to mitigate the problem further by e.g. introducing an extra scheduling layer on cgroup delegation boundaries. v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of SCX_OPS_KNOB_CGROUP_WEIGHT. v4: - Revert reference counted kptr for cgv_node as the change caused easily reproducible stalls. v3: - Updated to reflect the core API changes including ops.init/exit_task() and direct dispatch from ops.select_cpu(). Fixes and improvements including additional statistics. - Use reference counted kptr for cgv_node instead of xchg'ing against stash location. - Dropped '-p' option. v2: - Use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
-
Tejun Heo authored
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>
-
Tejun Heo authored
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or SCX may implement the feature, put the interface code behind the new CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED. This allows either sched class to enable the itnerface code without ading more complex CONFIG tests. When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares() is added to support later CONFIG_CGROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED builds. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
Move tg_weight() upward and make cpu_shares_read_u64() use it too. This makes the weight retrieval shared between cgroup v1 and v2 paths and will be used to implement cgroup support for sched_ext. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
A new BPF extensible sched_class will use css_tg() in the init and exit paths to visit all task_groups by walking cgroups. v4: __setscheduler_prio() is already exposed. Dropped from this patch. v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup mechanism. v2: Expose SCHED_CHANGE_BLOCK() too and update the description. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
-
Tejun Heo authored
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task(). However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again. Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning: sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly. - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering. - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
Tejun Heo authored
scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while calling scx_ops_exit_task() on all tasks including dead ones. This can leave a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent. If another task was in the process of changing the TASK_DEAD task's scheduling class and grabs the rq lock after scx_ops_disable_workfn() is done with the task, the task ends up calling scx_ops_disable_task() on the dead task which is in an inconsistent state triggering a warning: WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160 ... RIP: 0010:scx_ops_disable_task+0x12c/0x160 ... Call Trace: <TASK> check_class_changed+0x2c/0x70 __sched_setscheduler+0x8a0/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f140d70ea5b There is no reason to leave dead tasks on SCX when unloading the BPF scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including the dead ones from SCX. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
With sched_ext converted to use put_prev_task() for class switch detection, there's no user of switch_class() left. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
-
Tejun Heo authored
Now that put_prev_task_scx() is called with @next on task switches, there's no reason to use sched_class.switch_class(). Rename switch_class_scx() to switch_class() and call it from put_prev_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
Relocate functions to ease the removal of switch_class_scx(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
Because the BPF scheduler's dispatch path is invoked from balance(), sched_ext needs to invoke balance_one() on all sibling rq's before picking the next task for core-sched. Before the recent pick_next_task() updates, sched_ext couldn't share pick task between regular and core-sched paths because pick_next_task() depended on put_prev_task() being called on the current task. Tasks currently running on sibling rq's can't be put when one rq is trying to pick the next task, so pick_task_scx() had to have a separate mechanism to pick between a sibling rq's current task and the first task in its local DSQ. However, with the preceding updates, pick_next_task_scx() no longer depends on the current task being put and can compare the current task and the next in line statelessly, and the pick task logic should be shareable between regular and core-sched paths. Unify regular and core-sched pick task paths: - There's no reason to distinguish local and sibling picks anymore. @local is removed from balance_one(). - pick_next_task_scx() is turned into pick_task_scx() by dropping the put_prev_set_next_task() call. - The old pick_task_scx() is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to keep running the current task. It's not really a task property. Replace it with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for the usage. Also, the existing clearing rule is unnecessarily strict and makes it difficult to use with core-sched. Just clear it on entry to balance_one(). Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
fd03c5b8 ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
-
- 03 Sep, 2024 11 commits
-
-
Tejun Heo authored
- Resolve trivial context conflicts from dl_server clearing being moved around. - Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to match sched/core. - Merge sched_class->switch_class() addition from sched_ext with tip/sched/core changes in __pick_next_task(). - Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous behavior where sched_class->put_prev_task() was called before sched_class->pick_next_task(). While this makes sched_ext build and function, the behavior is not in line with other sched classes. The follow-up patches will address the discrepancies and remove sched_class->switch_class(). Signed-off-by: Tejun Heo <tj@kernel.org>
-
Peter Zijlstra authored
In order to tell the previous sched_class what the next task is, add put_prev_task(.next). Notable SCX will use this to: 1) determine the next task will leave the SCX sched class and push the current task to another CPU if possible. 2) statistics on how often and which other classes preempt it Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
-
Peter Zijlstra authored
When a task is selected through a dl_server, it will have p->dl_server set, such that it can account runtime to the dl_server, see update_curr_task(). Currently p->dl_server is set in pick*task() whenever it goes through the dl_server, clearing it is a bit of a mess though. The trivial solution is clearing it on the final put (now that we have this location). However, this gives a problem when: p = pick_task(rq); if (p) put_prev_set_next_task(rq, prev, next); picks the same task but through a different path, notably when it goes from picking through the dl_server to a direct pick or vice-versa. In that case we cannot readily determine wether we should clear or preserve p->dl_server. An additional complication is pick_*task() setting p->dl_server for a remote pick, it might still need to update runtime before it schedules the core_pick. Close all these holes and remove all the random clearing of p->dl_server by: - having pick_*task() manage rq->dl_server - having the final put_prev_task() clear p->dl_server - having the first set_next_task() set p->dl_server = rq->dl_server - complicate the core_sched code to save/restore rq->dl_server where appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
-
Peter Zijlstra authored
Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
-
Peter Zijlstra authored
The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
-
Peter Zijlstra authored
With the goal of pushing put_prev_task() after pick_task() / into pick_next_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org
-
Peter Zijlstra authored
Abide by the simple rule: pick_next_task() := pick_task() + set_next_task(.first = true) This allows us to trivially get rid of server_pick_next() and things collapse nicely. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
-
Peter Zijlstra authored
The rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) Turns out, there's still a few things in pick_next_task() that are missing from that combination. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org
-
Peter Zijlstra authored
Turns out the core_sched bits forgot to use the set_next_task(.first=true) variant. Notably: pick_next_task() := pick_task() + set_next_task(.first = true) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
-
Valentin Schneider authored
__sched_setscheduler() goes through an enqueue/dequeue cycle like so: flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; prev_class->dequeue_task(rq, p, flags); new_class->enqueue_task(rq, p, flags); when prev_class := fair_sched_class, this is followed by: dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP); the idea being that since the task has switched classes, we need to drop the sched_delayed logic and have that task be deactivated per its previous dequeue_task(..., DEQUEUE_SLEEP). Unfortunately, this leaves the task on_rq. This is missing the tail end of dequeue_entities() that issues __block_task(), which __sched_setscheduler() won't have done due to not using DEQUEUE_DELAYED - not that it should, as it is pretty much a fair_sched_class specific thing. Make switched_from_fair() properly deactivate sched_delayed tasks upon class changes via __block_task(), as if a dequeue_task(..., DEQUEUE_DELAYED) had been issued. Fixes: 2e0199df ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240829135353.1524260-1-vschneid@redhat.com
-
Huang Shijie authored
In dl_server_start(), when schedstats is enabled, the following happens: dl_server_start() dl_se->dl_server = 1; enqueue_dl_entity() update_stats_enqueue_dl() __schedstats_from_dl_se() dl_task_of() BUG_ON(dl_server(dl_se)); Since only tasks have schedstats and internal entries do not, avoid trying to update stats in this case. Fixes: 63ba8422 ("sched/deadline: Introduce deadline servers") Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com
-
- 31 Aug, 2024 2 commits
-
-
Tejun Heo authored
Since 3cf78c5d ("sched_ext: Unpin and repin rq lock from balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost function. This is simpler and in line with what other balance functions are doing but it loses control over rq->clock_update_flags which makes assert_clock_udpated() trigger if other CPUs pins the rq lock. The only place this matters is touch_core_sched() which uses the timestamp to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may be better to use per-core dispatch sequence number. v2: Use sched_clock_cpu() instead of ktime_get_ns() per David. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3cf78c5d ("sched_ext: Unpin and repin rq lock from balance_scx()") Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
Tejun Heo authored
When deciding whether a task can be migrated to a CPU, dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online() tests instead of using task_can_run_on_remote_rq(). This had two problems. - It was missing is_migration_disabled() check and thus could try to migrate a task which shouldn't leading to assertion and scheduling failures. - It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu() and thus failed to consider ISA compatibility. Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq(): - Move scx_ops_error() triggering into task_can_run_on_remote_rq(). - When migration isn't allowed, fall back to the global DSQ instead of the source DSQ by returning DTL_INVALID. This is both simpler and an overall better behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
-
- 28 Aug, 2024 1 commit
-
-
Tejun Heo authored
The cfi stub for ops.tick was missing which will fail scheduler loading after pending BPF changes. Add it. Signed-off-by: Tejun Heo <tj@kernel.org>
-
- 27 Aug, 2024 1 commit
-
-
Tejun Heo authored
ARRAY_ELEM_PTR() is an access macro used to help the BPF verifier not confused by offseted memory acceeses by yiedling a valid pointer or NULL in a way that's clear to the verifier. As such, the canonical usage involves checking NULL return from the macro. Note that in many cases, the NULL condition can never happen - they're there just to hint the verifier. In a bpf_loop in scx_central.bpf.c::central_dispatch(), the NULL check was incorrect in that there was another dereference of the pointer in addition to the NULL checked access. This worked as the pointer can never be NULL and the verifier could tell it would never be NULL in this case. However, this still looks wrong and trips smatch: ./tools/sched_ext/scx_central.bpf.c:205 ____central_dispatch() error: we previously assumed 'gimme' could be null (see line 201) ./tools/sched_ext/scx_central.bpf.c 195 196 if (!scx_bpf_dispatch_nr_slots()) 197 break; 198 199 /* central's gimme is never set */ 200 gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); 201 if (gimme && !*gimme) ^^^^^ If gimme is NULL 202 continue; 203 204 if (dispatch_to_cpu(cpu)) --> 205 *gimme = false; Fix the NULL check so that there are no derefs if NULL. This doesn't change actual behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: http://lkml.kernel.org/r/<955e1c3c-ace2-4a1d-b246-15b8196038a3@stanley.mountain>
-
- 20 Aug, 2024 2 commits
-
-
Yipeng Zou authored
Since dequeue_task() allowed to fail, there is a compile error: kernel/sched/ext.c:3630:19: error: initialization of ‘bool (*)(struct rq*, struct task_struct *, int)’ {aka ‘_Bool (*)(struct rq *, struct task_struct *, int)’} from incompatible pointer type ‘void (*)(struct rq*, struct task_struct *, int)’ 3630 | .dequeue_task = dequeue_task_scx, | ^~~~~~~~~~~~~~~~ Allow dequeue_task_scx to fail too. Fixes: 863ccdbb ("sched: Allow sched_class::dequeue_task() to fail") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
To receive 863ccdbb ("sched: Allow sched_class::dequeue_task() to fail") which makes sched_class.dequeue_task() return bool instead of void. This leads to compile breakage and will be fixed by a follow-up patch. Signed-off-by: Tejun Heo <tj@kernel.org>
-
- 17 Aug, 2024 10 commits
-
-
Peter Zijlstra authored
In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org
-
Peter Zijlstra authored
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
-
Peter Zijlstra authored
Part of the reason to have shorter slices is to improve responsiveness. Allow shorter slices to preempt longer slices on wakeup. Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | 100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT 1 massive_intr:(5) | 846018.956 ms | 779188 | avg: 0.273 ms | max: 58.337 ms | sum:212545.245 ms | 2 massive_intr:(5) | 853450.693 ms | 792269 | avg: 0.275 ms | max: 71.193 ms | sum:218263.588 ms | 3 massive_intr:(5) | 843888.920 ms | 771456 | avg: 0.277 ms | max: 92.405 ms | sum:213353.221 ms | 1 chromium-browse:(8) | 53015.889 ms | 131766 | avg: 0.463 ms | max: 36.341 ms | sum:60959.230 ms | 2 chromium-browse:(8) | 53864.088 ms | 136962 | avg: 0.480 ms | max: 27.091 ms | sum:65687.681 ms | 3 chromium-browse:(9) | 53637.904 ms | 132637 | avg: 0.481 ms | max: 24.756 ms | sum:63781.673 ms | 1 cyclictest:(5) | 12615.604 ms | 639689 | avg: 0.471 ms | max: 32.272 ms | sum:301351.094 ms | 2 cyclictest:(5) | 12511.583 ms | 642578 | avg: 0.448 ms | max: 44.243 ms | sum:287632.830 ms | 3 cyclictest:(5) | 12545.867 ms | 635953 | avg: 0.475 ms | max: 25.530 ms | sum:302374.658 ms | 100ms massive_intr 500us cyclictest PREEMPT_SHORT 1 massive_intr:(5) | 839843.919 ms | 837384 | avg: 0.264 ms | max: 74.366 ms | sum:221476.885 ms | 2 massive_intr:(5) | 852449.913 ms | 845086 | avg: 0.252 ms | max: 68.162 ms | sum:212595.968 ms | 3 massive_intr:(5) | 839180.725 ms | 836883 | avg: 0.266 ms | max: 69.742 ms | sum:222812.038 ms | 1 chromium-browse:(11) | 54591.481 ms | 138388 | avg: 0.458 ms | max: 35.427 ms | sum:63401.508 ms | 2 chromium-browse:(8) | 52034.541 ms | 132276 | avg: 0.436 ms | max: 31.826 ms | sum:57732.958 ms | 3 chromium-browse:(8) | 55231.771 ms | 141892 | avg: 0.469 ms | max: 27.607 ms | sum:66538.697 ms | 1 cyclictest:(5) | 13156.391 ms | 667412 | avg: 0.373 ms | max: 38.247 ms | sum:249174.502 ms | 2 cyclictest:(5) | 12688.939 ms | 665144 | avg: 0.374 ms | max: 33.548 ms | sum:248509.392 ms | 3 cyclictest:(5) | 13475.623 ms | 669110 | avg: 0.370 ms | max: 37.819 ms | sum:247673.390 ms | As per the numbers the, this makes cyclictest (short slice) it's max-delay more consistent and consistency drops the sum-delay. The trade-off is that the massive_intr (long slice) gets more context switches and a slight increase in sum-delay. Chunxin contributed did_preempt_short() where a task that lost slice protection from PREEMPT_SHORT gets rescheduled once it becomes in-eligible. [mike: numbers] Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
-
Peter Zijlstra authored
During OSPM24 Youssef noted that migrations are re-setting the virtual deadline. Notably everything that does a dequeue-enqueue, like setting nice, changing preferred numa-node, and a myriad of other random crap, will cause this to happen. This shouldn't be. Preserve the relative virtual deadline across such dequeue/enqueue cycles. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org
-
Peter Zijlstra authored
Note that tasks that are kept on the runqueue to burn off negative lag, are not in fact runnable anymore, they'll get dequeued the moment they get picked. As such, don't count this time towards runnable. Thanks to Valentin for spotting I had this backwards initially. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org
-
Peter Zijlstra authored
'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them at the 0-lag point, truncate lag (eg. don't let them earn positive lag). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.403750550@infradead.org
-
Peter Zijlstra authored
Extend / fix 86bfbb7c ("sched/fair: Add lag based placement") by noting that lag is fundamentally a temporal measure. It should not be carried around indefinitely. OTOH it should also not be instantly discarded, doing so will allow a task to game the system by purposefully (micro) sleeping at the end of its time quantum. Since lag is intimately tied to the virtual time base, a wall-time based decay is also insufficient, notably competition is required for any of this to make sense. Instead, delay the dequeue and keep the 'tasks' on the runqueue, competing until they are eligible. Strictly speaking, we only care about keeping them until the 0-lag point, but that is a difficult proposition, instead carry them around until they get picked again, and dequeue them at that point. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org
-
Peter Zijlstra authored
Since special task states must not suffer spurious wakeups, and the proposed delayed dequeue can cause exactly these (under some boundary conditions), propagate this knowledge into dequeue_task() such that it can do the right thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
-
Peter Zijlstra authored
The special task states are those that do not suffer spurious wakeups, TASK_FROZEN is very much one of those, mark it as such. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.998329901@infradead.org
-
Peter Zijlstra authored
Doing a wakeup on a delayed dequeue task is about as simple as it sounds -- remove the delayed mark and enjoy the fact it was actually still on the runqueue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org
-