Commit bb896afe authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-linus' of...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes:
  sched: default to n for GROUP_SCHED and FAIR_GROUP_SCHED
  sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
  sched, x86: add HAVE_UNSTABLE_SCHED_CLOCK
  sched: fix cpu clock
  sched: fair-group: fix a Div0 error of the fair group scheduler
  sched: fix missing locking in sched_domains code
  sched: make clock sync tunable by architecture code
  sched: fix debugging
  sched: fix sched_info_switch not being called according to documentation
  sched: fix hrtick_start_fair and CPU-Hotplug
  sched: fix SCHED_FAIR wake-idle logic error
  sched: fix RT task-wakeup logic
  sched: add statics, don't return void expressions
  sched: add debug checks to idle functions
  sched: remove old sched doc
  sched: make rt_sched_class, idle_sched_class static
  sched: optimize calc_delta_mine()
  sched: fix normalized sleeper
parents 2e83fc4d aac6abca
Goals, Design and Implementation of the
new ultra-scalable O(1) scheduler
This is an edited version of an email Ingo Molnar sent to
lkml on 4 Jan 2002. It describes the goals, design, and
implementation of Ingo's new ultra-scalable O(1) scheduler.
Last Updated: 18 April 2002.
Goal
====
The main goal of the new scheduler is to keep all the good things we know
and love about the current Linux scheduler:
- good interactive performance even during high load: if the user
types or clicks then the system must react instantly and must execute
the user tasks smoothly, even during considerable background load.
- good scheduling/wakeup performance with 1-2 runnable processes.
- fairness: no process should stay without any timeslice for any
unreasonable amount of time. No process should get an unjustly high
amount of CPU time.
- priorities: less important tasks can be started with lower priority,
more important tasks with higher priority.
- SMP efficiency: no CPU should stay idle if there is work to do.
- SMP affinity: processes which run on one CPU should stay affine to
that CPU. Processes should not bounce between CPUs too frequently.
- plus additional scheduler features: RT scheduling, CPU binding.
and the goal is also to add a few new things:
- fully O(1) scheduling. Are you tired of the recalculation loop
blowing the L1 cache away every now and then? Do you think the goodness
loop is taking a bit too long to finish if there are lots of runnable
processes? This new scheduler takes no prisoners: wakeup(), schedule(),
the timer interrupt are all O(1) algorithms. There is no recalculation
loop. There is no goodness loop either.
- 'perfect' SMP scalability. With the new scheduler there is no 'big'
runqueue_lock anymore - it's all per-CPU runqueues and locks - two
tasks on two separate CPUs can wake up, schedule and context-switch
completely in parallel, without any interlocking. All
scheduling-relevant data is structured for maximum scalability.
- better SMP affinity. The old scheduler has a particular weakness that
causes the random bouncing of tasks between CPUs if/when higher
priority/interactive tasks, this was observed and reported by many
people. The reason is that the timeslice recalculation loop first needs
every currently running task to consume its timeslice. But when this
happens on eg. an 8-way system, then this property starves an
increasing number of CPUs from executing any process. Once the last
task that has a timeslice left has finished using up that timeslice,
the recalculation loop is triggered and other CPUs can start executing
tasks again - after having idled around for a number of timer ticks.
The more CPUs, the worse this effect.
Furthermore, this same effect causes the bouncing effect as well:
whenever there is such a 'timeslice squeeze' of the global runqueue,
idle processors start executing tasks which are not affine to that CPU.
(because the affine tasks have finished off their timeslices already.)
The new scheduler solves this problem by distributing timeslices on a
per-CPU basis, without having any global synchronization or
recalculation.
- batch scheduling. A significant proportion of computing-intensive tasks
benefit from batch-scheduling, where timeslices are long and processes
are roundrobin scheduled. The new scheduler does such batch-scheduling
of the lowest priority tasks - so nice +19 jobs will get
'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
in essence SCHED_IDLE, from an interactiveness point of view.
- handle extreme loads more smoothly, without breakdown and scheduling
storms.
- O(1) RT scheduling. For those RT folks who are paranoid about the
O(nr_running) property of the goodness loop and the recalculation loop.
- run fork()ed children before the parent. Andrea has pointed out the
advantages of this a few months ago, but patches for this feature
do not work with the old scheduler as well as they should,
because idle processes often steal the new child before the fork()ing
CPU gets to execute it.
Design
======
The core of the new scheduler contains the following mechanisms:
- *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
array and an 'expired' array. The active array contains all tasks that
are affine to this CPU and have timeslices left. The expired array
contains all tasks which have used up their timeslices - but this array
is kept sorted as well. The active and expired array is not accessed
directly, it's accessed through two pointers in the per-CPU runqueue
structure. If all active tasks are used up then we 'switch' the two
pointers and from now on the ready-to-go (former-) expired array is the
active array - and the empty active array serves as the new collector
for expired tasks.
- there is a 64-bit bitmap cache for array indices. Finding the highest
priority task is thus a matter of two x86 BSFL bit-search instructions.
the split-array solution enables us to have an arbitrary number of active
and expired tasks, and the recalculation of timeslices can be done
immediately when the timeslice expires. Because the arrays are always
access through the pointers in the runqueue, switching the two arrays can
be done very quickly.
this is a hybride priority-list approach coupled with roundrobin
scheduling and the array-switch method of distributing timeslices.
- there is a per-task 'load estimator'.
one of the toughest things to get right is good interactive feel during
heavy system load. While playing with various scheduler variants i found
that the best interactive feel is achieved not by 'boosting' interactive
tasks, but by 'punishing' tasks that want to use more CPU time than there
is available. This method is also much easier to do in an O(1) fashion.
to establish the actual 'load' the task contributes to the system, a
complex-looking but pretty accurate method is used: there is a 4-entry
'history' ringbuffer of the task's activities during the last 4 seconds.
This ringbuffer is operated without much overhead. The entries tell the
scheduler a pretty accurate load-history of the task: has it used up more
CPU time or less during the past N seconds. [the size '4' and the interval
of 4x 1 seconds was found by lots of experimentation - this part is
flexible and can be changed in both directions.]
the penalty a task gets for generating more load than the CPU can handle
is a priority decrease - there is a maximum amount to this penalty
relative to their static priority, so even fully CPU-bound tasks will
observe each other's priorities, and will share the CPU accordingly.
the SMP load-balancer can be extended/switched with additional parallel
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
can be supported easily by changing the load-balancer. Right now it's
tuned for my SMP systems.
i skipped the prev->mm == next->mm advantage - no workload i know of shows
any sensitivity to this. It can be added back by sacrificing O(1)
schedule() [the current and one-lower priority list can be searched for a
that->mm == current->mm condition], but costs a fair number of cycles
during a number of important workloads, so i wanted to avoid this as much
as possible.
- the SMP idle-task startup code was still racy and the new scheduler
triggered this. So i streamlined the idle-setup code a bit. We do not call
into schedule() before all processors have started up fully and all idle
threads are in place.
- the patch also cleans up a number of aspects of sched.c - moves code
into other areas of the kernel where it's appropriate, and simplifies
certain code paths and data constructs. As a result, the new scheduler's
code is smaller than the old one.
Ingo
......@@ -18,6 +18,7 @@ config X86_64
### Arch settings
config X86
def_bool y
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_KPROBES
......
......@@ -158,6 +158,8 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
}
#endif
extern unsigned long long time_sync_thresh;
/*
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
......@@ -1551,6 +1553,35 @@ static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
extern unsigned long long sched_clock(void);
#ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
static inline void sched_clock_init(void)
{
}
static inline u64 sched_clock_cpu(int cpu)
{
return sched_clock();
}
static inline void sched_clock_tick(void)
{
}
static inline void sched_clock_idle_sleep_event(void)
{
}
static inline void sched_clock_idle_wakeup_event(u64 delta_ns)
{
}
#else
extern void sched_clock_init(void);
extern u64 sched_clock_cpu(int cpu);
extern void sched_clock_tick(void);
extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
#endif
/*
* For kernel-internal use: high-speed (but slightly incorrect) per-cpu
* clock constructed from sched_clock():
......@@ -1977,6 +2008,11 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}
static inline int test_tsk_need_resched(struct task_struct *tsk)
{
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
static inline int signal_pending(struct task_struct *p)
{
return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
......@@ -1991,7 +2027,7 @@ static inline int fatal_signal_pending(struct task_struct *p)
static inline int need_resched(void)
{
return unlikely(test_thread_flag(TIF_NEED_RESCHED));
return unlikely(test_tsk_need_resched(current));
}
/*
......
......@@ -316,9 +316,16 @@ config CPUSETS
Say N if unsure.
#
# Architectures with an unreliable sched_clock() should select this:
#
config HAVE_UNSTABLE_SCHED_CLOCK
bool
config GROUP_SCHED
bool "Group CPU scheduler"
default y
depends on EXPERIMENTAL
default n
help
This feature lets CPU scheduler recognize task groups and control CPU
bandwidth allocation to such task groups.
......@@ -326,7 +333,7 @@ config GROUP_SCHED
config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
depends on GROUP_SCHED
default y
default GROUP_SCHED
config RT_GROUP_SCHED
bool "Group scheduling for SCHED_RR/FIFO"
......
......@@ -602,6 +602,7 @@ asmlinkage void __init start_kernel(void)
softirq_init();
timekeeping_init();
time_init();
sched_clock_init();
profile_init();
if (!irqs_disabled())
printk("start_kernel(): bug: interrupts were enabled early\n");
......
......@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o
notifier.o ksysfs.o pm_qos_params.o sched_clock.o
obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
......
This diff is collapsed.
/*
* sched_clock for unstable cpu clocks
*
* Copyright (C) 2008 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
*
* Based on code by:
* Ingo Molnar <mingo@redhat.com>
* Guillaume Chazarain <guichaz@gmail.com>
*
* Create a semi stable clock from a mixture of other events, including:
* - gtod
* - jiffies
* - sched_clock()
* - explicit idle events
*
* We use gtod as base and the unstable clock deltas. The deltas are filtered,
* making it monotonic and keeping it within an expected window. This window
* is set up using jiffies.
*
* Furthermore, explicit sleep and wakeup hooks allow us to account for time
* that is otherwise invisible (TSC gets stopped).
*
* The clock: sched_clock_cpu() is monotonic per cpu, and should be somewhat
* consistent between cpus (never more than 1 jiffies difference).
*/
#include <linux/sched.h>
#include <linux/percpu.h>
#include <linux/spinlock.h>
#include <linux/ktime.h>
#include <linux/module.h>
#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
struct sched_clock_data {
/*
* Raw spinlock - this is a special case: this might be called
* from within instrumentation code so we dont want to do any
* instrumentation ourselves.
*/
raw_spinlock_t lock;
unsigned long prev_jiffies;
u64 prev_raw;
u64 tick_raw;
u64 tick_gtod;
u64 clock;
};
static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, sched_clock_data);
static inline struct sched_clock_data *this_scd(void)
{
return &__get_cpu_var(sched_clock_data);
}
static inline struct sched_clock_data *cpu_sdc(int cpu)
{
return &per_cpu(sched_clock_data, cpu);
}
void sched_clock_init(void)
{
u64 ktime_now = ktime_to_ns(ktime_get());
u64 now = 0;
int cpu;
for_each_possible_cpu(cpu) {
struct sched_clock_data *scd = cpu_sdc(cpu);
scd->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
scd->prev_jiffies = jiffies;
scd->prev_raw = now;
scd->tick_raw = now;
scd->tick_gtod = ktime_now;
scd->clock = ktime_now;
}
}
/*
* update the percpu scd from the raw @now value
*
* - filter out backward motion
* - use jiffies to generate a min,max window to clip the raw values
*/
static void __update_sched_clock(struct sched_clock_data *scd, u64 now)
{
unsigned long now_jiffies = jiffies;
long delta_jiffies = now_jiffies - scd->prev_jiffies;
u64 clock = scd->clock;
u64 min_clock, max_clock;
s64 delta = now - scd->prev_raw;
WARN_ON_ONCE(!irqs_disabled());
min_clock = scd->tick_gtod + delta_jiffies * TICK_NSEC;
if (unlikely(delta < 0)) {
clock++;
goto out;
}
max_clock = min_clock + TICK_NSEC;
if (unlikely(clock + delta > max_clock)) {
if (clock < max_clock)
clock = max_clock;
else
clock++;
} else {
clock += delta;
}
out:
if (unlikely(clock < min_clock))
clock = min_clock;
scd->prev_raw = now;
scd->prev_jiffies = now_jiffies;
scd->clock = clock;
}
static void lock_double_clock(struct sched_clock_data *data1,
struct sched_clock_data *data2)
{
if (data1 < data2) {
__raw_spin_lock(&data1->lock);
__raw_spin_lock(&data2->lock);
} else {
__raw_spin_lock(&data2->lock);
__raw_spin_lock(&data1->lock);
}
}
u64 sched_clock_cpu(int cpu)
{
struct sched_clock_data *scd = cpu_sdc(cpu);
u64 now, clock;
WARN_ON_ONCE(!irqs_disabled());
now = sched_clock();
if (cpu != raw_smp_processor_id()) {
/*
* in order to update a remote cpu's clock based on our
* unstable raw time rebase it against:
* tick_raw (offset between raw counters)
* tick_gotd (tick offset between cpus)
*/
struct sched_clock_data *my_scd = this_scd();
lock_double_clock(scd, my_scd);
now -= my_scd->tick_raw;
now += scd->tick_raw;
now -= my_scd->tick_gtod;
now += scd->tick_gtod;
__raw_spin_unlock(&my_scd->lock);
} else {
__raw_spin_lock(&scd->lock);
}
__update_sched_clock(scd, now);
clock = scd->clock;
__raw_spin_unlock(&scd->lock);
return clock;
}
void sched_clock_tick(void)
{
struct sched_clock_data *scd = this_scd();
u64 now, now_gtod;
WARN_ON_ONCE(!irqs_disabled());
now = sched_clock();
now_gtod = ktime_to_ns(ktime_get());
__raw_spin_lock(&scd->lock);
__update_sched_clock(scd, now);
/*
* update tick_gtod after __update_sched_clock() because that will
* already observe 1 new jiffy; adding a new tick_gtod to that would
* increase the clock 2 jiffies.
*/
scd->tick_raw = now;
scd->tick_gtod = now_gtod;
__raw_spin_unlock(&scd->lock);
}
/*
* We are going deep-idle (irqs are disabled):
*/
void sched_clock_idle_sleep_event(void)
{
sched_clock_cpu(smp_processor_id());
}
EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
/*
* We just idled delta nanoseconds (called with irqs disabled):
*/
void sched_clock_idle_wakeup_event(u64 delta_ns)
{
struct sched_clock_data *scd = this_scd();
u64 now = sched_clock();
/*
* Override the previous timestamp and ignore all
* sched_clock() deltas that occured while we idled,
* and use the PM-provided delta_ns to advance the
* rq clock:
*/
__raw_spin_lock(&scd->lock);
scd->prev_raw = now;
scd->clock += delta_ns;
__raw_spin_unlock(&scd->lock);
touch_softlockup_watchdog();
}
EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
#endif
/*
* Scheduler clock - returns current time in nanosec units.
* This is default implementation.
* Architectures and sub-architectures can override this.
*/
unsigned long long __attribute__((weak)) sched_clock(void)
{
return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
}
......@@ -204,13 +204,6 @@ static void print_cpu(struct seq_file *m, int cpu)
PN(next_balance);
P(curr->pid);
PN(clock);
PN(idle_clock);
PN(prev_clock_raw);
P(clock_warps);
P(clock_overflows);
P(clock_underflows);
P(clock_deep_idle_events);
PN(clock_max_delta);
P(cpu_load[0]);
P(cpu_load[1]);
P(cpu_load[2]);
......
......@@ -682,6 +682,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);
if (wakeup) {
place_entity(cfs_rq, se, 0);
......@@ -692,7 +693,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
}
static void update_avg(u64 *avg, u64 sample)
......@@ -841,8 +841,10 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
if (queued)
return resched_task(rq_of(cfs_rq)->curr);
if (queued) {
resched_task(rq_of(cfs_rq)->curr);
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
......@@ -957,7 +959,7 @@ static void yield_task_fair(struct rq *rq)
return;
if (likely(!sysctl_sched_compat_yield) && curr->policy != SCHED_BATCH) {
__update_rq_clock(rq);
update_rq_clock(rq);
/*
* Update run-time statistics of the 'current'.
*/
......@@ -1007,7 +1009,7 @@ static int wake_idle(int cpu, struct task_struct *p)
* sibling runqueue info. This will avoid the checks and cache miss
* penalities associated with that.
*/
if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1)
if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
return cpu;
for_each_domain(cpu, sd) {
......@@ -1611,30 +1613,6 @@ static const struct sched_class fair_sched_class = {
};
#ifdef CONFIG_SCHED_DEBUG
static void
print_cfs_rq_tasks(struct seq_file *m, struct cfs_rq *cfs_rq, int depth)
{
struct sched_entity *se;
if (!cfs_rq)
return;
list_for_each_entry_rcu(se, &cfs_rq->tasks, group_node) {
int i;
for (i = depth; i; i--)
seq_puts(m, " ");
seq_printf(m, "%lu %s %lu\n",
se->load.weight,
entity_is_task(se) ? "T" : "G",
calc_delta_weight(SCHED_LOAD_SCALE, se)
);
if (!entity_is_task(se))
print_cfs_rq_tasks(m, group_cfs_rq(se), depth + 1);
}
}
static void print_cfs_stats(struct seq_file *m, int cpu)
{
struct cfs_rq *cfs_rq;
......@@ -1642,9 +1620,6 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
rcu_read_lock();
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
seq_printf(m, "\nWeight tree:\n");
print_cfs_rq_tasks(m, &cpu_rq(cpu)->cfs, 1);
rcu_read_unlock();
}
#endif
......@@ -99,7 +99,7 @@ static void prio_changed_idle(struct rq *rq, struct task_struct *p,
/*
* Simple, special scheduling class for the per-CPU idle tasks:
*/
const struct sched_class idle_sched_class = {
static const struct sched_class idle_sched_class = {
/* .next is NULL */
/* no enqueue/yield_task for idle tasks */
......
......@@ -1098,11 +1098,14 @@ static void post_schedule_rt(struct rq *rq)
}
}
/*
* If we are not running and we are not going to reschedule soon, we should
* try to push tasks away now
*/
static void task_wake_up_rt(struct rq *rq, struct task_struct *p)
{
if (!task_running(rq, p) &&
(p->prio >= rq->rt.highest_prio) &&
!test_tsk_need_resched(rq->curr) &&
rq->rt.overloaded)
push_rt_tasks(rq);
}
......@@ -1309,7 +1312,7 @@ static void set_curr_task_rt(struct rq *rq)
p->se.exec_start = rq->clock;
}
const struct sched_class rt_sched_class = {
static const struct sched_class rt_sched_class = {
.next = &fair_sched_class,
.enqueue_task = enqueue_task_rt,
.dequeue_task = dequeue_task_rt,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment