Commit 7394ebbd authored by Rick Lindsley's avatar Rick Lindsley Committed by Linus Torvalds

[PATCH] scheduler statistics

It adds lots of CPU scheduler stats in /proc/pid/stat.  They are described in
the new Documentation//sched-stats.txt

We were carrying this patch offline for some time, but as there's still
considerable ongoing work in this area, and as the new stats are a
configuration option, I think it's best that this capability be in the base
kernel.

Nick removed a fair amount of statistics that he wasn't using.  The full patch
gathers more information.  In particular, his patch doesn't include the code
to measure the latency between the time a process is made runnable and the
time it hits a processor which will be key to measuring interactivity changes.

He passed his changes back to me and I got finished merging his changes with
the current statistics patches just before OLS.  I believe this is largely a
superset of the patch you grabbed and should port relatively easily too.

Versions also exist for

    2.6.8-rc2
    2.6.8-rc2-mm1
    2.6.8-rc2-mm2

at
    http://eaglet.rain.com/rick/linux/schedstat/patches/

and within 24 hours at

    http://oss.software.ibm.com/linux/patches/?patch_id=730&show=all

The version below is for 2.6.8-rc2-mm2 without the staircase code and has
been compiled cleanly but not yet run.

From: Ingo Molnar <mingo@elte.hu>

this code needs a couple of cleanups before it can go into mainline:

fs/proc/array.c, fs/proc/base.c, fs/proc/proc_misc.c:

 - moved the new /proc/<PID>/stat fields to /proc/<PID>/schedstat,
   because the new fields break older procps. It's cleaner this way
   anyway. This moving of fields necessiated a bump to version 10.

Documentation/sched-stats.txt:

 - updated sched-stats.txt for version 10

 - wake_up_forked_thread() => wake_up_new_task()

 - updated the per-process field description

Kconfig:

 - removed the default y and made the option dependent on DEBUG_KERNEL. 
   This is really for scheduler analysis, normal users dont need the 
   overhead.

include/linux/sched.h:

 - moved the definitions into kernel/sched.c - this fixes UP compilation
   and is cleaner.

 - also moved the sched-domain definitions to sched.c - now that the 
   sched-domains internals are not exposed to architectures this is
   doable. It's also necessary due to the previous change.

kernel/fork.c:

 - moved the ->sched_info init to sched_fork() where it belongs.

kernel/sched.c:

 - wake_up_forked_thread() -> wake_up_new_task(), wuft_cnt -> wunt_cnt,
   wuft_moved -> wunt_moved.

 - wunt_cnt and wunt_moved were defined by never updated - added the 
   missing code to wake_up_new_task().

 - whitespace/style police

 - removed whitespace changes done to code not related to schedstats -
   i'll send a separate patch for these (and more).
Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
parent 8399dc16
Version 10 of schedstats includes support for sched_domains, which
hit the mainline kernel in 2.6.7. Some counters make more sense to be
per-runqueue; other to be per-domain.
In version 10 of schedstat, there is at least one level of domain
statistics for each cpu listed, and there may well be more than one
domain. Domains have no particular names in this implementation, but
the highest numbered one typically arbitrates balancing across all the
cpus on the machine, while domain0 is the most tightly focused domain,
sometimes balancing only between pairs of cpus. At this time, there
are no architectures which need more than three domain levels. The first
field in the domain stats is a bit map indicating which cpus are affected
by that domain.
These fields are counters, and only increment. Programs which make use
of these will need to start with a baseline observation and then calculate
the change in the counters at each subsequent observation. A perl script
which does this for many of the fields is available at
http://eaglet.rain.com/rick/linux/schedstat/
Note that any such script will necessarily be version-specific, as the main
reason to change versions is changes in the output format. For those wishing
to write their own scripts, the fields are described here.
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
NOTE: In the sched_yield() statistics, the active queue is considered empty
if it has only one process in it, since obviously the process calling
sched_yield() is that process.
First four fields are sched_yield() statistics:
1) # of times both the active and the expired queue were empty
2) # of times just the active queue was empty
3) # of times just the expired queue was empty
4) # of times sched_yield() was called
Next four are schedule() statistics:
5) # of times the active queue had at least one other process on it
6) # of times we switched to the expired queue and reused it
7) # of times schedule() was called
8) # of times schedule() left the processor idle
Next four are active_load_balance() statistics:
9) # of times active_load_balance() was called
10) # of times active_load_balance() caused this cpu to gain a task
11) # of times active_load_balance() caused this cpu to lose a task
12) # of times active_load_balance() tried to move a task and failed
Next three are try_to_wake_up() statistics:
13) # of times try_to_wake_up() was called
14) # of times try_to_wake_up() successfully moved the awakening task
15) # of times try_to_wake_up() attempted to move the awakening task
Next two are wake_up_new_task() statistics:
16) # of times wake_up_new_task() was called
17) # of times wake_up_new_task() successfully moved the new task
Next one is a sched_migrate_task() statistic:
18) # of times sched_migrate_task() was called
Next one is a sched_balance_exec() statistic:
19) # of times sched_balance_exec() was called
Next three are statistics describing scheduling latency:
20) sum of all time spent running by tasks on this processor (in ms)
21) sum of all time spent waiting to run by tasks on this processor (in ms)
22) # of tasks (not necessarily unique) given to the processor
The last six are statistics dealing with pull_task():
23) # of times pull_task() moved a task to this cpu when newly idle
24) # of times pull_task() stole a task from this cpu when another cpu
was newly idle
25) # of times pull_task() moved a task to this cpu when idle
26) # of times pull_task() stole a task from this cpu when another cpu
was idle
27) # of times pull_task() moved a task to this cpu when busy
28) # of times pull_task() stole a task from this cpu when another cpu
was busy
Domain statistics
-----------------
One of these is produced per domain for each cpu described.
domain<N> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The first field is a bit mask indicating what cpus this domain operates over.
The next fifteen are a variety of load_balance() statistics:
1) # of times in this domain load_balance() was called when the cpu
was idle
2) # of times in this domain load_balance() was called when the cpu
was busy
3) # of times in this domain load_balance() was called when the cpu
was just becoming idle
4) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was idle
5) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was busy
6) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was just becoming idle
7) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
8) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was busy
9) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was just becoming idle
10) # of times in this domain load_balance() was called but did not find
a busier queue while the cpu was idle
11) # of times in this domain load_balance() was called but did not find
a busier queue while the cpu was busy
12) # of times in this domain load_balance() was called but did not find
a busier queue while the cpu was just becoming idle
13) # of times in this domain a busier queue was found while the cpu was
idle but no busier group was found
14) # of times in this domain a busier queue was found while the cpu was
busy but no busier group was found
15) # of times in this domain a busier queue was found while the cpu was
just becoming idle but no busier group was found
Next two are sched_balance_exec() statistics:
17) # of times in this domain sched_balance_exec() successfully pushed
a task to a new cpu
18) # of times in this domain sched_balance_exec() tried but failed to
push a task to a new cpu
Next two are try_to_wake_up() statistics:
19) # of times in this domain try_to_wake_up() tried to move a task based
on affinity and cache warmth
20) # of times in this domain try_to_wake_up() tried to move a task based
on load balancing
/proc/<pid>/schedstat
----------------
schedstats also adds a new /proc/<pid/schedstat file to include some of
the same information on a per-process level. There are three fields in
this file correlating to fields 20, 21, and 22 in the CPU fields, but
they only apply for that process.
A program could be easily written to make use of these extra fields to
report on how well a particular process or set of processes is faring
under the scheduler's policies. A simple version of such a program is
available at
http://eaglet.rain.com/rick/linux/schedstat/v10/latency.c
......@@ -45,6 +45,18 @@ config 4KSTACKS
on the VM subsystem for higher order allocations. This option
will also use IRQ stacks to compensate for the reduced stackspace.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
config X86_FIND_SMP_CONFIG
bool
depends on X86_LOCAL_APIC || X86_VOYAGER
......
......@@ -53,6 +53,18 @@ config BDI_SWITCH
Unless you are intending to debug the kernel with one of these
machines, say N here.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
config BOOTX_TEXT
bool "Support for early boot text console (BootX or OpenFirmware only)"
depends PPC_OF
......
......@@ -334,6 +334,18 @@ config VIOTAPE
If you are running Linux on an iSeries system and you want Linux
to read and/or write a tape drive owned by OS/400, say Y here.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
endmenu
config VIOPATH
......
......@@ -18,6 +18,18 @@ config INIT_DEBUG
Fill __init and __initdata at the end of boot. This helps debugging
illegal uses of __init and __initdata after initialization.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
config FRAME_POINTER
bool "Compile the kernel with frame pointers"
help
......
......@@ -60,6 +60,9 @@ enum pid_directory_inos {
PROC_TGID_MAPS,
PROC_TGID_MOUNTS,
PROC_TGID_WCHAN,
#ifdef CONFIG_SCHEDSTATS
PROC_TGID_SCHEDSTAT,
#endif
#ifdef CONFIG_SECURITY
PROC_TGID_ATTR,
PROC_TGID_ATTR_CURRENT,
......@@ -83,6 +86,9 @@ enum pid_directory_inos {
PROC_TID_MAPS,
PROC_TID_MOUNTS,
PROC_TID_WCHAN,
#ifdef CONFIG_SCHEDSTATS
PROC_TID_SCHEDSTAT,
#endif
#ifdef CONFIG_SECURITY
PROC_TID_ATTR,
PROC_TID_ATTR_CURRENT,
......@@ -122,6 +128,9 @@ static struct pid_entry tgid_base_stuff[] = {
#endif
#ifdef CONFIG_KALLSYMS
E(PROC_TGID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
#ifdef CONFIG_SCHEDSTATS
E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
#endif
{0,0,NULL,0}
};
......@@ -144,6 +153,9 @@ static struct pid_entry tid_base_stuff[] = {
#endif
#ifdef CONFIG_KALLSYMS
E(PROC_TID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
#ifdef CONFIG_SCHEDSTATS
E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
#endif
{0,0,NULL,0}
};
......@@ -398,6 +410,19 @@ static int proc_pid_wchan(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_KALLSYMS */
#ifdef CONFIG_SCHEDSTATS
/*
* Provides /proc/PID/schedstat
*/
static int proc_pid_schedstat(struct task_struct *task, char *buffer)
{
return sprintf(buffer, "%lu %lu %lu\n",
task->sched_info.cpu_time,
task->sched_info.run_delay,
task->sched_info.pcnt);
}
#endif
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
......@@ -1375,6 +1400,13 @@ static struct dentry *proc_pident_lookup(struct inode *dir,
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_wchan;
break;
#endif
#ifdef CONFIG_SCHEDSTATS
case PROC_TID_SCHEDSTAT:
case PROC_TGID_SCHEDSTAT:
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_schedstat;
break;
#endif
default:
printk("procfs: impossible type (%d)",p->type);
......
......@@ -681,6 +681,9 @@ void __init proc_misc_init(void)
#ifdef CONFIG_MODULES
create_seq_entry("modules", 0, &proc_modules_operations);
#endif
#ifdef CONFIG_SCHEDSTATS
create_seq_entry("schedstat", 0, &proc_schedstat_operations);
#endif
#ifdef CONFIG_PROC_KCORE
proc_root_kcore = create_proc_entry("kcore", S_IRUSR, NULL);
if (proc_root_kcore) {
......
......@@ -352,6 +352,20 @@ struct k_itimer {
struct timespec wall_to_prev; /* wall_to_monotonic used when set */
};
#ifdef CONFIG_SCHEDSTATS
struct sched_info {
/* cumulative counters */
unsigned long cpu_time, /* time spent on the cpu */
run_delay, /* time spent waiting on a runqueue */
pcnt; /* # of timeslices run on this cpu */
/* timestamps */
unsigned long last_arrival, /* when we last ran on a cpu */
last_queued; /* when we were last queued to run */
};
extern struct file_operations proc_schedstat_operations;
#endif
struct io_context; /* See blkdev.h */
void exit_io_context(void);
......@@ -414,6 +428,10 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
#endif
struct list_head tasks;
/*
* ptrace_list/ptrace_children forms the list of my children
......@@ -570,119 +588,6 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#ifdef CONFIG_SMP
#define SCHED_LOAD_SCALE 128UL /* increase resolution of load */
#define SD_BALANCE_NEWIDLE 1 /* Balance when about to become idle */
#define SD_BALANCE_EXEC 2 /* Balance on exec */
#define SD_WAKE_IDLE 4 /* Wake to idle CPU on task wakeup */
#define SD_WAKE_AFFINE 8 /* Wake task to waking CPU */
#define SD_WAKE_BALANCE 16 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 32 /* Domain members share cpu power */
struct sched_group {
struct sched_group *next; /* Must be a circular list */
cpumask_t cpumask;
/*
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU. This should be read only (except for setup). Although
* it will need to be written to at cpu hot(un)plug time, perhaps the
* cpucontrol semaphore will provide enough exclusion?
*/
unsigned long cpu_power;
};
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */
int flags; /* See SD_* */
/* Runtime fields. */
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
};
#ifndef ARCH_HAS_SCHED_TUNE
#ifdef CONFIG_SCHED_SMT
#define ARCH_HAS_SCHED_WAKE_IDLE
/* Common values for SMT siblings */
#define SD_SIBLING_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 2, \
.busy_factor = 8, \
.imbalance_pct = 110, \
.cache_hot_time = 0, \
.cache_nice_tries = 0, \
.per_cpu_gain = 25, \
.flags = SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_IDLE \
| SD_SHARE_CPUPOWER, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
/* Common values for CPUs */
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
.cache_hot_time = (5*1000000/2), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#ifdef CONFIG_NUMA
/* Common values for NUMA nodes */
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
.busy_factor = 32, \
.imbalance_pct = 125, \
.cache_hot_time = (10*1000000), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_EXEC \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
#endif /* ARCH_HAS_SCHED_TUNE */
extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
#else
static inline int set_cpus_allowed(task_t *p, cpumask_t new_mask)
......
......@@ -40,6 +40,8 @@
#include <linux/cpu.h>
#include <linux/percpu.h>
#include <linux/kthread.h>
#include <linux/seq_file.h>
#include <linux/times.h>
#include <asm/tlb.h>
#include <asm/unistd.h>
......@@ -182,6 +184,16 @@ static unsigned int task_timeslice(task_t *p)
#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
enum idle_type
{
IDLE,
NOT_IDLE,
NEWLY_IDLE,
MAX_IDLE_TYPES
};
struct sched_domain;
/*
* These are the runqueue data structures:
*/
......@@ -233,10 +245,186 @@ struct runqueue {
task_t *migration_thread;
struct list_head migration_queue;
#endif
#ifdef CONFIG_SCHEDSTATS
/* latency stats */
struct sched_info rq_sched_info;
/* sys_sched_yield() stats */
unsigned long yld_exp_empty;
unsigned long yld_act_empty;
unsigned long yld_both_empty;
unsigned long yld_cnt;
/* schedule() stats */
unsigned long sched_noswitch;
unsigned long sched_switch;
unsigned long sched_cnt;
unsigned long sched_goidle;
/* pull_task() stats */
unsigned long pt_gained[MAX_IDLE_TYPES];
unsigned long pt_lost[MAX_IDLE_TYPES];
/* active_load_balance() stats */
unsigned long alb_cnt;
unsigned long alb_lost;
unsigned long alb_gained;
unsigned long alb_failed;
/* try_to_wake_up() stats */
unsigned long ttwu_cnt;
unsigned long ttwu_attempts;
unsigned long ttwu_moved;
/* wake_up_new_task() stats */
unsigned long wunt_cnt;
unsigned long wunt_moved;
/* sched_migrate_task() stats */
unsigned long smt_cnt;
/* sched_balance_exec() stats */
unsigned long sbe_cnt;
#endif
};
static DEFINE_PER_CPU(struct runqueue, runqueues);
/*
* sched-domains (multiprocessor balancing) declarations:
*/
#ifdef CONFIG_SMP
#define SCHED_LOAD_SCALE 128UL /* increase resolution of load */
#define SD_BALANCE_NEWIDLE 1 /* Balance when about to become idle */
#define SD_BALANCE_EXEC 2 /* Balance on exec */
#define SD_WAKE_IDLE 4 /* Wake to idle CPU on task wakeup */
#define SD_WAKE_AFFINE 8 /* Wake task to waking CPU */
#define SD_WAKE_BALANCE 16 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 32 /* Domain members share cpu power */
struct sched_group {
struct sched_group *next; /* Must be a circular list */
cpumask_t cpumask;
/*
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU. This should be read only (except for setup). Although
* it will need to be written to at cpu hot(un)plug time, perhaps the
* cpucontrol semaphore will provide enough exclusion?
*/
unsigned long cpu_power;
};
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */
int flags; /* See SD_* */
/* Runtime fields. */
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
unsigned long lb_cnt[MAX_IDLE_TYPES];
unsigned long lb_failed[MAX_IDLE_TYPES];
unsigned long lb_imbalance[MAX_IDLE_TYPES];
unsigned long lb_nobusyg[MAX_IDLE_TYPES];
unsigned long lb_nobusyq[MAX_IDLE_TYPES];
/* sched_balance_exec() stats */
unsigned long sbe_attempts;
unsigned long sbe_pushed;
/* try_to_wake_up() stats */
unsigned long ttwu_wake_affine;
unsigned long ttwu_wake_balance;
#endif
};
#ifndef ARCH_HAS_SCHED_TUNE
#ifdef CONFIG_SCHED_SMT
#define ARCH_HAS_SCHED_WAKE_IDLE
/* Common values for SMT siblings */
#define SD_SIBLING_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 2, \
.busy_factor = 8, \
.imbalance_pct = 110, \
.cache_hot_time = 0, \
.cache_nice_tries = 0, \
.per_cpu_gain = 25, \
.flags = SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_IDLE \
| SD_SHARE_CPUPOWER, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
/* Common values for CPUs */
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
.cache_hot_time = (5*1000000/2), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
/* Arch can override this macro in processor.h */
#if defined(CONFIG_NUMA) && !defined(SD_NODE_INIT)
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
.busy_factor = 32, \
.imbalance_pct = 125, \
.cache_hot_time = (10*1000000), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_EXEC \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
#endif /* ARCH_HAS_SCHED_TUNE */
#endif
#define for_each_domain(cpu, domain) \
for (domain = cpu_rq(cpu)->sd; domain; domain = domain->parent)
......@@ -279,6 +467,100 @@ static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
spin_unlock_irqrestore(&rq->lock, *flags);
}
#ifdef CONFIG_SCHEDSTATS
/*
* bump this up when changing the output format or the meaning of an existing
* format, so that tools can adapt (or abort)
*/
#define SCHEDSTAT_VERSION 10
static int show_schedstat(struct seq_file *seq, void *v)
{
int cpu;
enum idle_type itype;
seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
seq_printf(seq, "timestamp %lu\n", jiffies);
for_each_online_cpu(cpu) {
runqueue_t *rq = cpu_rq(cpu);
struct sched_domain *sd;
int dcnt = 0;
/* runqueue-specific stats */
seq_printf(seq,
"cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu "
"%lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
cpu, rq->yld_both_empty,
rq->yld_act_empty, rq->yld_exp_empty,
rq->yld_cnt, rq->sched_noswitch,
rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
rq->alb_cnt, rq->alb_gained, rq->alb_lost,
rq->alb_failed,
rq->ttwu_cnt, rq->ttwu_moved, rq->ttwu_attempts,
rq->wunt_cnt, rq->wunt_moved,
rq->smt_cnt, rq->sbe_cnt, rq->rq_sched_info.cpu_time,
rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
for (itype = IDLE; itype < MAX_IDLE_TYPES; itype++)
seq_printf(seq, " %lu %lu", rq->pt_gained[itype],
rq->pt_lost[itype]);
seq_printf(seq, "\n");
/* domain-specific stats */
for_each_domain(cpu, sd) {
char mask_str[NR_CPUS];
cpumask_scnprintf(mask_str, NR_CPUS, sd->span);
seq_printf(seq, "domain%d %s", dcnt++, mask_str);
for (itype = IDLE; itype < MAX_IDLE_TYPES; itype++) {
seq_printf(seq, " %lu %lu %lu %lu %lu",
sd->lb_cnt[itype],
sd->lb_failed[itype],
sd->lb_imbalance[itype],
sd->lb_nobusyq[itype],
sd->lb_nobusyg[itype]);
}
seq_printf(seq, " %lu %lu %lu %lu\n",
sd->sbe_pushed, sd->sbe_attempts,
sd->ttwu_wake_affine, sd->ttwu_wake_balance);
}
}
return 0;
}
static int schedstat_open(struct inode *inode, struct file *file)
{
unsigned int size = PAGE_SIZE * (1 + num_online_cpus() / 32);
char *buf = kmalloc(size, GFP_KERNEL);
struct seq_file *m;
int res;
if (!buf)
return -ENOMEM;
res = single_open(file, show_schedstat, NULL);
if (!res) {
m = file->private_data;
m->buf = buf;
m->size = size;
} else
kfree(buf);
return res;
}
struct file_operations proc_schedstat_operations = {
.open = schedstat_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
# define schedstat_inc(rq, field) rq->field++;
# define schedstat_add(rq, field, amt) rq->field += amt;
#else /* !CONFIG_SCHEDSTATS */
# define schedstat_inc(rq, field) do { } while (0);
# define schedstat_add(rq, field, amt) do { } while (0);
#endif
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
......@@ -298,6 +580,112 @@ static inline void rq_unlock(runqueue_t *rq)
spin_unlock_irq(&rq->lock);
}
#ifdef CONFIG_SCHEDSTATS
/*
* Called when a process is dequeued from the active array and given
* the cpu. We should note that with the exception of interactive
* tasks, the expired queue will become the active queue after the active
* queue is empty, without explicitly dequeuing and requeuing tasks in the
* expired queue. (Interactive tasks may be requeued directly to the
* active queue, thus delaying tasks in the expired queue from running;
* see scheduler_tick()).
*
* This function is only called from sched_info_arrive(), rather than
* dequeue_task(). Even though a task may be queued and dequeued multiple
* times as it is shuffled about, we're really interested in knowing how
* long it was from the *first* time it was queued to the time that it
* finally hit a cpu.
*/
static inline void sched_info_dequeued(task_t *t)
{
t->sched_info.last_queued = 0;
}
/*
* Called when a task finally hits the cpu. We can now calculate how
* long it was waiting to run. We also note when it began so that we
* can keep stats on how long its timeslice is.
*/
static inline void sched_info_arrive(task_t *t)
{
unsigned long now = jiffies, diff = 0;
struct runqueue *rq = task_rq(t);
if (t->sched_info.last_queued)
diff = now - t->sched_info.last_queued;
sched_info_dequeued(t);
t->sched_info.run_delay += diff;
t->sched_info.last_arrival = now;
t->sched_info.pcnt++;
if (!rq)
return;
rq->rq_sched_info.run_delay += diff;
rq->rq_sched_info.pcnt++;
}
/*
* Called when a process is queued into either the active or expired
* array. The time is noted and later used to determine how long we
* had to wait for us to reach the cpu. Since the expired queue will
* become the active queue after active queue is empty, without dequeuing
* and requeuing any tasks, we are interested in queuing to either. It
* is unusual but not impossible for tasks to be dequeued and immediately
* requeued in the same or another array: this can happen in sched_yield(),
* set_user_nice(), and even load_balance() as it moves tasks from runqueue
* to runqueue.
*
* This function is only called from enqueue_task(), but also only updates
* the timestamp if it is already not set. It's assumed that
* sched_info_dequeued() will clear that stamp when appropriate.
*/
static inline void sched_info_queued(task_t *t)
{
if (!t->sched_info.last_queued)
t->sched_info.last_queued = jiffies;
}
/*
* Called when a process ceases being the active-running process, either
* voluntarily or involuntarily. Now we can calculate how long we ran.
*/
static inline void sched_info_depart(task_t *t)
{
struct runqueue *rq = task_rq(t);
unsigned long diff = jiffies - t->sched_info.last_arrival;
t->sched_info.cpu_time += diff;
if (rq)
rq->rq_sched_info.cpu_time += diff;
}
/*
* Called when tasks are switched involuntarily due, typically, to expiring
* their time slice. (This may also be called when switching to or from
* the idle task.) We are only called when prev != next.
*/
static inline void sched_info_switch(task_t *prev, task_t *next)
{
struct runqueue *rq = task_rq(prev);
/*
* prev now departs the cpu. It's not interesting to record
* stats about how efficient we were at scheduling the idle
* process, however.
*/
if (prev != rq->idle)
sched_info_depart(prev);
if (next != rq->idle)
sched_info_arrive(next);
}
#else
#define sched_info_queued(t) do { } while (0)
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS */
/*
* Adding/removing a task to/from a priority array:
*/
......@@ -311,6 +699,7 @@ static void dequeue_task(struct task_struct *p, prio_array_t *array)
static void enqueue_task(struct task_struct *p, prio_array_t *array)
{
sched_info_queued(p);
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
......@@ -740,6 +1129,7 @@ static int try_to_wake_up(task_t * p, unsigned int state, int sync)
#endif
rq = task_rq_lock(p, &flags);
schedstat_inc(rq, ttwu_cnt);
old_state = p->state;
if (!(old_state & state))
goto out;
......@@ -787,23 +1177,35 @@ static int try_to_wake_up(task_t * p, unsigned int state, int sync)
*/
imbalance = sd->imbalance_pct + (sd->imbalance_pct - 100) / 2;
if ( ((sd->flags & SD_WAKE_AFFINE) &&
!task_hot(p, rq->timestamp_last_tick, sd))
|| ((sd->flags & SD_WAKE_BALANCE) &&
imbalance*this_load <= 100*load) ) {
if ((sd->flags & SD_WAKE_AFFINE) &&
!task_hot(p, rq->timestamp_last_tick, sd)) {
/*
* This domain has SD_WAKE_AFFINE and p is cache cold
* in this domain.
*/
if (cpu_isset(cpu, sd->span)) {
schedstat_inc(sd, ttwu_wake_affine);
goto out_set_cpu;
}
} else if ((sd->flags & SD_WAKE_BALANCE) &&
imbalance*this_load <= 100*load) {
/*
* Now sd has SD_WAKE_AFFINE and p is cache cold in sd
* or sd has SD_WAKE_BALANCE and there is an imbalance
* This domain has SD_WAKE_BALANCE and there is
* an imbalance.
*/
if (cpu_isset(cpu, sd->span))
if (cpu_isset(cpu, sd->span)) {
schedstat_inc(sd, ttwu_wake_balance);
goto out_set_cpu;
}
}
}
new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
out_set_cpu:
schedstat_inc(rq, ttwu_attempts);
new_cpu = wake_idle(new_cpu, p);
if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) {
schedstat_inc(rq, ttwu_moved);
set_task_cpu(p, new_cpu);
task_rq_unlock(rq, &flags);
/* might preempt at this point */
......@@ -886,6 +1288,9 @@ void fastcall sched_fork(task_t *p)
INIT_LIST_HEAD(&p->run_list);
p->array = NULL;
spin_lock_init(&p->switch_lock);
#ifdef CONFIG_SCHEDSTATS
memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#ifdef CONFIG_PREEMPT
/*
* During context-switch we hold precisely one spinlock, which
......@@ -943,6 +1348,7 @@ void fastcall wake_up_new_task(task_t * p, unsigned long clone_flags)
BUG_ON(p->state != TASK_RUNNING);
schedstat_inc(rq, wunt_cnt);
/*
* We decrease the sleep average of forking parents
* and children as well, to keep max-interactive tasks
......@@ -991,6 +1397,7 @@ void fastcall wake_up_new_task(task_t * p, unsigned long clone_flags)
current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
schedstat_inc(rq, wunt_moved);
}
if (unlikely(cpu != this_cpu)) {
......@@ -1161,13 +1568,6 @@ unsigned long nr_iowait(void)
return sum;
}
enum idle_type
{
IDLE,
NOT_IDLE,
NEWLY_IDLE,
};
#ifdef CONFIG_SMP
/*
......@@ -1282,6 +1682,7 @@ static void sched_migrate_task(task_t *p, int dest_cpu)
|| unlikely(cpu_is_offline(dest_cpu)))
goto out;
schedstat_inc(rq, smt_cnt);
/* force the process onto the specified CPU */
if (migrate_task(p, dest_cpu, &req)) {
/* Need to wait for migration thread (might exit: take ref). */
......@@ -1309,6 +1710,7 @@ void sched_exec(void)
struct sched_domain *tmp, *sd = NULL;
int new_cpu, this_cpu = get_cpu();
schedstat_inc(this_rq(), sbe_cnt);
/* Prefer the current CPU if there's only this task running */
if (this_rq()->nr_running <= 1)
goto out;
......@@ -1317,9 +1719,11 @@ void sched_exec(void)
if (tmp->flags & SD_BALANCE_EXEC)
sd = tmp;
schedstat_inc(sd, sbe_attempts);
if (sd) {
new_cpu = find_idlest_cpu(current, this_cpu, sd);
if (new_cpu != this_cpu) {
schedstat_inc(sd, sbe_pushed);
put_cpu();
sched_migrate_task(current, new_cpu);
return;
......@@ -1443,6 +1847,15 @@ static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest,
idx++;
goto skip_bitmap;
}
/*
* Right now, this is the only place pull_task() is called,
* so we can safely collect pull_task() stats here rather than
* inside pull_task().
*/
schedstat_inc(this_rq, pt_gained[idle]);
schedstat_inc(busiest, pt_lost[idle]);
pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
pulled++;
......@@ -1637,14 +2050,20 @@ static int load_balance(int this_cpu, runqueue_t *this_rq,
int nr_moved;
spin_lock(&this_rq->lock);
schedstat_inc(sd, lb_cnt[idle]);
group = find_busiest_group(sd, this_cpu, &imbalance, idle);
if (!group)
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
goto out_balanced;
}
busiest = find_busiest_queue(group);
if (!busiest)
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
/*
* This should be "impossible", but since load
* balancing is inherently racy and statistical,
......@@ -1655,6 +2074,8 @@ static int load_balance(int this_cpu, runqueue_t *this_rq,
goto out_balanced;
}
schedstat_add(sd, lb_imbalance[idle], imbalance);
nr_moved = 0;
if (busiest->nr_running > 1) {
/*
......@@ -1671,6 +2092,7 @@ static int load_balance(int this_cpu, runqueue_t *this_rq,
spin_unlock(&this_rq->lock);
if (!nr_moved) {
schedstat_inc(sd, lb_failed[idle]);
sd->nr_balance_failed++;
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
......@@ -1725,19 +2147,27 @@ static int load_balance_newidle(int this_cpu, runqueue_t *this_rq,
unsigned long imbalance;
int nr_moved = 0;
schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
if (!group)
if (!group) {
schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
goto out;
}
busiest = find_busiest_queue(group);
if (!busiest || busiest == this_rq)
if (!busiest || busiest == this_rq) {
schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
goto out;
}
/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
nr_moved = move_tasks(this_rq, this_cpu, busiest,
imbalance, sd, NEWLY_IDLE);
if (!nr_moved)
schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
spin_unlock(&busiest->lock);
......@@ -1777,6 +2207,7 @@ static void active_load_balance(runqueue_t *busiest, int busiest_cpu)
struct sched_group *group, *busy_group;
int i;
schedstat_inc(busiest, alb_cnt);
if (busiest->nr_running <= 1)
return;
......@@ -1821,7 +2252,12 @@ static void active_load_balance(runqueue_t *busiest, int busiest_cpu)
if (unlikely(busiest == rq))
goto next_group;
double_lock_balance(busiest, rq);
move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
if (move_tasks(rq, push_cpu, busiest, 1, sd, IDLE)) {
schedstat_inc(busiest, alb_lost);
schedstat_inc(rq, alb_gained);
} else {
schedstat_inc(busiest, alb_failed);
}
spin_unlock(&rq->lock);
next_group:
group = group->next;
......@@ -2174,6 +2610,7 @@ asmlinkage void __sched schedule(void)
}
release_kernel_lock(prev);
schedstat_inc(rq, sched_cnt);
now = sched_clock();
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
......@@ -2220,18 +2657,21 @@ asmlinkage void __sched schedule(void)
/*
* Switch the active and expired arrays.
*/
schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
}
} else
schedstat_inc(rq, sched_noswitch);
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
if (dependent_sleeper(cpu, rq, next)) {
schedstat_inc(rq, sched_goidle);
next = rq->idle;
goto switch_tasks;
}
......@@ -2261,6 +2701,7 @@ asmlinkage void __sched schedule(void)
}
prev->timestamp = now;
sched_info_switch(prev, next);
if (likely(prev != next)) {
next->timestamp = now;
rq->nr_switches++;
......@@ -2972,6 +3413,7 @@ asmlinkage long sys_sched_yield(void)
prio_array_t *array = current->array;
prio_array_t *target = rq->expired;
schedstat_inc(rq, yld_cnt);
/*
* We implement yielding by moving the task into the expired
* queue.
......@@ -2982,6 +3424,13 @@ asmlinkage long sys_sched_yield(void)
if (rt_task(current))
target = rq->active;
if (current->array->nr_active == 1) {
schedstat_inc(rq, yld_act_empty);
if (!rq->expired->nr_active)
schedstat_inc(rq, yld_both_empty);
} else if (!rq->expired->nr_active)
schedstat_inc(rq, yld_exp_empty);
dequeue_task(current, array);
enqueue_task(current, target);
......@@ -3623,7 +4072,7 @@ EXPORT_SYMBOL(kernel_flag);
#ifdef CONFIG_SMP
/* Attach the domain 'sd' to 'cpu' as its base domain */
void cpu_attach_domain(struct sched_domain *sd, int cpu)
static void cpu_attach_domain(struct sched_domain *sd, int cpu)
{
migration_req_t req;
unsigned long flags;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment