Commit 7394ebbd authored by Rick Lindsley's avatar Rick Lindsley Committed by Linus Torvalds

[PATCH] scheduler statistics

It adds lots of CPU scheduler stats in /proc/pid/stat.  They are described in
the new Documentation//sched-stats.txt

We were carrying this patch offline for some time, but as there's still
considerable ongoing work in this area, and as the new stats are a
configuration option, I think it's best that this capability be in the base
kernel.

Nick removed a fair amount of statistics that he wasn't using.  The full patch
gathers more information.  In particular, his patch doesn't include the code
to measure the latency between the time a process is made runnable and the
time it hits a processor which will be key to measuring interactivity changes.

He passed his changes back to me and I got finished merging his changes with
the current statistics patches just before OLS.  I believe this is largely a
superset of the patch you grabbed and should port relatively easily too.

Versions also exist for

    2.6.8-rc2
    2.6.8-rc2-mm1
    2.6.8-rc2-mm2

at
    http://eaglet.rain.com/rick/linux/schedstat/patches/

and within 24 hours at

    http://oss.software.ibm.com/linux/patches/?patch_id=730&show=all

The version below is for 2.6.8-rc2-mm2 without the staircase code and has
been compiled cleanly but not yet run.

From: Ingo Molnar <mingo@elte.hu>

this code needs a couple of cleanups before it can go into mainline:

fs/proc/array.c, fs/proc/base.c, fs/proc/proc_misc.c:

 - moved the new /proc/<PID>/stat fields to /proc/<PID>/schedstat,
   because the new fields break older procps. It's cleaner this way
   anyway. This moving of fields necessiated a bump to version 10.

Documentation/sched-stats.txt:

 - updated sched-stats.txt for version 10

 - wake_up_forked_thread() => wake_up_new_task()

 - updated the per-process field description

Kconfig:

 - removed the default y and made the option dependent on DEBUG_KERNEL. 
   This is really for scheduler analysis, normal users dont need the 
   overhead.

include/linux/sched.h:

 - moved the definitions into kernel/sched.c - this fixes UP compilation
   and is cleaner.

 - also moved the sched-domain definitions to sched.c - now that the 
   sched-domains internals are not exposed to architectures this is
   doable. It's also necessary due to the previous change.

kernel/fork.c:

 - moved the ->sched_info init to sched_fork() where it belongs.

kernel/sched.c:

 - wake_up_forked_thread() -> wake_up_new_task(), wuft_cnt -> wunt_cnt,
   wuft_moved -> wunt_moved.

 - wunt_cnt and wunt_moved were defined by never updated - added the 
   missing code to wake_up_new_task().

 - whitespace/style police

 - removed whitespace changes done to code not related to schedstats -
   i'll send a separate patch for these (and more).
Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
parent 8399dc16
Version 10 of schedstats includes support for sched_domains, which
hit the mainline kernel in 2.6.7. Some counters make more sense to be
per-runqueue; other to be per-domain.
In version 10 of schedstat, there is at least one level of domain
statistics for each cpu listed, and there may well be more than one
domain. Domains have no particular names in this implementation, but
the highest numbered one typically arbitrates balancing across all the
cpus on the machine, while domain0 is the most tightly focused domain,
sometimes balancing only between pairs of cpus. At this time, there
are no architectures which need more than three domain levels. The first
field in the domain stats is a bit map indicating which cpus are affected
by that domain.
These fields are counters, and only increment. Programs which make use
of these will need to start with a baseline observation and then calculate
the change in the counters at each subsequent observation. A perl script
which does this for many of the fields is available at
http://eaglet.rain.com/rick/linux/schedstat/
Note that any such script will necessarily be version-specific, as the main
reason to change versions is changes in the output format. For those wishing
to write their own scripts, the fields are described here.
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
NOTE: In the sched_yield() statistics, the active queue is considered empty
if it has only one process in it, since obviously the process calling
sched_yield() is that process.
First four fields are sched_yield() statistics:
1) # of times both the active and the expired queue were empty
2) # of times just the active queue was empty
3) # of times just the expired queue was empty
4) # of times sched_yield() was called
Next four are schedule() statistics:
5) # of times the active queue had at least one other process on it
6) # of times we switched to the expired queue and reused it
7) # of times schedule() was called
8) # of times schedule() left the processor idle
Next four are active_load_balance() statistics:
9) # of times active_load_balance() was called
10) # of times active_load_balance() caused this cpu to gain a task
11) # of times active_load_balance() caused this cpu to lose a task
12) # of times active_load_balance() tried to move a task and failed
Next three are try_to_wake_up() statistics:
13) # of times try_to_wake_up() was called
14) # of times try_to_wake_up() successfully moved the awakening task
15) # of times try_to_wake_up() attempted to move the awakening task
Next two are wake_up_new_task() statistics:
16) # of times wake_up_new_task() was called
17) # of times wake_up_new_task() successfully moved the new task
Next one is a sched_migrate_task() statistic:
18) # of times sched_migrate_task() was called
Next one is a sched_balance_exec() statistic:
19) # of times sched_balance_exec() was called
Next three are statistics describing scheduling latency:
20) sum of all time spent running by tasks on this processor (in ms)
21) sum of all time spent waiting to run by tasks on this processor (in ms)
22) # of tasks (not necessarily unique) given to the processor
The last six are statistics dealing with pull_task():
23) # of times pull_task() moved a task to this cpu when newly idle
24) # of times pull_task() stole a task from this cpu when another cpu
was newly idle
25) # of times pull_task() moved a task to this cpu when idle
26) # of times pull_task() stole a task from this cpu when another cpu
was idle
27) # of times pull_task() moved a task to this cpu when busy
28) # of times pull_task() stole a task from this cpu when another cpu
was busy
Domain statistics
-----------------
One of these is produced per domain for each cpu described.
domain<N> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The first field is a bit mask indicating what cpus this domain operates over.
The next fifteen are a variety of load_balance() statistics:
1) # of times in this domain load_balance() was called when the cpu
was idle
2) # of times in this domain load_balance() was called when the cpu
was busy
3) # of times in this domain load_balance() was called when the cpu
was just becoming idle
4) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was idle
5) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was busy
6) # of times in this domain load_balance() tried to move one or more
tasks and failed, when the cpu was just becoming idle
7) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
8) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was busy
9) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was just becoming idle
10) # of times in this domain load_balance() was called but did not find
a busier queue while the cpu was idle
11) # of times in this domain load_balance() was called but did not find
a busier queue while the cpu was busy
12) # of times in this domain load_balance() was called but did not find
a busier queue while the cpu was just becoming idle
13) # of times in this domain a busier queue was found while the cpu was
idle but no busier group was found
14) # of times in this domain a busier queue was found while the cpu was
busy but no busier group was found
15) # of times in this domain a busier queue was found while the cpu was
just becoming idle but no busier group was found
Next two are sched_balance_exec() statistics:
17) # of times in this domain sched_balance_exec() successfully pushed
a task to a new cpu
18) # of times in this domain sched_balance_exec() tried but failed to
push a task to a new cpu
Next two are try_to_wake_up() statistics:
19) # of times in this domain try_to_wake_up() tried to move a task based
on affinity and cache warmth
20) # of times in this domain try_to_wake_up() tried to move a task based
on load balancing
/proc/<pid>/schedstat
----------------
schedstats also adds a new /proc/<pid/schedstat file to include some of
the same information on a per-process level. There are three fields in
this file correlating to fields 20, 21, and 22 in the CPU fields, but
they only apply for that process.
A program could be easily written to make use of these extra fields to
report on how well a particular process or set of processes is faring
under the scheduler's policies. A simple version of such a program is
available at
http://eaglet.rain.com/rick/linux/schedstat/v10/latency.c
......@@ -45,6 +45,18 @@ config 4KSTACKS
on the VM subsystem for higher order allocations. This option
will also use IRQ stacks to compensate for the reduced stackspace.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
config X86_FIND_SMP_CONFIG
bool
depends on X86_LOCAL_APIC || X86_VOYAGER
......
......@@ -53,6 +53,18 @@ config BDI_SWITCH
Unless you are intending to debug the kernel with one of these
machines, say N here.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
config BOOTX_TEXT
bool "Support for early boot text console (BootX or OpenFirmware only)"
depends PPC_OF
......
......@@ -334,6 +334,18 @@ config VIOTAPE
If you are running Linux on an iSeries system and you want Linux
to read and/or write a tape drive owned by OS/400, say Y here.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
endmenu
config VIOPATH
......
......@@ -18,6 +18,18 @@ config INIT_DEBUG
Fill __init and __initdata at the end of boot. This helps debugging
illegal uses of __init and __initdata after initialization.
config SCHEDSTATS
bool "Collect scheduler statistics"
depends on DEBUG_KERNEL && PROC_FS
help
If you say Y here, additional code will be inserted into the
scheduler and related routines to collect statistics about
scheduler behavior and provide them in /proc/schedstat. These
stats may be useful for both tuning and debugging the scheduler
If you aren't debugging the scheduler or trying to tune a specific
application, you can say N to avoid the very slight overhead
this adds.
config FRAME_POINTER
bool "Compile the kernel with frame pointers"
help
......
......@@ -60,6 +60,9 @@ enum pid_directory_inos {
PROC_TGID_MAPS,
PROC_TGID_MOUNTS,
PROC_TGID_WCHAN,
#ifdef CONFIG_SCHEDSTATS
PROC_TGID_SCHEDSTAT,
#endif
#ifdef CONFIG_SECURITY
PROC_TGID_ATTR,
PROC_TGID_ATTR_CURRENT,
......@@ -83,6 +86,9 @@ enum pid_directory_inos {
PROC_TID_MAPS,
PROC_TID_MOUNTS,
PROC_TID_WCHAN,
#ifdef CONFIG_SCHEDSTATS
PROC_TID_SCHEDSTAT,
#endif
#ifdef CONFIG_SECURITY
PROC_TID_ATTR,
PROC_TID_ATTR_CURRENT,
......@@ -122,6 +128,9 @@ static struct pid_entry tgid_base_stuff[] = {
#endif
#ifdef CONFIG_KALLSYMS
E(PROC_TGID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
#ifdef CONFIG_SCHEDSTATS
E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
#endif
{0,0,NULL,0}
};
......@@ -144,6 +153,9 @@ static struct pid_entry tid_base_stuff[] = {
#endif
#ifdef CONFIG_KALLSYMS
E(PROC_TID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
#ifdef CONFIG_SCHEDSTATS
E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
#endif
{0,0,NULL,0}
};
......@@ -398,6 +410,19 @@ static int proc_pid_wchan(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_KALLSYMS */
#ifdef CONFIG_SCHEDSTATS
/*
* Provides /proc/PID/schedstat
*/
static int proc_pid_schedstat(struct task_struct *task, char *buffer)
{
return sprintf(buffer, "%lu %lu %lu\n",
task->sched_info.cpu_time,
task->sched_info.run_delay,
task->sched_info.pcnt);
}
#endif
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
......@@ -1375,6 +1400,13 @@ static struct dentry *proc_pident_lookup(struct inode *dir,
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_wchan;
break;
#endif
#ifdef CONFIG_SCHEDSTATS
case PROC_TID_SCHEDSTAT:
case PROC_TGID_SCHEDSTAT:
inode->i_fop = &proc_info_file_operations;
ei->op.proc_read = proc_pid_schedstat;
break;
#endif
default:
printk("procfs: impossible type (%d)",p->type);
......
......@@ -681,6 +681,9 @@ void __init proc_misc_init(void)
#ifdef CONFIG_MODULES
create_seq_entry("modules", 0, &proc_modules_operations);
#endif
#ifdef CONFIG_SCHEDSTATS
create_seq_entry("schedstat", 0, &proc_schedstat_operations);
#endif
#ifdef CONFIG_PROC_KCORE
proc_root_kcore = create_proc_entry("kcore", S_IRUSR, NULL);
if (proc_root_kcore) {
......
......@@ -352,6 +352,20 @@ struct k_itimer {
struct timespec wall_to_prev; /* wall_to_monotonic used when set */
};
#ifdef CONFIG_SCHEDSTATS
struct sched_info {
/* cumulative counters */
unsigned long cpu_time, /* time spent on the cpu */
run_delay, /* time spent waiting on a runqueue */
pcnt; /* # of timeslices run on this cpu */
/* timestamps */
unsigned long last_arrival, /* when we last ran on a cpu */
last_queued; /* when we were last queued to run */
};
extern struct file_operations proc_schedstat_operations;
#endif
struct io_context; /* See blkdev.h */
void exit_io_context(void);
......@@ -414,6 +428,10 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;
#ifdef CONFIG_SCHEDSTATS
struct sched_info sched_info;
#endif
struct list_head tasks;
/*
* ptrace_list/ptrace_children forms the list of my children
......@@ -570,119 +588,6 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#ifdef CONFIG_SMP
#define SCHED_LOAD_SCALE 128UL /* increase resolution of load */
#define SD_BALANCE_NEWIDLE 1 /* Balance when about to become idle */
#define SD_BALANCE_EXEC 2 /* Balance on exec */
#define SD_WAKE_IDLE 4 /* Wake to idle CPU on task wakeup */
#define SD_WAKE_AFFINE 8 /* Wake task to waking CPU */
#define SD_WAKE_BALANCE 16 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 32 /* Domain members share cpu power */
struct sched_group {
struct sched_group *next; /* Must be a circular list */
cpumask_t cpumask;
/*
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU. This should be read only (except for setup). Although
* it will need to be written to at cpu hot(un)plug time, perhaps the
* cpucontrol semaphore will provide enough exclusion?
*/
unsigned long cpu_power;
};
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */
int flags; /* See SD_* */
/* Runtime fields. */
unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
};
#ifndef ARCH_HAS_SCHED_TUNE
#ifdef CONFIG_SCHED_SMT
#define ARCH_HAS_SCHED_WAKE_IDLE
/* Common values for SMT siblings */
#define SD_SIBLING_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 2, \
.busy_factor = 8, \
.imbalance_pct = 110, \
.cache_hot_time = 0, \
.cache_nice_tries = 0, \
.per_cpu_gain = 25, \
.flags = SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_IDLE \
| SD_SHARE_CPUPOWER, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
/* Common values for CPUs */
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
.cache_hot_time = (5*1000000/2), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#ifdef CONFIG_NUMA
/* Common values for NUMA nodes */
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
.busy_factor = 32, \
.imbalance_pct = 125, \
.cache_hot_time = (10*1000000), \
.cache_nice_tries = 1, \
.per_cpu_gain = 100, \
.flags = SD_BALANCE_EXEC \
| SD_WAKE_BALANCE, \
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
#endif /* ARCH_HAS_SCHED_TUNE */
extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
#else
static inline int set_cpus_allowed(task_t *p, cpumask_t new_mask)
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment