Commit 9972f27b authored by Sasha Goldshtein's avatar Sasha Goldshtein

cpudist: Support off-cpu time reports

Add -O switch, which directs cpudist to collect off-CPU time
statistics. Also restructure the code slightly and added examples
as appropriate.
parent de34c25b
...@@ -78,7 +78,7 @@ Examples: ...@@ -78,7 +78,7 @@ Examples:
- tools/[btrfsdist](tools/btrfsdist.py): Summarize btrfs operation latency distribution as a histogram. [Examples](tools/btrfsdist_example.txt). - tools/[btrfsdist](tools/btrfsdist.py): Summarize btrfs operation latency distribution as a histogram. [Examples](tools/btrfsdist_example.txt).
- tools/[btrfsslower](tools/btrfsslower.py): Trace slow btrfs operations. [Examples](tools/btrfsslower_example.txt). - tools/[btrfsslower](tools/btrfsslower.py): Trace slow btrfs operations. [Examples](tools/btrfsslower_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt). - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[cpudist](tools/cpudist.py): Summarize on-CPU time per task as a histogram. [Examples](tools/cpudist_example.txt) - tools/[cpudist](tools/cpudist.py): Summarize on- and off-CPU time per task as a histogram. [Examples](tools/cpudist_example.txt)
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt). - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt). - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt). - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
......
.TH cpudist 8 "2016-06-28" "USER COMMANDS" .TH cpudist 8 "2016-06-28" "USER COMMANDS"
.SH NAME .SH NAME
cpudist \- On-CPU task time as a histogram. cpudist \- On- and off-CPU task time as a histogram.
.SH SYNOPSIS .SH SYNOPSIS
.B cpudist [\-h] [\-T] [\-m] [\-P] [\-L] [\-p PID] [interval] [count] .B cpudist [\-h] [-O] [\-T] [\-m] [\-P] [\-L] [\-p PID] [interval] [count]
.SH DESCRIPTION .SH DESCRIPTION
This measures the time a task spends on the CPU before being descheduled, and This measures the time a task spends on the CPU before being descheduled, and
shows the times as a histogram. Tasks that spend a very short time on the CPU shows the times as a histogram. Tasks that spend a very short time on the CPU
...@@ -10,6 +10,11 @@ can be indicative of excessive context-switches and poor workload distribution, ...@@ -10,6 +10,11 @@ can be indicative of excessive context-switches and poor workload distribution,
and possibly point to a shared source of contention that keeps tasks switching and possibly point to a shared source of contention that keeps tasks switching
in and out as it becomes available (such as a mutex). in and out as it becomes available (such as a mutex).
Similarly, the tool can also measure the time a task spends off-CPU before it
is scheduled again. This can be helpful in identifying long blocking and I/O
operations, or alternatively very short descheduling times due to short-lived
locks or timers.
This tool uses in-kernel eBPF maps for storing timestamps and the histogram, This tool uses in-kernel eBPF maps for storing timestamps and the histogram,
for efficiency. Despite this, the overhead of this tool may become significant for efficiency. Despite this, the overhead of this tool may become significant
for some workloads: see the OVERHEAD section. for some workloads: see the OVERHEAD section.
...@@ -27,6 +32,9 @@ CONFIG_BPF and bcc. ...@@ -27,6 +32,9 @@ CONFIG_BPF and bcc.
\-h \-h
Print usage message. Print usage message.
.TP .TP
\-O
Measure off-CPU time instead of on-CPU time.
.TP
\-T \-T
Include timestamps on output. Include timestamps on output.
.TP .TP
...@@ -53,6 +61,10 @@ Summarize task on-CPU time as a histogram: ...@@ -53,6 +61,10 @@ Summarize task on-CPU time as a histogram:
# #
.B cpudist .B cpudist
.TP .TP
Summarize task off-CPU time as a histogram:
#
.B cpudist -O
.TP
Print 1 second summaries, 10 times: Print 1 second summaries, 10 times:
# #
.B cpudist 1 10 .B cpudist 1 10
......
#!/usr/bin/python #!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports # @lint-avoid-python-3-compatibility-imports
# #
# cpudist Summarize on-CPU time per task as a histogram. # cpudist Summarize on- and off-CPU time per task as a histogram.
# #
# USAGE: cpudist [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count] # USAGE: cpudist [-h] [-O] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]
# #
# This measures the time a task spends on the CPU, and shows this time as a # This measures the time a task spends on or off the CPU, and shows this time
# histogram, optionally per-process. # as a histogram, optionally per-process.
# #
# Copyright 2016 Sasha Goldshtein # Copyright 2016 Sasha Goldshtein
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
...@@ -18,6 +18,7 @@ import argparse ...@@ -18,6 +18,7 @@ import argparse
examples = """examples: examples = """examples:
cpudist # summarize on-CPU time as a histogram cpudist # summarize on-CPU time as a histogram
cpudist -O # summarize off-CPU time as a histogram
cpudist 1 10 # print 1 second summaries, 10 times cpudist 1 10 # print 1 second summaries, 10 times
cpudist -mT 1 # 1s summaries, milliseconds, and timestamps cpudist -mT 1 # 1s summaries, milliseconds, and timestamps
cpudist -P # show each PID separately cpudist -P # show each PID separately
...@@ -27,6 +28,8 @@ parser = argparse.ArgumentParser( ...@@ -27,6 +28,8 @@ parser = argparse.ArgumentParser(
description="Summarize on-CPU time per task as a histogram.", description="Summarize on-CPU time per task as a histogram.",
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples) epilog=examples)
parser.add_argument("-O", "--offcpu", action="store_true",
help="measure off-CPU time")
parser.add_argument("-T", "--timestamp", action="store_true", parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output") help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true", parser.add_argument("-m", "--milliseconds", action="store_true",
...@@ -52,60 +55,90 @@ bpf_text += tp.generate_decl() ...@@ -52,60 +55,90 @@ bpf_text += tp.generate_decl()
bpf_text += tp.generate_entry_probe() bpf_text += tp.generate_entry_probe()
bpf_text += tp.generate_struct() bpf_text += tp.generate_struct()
if not args.offcpu:
bpf_text += "#define ONCPU\n"
bpf_text += """ bpf_text += """
typedef struct pid_key { typedef struct pid_key {
u64 id; u64 id;
u64 slot; u64 slot;
} pid_key_t; } pid_key_t;
// We need to store the start time, which is when the thread got switched in,
// and the tgid for the pid because the sched_switch tracepoint doesn't provide
// that information.
BPF_HASH(start, u32, u64); BPF_HASH(start, u32, u64);
BPF_HASH(tgid_for_pid, u32, u32); BPF_HASH(tgid_for_pid, u32, u32);
STORAGE STORAGE
int sched_switch(struct pt_regs *ctx) #define INVALID_TGID 0xffffffff
static inline u32 get_tgid_if_missing(u32 tgid, u32 pid)
{ {
u64 pid_tgid = bpf_get_current_pid_tgid(); if (tgid == INVALID_TGID) {
u64 *di = __trace_di.lookup(&pid_tgid); u32 *stored_tgid = tgid_for_pid.lookup(&pid);
if (di == 0) if (stored_tgid != 0)
return 0; return *stored_tgid;
}
return tgid;
}
struct sched_switch_trace_entry args = {}; static inline void store_start(u32 tgid, u32 pid, u64 ts)
bpf_probe_read(&args, sizeof(args), (void *)*di); {
tgid = get_tgid_if_missing(tgid, pid);
u32 tgid, pid; if (FILTER)
u64 ts = bpf_ktime_get_ns(); return;
if (args.prev_state == TASK_RUNNING) { start.update(&pid, &ts);
pid = args.prev_pid; }
u32 *stored_tgid = tgid_for_pid.lookup(&pid); static inline void update_hist(u32 tgid, u32 pid, u64 ts)
if (stored_tgid == 0) {
goto BAIL; tgid = get_tgid_if_missing(tgid, pid);
tgid = *stored_tgid;
if (FILTER) if (FILTER)
goto BAIL; return;
u64 *tsp = start.lookup(&pid); u64 *tsp = start.lookup(&pid);
if (tsp == 0) if (tsp == 0)
goto BAIL; return;
u64 delta = ts - *tsp; u64 delta = ts - *tsp;
FACTOR FACTOR
STORE STORE
} }
BAIL: int sched_switch(struct pt_regs *ctx)
tgid = pid_tgid >> 32; {
pid = pid_tgid; u64 ts = bpf_ktime_get_ns();
if (FILTER) u64 pid_tgid = bpf_get_current_pid_tgid();
u32 tgid = pid_tgid >> 32, pid = pid_tgid;
// Keep a mapping of tgid for pid because when sched_switch hits,
// we only have the tgid information for the *current* pid, but not
// for the previous one.
tgid_for_pid.update(&pid, &tgid);
u64 *di = __trace_di.lookup(&pid_tgid);
if (di == 0)
return 0; return 0;
start.update(&pid, &ts); struct sched_switch_trace_entry args = {};
tgid_for_pid.update(&pid, &tgid); bpf_probe_read(&args, sizeof(args), (void *)*di);
// TODO: Store the comm as well
if (args.prev_state == TASK_RUNNING) {
u32 prev_pid = args.prev_pid;
#ifdef ONCPU
update_hist(INVALID_TGID, prev_pid, ts);
#else
store_start(INVALID_TGID, prev_pid, ts);
#endif
}
#ifdef ONCPU
store_start(tgid, pid, ts);
#else
update_hist(tgid, pid, ts);
#endif
return 0; return 0;
} }
...@@ -144,7 +177,8 @@ b = BPF(text=bpf_text) ...@@ -144,7 +177,8 @@ b = BPF(text=bpf_text)
Tracepoint.attach(b) Tracepoint.attach(b)
b.attach_kprobe(event="perf_trace_sched_switch", fn_name="sched_switch") b.attach_kprobe(event="perf_trace_sched_switch", fn_name="sched_switch")
print("Tracing on-CPU time... Hit Ctrl-C to end.") print("Tracing %s-CPU time... Hit Ctrl-C to end." %
("off" if args.offcpu else "on"))
exiting = 0 if args.interval else 1 exiting = 0 if args.interval else 1
dist = b.get_table("dist") dist = b.get_table("dist")
......
...@@ -6,6 +6,10 @@ that can indicate oversubscription (too many tasks for too few processors), ...@@ -6,6 +6,10 @@ that can indicate oversubscription (too many tasks for too few processors),
overhead due to excessive context switching (e.g. a common shared lock for overhead due to excessive context switching (e.g. a common shared lock for
multiple threads), uneven workload distribution, too-granular tasks, and more. multiple threads), uneven workload distribution, too-granular tasks, and more.
Alternatively, the same options are available for summarizing task off-CPU
time, which helps understand how often threads are being descheduled and how
long they spend waiting for I/O, locks, timers, and other causes of suspension.
# ./cpudist.py # ./cpudist.py
Tracing on-CPU time... Hit Ctrl-C to end. Tracing on-CPU time... Hit Ctrl-C to end.
^C ^C
...@@ -155,6 +159,47 @@ pid = 5068 ...@@ -155,6 +159,47 @@ pid = 5068
This histogram was obtained while executing `dd if=/dev/zero of=/dev/null` with This histogram was obtained while executing `dd if=/dev/zero of=/dev/null` with
fairly large block sizes. fairly large block sizes.
You could also ask for an off-CPU report using the -O switch. Here's a
histogram of task block times while the system is heavily loaded:
# ./cpudist -O -p $(parprimes)
Tracing off-CPU time... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 3 | |
64 -> 127 : 1 | |
128 -> 255 : 1 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 | |
1024 -> 2047 : 4 | |
2048 -> 4095 : 3 | |
4096 -> 8191 : 70 |*** |
8192 -> 16383 : 867 |****************************************|
16384 -> 32767 : 141 |****** |
32768 -> 65535 : 8 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 1 | |
262144 -> 524287 : 2 | |
524288 -> 1048575 : 3 | |
As you can see, threads are switching out for relatively long intervals, even
though we know the workload doesn't have any significant blocking. This can be
a result of over-subscription -- too many threads contending over too few CPUs.
Indeed, there are four available CPUs and more than four runnable threads:
# nproc
4
# cat /proc/loadavg
0.04 0.11 0.06 9/147 7494
(This shows we have 9 threads runnable out of 147 total. This is more than 4,
the number of available CPUs.)
Finally, let's ask for a per-thread report and values in milliseconds instead Finally, let's ask for a per-thread report and values in milliseconds instead
of microseconds: of microseconds:
...@@ -235,7 +280,7 @@ USAGE message: ...@@ -235,7 +280,7 @@ USAGE message:
# ./cpudist.py -h # ./cpudist.py -h
usage: cpudist.py [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count] usage: cpudist.py [-h] [-O] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]
Summarize on-CPU time per task as a histogram. Summarize on-CPU time per task as a histogram.
...@@ -245,6 +290,7 @@ positional arguments: ...@@ -245,6 +290,7 @@ positional arguments:
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-O, --offcpu measure off-CPU time
-T, --timestamp include timestamp on output -T, --timestamp include timestamp on output
-m, --milliseconds millisecond histogram -m, --milliseconds millisecond histogram
-P, --pids print a histogram per process ID -P, --pids print a histogram per process ID
...@@ -253,6 +299,7 @@ optional arguments: ...@@ -253,6 +299,7 @@ optional arguments:
examples: examples:
cpudist # summarize on-CPU time as a histogram cpudist # summarize on-CPU time as a histogram
cpudist -O # summarize off-CPU time as a histogram
cpudist 1 10 # print 1 second summaries, 10 times cpudist 1 10 # print 1 second summaries, 10 times
cpudist -mT 1 # 1s summaries, milliseconds, and timestamps cpudist -mT 1 # 1s summaries, milliseconds, and timestamps
cpudist -P # show each PID separately cpudist -P # show each PID separately
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment