Commit 97712b05 authored by Brenden Blanco's avatar Brenden Blanco Committed by GitHub

Merge pull request #586 from goldshtn/offcpudist

cpudist: Support off-cpu time reports
parents d1b62087 bee8d360
...@@ -78,7 +78,7 @@ Examples: ...@@ -78,7 +78,7 @@ Examples:
- tools/[btrfsdist](tools/btrfsdist.py): Summarize btrfs operation latency distribution as a histogram. [Examples](tools/btrfsdist_example.txt). - tools/[btrfsdist](tools/btrfsdist.py): Summarize btrfs operation latency distribution as a histogram. [Examples](tools/btrfsdist_example.txt).
- tools/[btrfsslower](tools/btrfsslower.py): Trace slow btrfs operations. [Examples](tools/btrfsslower_example.txt). - tools/[btrfsslower](tools/btrfsslower.py): Trace slow btrfs operations. [Examples](tools/btrfsslower_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt). - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[cpudist](tools/cpudist.py): Summarize on-CPU time per task as a histogram. [Examples](tools/cpudist_example.txt) - tools/[cpudist](tools/cpudist.py): Summarize on- and off-CPU time per task as a histogram. [Examples](tools/cpudist_example.txt)
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt). - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt). - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt). - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
......
.TH cpudist 8 "2016-06-28" "USER COMMANDS" .TH cpudist 8 "2016-06-28" "USER COMMANDS"
.SH NAME .SH NAME
cpudist \- On-CPU task time as a histogram. cpudist \- On- and off-CPU task time as a histogram.
.SH SYNOPSIS .SH SYNOPSIS
.B cpudist [\-h] [\-T] [\-m] [\-P] [\-L] [\-p PID] [interval] [count] .B cpudist [\-h] [-O] [\-T] [\-m] [\-P] [\-L] [\-p PID] [interval] [count]
.SH DESCRIPTION .SH DESCRIPTION
This measures the time a task spends on the CPU before being descheduled, and This measures the time a task spends on the CPU before being descheduled, and
shows the times as a histogram. Tasks that spend a very short time on the CPU shows the times as a histogram. Tasks that spend a very short time on the CPU
...@@ -10,15 +10,15 @@ can be indicative of excessive context-switches and poor workload distribution, ...@@ -10,15 +10,15 @@ can be indicative of excessive context-switches and poor workload distribution,
and possibly point to a shared source of contention that keeps tasks switching and possibly point to a shared source of contention that keeps tasks switching
in and out as it becomes available (such as a mutex). in and out as it becomes available (such as a mutex).
Similarly, the tool can also measure the time a task spends off-CPU before it
is scheduled again. This can be helpful in identifying long blocking and I/O
operations, or alternatively very short descheduling times due to short-lived
locks or timers.
This tool uses in-kernel eBPF maps for storing timestamps and the histogram, This tool uses in-kernel eBPF maps for storing timestamps and the histogram,
for efficiency. Despite this, the overhead of this tool may become significant for efficiency. Despite this, the overhead of this tool may become significant
for some workloads: see the OVERHEAD section. for some workloads: see the OVERHEAD section.
This tool uses the sched:sched_switch kernel tracepoint to determine when a
task is scheduled and descheduled. If the tracepoint arguments change in the
future, this tool will have to be updated. Still, it is more reliable than
using kprobes on the respective kernel functions directly.
Since this uses BPF, only the root user can use this tool. Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS .SH REQUIREMENTS
CONFIG_BPF and bcc. CONFIG_BPF and bcc.
...@@ -27,6 +27,9 @@ CONFIG_BPF and bcc. ...@@ -27,6 +27,9 @@ CONFIG_BPF and bcc.
\-h \-h
Print usage message. Print usage message.
.TP .TP
\-O
Measure off-CPU time instead of on-CPU time.
.TP
\-T \-T
Include timestamps on output. Include timestamps on output.
.TP .TP
...@@ -53,6 +56,10 @@ Summarize task on-CPU time as a histogram: ...@@ -53,6 +56,10 @@ Summarize task on-CPU time as a histogram:
# #
.B cpudist .B cpudist
.TP .TP
Summarize task off-CPU time as a histogram:
#
.B cpudist -O
.TP
Print 1 second summaries, 10 times: Print 1 second summaries, 10 times:
# #
.B cpudist 1 10 .B cpudist 1 10
......
#!/usr/bin/python #!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports # @lint-avoid-python-3-compatibility-imports
# #
# cpudist Summarize on-CPU time per task as a histogram. # cpudist Summarize on- and off-CPU time per task as a histogram.
# #
# USAGE: cpudist [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count] # USAGE: cpudist [-h] [-O] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]
# #
# This measures the time a task spends on the CPU, and shows this time as a # This measures the time a task spends on or off the CPU, and shows this time
# histogram, optionally per-process. # as a histogram, optionally per-process.
# #
# Copyright 2016 Sasha Goldshtein # Copyright 2016 Sasha Goldshtein
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
...@@ -18,6 +18,7 @@ import argparse ...@@ -18,6 +18,7 @@ import argparse
examples = """examples: examples = """examples:
cpudist # summarize on-CPU time as a histogram cpudist # summarize on-CPU time as a histogram
cpudist -O # summarize off-CPU time as a histogram
cpudist 1 10 # print 1 second summaries, 10 times cpudist 1 10 # print 1 second summaries, 10 times
cpudist -mT 1 # 1s summaries, milliseconds, and timestamps cpudist -mT 1 # 1s summaries, milliseconds, and timestamps
cpudist -P # show each PID separately cpudist -P # show each PID separately
...@@ -27,6 +28,8 @@ parser = argparse.ArgumentParser( ...@@ -27,6 +28,8 @@ parser = argparse.ArgumentParser(
description="Summarize on-CPU time per task as a histogram.", description="Summarize on-CPU time per task as a histogram.",
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples) epilog=examples)
parser.add_argument("-O", "--offcpu", action="store_true",
help="measure off-CPU time")
parser.add_argument("-T", "--timestamp", action="store_true", parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output") help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true", parser.add_argument("-m", "--milliseconds", action="store_true",
...@@ -45,12 +48,12 @@ args = parser.parse_args() ...@@ -45,12 +48,12 @@ args = parser.parse_args()
countdown = int(args.count) countdown = int(args.count)
debug = 0 debug = 0
tp = Tracepoint.enable_tracepoint("sched", "sched_switch") bpf_text = """#include <uapi/linux/ptrace.h>
bpf_text = "#include <uapi/linux/ptrace.h>\n" #include <linux/sched.h>
bpf_text += "#include <linux/sched.h>\n" """
bpf_text += tp.generate_decl()
bpf_text += tp.generate_entry_probe() if not args.offcpu:
bpf_text += tp.generate_struct() bpf_text += "#define ONCPU\n"
bpf_text += """ bpf_text += """
typedef struct pid_key { typedef struct pid_key {
...@@ -58,54 +61,63 @@ typedef struct pid_key { ...@@ -58,54 +61,63 @@ typedef struct pid_key {
u64 slot; u64 slot;
} pid_key_t; } pid_key_t;
// We need to store the start time, which is when the thread got switched in,
// and the tgid for the pid because the sched_switch tracepoint doesn't provide
// that information.
BPF_HASH(start, u32, u64); BPF_HASH(start, u32, u64);
BPF_HASH(tgid_for_pid, u32, u32);
STORAGE STORAGE
int sched_switch(struct pt_regs *ctx) static inline void store_start(u32 tgid, u32 pid, u64 ts)
{ {
u64 pid_tgid = bpf_get_current_pid_tgid(); if (FILTER)
u64 *di = __trace_di.lookup(&pid_tgid); return;
if (di == 0)
return 0;
struct sched_switch_trace_entry args = {};
bpf_probe_read(&args, sizeof(args), (void *)*di);
u32 tgid, pid;
u64 ts = bpf_ktime_get_ns();
if (args.prev_state == TASK_RUNNING) {
pid = args.prev_pid;
u32 *stored_tgid = tgid_for_pid.lookup(&pid); start.update(&pid, &ts);
if (stored_tgid == 0) }
goto BAIL;
tgid = *stored_tgid;
static inline void update_hist(u32 tgid, u32 pid, u64 ts)
{
if (FILTER) if (FILTER)
goto BAIL; return;
u64 *tsp = start.lookup(&pid); u64 *tsp = start.lookup(&pid);
if (tsp == 0) if (tsp == 0)
goto BAIL; return;
if (ts < *tsp) {
// Probably a clock issue where the recorded on-CPU event had a
// timestamp later than the recorded off-CPU event, or vice versa.
return;
}
u64 delta = ts - *tsp; u64 delta = ts - *tsp;
FACTOR FACTOR
STORE STORE
}
int sched_switch(struct pt_regs *ctx, struct task_struct *prev)
{
u64 ts = bpf_ktime_get_ns();
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 tgid = pid_tgid >> 32, pid = pid_tgid;
#ifdef ONCPU
if (prev->state == TASK_RUNNING) {
#else
if (1) {
#endif
u32 prev_pid = prev->pid;
u32 prev_tgid = prev->tgid;
#ifdef ONCPU
update_hist(prev_tgid, prev_pid, ts);
#else
store_start(prev_tgid, prev_pid, ts);
#endif
} }
BAIL: BAIL:
tgid = pid_tgid >> 32; #ifdef ONCPU
pid = pid_tgid; store_start(tgid, pid, ts);
if (FILTER) #else
return 0; update_hist(tgid, pid, ts);
#endif
start.update(&pid, &ts);
tgid_for_pid.update(&pid, &tgid);
return 0; return 0;
} }
...@@ -141,10 +153,10 @@ if debug: ...@@ -141,10 +153,10 @@ if debug:
print(bpf_text) print(bpf_text)
b = BPF(text=bpf_text) b = BPF(text=bpf_text)
Tracepoint.attach(b) b.attach_kprobe(event="finish_task_switch", fn_name="sched_switch")
b.attach_kprobe(event="perf_trace_sched_switch", fn_name="sched_switch")
print("Tracing on-CPU time... Hit Ctrl-C to end.") print("Tracing %s-CPU time... Hit Ctrl-C to end." %
("off" if args.offcpu else "on"))
exiting = 0 if args.interval else 1 exiting = 0 if args.interval else 1
dist = b.get_table("dist") dist = b.get_table("dist")
...@@ -158,7 +170,14 @@ while (1): ...@@ -158,7 +170,14 @@ while (1):
if args.timestamp: if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="") print("%-8s\n" % strftime("%H:%M:%S"), end="")
dist.print_log2_hist(label, section, section_print_fn=int) def pid_to_comm(pid):
try:
comm = open("/proc/%d/comm" % pid, "r").read()
return "%d %s" % (pid, comm)
except IOError:
return str(pid)
dist.print_log2_hist(label, section, section_print_fn=pid_to_comm)
dist.clear() dist.clear()
countdown -= 1 countdown -= 1
......
...@@ -6,6 +6,10 @@ that can indicate oversubscription (too many tasks for too few processors), ...@@ -6,6 +6,10 @@ that can indicate oversubscription (too many tasks for too few processors),
overhead due to excessive context switching (e.g. a common shared lock for overhead due to excessive context switching (e.g. a common shared lock for
multiple threads), uneven workload distribution, too-granular tasks, and more. multiple threads), uneven workload distribution, too-granular tasks, and more.
Alternatively, the same options are available for summarizing task off-CPU
time, which helps understand how often threads are being descheduled and how
long they spend waiting for I/O, locks, timers, and other causes of suspension.
# ./cpudist.py # ./cpudist.py
Tracing on-CPU time... Hit Ctrl-C to end. Tracing on-CPU time... Hit Ctrl-C to end.
^C ^C
...@@ -155,6 +159,47 @@ pid = 5068 ...@@ -155,6 +159,47 @@ pid = 5068
This histogram was obtained while executing `dd if=/dev/zero of=/dev/null` with This histogram was obtained while executing `dd if=/dev/zero of=/dev/null` with
fairly large block sizes. fairly large block sizes.
You could also ask for an off-CPU report using the -O switch. Here's a
histogram of task block times while the system is heavily loaded:
# ./cpudist -O -p $(parprimes)
Tracing off-CPU time... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 3 | |
64 -> 127 : 1 | |
128 -> 255 : 1 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 | |
1024 -> 2047 : 4 | |
2048 -> 4095 : 3 | |
4096 -> 8191 : 70 |*** |
8192 -> 16383 : 867 |****************************************|
16384 -> 32767 : 141 |****** |
32768 -> 65535 : 8 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 1 | |
262144 -> 524287 : 2 | |
524288 -> 1048575 : 3 | |
As you can see, threads are switching out for relatively long intervals, even
though we know the workload doesn't have any significant blocking. This can be
a result of over-subscription -- too many threads contending over too few CPUs.
Indeed, there are four available CPUs and more than four runnable threads:
# nproc
4
# cat /proc/loadavg
0.04 0.11 0.06 9/147 7494
(This shows we have 9 threads runnable out of 147 total. This is more than 4,
the number of available CPUs.)
Finally, let's ask for a per-thread report and values in milliseconds instead Finally, let's ask for a per-thread report and values in milliseconds instead
of microseconds: of microseconds:
...@@ -235,7 +280,7 @@ USAGE message: ...@@ -235,7 +280,7 @@ USAGE message:
# ./cpudist.py -h # ./cpudist.py -h
usage: cpudist.py [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count] usage: cpudist.py [-h] [-O] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]
Summarize on-CPU time per task as a histogram. Summarize on-CPU time per task as a histogram.
...@@ -245,6 +290,7 @@ positional arguments: ...@@ -245,6 +290,7 @@ positional arguments:
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-O, --offcpu measure off-CPU time
-T, --timestamp include timestamp on output -T, --timestamp include timestamp on output
-m, --milliseconds millisecond histogram -m, --milliseconds millisecond histogram
-P, --pids print a histogram per process ID -P, --pids print a histogram per process ID
...@@ -253,6 +299,7 @@ optional arguments: ...@@ -253,6 +299,7 @@ optional arguments:
examples: examples:
cpudist # summarize on-CPU time as a histogram cpudist # summarize on-CPU time as a histogram
cpudist -O # summarize off-CPU time as a histogram
cpudist 1 10 # print 1 second summaries, 10 times cpudist 1 10 # print 1 second summaries, 10 times
cpudist -mT 1 # 1s summaries, milliseconds, and timestamps cpudist -mT 1 # 1s summaries, milliseconds, and timestamps
cpudist -P # show each PID separately cpudist -P # show each PID separately
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment