Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
B
bcc
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
bcc
Commits
97712b05
Commit
97712b05
authored
Jun 30, 2016
by
Brenden Blanco
Committed by
GitHub
Jun 30, 2016
Browse files
Options
Browse Files
Download
Plain Diff
Merge pull request #586 from goldshtn/offcpudist
cpudist: Support off-cpu time reports
parents
d1b62087
bee8d360
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
132 additions
and
59 deletions
+132
-59
README.md
README.md
+1
-1
man/man8/cpudist.8
man/man8/cpudist.8
+14
-7
tools/cpudist.py
tools/cpudist.py
+69
-50
tools/cpudist_example.txt
tools/cpudist_example.txt
+48
-1
No files found.
README.md
View file @
97712b05
...
...
@@ -78,7 +78,7 @@ Examples:
-
tools/
[
btrfsdist
](
tools/btrfsdist.py
)
: Summarize btrfs operation latency distribution as a histogram.
[
Examples
](
tools/btrfsdist_example.txt
)
.
-
tools/
[
btrfsslower
](
tools/btrfsslower.py
)
: Trace slow btrfs operations.
[
Examples
](
tools/btrfsslower_example.txt
)
.
-
tools/
[
cachestat
](
tools/cachestat.py
)
: Trace page cache hit/miss ratio.
[
Examples
](
tools/cachestat_example.txt
)
.
-
tools/
[
cpudist
](
tools/cpudist.py
)
: Summarize on-CPU time per task as a histogram.
[
Examples
](
tools/cpudist_example.txt
)
-
tools/
[
cpudist
](
tools/cpudist.py
)
: Summarize on-
and off-
CPU time per task as a histogram.
[
Examples
](
tools/cpudist_example.txt
)
-
tools/
[
dcsnoop
](
tools/dcsnoop.py
)
: Trace directory entry cache (dcache) lookups.
[
Examples
](
tools/dcsnoop_example.txt
)
.
-
tools/
[
dcstat
](
tools/dcstat.py
)
: Directory entry cache (dcache) stats.
[
Examples
](
tools/dcstat_example.txt
)
.
-
tools/
[
execsnoop
](
tools/execsnoop.py
)
: Trace new processes via exec() syscalls.
[
Examples
](
tools/execsnoop_example.txt
)
.
...
...
man/man8/cpudist.8
View file @
97712b05
.TH cpudist 8 "2016-06-28" "USER COMMANDS"
.SH NAME
cpudist \- On-CPU task time as a histogram.
cpudist \- On-
and off-
CPU task time as a histogram.
.SH SYNOPSIS
.B cpudist [\-h] [\-T] [\-m] [\-P] [\-L] [\-p PID] [interval] [count]
.B cpudist [\-h] [
-O] [
\-T] [\-m] [\-P] [\-L] [\-p PID] [interval] [count]
.SH DESCRIPTION
This measures the time a task spends on the CPU before being descheduled, and
shows the times as a histogram. Tasks that spend a very short time on the CPU
...
...
@@ -10,15 +10,15 @@ can be indicative of excessive context-switches and poor workload distribution,
and possibly point to a shared source of contention that keeps tasks switching
in and out as it becomes available (such as a mutex).
Similarly, the tool can also measure the time a task spends off-CPU before it
is scheduled again. This can be helpful in identifying long blocking and I/O
operations, or alternatively very short descheduling times due to short-lived
locks or timers.
This tool uses in-kernel eBPF maps for storing timestamps and the histogram,
for efficiency. Despite this, the overhead of this tool may become significant
for some workloads: see the OVERHEAD section.
This tool uses the sched:sched_switch kernel tracepoint to determine when a
task is scheduled and descheduled. If the tracepoint arguments change in the
future, this tool will have to be updated. Still, it is more reliable than
using kprobes on the respective kernel functions directly.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
...
...
@@ -27,6 +27,9 @@ CONFIG_BPF and bcc.
\-h
Print usage message.
.TP
\-O
Measure off-CPU time instead of on-CPU time.
.TP
\-T
Include timestamps on output.
.TP
...
...
@@ -53,6 +56,10 @@ Summarize task on-CPU time as a histogram:
#
.B cpudist
.TP
Summarize task off-CPU time as a histogram:
#
.B cpudist -O
.TP
Print 1 second summaries, 10 times:
#
.B cpudist 1 10
...
...
tools/cpudist.py
View file @
97712b05
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# cpudist Summarize on-CPU time per task as a histogram.
# cpudist Summarize on-
and off-
CPU time per task as a histogram.
#
# USAGE: cpudist [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]
# USAGE: cpudist [-h] [-
O] [-
T] [-m] [-P] [-L] [-p PID] [interval] [count]
#
# This measures the time a task spends on
the CPU, and shows this time as a
# histogram, optionally per-process.
# This measures the time a task spends on
or off the CPU, and shows this time
#
as a
histogram, optionally per-process.
#
# Copyright 2016 Sasha Goldshtein
# Licensed under the Apache License, Version 2.0 (the "License")
...
...
@@ -18,6 +18,7 @@ import argparse
examples
=
"""examples:
cpudist # summarize on-CPU time as a histogram
cpudist -O # summarize off-CPU time as a histogram
cpudist 1 10 # print 1 second summaries, 10 times
cpudist -mT 1 # 1s summaries, milliseconds, and timestamps
cpudist -P # show each PID separately
...
...
@@ -27,6 +28,8 @@ parser = argparse.ArgumentParser(
description
=
"Summarize on-CPU time per task as a histogram."
,
formatter_class
=
argparse
.
RawDescriptionHelpFormatter
,
epilog
=
examples
)
parser
.
add_argument
(
"-O"
,
"--offcpu"
,
action
=
"store_true"
,
help
=
"measure off-CPU time"
)
parser
.
add_argument
(
"-T"
,
"--timestamp"
,
action
=
"store_true"
,
help
=
"include timestamp on output"
)
parser
.
add_argument
(
"-m"
,
"--milliseconds"
,
action
=
"store_true"
,
...
...
@@ -45,12 +48,12 @@ args = parser.parse_args()
countdown
=
int
(
args
.
count
)
debug
=
0
tp
=
Tracepoint
.
enable_tracepoint
(
"sched"
,
"sched_switch"
)
bpf_text
=
"#include <uapi/linux/ptrace.h>
\
n
"
bpf_text
+=
"#include <linux/sched.h>
\
n
"
bpf_text
+=
tp
.
generate_decl
()
bpf_text
+=
tp
.
generate_entry_probe
()
bpf_text
+=
tp
.
generate_struct
()
bpf_text
=
"""#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
""
"
if
not
args
.
offcpu
:
bpf_text
+=
"#define ONCPU
\
n
"
bpf_text
+=
"""
typedef struct pid_key {
...
...
@@ -58,54 +61,63 @@ typedef struct pid_key {
u64 slot;
} pid_key_t;
// We need to store the start time, which is when the thread got switched in,
// and the tgid for the pid because the sched_switch tracepoint doesn't provide
// that information.
BPF_HASH(start, u32, u64);
BPF_HASH(tgid_for_pid, u32, u32);
STORAGE
int sched_switch(struct pt_regs *ctx
)
static inline void store_start(u32 tgid, u32 pid, u64 ts
)
{
u64 pid_tgid = bpf_get_current_pid_tgid();
u64 *di = __trace_di.lookup(&pid_tgid);
if (di == 0)
return 0;
struct sched_switch_trace_entry args = {};
bpf_probe_read(&args, sizeof(args), (void *)*di);
u32 tgid, pid;
u64 ts = bpf_ktime_get_ns();
if (FILTER)
return;
if (args.prev_state == TASK_RUNNING) {
pid = args.prev_pid;
start.update(&pid, &ts);
}
u32 *stored_tgid = tgid_for_pid.lookup(&pid);
if (stored_tgid == 0)
goto BAIL;
tgid = *stored_tgid
;
static inline void update_hist(u32 tgid, u32 pid, u64 ts)
{
if (FILTER)
return
;
if (FILTER)
goto BAIL;
u64 *tsp = start.lookup(&pid);
if (tsp == 0)
return;
u64 *tsp = start.lookup(&pid);
if (tsp == 0)
goto BAIL;
if (ts < *tsp) {
// Probably a clock issue where the recorded on-CPU event had a
// timestamp later than the recorded off-CPU event, or vice versa.
return;
}
u64 delta = ts - *tsp;
FACTOR
STORE
}
u64 delta = ts - *tsp;
FACTOR
STORE
int sched_switch(struct pt_regs *ctx, struct task_struct *prev)
{
u64 ts = bpf_ktime_get_ns();
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 tgid = pid_tgid >> 32, pid = pid_tgid;
#ifdef ONCPU
if (prev->state == TASK_RUNNING) {
#else
if (1) {
#endif
u32 prev_pid = prev->pid;
u32 prev_tgid = prev->tgid;
#ifdef ONCPU
update_hist(prev_tgid, prev_pid, ts);
#else
store_start(prev_tgid, prev_pid, ts);
#endif
}
BAIL:
tgid = pid_tgid >> 32;
pid = pid_tgid;
if (FILTER)
return 0;
start.update(&pid, &ts);
tgid_for_pid.update(&pid, &tgid);
#ifdef ONCPU
store_start(tgid, pid, ts);
#else
update_hist(tgid, pid, ts);
#endif
return 0;
}
...
...
@@ -141,10 +153,10 @@ if debug:
print
(
bpf_text
)
b
=
BPF
(
text
=
bpf_text
)
Tracepoint
.
attach
(
b
)
b
.
attach_kprobe
(
event
=
"perf_trace_sched_switch"
,
fn_name
=
"sched_switch"
)
b
.
attach_kprobe
(
event
=
"finish_task_switch"
,
fn_name
=
"sched_switch"
)
print
(
"Tracing on-CPU time... Hit Ctrl-C to end."
)
print
(
"Tracing %s-CPU time... Hit Ctrl-C to end."
%
(
"off"
if
args
.
offcpu
else
"on"
))
exiting
=
0
if
args
.
interval
else
1
dist
=
b
.
get_table
(
"dist"
)
...
...
@@ -158,7 +170,14 @@ while (1):
if
args
.
timestamp
:
print
(
"%-8s
\
n
"
%
strftime
(
"%H:%M:%S"
),
end
=
""
)
dist
.
print_log2_hist
(
label
,
section
,
section_print_fn
=
int
)
def
pid_to_comm
(
pid
):
try
:
comm
=
open
(
"/proc/%d/comm"
%
pid
,
"r"
).
read
()
return
"%d %s"
%
(
pid
,
comm
)
except
IOError
:
return
str
(
pid
)
dist
.
print_log2_hist
(
label
,
section
,
section_print_fn
=
pid_to_comm
)
dist
.
clear
()
countdown
-=
1
...
...
tools/cpudist_example.txt
View file @
97712b05
...
...
@@ -6,6 +6,10 @@ that can indicate oversubscription (too many tasks for too few processors),
overhead due to excessive context switching (e.g. a common shared lock for
multiple threads), uneven workload distribution, too-granular tasks, and more.
Alternatively, the same options are available for summarizing task off-CPU
time, which helps understand how often threads are being descheduled and how
long they spend waiting for I/O, locks, timers, and other causes of suspension.
# ./cpudist.py
Tracing on-CPU time... Hit Ctrl-C to end.
^C
...
...
@@ -155,6 +159,47 @@ pid = 5068
This histogram was obtained while executing `dd if=/dev/zero of=/dev/null` with
fairly large block sizes.
You could also ask for an off-CPU report using the -O switch. Here's a
histogram of task block times while the system is heavily loaded:
# ./cpudist -O -p $(parprimes)
Tracing off-CPU time... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 3 | |
64 -> 127 : 1 | |
128 -> 255 : 1 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 | |
1024 -> 2047 : 4 | |
2048 -> 4095 : 3 | |
4096 -> 8191 : 70 |*** |
8192 -> 16383 : 867 |****************************************|
16384 -> 32767 : 141 |****** |
32768 -> 65535 : 8 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 1 | |
262144 -> 524287 : 2 | |
524288 -> 1048575 : 3 | |
As you can see, threads are switching out for relatively long intervals, even
though we know the workload doesn't have any significant blocking. This can be
a result of over-subscription -- too many threads contending over too few CPUs.
Indeed, there are four available CPUs and more than four runnable threads:
# nproc
4
# cat /proc/loadavg
0.04 0.11 0.06 9/147 7494
(This shows we have 9 threads runnable out of 147 total. This is more than 4,
the number of available CPUs.)
Finally, let's ask for a per-thread report and values in milliseconds instead
of microseconds:
...
...
@@ -235,7 +280,7 @@ USAGE message:
# ./cpudist.py -h
usage: cpudist.py [-h] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]
usage: cpudist.py [-h] [-
O] [-
T] [-m] [-P] [-L] [-p PID] [interval] [count]
Summarize on-CPU time per task as a histogram.
...
...
@@ -245,6 +290,7 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
-O, --offcpu measure off-CPU time
-T, --timestamp include timestamp on output
-m, --milliseconds millisecond histogram
-P, --pids print a histogram per process ID
...
...
@@ -253,6 +299,7 @@ optional arguments:
examples:
cpudist # summarize on-CPU time as a histogram
cpudist -O # summarize off-CPU time as a histogram
cpudist 1 10 # print 1 second summaries, 10 times
cpudist -mT 1 # 1s summaries, milliseconds, and timestamps
cpudist -P # show each PID separately
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment