Commit a3bfe09c authored by Brendan Gregg's avatar Brendan Gregg Committed by GitHub

Merge pull request #1444 from sandip4n/update-tools-wakeuptime

Make 'tools/wakeuptime' use stack maps
parents b7604530 ab01e24d
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
.SH NAME .SH NAME
wakeuptime \- Summarize sleep to wakeup time by waker kernel stack. Uses Linux eBPF/bcc. wakeuptime \- Summarize sleep to wakeup time by waker kernel stack. Uses Linux eBPF/bcc.
.SH SYNOPSIS .SH SYNOPSIS
.B wakeuptime [\-h] [\-u] [\-p PID] [\-v] [\-f] [duration] .B wakeuptime [\-h] [\-u] [\-p PID] [\-v] [\-f] [\-\-stack-storage-size STACK_STORAGE_SIZE] [\-m MIN_BLOCK_TIME] [\-M MAX_BLOCK_TIME] [duration]
.SH DESCRIPTION .SH DESCRIPTION
This program shows the kernel stack traces for threads that woke up other This program shows the kernel stack traces for threads that woke up other
blocked threads, along with the process names of the waker and target, along blocked threads, along with the process names of the waker and target, along
...@@ -23,10 +23,8 @@ of the blocked thread. wakeuptime shows the stack trace of the waker thread. ...@@ -23,10 +23,8 @@ of the blocked thread. wakeuptime shows the stack trace of the waker thread.
See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
The stack depth is currently limited to 20, and the stack traces are kernel This tool only works on Linux 4.6+. It uses the new `BPF_STACK_TRACE` table
mode only. Check for newer versions where either may be improved. APIs to generate the in-kernel stack traces.
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS .SH REQUIREMENTS
CONFIG_BPF and bcc. CONFIG_BPF and bcc.
.SH OPTIONS .SH OPTIONS
...@@ -46,8 +44,17 @@ Show raw addresses (for non-folded format). ...@@ -46,8 +44,17 @@ Show raw addresses (for non-folded format).
\-p PID \-p PID
Trace this process ID only (filtered in-kernel). Trace this process ID only (filtered in-kernel).
.TP .TP
\-\-stack-storage-size STACK_STORAGE_SIZE
Change the number of unique stack traces that can be stored and displayed.
.TP
duration duration
Duration to trace, in seconds. Duration to trace, in seconds.
.TP
\-m MIN_BLOCK_TIME
The amount of time in microseconds over which we store traces (default 1)
.TP
\-M MAX_BLOCK_TIME
The amount of time in microseconds under which we store traces (default U64_MAX)
.SH EXAMPLES .SH EXAMPLES
.TP .TP
Trace all thread blocking events, and summarize (in-kernel) by kernel stack trace and total blocked time: Trace all thread blocking events, and summarize (in-kernel) by kernel stack trace and total blocked time:
...@@ -77,8 +84,8 @@ printed. While these techniques greatly lower overhead, scheduler events are ...@@ -77,8 +84,8 @@ printed. While these techniques greatly lower overhead, scheduler events are
still a high frequency event, as they can exceed 1 million events per second, still a high frequency event, as they can exceed 1 million events per second,
and so caution should still be used. Test before production use. and so caution should still be used. Test before production use.
If the overhead is still a problem, take a look at the MINBLOCK_US tunable in If the overhead is still a problem, take a look at the min block option.
the code. If your aim is to chase down longer blocking events, then this could If your aim is to chase down longer blocking events, then this could
be increased to filter shorter blocking events, further lowering overhead. be increased to filter shorter blocking events, further lowering overhead.
.SH SOURCE .SH SOURCE
This is from bcc. This is from bcc.
......
...@@ -366,6 +366,7 @@ class SmokeTests(TestCase): ...@@ -366,6 +366,7 @@ class SmokeTests(TestCase):
def test_vfsstat(self): def test_vfsstat(self):
self.run_with_duration("vfsstat.py 1 1") self.run_with_duration("vfsstat.py 1 1")
@skipUnless(kernel_version_ge(4,6), "requires kernel >= 4.6")
def test_wakeuptime(self): def test_wakeuptime(self):
self.run_with_duration("wakeuptime.py 1") self.run_with_duration("wakeuptime.py 1")
......
#!/usr/bin/python
#
# wakeuptime Summarize sleep to wakeup time by waker kernel stack
# For Linux, uses BCC, eBPF.
#
# USAGE: wakeuptime [-h] [-u] [-p PID] [-v] [-f] [duration]
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 14-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./wakeuptime # trace blocked time with waker stacks
./wakeuptime 5 # trace for 5 seconds only
./wakeuptime -f 5 # 5 seconds, and output in folded format
./wakeuptime -u # don't include kernel threads (user only)
./wakeuptime -p 185 # trace fo PID 185 only
"""
parser = argparse.ArgumentParser(
description="Summarize sleep to wakeup time by waker kernel stack",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-u", "--useronly", action="store_true",
help="user threads only (no kernel threads)")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("-f", "--folded", action="store_true",
help="output folded format")
parser.add_argument("duration", nargs="?", default=99999999,
help="duration of trace, in seconds")
args = parser.parse_args()
folded = args.folded
duration = int(args.duration)
debug = 0
maxdepth = 20 # and MAXDEPTH
if args.pid and args.useronly:
print("ERROR: use either -p or -u.")
exit()
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#define MAXDEPTH 20
#define MINBLOCK_US 1
struct key_t {
char waker[TASK_COMM_LEN];
char target[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64/arm64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
#ifdef __x86_64__
if (ret < __START_KERNEL_map)
#elif __aarch64__
if (ret < VA_START)
#else
#error "Unsupported architecture for stack walker"
#endif
return 0;
return ret;
}
return 0;
}
int offcpu(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns();
// XXX: should filter here too, but need task_struct
start.update(&pid, &ts);
return 0;
}
int waker(struct pt_regs *ctx, struct task_struct *p) {
u32 pid = p->pid;
u64 delta, *tsp, ts;
tsp = start.lookup(&pid);
if (tsp == 0)
return 0; // missed start
start.delete(&pid);
if (FILTER)
return 0;
// calculate delta time
delta = bpf_ktime_get_ns() - *tsp;
delta = delta / 1000;
if (delta < MINBLOCK_US)
return 0;
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
bpf_probe_read(&key.target, sizeof(key.target), p->comm);
bpf_get_current_comm(&key.waker, sizeof(key.waker));
bp = PT_REGS_FP(ctx);
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val) += delta;
return 0;
}
"""
if args.pid:
filter = 'pid != %s' % args.pid
elif args.useronly:
filter = 'p->flags & PF_KTHREAD'
else:
filter = '0'
bpf_text = bpf_text.replace('FILTER', filter)
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="schedule", fn_name="offcpu")
b.attach_kprobe(event="try_to_wake_up", fn_name="waker")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions traced. Exiting.")
exit()
# header
if not folded:
print("Tracing blocked time (us) by kernel stack", end="")
if duration < 99999999:
print(" for %d secs." % duration)
else:
print("... Hit Ctrl-C to end.")
# output
while (1):
try:
sleep(duration)
except KeyboardInterrupt:
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
if not folded:
print()
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
if folded:
# print folded stack output
line = k.waker.decode() + ";"
for i in reversed(range(0, maxdepth)):
if k.ret[i] == 0:
continue
line = line + b.ksym(k.ret[i])
if i != 0:
line = line + ";"
print("%s;%s %d" % (line, k.target.decode(), v.value))
else:
# print default multi-line stack output
print(" %-16s %s" % ("target:", k.target.decode()))
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
print(" %-16s %s" % ("waker:", k.waker.decode()))
print(" %d\n" % v.value)
counts.clear()
if not folded:
print("Detaching...")
exit()
...@@ -5,12 +5,6 @@ ...@@ -5,12 +5,6 @@
# #
# USAGE: wakeuptime [-h] [-u] [-p PID] [-v] [-f] [duration] # USAGE: wakeuptime [-h] [-u] [-p PID] [-v] [-f] [duration]
# #
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc. # Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
# #
...@@ -21,6 +15,25 @@ from bcc import BPF ...@@ -21,6 +15,25 @@ from bcc import BPF
from time import sleep, strftime from time import sleep, strftime
import argparse import argparse
import signal import signal
import errno
from sys import stderr
# arg validation
def positive_int(val):
try:
ival = int(val)
except ValueError:
raise argparse.ArgumentTypeError("must be an integer")
if ival < 0:
raise argparse.ArgumentTypeError("must be positive")
return ival
def positive_nonzero_int(val):
ival = positive_int(val)
if ival == 0:
raise argparse.ArgumentTypeError("must be nonzero")
return ival
# arguments # arguments
examples = """examples: examples = """examples:
...@@ -37,21 +50,33 @@ parser = argparse.ArgumentParser( ...@@ -37,21 +50,33 @@ parser = argparse.ArgumentParser(
parser.add_argument("-u", "--useronly", action="store_true", parser.add_argument("-u", "--useronly", action="store_true",
help="user threads only (no kernel threads)") help="user threads only (no kernel threads)")
parser.add_argument("-p", "--pid", parser.add_argument("-p", "--pid",
type=positive_int,
help="trace this PID only") help="trace this PID only")
parser.add_argument("-v", "--verbose", action="store_true", parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses") help="show raw addresses")
parser.add_argument("-f", "--folded", action="store_true", parser.add_argument("-f", "--folded", action="store_true",
help="output folded format") help="output folded format")
parser.add_argument("--stack-storage-size", default=1024,
type=positive_nonzero_int,
help="the number of unique stack traces that can be stored and "
"displayed (default 1024)")
parser.add_argument("duration", nargs="?", default=99999999, parser.add_argument("duration", nargs="?", default=99999999,
type=positive_nonzero_int,
help="duration of trace, in seconds") help="duration of trace, in seconds")
parser.add_argument("-m", "--min-block-time", default=1,
type=positive_nonzero_int,
help="the amount of time in microseconds over which we " +
"store traces (default 1)")
parser.add_argument("-M", "--max-block-time", default=(1 << 64) - 1,
type=positive_nonzero_int,
help="the amount of time in microseconds under which we " +
"store traces (default U64_MAX)")
args = parser.parse_args() args = parser.parse_args()
folded = args.folded folded = args.folded
duration = int(args.duration) duration = int(args.duration)
debug = 0 debug = 0
maxdepth = 20 # and MAXDEPTH
if args.pid and args.useronly: if args.pid and args.useronly:
print("ERROR: use either -p or -u.") parser.error("use either -p or -u.")
exit()
# signal handler # signal handler
def signal_ignore(signal, frame): def signal_ignore(signal, frame):
...@@ -62,43 +87,27 @@ bpf_text = """ ...@@ -62,43 +87,27 @@ bpf_text = """
#include <uapi/linux/ptrace.h> #include <uapi/linux/ptrace.h>
#include <linux/sched.h> #include <linux/sched.h>
#define MAXDEPTH 20 #define MINBLOCK_US MINBLOCK_US_VALUEULL
#define MINBLOCK_US 1 #define MAXBLOCK_US MAXBLOCK_US_VALUEULL
struct key_t { struct key_t {
int w_k_stack_id;
char waker[TASK_COMM_LEN]; char waker[TASK_COMM_LEN];
char target[TASK_COMM_LEN]; char target[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
}; };
BPF_HASH(counts, struct key_t); BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32); BPF_HASH(start, u32);
BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64/arm64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
#ifdef __x86_64__
if (ret < __START_KERNEL_map)
#elif __aarch64__
if (ret < VA_START)
#else
#error "Unsupported architecture for stack walker"
#endif
return 0;
return ret;
}
return 0;
}
int offcpu(struct pt_regs *ctx) { int offcpu(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid(); u32 pid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns(); struct task_struct *p = (struct task_struct *) bpf_get_current_task();
// XXX: should filter here too, but need task_struct u64 ts;
if (FILTER)
return 0;
ts = bpf_ktime_get_ns();
start.update(&pid, &ts); start.update(&pid, &ts);
return 0; return 0;
} }
...@@ -118,41 +127,16 @@ int waker(struct pt_regs *ctx, struct task_struct *p) { ...@@ -118,41 +127,16 @@ int waker(struct pt_regs *ctx, struct task_struct *p) {
// calculate delta time // calculate delta time
delta = bpf_ktime_get_ns() - *tsp; delta = bpf_ktime_get_ns() - *tsp;
delta = delta / 1000; delta = delta / 1000;
if (delta < MINBLOCK_US) if ((delta < MINBLOCK_US) || (delta > MAXBLOCK_US))
return 0; return 0;
struct key_t key = {}; struct key_t key = {};
u64 zero = 0, *val, bp = 0; u64 zero = 0, *val;
int depth = 0;
key.w_k_stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
bpf_probe_read(&key.target, sizeof(key.target), p->comm); bpf_probe_read(&key.target, sizeof(key.target), p->comm);
bpf_get_current_comm(&key.waker, sizeof(key.waker)); bpf_get_current_comm(&key.waker, sizeof(key.waker));
bp = PT_REGS_FP(ctx);
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero); val = counts.lookup_or_init(&key, &zero);
(*val) += delta; (*val) += delta;
return 0; return 0;
...@@ -165,6 +149,12 @@ elif args.useronly: ...@@ -165,6 +149,12 @@ elif args.useronly:
else: else:
filter = '0' filter = '0'
bpf_text = bpf_text.replace('FILTER', filter) bpf_text = bpf_text.replace('FILTER', filter)
# set stack storage size
bpf_text = bpf_text.replace('STACK_STORAGE_SIZE', str(args.stack_storage_size))
bpf_text = bpf_text.replace('MINBLOCK_US_VALUE', str(args.min_block_time))
bpf_text = bpf_text.replace('MAXBLOCK_US_VALUE', str(args.max_block_time))
if debug: if debug:
print(bpf_text) print(bpf_text)
...@@ -195,30 +185,43 @@ while (1): ...@@ -195,30 +185,43 @@ while (1):
if not folded: if not folded:
print() print()
missing_stacks = 0
has_enomem = False
counts = b.get_table("counts") counts = b.get_table("counts")
stack_traces = b.get_table("stack_traces")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value): for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
# handle get_stackid errors
# check for an ENOMEM error
if k.w_k_stack_id == -errno.ENOMEM:
missing_stacks += 1
continue
waker_kernel_stack = [] if k.w_k_stack_id < 1 else \
reversed(list(stack_traces.walk(k.w_k_stack_id))[1:])
if folded: if folded:
# print folded stack output # print folded stack output
line = k.waker.decode() + ";" line = \
for i in reversed(range(0, maxdepth)): [k.waker.decode()] + \
if k.ret[i] == 0: [b.ksym(addr)
continue for addr in reversed(list(waker_kernel_stack))] + \
line = line + b.ksym(k.ret[i]) [k.target.decode()]
if i != 0: print("%s %d" % (";".join(line), v.value))
line = line + ";"
print("%s;%s %d" % (line, k.target.decode(), v.value))
else: else:
# print default multi-line stack output # print default multi-line stack output
print(" %-16s %s" % ("target:", k.target.decode())) print(" %-16s %s" % ("target:", k.target.decode()))
for i in range(0, maxdepth): for addr in waker_kernel_stack:
if k.ret[i] == 0: print(" %-16x %s" % (addr, b.ksym(addr)))
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
print(" %-16s %s" % ("waker:", k.waker.decode())) print(" %-16s %s" % ("waker:", k.waker.decode()))
print(" %d\n" % v.value) print(" %d\n" % v.value)
counts.clear() counts.clear()
if missing_stacks > 0:
enomem_str = " Consider increasing --stack-storage-size."
print("WARNING: %d stack traces could not be displayed.%s" %
(missing_stacks, enomem_str),
file=stderr)
if not folded: if not folded:
print("Detaching...") print("Detaching...")
exit() exit()
...@@ -448,20 +448,31 @@ creating your "wakeup time flame graphs". ...@@ -448,20 +448,31 @@ creating your "wakeup time flame graphs".
USAGE message: USAGE message:
# ./wakeuptime -h # ./wakeuptime -h
usage: wakeuptime [-h] [-u] [-p PID] [-v] [-f] [duration] usage: wakeuptime [-h] [-u] [-p PID] [-v] [-f]
[--stack-storage-size STACK_STORAGE_SIZE]
[-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME]
[duration]
Summarize sleep to wakeup time by waker kernel stack Summarize sleep to wakeup time by waker kernel stack
positional arguments: positional arguments:
duration duration of trace, in seconds duration duration of trace, in seconds
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-u, --useronly user threads only (no kernel threads) -u, --useronly user threads only (no kernel threads)
-p PID, --pid PID trace this PID only -p PID, --pid PID trace this PID only
-v, --verbose show raw addresses -v, --verbose show raw addresses
-f, --folded output folded format -f, --folded output folded format
--stack-storage-size STACK_STORAGE_SIZE
the number of unique stack traces that can be stored
and displayed (default 1024)
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
the amount of time in microseconds over which we store
traces (default 1)
-M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
the amount of time in microseconds under which we
store traces (default U64_MAX)
examples: examples:
./wakeuptime # trace blocked time with waker stacks ./wakeuptime # trace blocked time with waker stacks
./wakeuptime 5 # trace for 5 seconds only ./wakeuptime 5 # trace for 5 seconds only
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment