Commit c256e94d authored by Brendan Gregg's avatar Brendan Gregg

Merge pull request #450 from vmg/vmg/stacks-next

Stack Traces: TNG
parents 478636be b4ebed0d
......@@ -3,7 +3,7 @@
memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
[-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
.SH DESCRIPTION
memleak traces and matches memory allocation and deallocation requests, and
collects call stacks for each allocation. memleak can then print a summary
......@@ -15,10 +15,7 @@ When tracing all processes, memleak instruments kmalloc and kfree.
memleak may introduce significant overhead when tracing processes that allocate
and free many blocks very quickly. See the OVERHEAD section below.
The stack depth is limited to 10 by default (+1 for the current instruction pointer),
but it can be controlled using the \-d switch if deeper stacks are required.
This currently only works on x86_64. Check for future versions.
This tool only works on Linux 4.6+. Stack traces are obtained using the new BPF_STACK_TRACE` APIs.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......@@ -45,10 +42,6 @@ Run the specified command and trace its allocations only. This traces malloc and
\-s SAMPLE_RATE
Record roughly every SAMPLE_RATE-th allocation to reduce overhead.
.TP
\-d STACK_DEPTH
Capture STACK_DEPTH frames (or less) when obtaining allocation call stacks.
The default value is 10.
.TP
\-t TOP
Print only the top TOP stacks (sorted by size).
The default value is 10.
......@@ -106,9 +99,6 @@ placed in a typical period of 10 seconds:
#
.B perf stat -a -e 'probe:__kmalloc' -- sleep 10
Another setting that may help reduce overhead is lowering the number of stack
frames captured and parsed by memleak for each allocation, using the \-d switch.
.SH SOURCE
This is from bcc.
.IP
......
......@@ -22,10 +22,8 @@ especially the -f format, can be used to generate an "off-CPU time flame graph".
See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
The stack depth is currently limited to 20, and the stack traces are kernel
mode only. Check for newer versions where either may be improved.
This currently only works on x86_64. Check for future versions.
This tool only works on Linux 4.6+. It uses the new `BPF_STACK_TRACE` table
APIs to generate the in-kernel stack traces.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......
......@@ -12,10 +12,8 @@ and their occurrence counts.
The pattern is a string with optional '*' wildcards, similar to file globbing.
If you'd prefer to use regular expressions, use the \-r option.
The stack depth is currently limited to 10 (+1 for the current instruction
pointer).
This tool only works on Linux 4.6+. Stack traces are obtained using the new `BPF_STACK_TRACE` APIs.
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......@@ -82,13 +80,12 @@ Only count stacks when PID 185 is on-CPU:
This summarizes unique stack traces in-kernel for efficiency, allowing it to
trace a higher rate of function calls than methods that post-process in user
space. The stack trace data is only copied to user space when the output is
printed, which usually only happens once. Given these techniques, I'd suspect
that call rates of < 10,000/sec would incur negligible overhead (for this
current version; future versions may improve this). Beyond that,
there will be a point where the overhead is measurable, as this does add
a number of instructions to each function call to walk and save stacks.
Test before production use. You can also use funccount to get a handle on
function call rates first.
printed, which usually only happens once. The stack walking also happens in an
optimized code path in the kernel thanks to the new BPF_STACK_TRACE table APIs,
which should be more efficient than the manual walker in the eBPF tracer which
older versions of this script used. With this in mind, call rates of <
10,000/sec would incur negligible overhead. Test before production use. You can
also use funccount to get a handle on function call rates first.
.SH SOURCE
This is from bcc.
.IP
......
......@@ -9,10 +9,7 @@ kernel stack back trace for that call. This shows the ancestry of function
calls, and is a quick way to investigate low frequency kernel functions and
their cause. For high frequency kernel functions, see stackcount.
The stack depth is currently limited to 10 (+1 for the current instruction
pointer).
This currently only works on x86_64. Check for future versions.
This tool only works on Linux 4.6+. Stack traces are obtained using the new BPF_STACK_TRACE` APIs.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......
......@@ -732,7 +732,7 @@ class BPF(object):
if idx == -1:
return "[unknown]"
offset = int(addr - ksyms[idx][1])
return ksyms[idx][0] + hex(offset)
return "%s+0x%x" % (ksyms[idx][0], offset)
@staticmethod
def ksymname(name):
......
......@@ -44,29 +44,33 @@ class Time(object):
raise OSError(errno_, os.strerror(errno_))
return t.tv_sec * 1e9 + t.tv_nsec
class StackDecoder(object):
class KStackDecoder(object):
def refresh(self):
pass
def __call__(self, addr):
return "%s [kernel] (%x)" % (BPF.ksym(addr), addr)
class UStackDecoder(object):
def __init__(self, pid):
self.pid = pid
if pid != -1:
self.proc_sym = ProcessSymbols(pid)
self.proc_sym = ProcessSymbols(pid)
def refresh(self):
if self.pid != -1:
self.proc_sym.refresh_code_ranges()
def decode_stack(self, info, is_kernel_trace):
stack = ""
if info.num_frames <= 0:
return "???"
for i in range(0, info.num_frames):
addr = info.callstack[i]
if is_kernel_trace:
stack += " %s [kernel] (%x) ;" % \
(BPF.ksym(addr), addr)
else:
stack += " %s (%x) ;" % \
(self.proc_sym.decode_addr(addr), addr)
return stack
self.proc_sym.refresh_code_ranges()
def __call__(self, addr):
return "%s (%x)" % (self.proc_sym.decode_addr(addr), addr)
class Allocation(object):
def __init__(self, stack, size):
self.stack = stack
self.count = 1
self.size = size
def update(self, size):
self.count += 1
self.size += size
def run_command_get_output(command):
p = subprocess.Popen(command.split(),
......@@ -125,8 +129,6 @@ parser.add_argument("-c", "--command",
help="execute and trace the specified command")
parser.add_argument("-s", "--sample-rate", default=1, type=int,
help="sample every N-th allocation to decrease the overhead")
parser.add_argument("-d", "--stack-depth", default=10, type=int,
help="maximum stack depth to capture")
parser.add_argument("-T", "--top", type=int, default=10,
help="display only this many top allocating stacks (by size)")
parser.add_argument("-z", "--min-size", type=int,
......@@ -144,7 +146,6 @@ interval = args.interval
min_age_ns = 1e6 * args.older
sample_every_n = args.sample_rate
num_prints = args.count
max_stack_size = args.stack_depth + 2
top_stacks = args.top
min_size = args.min_size
max_size = args.max_size
......@@ -163,33 +164,12 @@ bpf_source = """
struct alloc_info_t {
u64 size;
u64 timestamp_ns;
int num_frames;
u64 callstack[MAX_STACK_SIZE];
int stack_id;
};
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
// Adapted from https://github.com/iovisor/bcc/tools/offcputime.py
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
return ret;
}
return 0;
}
static int grab_stack(struct pt_regs *ctx, struct alloc_info_t *info)
{
int depth = 0;
u64 bp = ctx->bp;
GRAB_ONE_FRAME
return depth;
}
BPF_STACK_TRACE(stack_traces, 1024)
int alloc_enter(struct pt_regs *ctx, size_t size)
{
......@@ -223,12 +203,12 @@ int alloc_exit(struct pt_regs *ctx)
sizes.delete(&pid);
info.timestamp_ns = bpf_ktime_get_ns();
info.num_frames = grab_stack(ctx, &info) - 2;
info.stack_id = stack_traces.get_stackid(ctx, STACK_FLAGS);
allocs.update(&address, &info);
if (SHOULD_PRINT) {
bpf_trace_printk("alloc exited, size = %lu, result = %lx, frames = %d\\n",
info.size, address, info.num_frames);
bpf_trace_printk("alloc exited, size = %lu, result = %lx\\n",
info.size, address);
}
return 0;
}
......@@ -251,9 +231,6 @@ int free_enter(struct pt_regs *ctx, void *address)
"""
bpf_source = bpf_source.replace("SHOULD_PRINT", "1" if trace_all else "0")
bpf_source = bpf_source.replace("SAMPLE_EVERY_N", str(sample_every_n))
bpf_source = bpf_source.replace("GRAB_ONE_FRAME", max_stack_size *
"\tif (!(info->callstack[depth++] = get_frame(&bp))) return depth;\n")
bpf_source = bpf_source.replace("MAX_STACK_SIZE", str(max_stack_size))
size_filter = ""
if min_size is not None and max_size is not None:
......@@ -265,6 +242,11 @@ elif max_size is not None:
size_filter = "if (size > %d) return 0;" % max_size
bpf_source = bpf_source.replace("SIZE_FILTER", size_filter)
stack_flags = "BPF_F_REUSE_STACKID"
if not kernel_trace:
stack_flags += "|BPF_F_USER_STACK"
bpf_source = bpf_source.replace("STACK_FLAGS", stack_flags)
bpf_program = BPF(text=bpf_source)
if not kernel_trace:
......@@ -281,29 +263,31 @@ else:
bpf_program.attach_kretprobe(event="__kmalloc", fn_name="alloc_exit")
bpf_program.attach_kprobe(event="kfree", fn_name="free_enter")
decoder = StackDecoder(pid)
decoder = KStackDecoder() if kernel_trace else UStackDecoder(pid)
def print_outstanding():
stacks = {}
print("[%s] Top %d stacks with outstanding allocations:" %
(datetime.now().strftime("%H:%M:%S"), top_stacks))
allocs = bpf_program.get_table("allocs")
alloc_info = {}
allocs = bpf_program["allocs"]
stack_traces = bpf_program["stack_traces"]
for address, info in sorted(allocs.items(), key=lambda a: a[1].size):
if Time.monotonic_time() - min_age_ns < info.timestamp_ns:
continue
stack = decoder.decode_stack(info, kernel_trace)
if stack in stacks:
stacks[stack] = (stacks[stack][0] + 1,
stacks[stack][1] + info.size)
if info.stack_id < 0:
continue
if info.stack_id in alloc_info:
alloc_info[info.stack_id].update(info.size)
else:
stacks[stack] = (1, info.size)
stack = list(stack_traces.walk(info.stack_id, decoder))
alloc_info[info.stack_id] = Allocation(stack, info.size)
if args.show_allocs:
print("\taddr = %x size = %s" %
(address.value, info.size))
to_show = sorted(stacks.items(), key=lambda s: s[1][1])[-top_stacks:]
for stack, (count, size) in to_show:
to_show = sorted(alloc_info.values(), key=lambda a: a.size)[-top_stacks:]
for alloc in to_show:
print("\t%d bytes in %d allocations from stack\n\t\t%s" %
(size, count, stack.replace(";", "\n\t\t")))
(alloc.size, alloc.count, "\n\t\t".join(alloc.stack)))
count_so_far = 0
while True:
......
......@@ -9,8 +9,6 @@
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
......@@ -48,7 +46,6 @@ args = parser.parse_args()
folded = args.folded
duration = int(args.duration)
debug = 0
maxdepth = 20 # and MAXDEPTH
if args.pid and args.useronly:
print("ERROR: use either -p or -u.")
exit()
......@@ -62,31 +59,15 @@ bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#define MAXDEPTH 20
#define MINBLOCK_US 1
struct key_t {
char name[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
int stack_id;
};
BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
BPF_STACK_TRACE(stack_traces, 1024)
int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
u32 pid;
......@@ -111,36 +92,12 @@ int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
return 0;
// create map key
u64 zero = 0, *val, bp = 0;
int depth = 0;
u64 zero = 0, *val;
struct key_t key = {};
bpf_get_current_comm(&key.name, sizeof(key.name));
bp = ctx->bp;
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
key.stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
val = counts.lookup_or_init(&key, &zero);
(*val) += delta;
return 0;
......@@ -183,24 +140,17 @@ while (1):
if not folded:
print()
counts = b.get_table("counts")
stack_traces = b.get_table("stack_traces")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
if folded:
# print folded stack output
line = k.name.decode() + ";"
for i in reversed(range(0, maxdepth)):
if k.ret[i] == 0:
continue
line = line + b.ksym(k.ret[i])
if i != 0:
line = line + ";"
print("%s %d" % (line, v.value))
stack = list(stack_traces.walk(k.stack_id))[1:]
line = [k.name.decode()] + [b.ksym(addr) for addr in reversed(stack)]
print("%s %d" % (";".join(line), v.value))
else:
# print default multi-line stack output
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
for addr in stack_traces.walk(k.stack_id):
print(" %-16x %s" % (addr, b.ksym(addr)))
print(" %-16s %s" % ("-", k.name))
print(" %d\n" % v.value)
counts.clear()
......
This diff is collapsed.
#!/usr/bin/python
#
# offcputime Summarize off-CPU time by kernel stack trace
# For Linux, uses BCC, eBPF.
#
# USAGE: offcputime [-h] [-u] [-p PID] [-v] [-f] [duration]
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 13-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./offcputime # trace off-CPU stack time until Ctrl-C
./offcputime 5 # trace for 5 seconds only
./offcputime -f 5 # 5 seconds, and output in folded format
./offcputime -u # don't include kernel threads (user only)
./offcputime -p 185 # trace fo PID 185 only
"""
parser = argparse.ArgumentParser(
description="Summarize off-CPU time by kernel stack trace",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-u", "--useronly", action="store_true",
help="user threads only (no kernel threads)")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("-f", "--folded", action="store_true",
help="output folded format")
parser.add_argument("duration", nargs="?", default=99999999,
help="duration of trace, in seconds")
args = parser.parse_args()
folded = args.folded
duration = int(args.duration)
debug = 0
maxdepth = 20 # and MAXDEPTH
if args.pid and args.useronly:
print("ERROR: use either -p or -u.")
exit()
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#define MAXDEPTH 20
#define MINBLOCK_US 1
struct key_t {
char name[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
u32 pid;
u64 ts, *tsp;
// record previous thread sleep time
if (FILTER) {
pid = prev->pid;
ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
}
// calculate current thread's delta time
pid = bpf_get_current_pid_tgid();
tsp = start.lookup(&pid);
if (tsp == 0)
return 0; // missed start or filtered
u64 delta = bpf_ktime_get_ns() - *tsp;
start.delete(&pid);
delta = delta / 1000;
if (delta < MINBLOCK_US)
return 0;
// create map key
u64 zero = 0, *val, bp = 0;
int depth = 0;
struct key_t key = {};
bpf_get_current_comm(&key.name, sizeof(key.name));
bp = ctx->bp;
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val) += delta;
return 0;
}
"""
if args.pid:
filter = 'pid == %s' % args.pid
elif args.useronly:
filter = '!(prev->flags & PF_KTHREAD)'
else:
filter = '1'
bpf_text = bpf_text.replace('FILTER', filter)
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="finish_task_switch", fn_name="oncpu")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions traced. Exiting.")
exit()
# header
if not folded:
print("Tracing off-CPU time (us) by kernel stack", end="")
if duration < 99999999:
print(" for %d secs." % duration)
else:
print("... Hit Ctrl-C to end.")
# output
while (1):
try:
sleep(duration)
except KeyboardInterrupt:
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
if not folded:
print()
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
if folded:
# print folded stack output
line = k.name.decode() + ";"
for i in reversed(range(0, maxdepth)):
if k.ret[i] == 0:
continue
line = line + b.ksym(k.ret[i])
if i != 0:
line = line + ";"
print("%s %d" % (line, v.value))
else:
# print default multi-line stack output
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
print(" %-16s %s" % ("-", k.name))
print(" %d\n" % v.value)
counts.clear()
if not folded:
print("Detaching...")
exit()
#!/usr/bin/python
#
# stackcount Count kernel function calls and their stack traces.
# For Linux, uses BCC, eBPF.
#
# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern
#
# The pattern is a string with optional '*' wildcards, similar to file
# globbing. If you'd prefer to use regular expressions, use the -r option.
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 11 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
"""
parser = argparse.ArgumentParser(
description="Count kernel function calls and their stack traces",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-i", "--interval", default=99999999,
help="summary interval, seconds")
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-r", "--regexp", action="store_true",
help="use regular expressions. Default is \"*\" wildcards only.")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("pattern",
help="search expression for kernel functions")
args = parser.parse_args()
pattern = args.pattern
if not args.regexp:
pattern = pattern.replace('*', '.*')
pattern = '^' + pattern + '$'
offset = args.offset
verbose = args.verbose
debug = 0
maxdepth = 10 # and MAXDEPTH
# signal handler
def signal_ignore(signal, frame):
print()
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#define MAXDEPTH 10
struct key_t {
u64 ip;
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
int trace_count(struct pt_regs *ctx) {
FILTER
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
key.ip = ctx->ip;
bp = ctx->bp;
// unrolled loop, 10 (MAXDEPTH) frames deep:
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
'if (pid != %s) { return 0; }') % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
b = BPF(text=bpf_text)
b.attach_kprobe(event_re=pattern, fn_name="trace_count")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions matched by \"%s\". Exiting." % args.pattern)
exit()
# header
print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
(matched, args.pattern))
def print_frame(addr):
print(" ", end="")
if verbose:
print("%-16x " % addr, end="")
if offset:
print("%s" % b.ksymaddr(addr))
else:
print("%s" % b.ksym(addr))
# output
exiting = 0 if args.interval else 1
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
print()
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
print_frame(k.ip)
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print_frame(k.ret[i])
print(" %d\n" % v.value)
counts.clear()
if exiting:
print("Detaching...")
exit()
#!/usr/bin/python
#
# stacksnoop Trace a kernel function and print all kernel stack traces.
# For Linux, uses BCC, eBPF, and currently x86_64 only. Inline C.
#
# USAGE: stacksnoop [-h] [-p PID] [-s] [-v] function
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# The stack depth is limited to 10 (+1 for the current instruction pointer).
# This could be tunable in a future version.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
# arguments
examples = """examples:
./stacksnoop ext4_sync_fs # print kernel stack traces for ext4_sync_fs
./stacksnoop -s ext4_sync_fs # ... also show symbol offsets
./stacksnoop -v ext4_sync_fs # ... show extra columns
./stacksnoop -p 185 ext4_sync_fs # ... only when PID 185 is on-CPU
"""
parser = argparse.ArgumentParser(
description="Trace and print kernel stack traces for a kernel function",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-v", "--verbose", action="store_true",
help="print more fields")
parser.add_argument("function",
help="kernel function name")
args = parser.parse_args()
function = args.function
offset = args.offset
verbose = args.verbose
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
static int print_frame(u64 *bp, int *depth) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (ret < __START_KERNEL_map)
return 0;
bpf_trace_printk("r%d: %llx\\n", *depth, ret);
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
*depth += 1;
return 1;
}
return 0;
}
void trace_stack(struct pt_regs *ctx) {
FILTER
u64 bp = 0;
int depth = 0;
bpf_trace_printk("\\n");
if (ctx->ip)
bpf_trace_printk("ip: %llx\\n", ctx->ip);
bp = ctx->bp;
// unrolled loop, 10 frames deep:
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
};
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
'if (pid != %s) { return; }') % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event=function, fn_name="trace_stack")
matched = b.num_open_kprobes()
if matched == 0:
print("Function \"%s\" not found. Exiting." % function)
exit()
# header
if verbose:
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
"STACK"))
else:
print("%-18s %s" % ("TIME(s)", "STACK"))
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
if msg != "":
(reg, addr) = msg.split(" ")
if offset:
ip = b.ksymaddr(int(addr, 16))
else:
ip = b.ksym(int(addr, 16))
msg = msg + " " + ip
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
else:
print("%-18.9f %s" % (ts, msg))
......@@ -72,52 +72,14 @@ def signal_ignore(signal, frame):
bpf_text = """
#include <uapi/linux/ptrace.h>
#define MAXDEPTH 10
struct key_t {
u64 ip;
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
BPF_HASH(counts, int);
BPF_STACK_TRACE(stack_traces, 1024);
int trace_count(struct pt_regs *ctx) {
FILTER
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
key.ip = ctx->ip;
bp = ctx->bp;
// unrolled loop, 10 (MAXDEPTH) frames deep:
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
int key = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
u64 zero = 0;
u64 *val = counts.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
......@@ -164,13 +126,11 @@ while (1):
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = b.get_table("counts")
counts = b["counts"]
stack_traces = b["stack_traces"]
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
print_frame(k.ip)
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print_frame(k.ret[i])
for addr in stack_traces.walk(k.value):
print_frame(addr)
print(" %d\n" % v.value)
counts.clear()
......
......@@ -82,9 +82,6 @@ tracing.
It can be useful to trace the path to submit_bio to explain unusual rates of
disk IOPS. These could have in-kernel origins (eg, background scrub).
This version of stackcount truncates stacks to 10 levels deep (plus 1 for
the traced function, so 11).
As another example, here are the code paths that led to ip_output(), which
sends a packet at the IP level:
......
......@@ -20,6 +20,7 @@
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
......@@ -50,45 +51,14 @@ debug = 0
bpf_text = """
#include <uapi/linux/ptrace.h>
static int print_frame(u64 *bp, int *depth) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (ret < __START_KERNEL_map)
return 0;
bpf_trace_printk("r%d: %llx\\n", *depth, ret);
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
*depth += 1;
return 1;
}
return 0;
}
BPF_STACK_TRACE(stack_traces, 128)
void trace_stack(struct pt_regs *ctx) {
FILTER
u64 bp = 0;
int depth = 0;
bpf_trace_printk("\\n");
if (ctx->ip)
bpf_trace_printk("ip: %llx\\n", ctx->ip);
bp = ctx->bp;
// unrolled loop, 10 frames deep:
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
};
int stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
if (stack_id >= 0)
bpf_trace_printk("stack_id=%d\\n", stack_id);
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
......@@ -107,24 +77,28 @@ if matched == 0:
print("Function \"%s\" not found. Exiting." % function)
exit()
stack_traces = b.get_table("stack_traces")
msg_regexp = re.compile("stack_id=(\d+)")
# header
if verbose:
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
"STACK"))
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU", "SYSCALL"))
else:
print("%-18s %s" % ("TIME(s)", "STACK"))
print("%-18s %s" % ("TIME(s)", "SYSCALL"))
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
if msg != "":
(reg, addr) = msg.split(" ")
if offset:
ip = b.ksymaddr(int(addr, 16))
m = msg_regexp.match(msg)
if m:
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, function))
else:
ip = b.ksym(int(addr, 16))
msg = msg + " " + ip
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
else:
print("%-18.9f %s" % (ts, msg))
print("%-18.9f %s" % (ts, function))
stack_id = int(m.group(1))
for addr in stack_traces.walk(stack_id):
sym = b.ksymaddr(addr) if offset else b.ksym(addr)
print("\t%016x %s" % (addr, sym))
print()
......@@ -3,27 +3,20 @@ Demonstrations of stacksnoop, the Linux eBPF/bcc version.
This program traces the given kernel function and prints the kernel stack trace
for every call. This tool is useful for studying low frequency kernel functions,
to see how they were invoked. For example, tracing the ext4_sync_fs() call:
# ./stacksnoop ext4_sync_fs
TIME(s) STACK
42005194.132250004
42005194.132253997 ip: ffffffff81280461 ext4_sync_fs
42005194.132256001 r0: ffffffff811ed7f9 iterate_supers
42005194.132257000 r1: ffffffff8121ba25 sys_sync
42005194.132257000 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
42005194.132275000
42005194.132275999 ip: ffffffff81280461 ext4_sync_fs
42005194.132275999 r0: ffffffff811ed7f9 iterate_supers
42005194.132276997 r1: ffffffff8121ba35 sys_sync
42005194.132276997 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
This shows that ext4_sync_fs() was called by iterate_supers(), which was called
by sys_sync(), and so on. (It tells me that this was a syscall invoked sync,
so an application has requested it.)
The "ip" refers to the instruction pointer, and the "r#" refers to the return
address for each stack frame.
to see how they were invoked. For example, tracing the submit_bio() call:
# ./stacksnoop submit_bio
TIME(s) SYSCALL
3592.838736000 submit_bio
ffffffff813bd961 submit_bio
ffffffff81257c12 submit_bh
ffffffff81301948 jbd2_journal_commit_transaction
ffffffff8130653a kjournald2
ffffffff810a2df8 kthread
ffffffff8183a122 ret_from_fork
This shows that submit_bio() was called by submit_bh(), which was called
by jbd2_journal_commit_transaction(), and so on.
For high frequency functions, see stackcount, which summarizes in-kernel for
efficiency. If you don't know if your function is low or high frequency, try
......@@ -32,20 +25,17 @@ funccount.
The -v option includes more fields, including the on-CPU process (COMM and PID):
# ./stacksnoop -v ext4_sync_fs
TIME(s) COMM PID CPU STACK
42005557.056332998 sync 22352 1
42005557.056336999 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs
42005557.056339003 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers
42005557.056340002 sync 22352 1 r1: ffffffff8121ba25 sys_sync
42005557.056340002 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
42005557.056358002 sync 22352 1
42005557.056358002 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs
42005557.056359001 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers
42005557.056359999 sync 22352 1 r1: ffffffff8121ba35 sys_sync
42005557.056359999 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
This identifies the application issuing the sync syscall: the sync(1) command
# ./stacksnoop -v submit_bio
TIME(s) COMM PID CPU SYSCALL
3734.855027000 jbd2/dm-0-8 313 0 submit_bio
ffffffff813bd961 submit_bio
ffffffff81257c12 submit_bh
ffffffff81301948 jbd2_journal_commit_transaction
ffffffff8130653a kjournald2
ffffffff810a2df8 kthread
ffffffff8183a122 ret_from_fork
This identifies the application issuing the sync syscall: the jbd2 process
(COMM column).
......@@ -53,29 +43,32 @@ Here's another example, showing the path to second_overflow() and on-CPU
process:
# ./stacksnoop -v second_overflow
TIME(s) COMM PID CPU STACK
42005696.529449999 <idle> 0 0
42005696.529457003 <idle> 0 0 ip: ffffffff810e5701 second_overflow
42005696.529459000 <idle> 0 0 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005696.529459998 <idle> 0 0 r1: ffffffff810ed6e0 tick_irq_enter
42005696.529459998 <idle> 0 0 r2: ffffffff8107a195 irq_enter
42005696.529460996 <idle> 0 0 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005696.529460996 <idle> 0 0 r4: ffffffff81777a2e xen_do_hypervisor_callback
42005697.616295002 <idle> 0 0
42005697.616301000 <idle> 0 0 ip: ffffffff810e5701 second_overflow
42005697.616302997 <idle> 0 0 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005697.616304003 <idle> 0 0 r1: ffffffff810ed6e0 tick_irq_enter
42005697.616304003 <idle> 0 0 r2: ffffffff8107a195 irq_enter
42005697.616305001 <idle> 0 0 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005697.616305001 <idle> 0 0 r4: ffffffff81777a2e xen_do_hypervisor_callback
42005698.556240998 <idle> 0 1
42005698.556247003 <idle> 0 1 ip: ffffffff810e5701 second_overflow
42005698.556249000 <idle> 0 1 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005698.556249000 <idle> 0 1 r1: ffffffff810ed6e0 tick_irq_enter
42005698.556249999 <idle> 0 1 r2: ffffffff8107a195 irq_enter
42005698.556249999 <idle> 0 1 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005698.556250997 <idle> 0 1 r4: ffffffff81777a2e xen_do_hypervisor_callback
[...]
TIME(s) COMM PID CPU SYSCALL
3837.526433000 <idle> 0 1 second_overflow
ffffffff810fac41 second_overflow
ffffffff81102320 tick_do_update_jiffies64
ffffffff81102bf0 tick_irq_enter
ffffffff810882ac irq_enter
ffffffff8183c7df smp_apic_timer_interrupt
ffffffff8183aae2 apic_timer_interrupt
ffffffff81038f9e default_idle
ffffffff8103979f arch_cpu_idle
ffffffff810c69da default_idle_call
ffffffff810c6cd7 cpu_startup_entry
ffffffff81051cbe start_secondary
3838.526953000 <idle> 0 1 second_overflow
ffffffff810fac41 second_overflow
ffffffff81102320 tick_do_update_jiffies64
ffffffff81102bf0 tick_irq_enter
ffffffff810882ac irq_enter
ffffffff8183c7df smp_apic_timer_interrupt
ffffffff8183aae2 apic_timer_interrupt
ffffffff81038f9e default_idle
ffffffff8103979f arch_cpu_idle
ffffffff810c69da default_idle_call
ffffffff810c6cd7 cpu_startup_entry
ffffffff81051cbe start_secondary
This fires every second (see TIME(s)), and is from tick_do_update_jiffies64().
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment