Commit 120bfad7 authored by 4ast's avatar 4ast

Merge pull request #313 from brendangregg/master

stack walking scripts
parents 3083c85c 38cef486
......@@ -71,9 +71,12 @@ Tools:
- tools/[funclatency](tools/funclatency): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
- tools/[hardirqs](tools/hardirqs): Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
- tools/[killsnoop](tools/killsnoop): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
- tools/[offcputime](tools/offcputime): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
- tools/[opensnoop](tools/opensnoop): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[softirqs](tools/softirqs): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[stackcount](tools/stackcount): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
- tools/[stacksnoop](tools/stacksnoop): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
- tools/[syncsnoop](tools/syncsnoop): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
- tools/[tcpaccept](tools/tcpaccept): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
- tools/[tcpconnect](tools/tcpconnect): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
......
.TH offcputime 8 "2016-01-14" "USER COMMANDS"
.SH NAME
offcputime \- Summarize off-CPU time by kernel stack trace. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B offcputime [\-h] [\-p PID] [\-i INTERVAL] [\-T] [duration]
.SH DESCRIPTION
This program shows kernel stack traces and task names that were blocked and
"off-CPU", and the total duration they were blocked: their "off-CPU time".
It works by tracing when threads block and when they return to CPU, measuring
both the time they were blocked and the blocked kernel stack trace and the
task name. This data is summarized in the kernel using an eBPF map, and by
summing the blocked time by unique stack trace and task name.
The output summary will help you identify reasons why threads
were blocking, and quantify the time they were blocked. This spans all types
of blocking activity: disk I/O, network I/O, locks, page faults, involuntary
context switches, etc.
This is complementary to CPU profiling (e.g., CPU flame graphs) which shows
the time spent on-CPU. This shows the time spent off-CPU, and the output,
especially the -f format, can be used to generate an "off-CPU time flame graph".
See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
The stack depth is currently limited to 20, and the stack traces are kernel
mode only. Check for newer versions where either may be improved.
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-v
Show raw addresses.
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.TP
duration
Duration to trace, in seconds.
.SH EXAMPLES
.TP
Trace all thread blocking events, and summarize (in-kernel) by kernel stack
trace and total off-CPU time:
#
.B offcputime
.TP
Trace for 5 seconds only:
#
.B offcputime 5
.TP
Trace for 5 seconds, and emit output in folded stack format (suitable for
flame graphs):
#
.B offcputime -f 5
.TP
Trace PID 185 only:
#
.B offcputime -p 185
.SH OVERHEAD
This summarizes unique stack traces in-kernel for efficiency, allowing it to
trace a higher rate of events than methods that post-process in user space. The
stack trace and time data is only copied to user space once, when the output is
printed. While these techniques greatly lower overhead, scheduler events are
still a high frequency event, as they can exceed 1 million events per second,
and so caution should still be used. Test before production use.
If the overhead is still a problem, take a look at the MINBLOCK_US tunable in
the code. If your aim is to chase down longer blocking events, then this could
be increased to filter shorter blocking events, further lowering overhead.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
stackcount(8)
.TH stackcount 8 "2016-01-14" "USER COMMANDS"
.SH NAME
stackcount \- Count kernel function calls and their stack traces. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] pattern
.SH DESCRIPTION
stackcount traces kernel functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. This allows higher
frequency events to be studied. The output consists of unique stack traces,
and their occurance counts.
The pattern is a string with optional '*' wildcards, similar to file globbing.
If you'd prefer to use regular expressions, use the \-r option.
The stack depth is currently limited to 10 (+1 for the current instruction
pointer).
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-r
Allow regular expressions for the search pattern. The default allows "*"
wildcards only.
.TP
\-s
Show address offsets.
.TP
\-T
Include a timestamp with interval output.
.TP
\-v
Show raw addresses.
.TP
\-i interval
Summary interval, in seconds.
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.TP
pattern
A kernel function name, or a search pattern. Can include wildcards ("*"). If the
\-r option is used, can include regular expressions.
.SH EXAMPLES
.TP
Count kernel stack traces for submit_bio():
#
.B stackcount submit_bio
.TP
Count kernel stack traces for ip_output():
#
.B stackcount ip_output
.TP
Show symbol offsets:
#
.B stackcount -s ip_output
.TP
Show offsets and raw addresses (verbose):
#
.B stackcount -sv ip_output
.TP
Count kernel stacks for kernel functions matching tcp_send*:
#
.B stackcount 'tcp_send*'
.TP
Same as previous, but using regular expressions:
#
.B stackcount -r '^tcp_send.*'
.TP
Output every 5 seconds, with timestamps:
#
.B stackcount -Ti 5 ip_output
.TP
Only count stacks when PID 185 is on-CPU:
#
.B stackcount -p 185 ip_output
.SH OVERHEAD
This summarizes unique stack traces in-kernel for efficiency, allowing it to
trace a higher rate of function calls than methods that post-process in user
space. The stack trace data is only copied to user space when the output is
printed, which usually only happens once. Given these techniques, I'd suspect
that call rates of < 10,000/sec would incur negligible overhead (for this
current version; future versions may improve this). Beyond that,
there will be a point where the overhead is measurable, as this does add
a number of instructions to each function call to walk and save stacks.
Test before production use. You can also use funccount to get a handle on
function call rates first.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
stacksnoop(8), funccount(8)
.TH stacksnoop 8 "2016-01-14" "USER COMMANDS"
.SH NAME
stacksnoop \- Print kernel stack traces for kernel functions. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B stacksnoop [\-h] [\-p PID] [\-s] [\-v] function
.SH DESCRIPTION
stacksnoop traces a given kernel function and for each call, prints the
kernel stack back trace for that call. This shows the ancestry of function
calls, and is a quick way to investigate low frequency kernel functions and
their cause. For high frequency kernel functions, see stackcount.
The stack depth is currently limited to 10 (+1 for the current instruction
pointer).
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-s
Show address offsets.
.TP
\-v
Print more fields.
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.TP
function
Kernel function name.
.SH EXAMPLES
.TP
Print kernel stack traces for each call to ext4_sync_fs:
#
.B stacksnoop ext4_sync_fs
.TP
Also show the symbol offsets:
#
.B stacksnoop -s ext4_sync_fs
.TP
Show extra columns:
#
.B stacksnoop -v ext4_sync_fs
.TP
Only trace when PID 185 is on-CPU:
#
.B stacksnoop -p 185 ext4_sync_fs
.SH FIELDS
.TP
TIME(s)
Time of the call, in seconds.
.TP
STACK
Kernel stack trace. The first column shows "ip" for instruction pointer, and
"r#" for each return pointer in the stack. The second column is the stack trace
as hexidecimal. The third column is the translated kernel symbol names.
.SH OVERHEAD
This can have significant overhead if frequently called functions (> 1000/s) are
traced, and is only intended for low frequency function calls. This is because
details including the stack trace for every call is passed to user space and
processed. See stackcount for higher frequency calls, which performs in-kernel
summaries.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
stackcount(8)
#!/usr/bin/python
#
# offcputime Summarize off-CPU time by kernel stack trace
# For Linux, uses BCC, eBPF.
#
# USAGE: offcputime [-h] [-p PID] [-i INTERVAL] [-T] [duration]
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 13-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./offcputime # trace off-CPU stack time until Ctrl-C
./offcputime 5 # trace for 5 seconds only
./offcputime -f 5 # 5 seconds, and output in folded format
./offcputime -p 185 # trace fo PID 185 only
"""
parser = argparse.ArgumentParser(
description="Summarize off-CPU time by kernel stack trace",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("-f", "--folded", action="store_true",
help="output folded format")
parser.add_argument("duration", nargs="?", default=99999999,
help="duration of trace, in seconds")
args = parser.parse_args()
folded = args.folded
duration = int(args.duration)
debug = 0
maxdepth = 20 # and MAXDEPTH
# signal handler
def signal_ignore(signal, frame):
print()
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#define MAXDEPTH 20
#define MINBLOCK_US 1
struct key_t {
char name[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
int offcpu(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns();
FILTER
start.update(&pid, &ts);
return 0;
}
int oncpu(struct pt_regs *ctx) {
u32 pid = bpf_get_current_pid_tgid();
FILTER
u64 ts = bpf_ktime_get_ns();
struct key_t key = {};
u64 zero = 0, *val, bp = 0, *tsp, delta;
int depth = 0;
// calculate delta time
tsp = start.lookup(&pid);
if (tsp == 0)
return 0; // missed start
delta = bpf_ktime_get_ns() - *tsp;
start.delete(&pid);
delta = delta / 1000;
if (delta < MINBLOCK_US)
return 0;
bpf_get_current_comm(&key.name, sizeof(key.name));
bp = ctx->bp;
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val) += delta;
return 0;
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
'if (pid != %s) { return 0; }' % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
b = BPF(text=bpf_text)
b.attach_kprobe(event="schedule", fn_name="offcpu")
b.attach_kprobe(event="finish_task_switch", fn_name="oncpu")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions traced. Exiting.")
exit()
# header
if not folded:
print("Tracing off-CPU time (us) by kernel stack", end="")
if duration < 99999999:
print(" for %d secs." % duration)
else:
print("... Hit Ctrl-C to end.")
# output
while (1):
try:
sleep(duration)
except KeyboardInterrupt:
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
if not folded:
print()
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
if folded:
# print folded stack output
line = k.name + ";"
for i in reversed(range(0, maxdepth)):
if k.ret[i] == 0:
continue
line = line + b.ksym(k.ret[i])
if i != 0:
line = line + ";"
print("%s %d" % (line, v.value))
else:
# print default multi-line stack output
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
print(" %-16s %s" % ("-", k.name))
print(" %d\n" % v.value)
counts.clear()
if not folded:
print("Detaching...")
exit()
This diff is collapsed.
#!/usr/bin/python
#
# stackcount Count kernel function calls and their stack traces.
# For Linux, uses BCC, eBPF.
#
# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern
#
# The pattern is a string with optional '*' wildcards, similar to file
# globbing. If you'd prefer to use regular expressions, use the -r option.
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 11 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
"""
parser = argparse.ArgumentParser(
description="Count kernel function calls and their stack traces",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-i", "--interval", default=99999999,
help="summary interval, seconds")
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-r", "--regexp", action="store_true",
help="use regular expressions. Default is \"*\" wildcards only.")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("pattern",
help="search expression for kernel functions")
args = parser.parse_args()
pattern = args.pattern
if not args.regexp:
pattern = pattern.replace('*', '.*')
pattern = '^' + pattern + '$'
offset = args.offset
verbose = args.verbose
debug = 0
maxdepth = 10 # and MAXDEPTH
# signal handler
def signal_ignore(signal, frame):
print()
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#define MAXDEPTH 10
struct key_t {
u64 ip;
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (!ret || ret < __START_KERNEL_map)
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
bp = 0;
return ret;
}
return 0;
}
int trace_count(struct pt_regs *ctx) {
FILTER
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
key.ip = ctx->ip;
bp = ctx->bp;
// unrolled loop, 10 (MAXDEPTH) frames deep:
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
'if (pid != %s) { return 0; }') % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
b = BPF(text=bpf_text)
b.attach_kprobe(event_re=pattern, fn_name="trace_count")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions matched by \"%s\". Exiting." % args.pattern)
exit()
# header
print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
(matched, args.pattern))
def print_frame(addr):
print(" ", end="")
if verbose:
print("%-16x " % addr, end="")
if offset:
print("%s" % b.ksymaddr(addr))
else:
print("%s" % b.ksym(addr))
# output
exiting = 0 if args.interval else 1
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
print()
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
print_frame(k.ip)
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print_frame(k.ret[i])
print(" %d\n" % v.value)
counts.clear()
if exiting:
print("Detaching...")
exit()
Demonstrations of stackcount, the Linux eBPF/bcc version.
This program traces kernel functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. For example, counting
stack traces that led to submit_bio(), which creates block device I/O:
# ./stackcount submit_bio
Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
^C
submit_bio
submit_bh
journal_submit_commit_record.isra.13
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
mb_cache_list
1
submit_bio
__block_write_full_page.constprop.39
block_write_full_page
blkdev_writepage
__writepage
write_cache_pages
generic_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
2
submit_bio
__block_write_full_page.constprop.39
block_write_full_page
blkdev_writepage
__writepage
write_cache_pages
generic_writepages
do_writepages
__filemap_fdatawrite_range
filemap_fdatawrite
fdatawrite_one_bdev
36
submit_bio
submit_bh
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
mb_cache_list
38
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_rename
ext4_rename2
vfs_rename
sys_rename
entry_SYSCALL_64_fastpath
79
Detaching...
The output shows unique stack traces, in order from leaf (on-CPU) to root,
followed by their occurrence count. The last stack trace in the above output
shows syscall handling, ext4_rename(), and filemap_flush(): looks like an
application issued file rename has caused back end disk I/O due to ext4
block allocation and a filemap_flush(). I'd have to browse the code to those
functions to confirm!
The order of printed stack traces is from least to most frequent. The most
frequent in this case, the ext4_rename() stack, was taken 79 times during
tracing.
It can be useful to trace the path to submit_bio to explain unusual rates of
disk IOPS. These could have in-kernel origins (eg, background scrub).
This version of stackcount truncates stacks to 10 levels deep (plus 1 for
the traced function, so 11).
As another example, here are the code paths that led to ip_output(), which
sends a packet at the IP level:
# ./stackcount ip_output
Tracing 1 functions for "ip_output"... Hit Ctrl-C to end.
^C
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
tcp_tsq_handler.part.32
tcp_tasklet_func
tasklet_action
__do_softirq
do_softirq_own_stack
do_softirq
__local_bh_enable_ip
1
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_send_ack
tcp_send_delayed_ack
__tcp_ack_snd_check
tcp_rcv_established
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
1
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_send_ack
tcp_send_delayed_ack
__tcp_ack_snd_check
tcp_rcv_established
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
1
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_send_ack
tcp_delack_timer_handler
tcp_delack_timer
call_timer_fn
run_timer_softirq
__do_softirq
irq_exit
xen_evtchn_do_upcall
1
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_rcv_established
tcp_v4_do_rcv
release_sock
tcp_sendmsg
inet_sendmsg
sock_sendmsg
3
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
tcp_tsq_handler.part.32
tcp_tasklet_func
tasklet_action
__do_softirq
run_ksoftirqd
smpboot_thread_fn
kthread
3
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_rcv_established
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
ip_rcv_finish
4
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_send_ack
tcp_cleanup_rbuf
tcp_recvmsg
inet_recvmsg
sock_recvmsg
sock_read_iter
__vfs_read
vfs_read
5
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_rcv_established
tcp_v4_do_rcv
release_sock
tcp_sendmsg
inet_sendmsg
sock_sendmsg
9
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_push
tcp_sendmsg
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
51
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_rcv_established
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
ip_rcv_finish
171
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_rcv_established
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
ip_local_deliver
ip_rcv_finish
994
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
tcp_tsq_handler.part.32
tcp_tasklet_func
tasklet_action
__do_softirq
irq_exit
xen_evtchn_do_upcall
xen_do_hypervisor_callback
1002
Detaching...
The last two most frequent stack traces are via tcp_transmit_skb(). Note the
send last begins with ip_rcv_finish(), for a local receive, which then becomes
a transmit: likely pushing more frames when processing the newly received ones.
As may be obvious, this is a great tool for quickly understanding kernel code
flow.
A -i option can be used to set an output interval, and -T to include a
timestamp. For example:
# ./stackcount -Ti 1 submit_bio
Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
01:11:37
submit_bio
submit_bh
journal_submit_commit_record.isra.13
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
mb_cache_list
1
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_rename
ext4_rename2
vfs_rename
sys_rename
entry_SYSCALL_64_fastpath
20
submit_bio
submit_bh
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
mb_cache_list
39
01:11:38
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_rename
ext4_rename2
vfs_rename
sys_rename
entry_SYSCALL_64_fastpath
20
01:11:39
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_rename
ext4_rename2
vfs_rename
sys_rename
entry_SYSCALL_64_fastpath
20
^C
01:11:39
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_rename
ext4_rename2
vfs_rename
sys_rename
entry_SYSCALL_64_fastpath
20
Detaching...
The -s output prints the return instruction offset for each function (aka
symbol offset). Eg:
# ./stackcount -s tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
^C
tcp_sendmsg0x1
sock_sendmsg0x38
sock_write_iter0x78
__vfs_write0xaa
vfs_write0xa9
sys_write0x46
entry_SYSCALL_64_fastpath0x16
29
Detaching...
If it wasn't clear how one function called another, knowing the instruction
offset can help you locate the lines of code from a dissassembly dump.
A wildcard can also be used. Eg, all functions beginning with "tcp_send":
# ./stackcount -s 'tcp_send*'
Tracing 14 functions for "tcp_send*"... Hit Ctrl-C to end.
^C
tcp_send_delayed_ack0x1
tcp_rcv_established0x3b1
tcp_v4_do_rcv0x130
tcp_v4_rcv0x8e0
ip_local_deliver_finish0x9f
ip_local_deliver0x51
ip_rcv_finish0x8a
ip_rcv0x29d
__netif_receive_skb_core0x637
__netif_receive_skb0x18
netif_receive_skb_internal0x23
1
tcp_send_delayed_ack0x1
tcp_rcv_established0x222
tcp_v4_do_rcv0x130
tcp_v4_rcv0x8e0
ip_local_deliver_finish0x9f
ip_local_deliver0x51
ip_rcv_finish0x8a
ip_rcv0x29d
__netif_receive_skb_core0x637
__netif_receive_skb0x18
netif_receive_skb_internal0x23
4
tcp_send_mss0x1
inet_sendmsg0x67
sock_sendmsg0x38
sock_write_iter0x78
__vfs_write0xaa
vfs_write0xa9
sys_write0x46
entry_SYSCALL_64_fastpath0x16
7
tcp_sendmsg0x1
sock_sendmsg0x38
sock_write_iter0x78
__vfs_write0xaa
vfs_write0xa9
sys_write0x46
entry_SYSCALL_64_fastpath0x16
7
Detaching...
Use -r to allow regular expressions.
USAGE message:
# ./stackcount -h
usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s] [-v] pattern
Count kernel function calls and their stack traces
positional arguments:
pattern search expression for kernel functions
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
-i INTERVAL, --interval INTERVAL
summary interval, seconds
-T, --timestamp include timestamp on output
-r, --regexp use regular expressions. Default is "*" wildcards
only.
-s, --offset show address offsets
-v, --verbose show raw addresses
examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
#!/usr/bin/python
#
# stacksnoop Trace a kernel function and print all kernel stack traces.
# For Linux, uses BCC, eBPF, and currently x86_64 only. Inline C.
#
# USAGE: stacksnoop [-h] [-p PID] [-s] [-v] function
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# The stack depth is limited to 10 (+1 for the current instruction pointer).
# This could be tunable in a future version.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
# arguments
examples = """examples:
./stacksnoop ext4_sync_fs # print kernel stack traces for ext4_sync_fs
./stacksnoop -s ext4_sync_fs # ... also show symbol offsets
./stacksnoop -v ext4_sync_fs # ... show extra columns
./stacksnoop -p 185 ext4_sync_fs # ... only when PID 185 is on-CPU
"""
parser = argparse.ArgumentParser(
description="Trace and print kernel stack traces for a kernel function",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-v", "--verbose", action="store_true",
help="print more fields")
parser.add_argument("function",
help="kernel function name")
args = parser.parse_args()
function = args.function
offset = args.offset
verbose = args.verbose
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
static int print_frame(u64 *bp, int *depth) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (!ret || ret < __START_KERNEL_map)
return 0;
bpf_trace_printk("r%d: %llx\\n", *depth, ret);
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
*depth += 1;
return 1;
}
return 0;
}
void trace_stack(struct pt_regs *ctx) {
FILTER
u64 bp = 0;
int depth = 0;
bpf_trace_printk("\\n");
if (ctx->ip)
bpf_trace_printk("ip: %llx\\n", ctx->ip);
bp = ctx->bp;
// unrolled loop, 10 frames deep:
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
};
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
'if (pid != %s) { return 0; }') % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event=function, fn_name="trace_stack")
matched = b.num_open_kprobes()
if matched == 0:
print("Function \"%s\" not found. Exiting." % function)
exit()
# header
if verbose:
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
"STACK"))
else:
print("%-18s %s" % ("TIME(s)", "STACK"))
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
if msg != "":
(reg, addr) = msg.split(" ")
if offset:
ip = b.ksymaddr(int(addr, 16))
else:
ip = b.ksym(int(addr, 16))
msg = msg + " " + ip
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
else:
print("%-18.9f %s" % (ts, msg))
Demonstrations of stacksnoop, the Linux eBPF/bcc version.
This program traces the given kernel function and prints the kernel stack trace
for every call. This tool is useful for studying low frequency kernel functions,
to see how they were invoked. For exmaple, tracing the ext4_sync_fs() call:
# ./stacksnoop ext4_sync_fs
TIME(s) STACK
42005194.132250004
42005194.132253997 ip: ffffffff81280461 ext4_sync_fs
42005194.132256001 r0: ffffffff811ed7f9 iterate_supers
42005194.132257000 r1: ffffffff8121ba25 sys_sync
42005194.132257000 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
42005194.132275000
42005194.132275999 ip: ffffffff81280461 ext4_sync_fs
42005194.132275999 r0: ffffffff811ed7f9 iterate_supers
42005194.132276997 r1: ffffffff8121ba35 sys_sync
42005194.132276997 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
This shows that ext4_sync_fs() was called by iterate_supers(), which was called
by sys_sync(), and so on. (It tells me that this was a syscall invoked sync,
so an application has requested it.)
The "ip" refers to the instruction pointer, and the "r#" refers to the return
address for each stack frame.
For high frequency functions, see stackcount, which summarizes in-kernel for
efficiency. If you don't know if your function is low or high frequency, try
funccount.
The -v option includes more fields, including the on-CPU process (COMM and PID):
# ./stacksnoop -v ext4_sync_fs
TIME(s) COMM PID CPU STACK
42005557.056332998 sync 22352 1
42005557.056336999 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs
42005557.056339003 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers
42005557.056340002 sync 22352 1 r1: ffffffff8121ba25 sys_sync
42005557.056340002 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
42005557.056358002 sync 22352 1
42005557.056358002 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs
42005557.056359001 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers
42005557.056359999 sync 22352 1 r1: ffffffff8121ba35 sys_sync
42005557.056359999 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
This identifies the application issuing the sync syscall: the sync(1) command
(COMM column).
Here's another example, showing the path to second_overflow() and on-CPU
process:
# ./stacksnoop -v second_overflow
TIME(s) COMM PID CPU STACK
42005696.529449999 <idle> 0 0
42005696.529457003 <idle> 0 0 ip: ffffffff810e5701 second_overflow
42005696.529459000 <idle> 0 0 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005696.529459998 <idle> 0 0 r1: ffffffff810ed6e0 tick_irq_enter
42005696.529459998 <idle> 0 0 r2: ffffffff8107a195 irq_enter
42005696.529460996 <idle> 0 0 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005696.529460996 <idle> 0 0 r4: ffffffff81777a2e xen_do_hypervisor_callback
42005697.616295002 <idle> 0 0
42005697.616301000 <idle> 0 0 ip: ffffffff810e5701 second_overflow
42005697.616302997 <idle> 0 0 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005697.616304003 <idle> 0 0 r1: ffffffff810ed6e0 tick_irq_enter
42005697.616304003 <idle> 0 0 r2: ffffffff8107a195 irq_enter
42005697.616305001 <idle> 0 0 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005697.616305001 <idle> 0 0 r4: ffffffff81777a2e xen_do_hypervisor_callback
42005698.556240998 <idle> 0 1
42005698.556247003 <idle> 0 1 ip: ffffffff810e5701 second_overflow
42005698.556249000 <idle> 0 1 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005698.556249000 <idle> 0 1 r1: ffffffff810ed6e0 tick_irq_enter
42005698.556249999 <idle> 0 1 r2: ffffffff8107a195 irq_enter
42005698.556249999 <idle> 0 1 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005698.556250997 <idle> 0 1 r4: ffffffff81777a2e xen_do_hypervisor_callback
[...]
This fires every second (see TIME(s)), and is from tick_do_update_jiffies64().
USAGE message:
# ./stacksnoop -h
usage: stacksnoop [-h] [-p PID] [-s] [-v] function
Trace and print kernel stack traces for a kernel function
positional arguments:
function kernel function name
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
-s, --offset show address offsets
-v, --verbose print more fields
examples:
./stacksnoop ext4_sync_fs # print kernel stack traces for ext4_sync_fs
./stacksnoop -s ext4_sync_fs # ... also show symbol offsets
./stacksnoop -v ext4_sync_fs # ... show extra columns
./stacksnoop -p 185 ext4_sync_fs # ... only when PID 185 is on-CPU
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment