stack walking scripts

38cef486 · Brendan Gregg · 3083c85c · 38cef486 · 38cef486 · 38cef486
Commit 38cef486 authored Jan 15, 2016 by Brendan Gregg
10 changed files
--- a/README.md
+++ b/README.md
@@ -71,9 +71,12 @@ Tools:
 - tools/[funclatency](tools/funclatency): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
 - tools/[hardirqs](tools/hardirqs):  Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
 - tools/[killsnoop](tools/killsnoop): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
+- tools/[offcputime](tools/offcputime): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
 - tools/[opensnoop](tools/opensnoop): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
 - tools/[pidpersec](tools/pidpersec): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
 - tools/[softirqs](tools/softirqs):  Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
+- tools/[stackcount](tools/stackcount): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
+- tools/[stacksnoop](tools/stacksnoop): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
 - tools/[syncsnoop](tools/syncsnoop): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
 - tools/[tcpaccept](tools/tcpaccept): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
 - tools/[tcpconnect](tools/tcpconnect): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).

--- a/man/man8/offcputime.8
+++ b/man/man8/offcputime.8
+.TH offcputime 8  "2016-01-14" "USER COMMANDS"
+.SH NAME
+offcputime \- Summarize off-CPU time by kernel stack trace. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B offcputime [\-h] [\-p PID] [\-i INTERVAL] [\-T] [duration]
+.SH DESCRIPTION
+This program shows kernel stack traces and task names that were blocked and
+"off-CPU", and the total duration they were blocked: their "off-CPU time".
+It works by tracing when threads block and when they return to CPU, measuring
+both the time they were blocked and the blocked kernel stack trace and the
+task name. This data is summarized in the kernel using an eBPF map, and by
+summing the blocked time by unique stack trace and task name.
+
+The output summary will help you identify reasons why threads
+were blocking, and quantify the time they were blocked. This spans all types
+of blocking activity: disk I/O, network I/O, locks, page faults, involuntary
+context switches, etc.
+
+This is complementary to CPU profiling (e.g., CPU flame graphs) which shows
+the time spent on-CPU. This shows the time spent off-CPU, and the output,
+especially the -f format, can be used to generate an "off-CPU time flame graph".
+
+See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
+
+The stack depth is currently limited to 20, and the stack traces are kernel
+mode only. Check for newer versions where either may be improved.
+
+This currently only works on x86_64. Check for future versions.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-v
+Show raw addresses.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel).
+.TP
+duration
+Duration to trace, in seconds.
+.SH EXAMPLES
+.TP
+Trace all thread blocking events, and summarize (in-kernel) by kernel stack
+trace and total off-CPU time:
+#
+.B offcputime
+.TP
+Trace for 5 seconds only:
+#
+.B offcputime 5
+.TP
+Trace for 5 seconds, and emit output in folded stack format (suitable for
+flame graphs):
+#
+.B offcputime -f 5
+.TP
+Trace PID 185 only:
+#
+.B offcputime -p 185
+.SH OVERHEAD
+This summarizes unique stack traces in-kernel for efficiency, allowing it to
+trace a higher rate of events than methods that post-process in user space. The
+stack trace and time data is only copied to user space once, when the output is
+printed. While these techniques greatly lower overhead, scheduler events are
+still a high frequency event, as they can exceed 1 million events per second,
+and so caution should still be used. Test before production use.
+
+If the overhead is still a problem, take a look at the MINBLOCK_US tunable in
+the code. If your aim is to chase down longer blocking events, then this could
+be increased to filter shorter blocking events, further lowering overhead.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+stackcount(8)
--- a/man/man8/stackcount.8
+++ b/man/man8/stackcount.8
+.TH stackcount 8  "2016-01-14" "USER COMMANDS"
+.SH NAME
+stackcount \- Count kernel function calls and their stack traces. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] pattern
+.SH DESCRIPTION
+stackcount traces kernel functions and frequency counts them with their entire
+kernel stack trace, summarized in-kernel for efficiency. This allows higher
+frequency events to be studied. The output consists of unique stack traces,
+and their occurance counts.
+
+The pattern is a string with optional '*' wildcards, similar to file globbing.
+If you'd prefer to use regular expressions, use the \-r option.
+
+The stack depth is currently limited to 10 (+1 for the current instruction
+pointer).
+
+This currently only works on x86_64. Check for future versions.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-r
+Allow regular expressions for the search pattern. The default allows "*"
+wildcards only.
+.TP
+\-s
+Show address offsets.
+.TP
+\-T
+Include a timestamp with interval output.
+.TP
+\-v
+Show raw addresses.
+.TP
+\-i interval
+Summary interval, in seconds.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel).
+.TP
+pattern
+A kernel function name, or a search pattern. Can include wildcards ("*"). If the
+\-r option is used, can include regular expressions.
+.SH EXAMPLES
+.TP
+Count kernel stack traces for submit_bio():
+#
+.B stackcount submit_bio
+.TP
+Count kernel stack traces for ip_output():
+#
+.B stackcount ip_output
+.TP
+Show symbol offsets:
+#
+.B stackcount -s ip_output
+.TP
+Show offsets and raw addresses (verbose):
+#
+.B stackcount -sv ip_output
+.TP
+Count kernel stacks for kernel functions matching tcp_send*:
+#
+.B stackcount 'tcp_send*'
+.TP
+Same as previous, but using regular expressions:
+#
+.B stackcount -r '^tcp_send.*'
+.TP
+Output every 5 seconds, with timestamps:
+#
+.B stackcount -Ti 5 ip_output
+.TP
+Only count stacks when PID 185 is on-CPU:
+#
+.B stackcount -p 185 ip_output
+.SH OVERHEAD
+This summarizes unique stack traces in-kernel for efficiency, allowing it to
+trace a higher rate of function calls than methods that post-process in user
+space. The stack trace data is only copied to user space when the output is
+printed, which usually only happens once. Given these techniques, I'd suspect
+that call rates of < 10,000/sec would incur negligible overhead (for this
+current version; future versions may improve this). Beyond that,
+there will be a point where the overhead is measurable, as this does add
+a number of instructions to each function call to walk and save stacks.
+Test before production use. You can also use funccount to get a handle on
+function call rates first.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+stacksnoop(8), funccount(8)
--- a/man/man8/stacksnoop.8
+++ b/man/man8/stacksnoop.8
+.TH stacksnoop 8  "2016-01-14" "USER COMMANDS"
+.SH NAME
+stacksnoop \- Print kernel stack traces for kernel functions. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B stacksnoop [\-h] [\-p PID] [\-s] [\-v] function
+.SH DESCRIPTION
+stacksnoop traces a given kernel function and for each call, prints the
+kernel stack back trace for that call. This shows the ancestry of function
+calls, and is a quick way to investigate low frequency kernel functions and
+their cause. For high frequency kernel functions, see stackcount.
+
+The stack depth is currently limited to 10 (+1 for the current instruction
+pointer).
+
+This currently only works on x86_64. Check for future versions.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-s
+Show address offsets.
+.TP
+\-v
+Print more fields.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel).
+.TP
+function
+Kernel function name.
+.SH EXAMPLES
+.TP
+Print kernel stack traces for each call to ext4_sync_fs:
+#
+.B stacksnoop ext4_sync_fs
+.TP
+Also show the symbol offsets:
+#
+.B stacksnoop -s ext4_sync_fs
+.TP
+Show extra columns:
+#
+.B stacksnoop -v ext4_sync_fs
+.TP
+Only trace when PID 185 is on-CPU:
+#
+.B stacksnoop -p 185 ext4_sync_fs
+.SH FIELDS
+.TP
+TIME(s)
+Time of the call, in seconds.
+.TP
+STACK
+Kernel stack trace. The first column shows "ip" for instruction pointer, and
+"r#" for each return pointer in the stack. The second column is the stack trace
+as hexidecimal. The third column is the translated kernel symbol names.
+.SH OVERHEAD
+This can have significant overhead if frequently called functions (> 1000/s) are
+traced, and is only intended for low frequency function calls. This is because
+details including the stack trace for every call is passed to user space and
+processed. See stackcount for higher frequency calls, which performs in-kernel
+summaries.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+stackcount(8)
--- a/tools/offcputime
+++ b/tools/offcputime
+#!/usr/bin/python
+#
+# offcputime    Summarize off-CPU time by kernel stack trace
+#               For Linux, uses BCC, eBPF.
+#
+# USAGE: offcputime [-h] [-p PID] [-i INTERVAL] [-T] [duration]
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# Currently limited to a stack trace depth of 21 (maxdepth + 1).
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 13-Jan-2016	Brendan Gregg	Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+    ./offcputime             # trace off-CPU stack time until Ctrl-C
+    ./offcputime 5           # trace for 5 seconds only
+    ./offcputime -f 5        # 5 seconds, and output in folded format
+    ./offcputime -p 185      # trace fo PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Summarize off-CPU time by kernel stack trace",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="show raw addresses")
+parser.add_argument("-f", "--folded", action="store_true",
+    help="output folded format")
+parser.add_argument("duration", nargs="?", default=99999999,
+    help="duration of trace, in seconds")
+args = parser.parse_args()
+folded = args.folded
+duration = int(args.duration)
+debug = 0
+maxdepth = 20    # and MAXDEPTH
+
+# signal handler
+def signal_ignore(signal, frame):
+    print()
+
+# load BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/sched.h>
+
+#define MAXDEPTH	20
+#define MINBLOCK_US	1
+
+struct key_t {
+    char name[TASK_COMM_LEN];
+    // Skip saving the ip
+    u64 ret[MAXDEPTH];
+};
+BPF_HASH(counts, struct key_t);
+BPF_HASH(start, u32);
+
+static u64 get_frame(u64 *bp) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            bp = 0;
+        if (ret < __START_KERNEL_map)
+            return 0;
+        return ret;
+    }
+    return 0;
+}
+
+int offcpu(struct pt_regs *ctx) {
+    u32 pid = bpf_get_current_pid_tgid();
+    u64 ts = bpf_ktime_get_ns();
+    FILTER
+    start.update(&pid, &ts);
+    return 0;
+}
+
+int oncpu(struct pt_regs *ctx) {
+    u32 pid = bpf_get_current_pid_tgid();
+    FILTER
+    u64 ts = bpf_ktime_get_ns();
+    struct key_t key = {};
+    u64 zero = 0, *val, bp = 0, *tsp, delta;
+    int depth = 0;
+
+    // calculate delta time
+    tsp = start.lookup(&pid);
+    if (tsp == 0)
+        return 0;        // missed start
+    delta = bpf_ktime_get_ns() - *tsp;
+    start.delete(&pid);
+    delta = delta / 1000;
+    if (delta < MINBLOCK_US)
+        return 0;
+
+    bpf_get_current_comm(&key.name, sizeof(key.name));
+    bp = ctx->bp;
+
+    // unrolled loop (MAXDEPTH):
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+out:
+    val = counts.lookup_or_init(&key, &zero);
+    (*val) += delta;
+    return 0;
+}
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        'if (pid != %s) { return 0; }' % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+b = BPF(text=bpf_text)
+b.attach_kprobe(event="schedule", fn_name="offcpu")
+b.attach_kprobe(event="finish_task_switch", fn_name="oncpu")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("0 functions traced. Exiting.")
+    exit()
+
+# header
+if not folded:
+    print("Tracing off-CPU time (us) by kernel stack", end="")
+    if duration < 99999999:
+        print(" for %d secs." % duration)
+    else:
+        print("... Hit Ctrl-C to end.")
+
+# output
+while (1):
+    try:
+        sleep(duration)
+    except KeyboardInterrupt:
+        # as cleanup can take many seconds, trap Ctrl-C:
+        signal.signal(signal.SIGINT, signal_ignore)
+
+    if not folded:
+        print()
+    counts = b.get_table("counts")
+    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
+        if folded:
+            # print folded stack output
+            line = k.name + ";"
+            for i in reversed(range(0, maxdepth)):
+                if k.ret[i] == 0:
+                    continue
+                line = line + b.ksym(k.ret[i])
+                if i != 0:
+                    line = line + ";"
+            print("%s %d" % (line, v.value))
+        else:
+            # print default multi-line stack output
+            for i in range(0, maxdepth):
+                if k.ret[i] == 0:
+                    break
+                print("    %-16x %s" % (k.ret[i],
+                    b.ksym(k.ret[i])))
+            print("    %-16s %s" % ("-", k.name))
+            print("        %d\n" % v.value)
+    counts.clear()
+
+    if not folded:
+        print("Detaching...")
+    exit()
--- a/tools/offcputime_example.txt
+++ b/tools/offcputime_example.txt
--- a/tools/stackcount
+++ b/tools/stackcount
+#!/usr/bin/python
+#
+# stackcount    Count kernel function calls and their stack traces.
+#               For Linux, uses BCC, eBPF.
+#
+# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern
+#
+# The pattern is a string with optional '*' wildcards, similar to file
+# globbing. If you'd prefer to use regular expressions, use the -r option.
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# Currently limited to a stack trace depth of 11 (maxdepth + 1).
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 12-Jan-2016	Brendan Gregg	Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+    ./stackcount submit_bio       # count kernel stack traces for submit_bio
+    ./stackcount ip_output        # count kernel stack traces for ip_output
+    ./stackcount -s ip_output     # show symbol offsets
+    ./stackcount -sv ip_output    # show offsets and raw addresses (verbose)
+    ./stackcount 'tcp_send*'      # count stacks for funcs matching tcp_send*
+    ./stackcount -r '^tcp_send.*' # same as above, using regular expressions
+    ./stackcount -Ti 5 ip_output  # output every 5 seconds, with timestamps
+    ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Count kernel function calls and their stack traces",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-i", "--interval", default=99999999,
+    help="summary interval, seconds")
+parser.add_argument("-T", "--timestamp", action="store_true",
+    help="include timestamp on output")
+parser.add_argument("-r", "--regexp", action="store_true",
+    help="use regular expressions. Default is \"*\" wildcards only.")
+parser.add_argument("-s", "--offset", action="store_true",
+    help="show address offsets")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="show raw addresses")
+parser.add_argument("pattern",
+    help="search expression for kernel functions")
+args = parser.parse_args()
+pattern = args.pattern
+if not args.regexp:
+    pattern = pattern.replace('*', '.*')
+    pattern = '^' + pattern + '$'
+offset = args.offset
+verbose = args.verbose
+debug = 0
+maxdepth = 10    # and MAXDEPTH
+
+# signal handler
+def signal_ignore(signal, frame):
+    print()
+
+# load BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+
+#define MAXDEPTH	10
+
+struct key_t {
+    u64 ip;
+    u64 ret[MAXDEPTH];
+};
+BPF_HASH(counts, struct key_t);
+
+static u64 get_frame(u64 *bp) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (!ret || ret < __START_KERNEL_map)
+            return 0;
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            bp = 0;
+        return ret;
+    }
+    return 0;
+}
+
+int trace_count(struct pt_regs *ctx) {
+    FILTER
+    struct key_t key = {};
+    u64 zero = 0, *val, bp = 0;
+    int depth = 0;
+
+    key.ip = ctx->ip;
+    bp = ctx->bp;
+
+    // unrolled loop, 10 (MAXDEPTH) frames deep:
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+out:
+    val = counts.lookup_or_init(&key, &zero);
+    (*val)++;
+    return 0;
+}
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        ('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
+        'if (pid != %s) { return 0; }') % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+b = BPF(text=bpf_text)
+b.attach_kprobe(event_re=pattern, fn_name="trace_count")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("0 functions matched by \"%s\". Exiting." % args.pattern)
+    exit()
+
+# header
+print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
+    (matched, args.pattern))
+
+def print_frame(addr):
+    print("  ", end="")
+    if verbose:
+        print("%-16x " % addr, end="")
+    if offset:
+        print("%s" % b.ksymaddr(addr))
+    else:
+        print("%s" % b.ksym(addr))
+
+# output
+exiting = 0 if args.interval else 1
+while (1):
+    try:
+        sleep(int(args.interval))
+    except KeyboardInterrupt:
+        exiting = 1
+        # as cleanup can take many seconds, trap Ctrl-C:
+        signal.signal(signal.SIGINT, signal_ignore)
+
+    print()
+    if args.timestamp:
+        print("%-8s\n" % strftime("%H:%M:%S"), end="")
+
+    counts = b.get_table("counts")
+    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
+        print_frame(k.ip)
+        for i in range(0, maxdepth):
+            if k.ret[i] == 0:
+                break
+            print_frame(k.ret[i])
+        print("    %d\n" % v.value)
+    counts.clear()
+
+    if exiting:
+        print("Detaching...")
+        exit()
--- a/tools/stackcount_example.txt
+++ b/tools/stackcount_example.txt
+Demonstrations of stackcount, the Linux eBPF/bcc version.
+
+
+This program traces kernel functions and frequency counts them with their entire
+kernel stack trace, summarized in-kernel for efficiency. For example, counting
+stack traces that led to submit_bio(), which creates block device I/O:
+
+# ./stackcount submit_bio
+Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
+^C
+  submit_bio
+  submit_bh
+  journal_submit_commit_record.isra.13
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    1
+
+  submit_bio
+  __block_write_full_page.constprop.39
+  block_write_full_page
+  blkdev_writepage
+  __writepage
+  write_cache_pages
+  generic_writepages
+  do_writepages
+  __writeback_single_inode
+  writeback_sb_inodes
+  __writeback_inodes_wb
+    2
+
+  submit_bio
+  __block_write_full_page.constprop.39
+  block_write_full_page
+  blkdev_writepage
+  __writepage
+  write_cache_pages
+  generic_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_fdatawrite
+  fdatawrite_one_bdev
+    36
+
+  submit_bio
+  submit_bh
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    38
+
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    79
+
+Detaching...
+
+The output shows unique stack traces, in order from leaf (on-CPU) to root,
+followed by their occurrence count. The last stack trace in the above output
+shows syscall handling, ext4_rename(), and filemap_flush(): looks like an
+application issued file rename has caused back end disk I/O due to ext4
+block allocation and a filemap_flush(). I'd have to browse the code to those
+functions to confirm!
+
+The order of printed stack traces is from least to most frequent. The most
+frequent in this case, the ext4_rename() stack, was taken 79 times during
+tracing.
+
+It can be useful to trace the path to submit_bio to explain unusual rates of
+disk IOPS. These could have in-kernel origins (eg, background scrub).
+
+This version of stackcount truncates stacks to 10 levels deep (plus 1 for
+the traced function, so 11).
+
+
+As another example, here are the code paths that led to ip_output(), which
+sends a packet at the IP level:
+
+# ./stackcount ip_output
+Tracing 1 functions for "ip_output"... Hit Ctrl-C to end.
+^C
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  tcp_tsq_handler.part.32
+  tcp_tasklet_func
+  tasklet_action
+  __do_softirq
+  do_softirq_own_stack
+  do_softirq
+  __local_bh_enable_ip
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_send_delayed_ack
+  __tcp_ack_snd_check
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_send_delayed_ack
+  __tcp_ack_snd_check
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_delack_timer_handler
+  tcp_delack_timer
+  call_timer_fn
+  run_timer_softirq
+  __do_softirq
+  irq_exit
+  xen_evtchn_do_upcall
+    1
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  release_sock
+  tcp_sendmsg
+  inet_sendmsg
+  sock_sendmsg
+    3
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  tcp_tsq_handler.part.32
+  tcp_tasklet_func
+  tasklet_action
+  __do_softirq
+  run_ksoftirqd
+  smpboot_thread_fn
+  kthread
+    3
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+  ip_rcv_finish
+    4
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_send_ack
+  tcp_cleanup_rbuf
+  tcp_recvmsg
+  inet_recvmsg
+  sock_recvmsg
+  sock_read_iter
+  __vfs_read
+  vfs_read
+    5
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  release_sock
+  tcp_sendmsg
+  inet_sendmsg
+  sock_sendmsg
+    9
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_push
+  tcp_sendmsg
+  inet_sendmsg
+  sock_sendmsg
+  sock_write_iter
+  __vfs_write
+    51
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+  ip_rcv_finish
+    171
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  __tcp_push_pending_frames
+  tcp_rcv_established
+  tcp_v4_do_rcv
+  tcp_v4_rcv
+  ip_local_deliver_finish
+  ip_local_deliver
+  ip_rcv_finish
+    994
+
+  ip_output
+  ip_queue_xmit
+  tcp_transmit_skb
+  tcp_write_xmit
+  tcp_tsq_handler.part.32
+  tcp_tasklet_func
+  tasklet_action
+  __do_softirq
+  irq_exit
+  xen_evtchn_do_upcall
+  xen_do_hypervisor_callback
+    1002
+
+Detaching...
+
+The last two most frequent stack traces are via tcp_transmit_skb(). Note the
+send last begins with ip_rcv_finish(), for a local receive, which then becomes
+a transmit: likely pushing more frames when processing the newly received ones.
+
+As may be obvious, this is a great tool for quickly understanding kernel code
+flow.
+
+
+A -i option can be used to set an output interval, and -T to include a
+timestamp. For example:
+
+# ./stackcount -Ti 1 submit_bio
+Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
+
+01:11:37
+  submit_bio
+  submit_bh
+  journal_submit_commit_record.isra.13
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    1
+
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+  submit_bio
+  submit_bh
+  jbd2_journal_commit_transaction
+  kjournald2
+  kthread
+  ret_from_fork
+  mb_cache_list
+    39
+
+
+01:11:38
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+
+01:11:39
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+^C
+01:11:39
+  submit_bio
+  ext4_writepages
+  do_writepages
+  __filemap_fdatawrite_range
+  filemap_flush
+  ext4_alloc_da_blocks
+  ext4_rename
+  ext4_rename2
+  vfs_rename
+  sys_rename
+  entry_SYSCALL_64_fastpath
+    20
+
+Detaching...
+
+
+The -s output prints the return instruction offset for each function (aka
+symbol offset). Eg:
+
+# ./stackcount -s tcp_sendmsg
+Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
+^C
+  tcp_sendmsg0x1
+  sock_sendmsg0x38
+  sock_write_iter0x78
+  __vfs_write0xaa
+  vfs_write0xa9
+  sys_write0x46
+  entry_SYSCALL_64_fastpath0x16
+    29
+
+Detaching...
+
+If it wasn't clear how one function called another, knowing the instruction
+offset can help you locate the lines of code from a dissassembly dump.
+
+
+A wildcard can also be used. Eg, all functions beginning with "tcp_send":
+
+# ./stackcount -s 'tcp_send*'
+Tracing 14 functions for "tcp_send*"... Hit Ctrl-C to end.
+^C
+  tcp_send_delayed_ack0x1
+  tcp_rcv_established0x3b1
+  tcp_v4_do_rcv0x130
+  tcp_v4_rcv0x8e0
+  ip_local_deliver_finish0x9f
+  ip_local_deliver0x51
+  ip_rcv_finish0x8a
+  ip_rcv0x29d
+  __netif_receive_skb_core0x637
+  __netif_receive_skb0x18
+  netif_receive_skb_internal0x23
+    1
+
+  tcp_send_delayed_ack0x1
+  tcp_rcv_established0x222
+  tcp_v4_do_rcv0x130
+  tcp_v4_rcv0x8e0
+  ip_local_deliver_finish0x9f
+  ip_local_deliver0x51
+  ip_rcv_finish0x8a
+  ip_rcv0x29d
+  __netif_receive_skb_core0x637
+  __netif_receive_skb0x18
+  netif_receive_skb_internal0x23
+    4
+
+  tcp_send_mss0x1
+  inet_sendmsg0x67
+  sock_sendmsg0x38
+  sock_write_iter0x78
+  __vfs_write0xaa
+  vfs_write0xa9
+  sys_write0x46
+  entry_SYSCALL_64_fastpath0x16
+    7
+
+  tcp_sendmsg0x1
+  sock_sendmsg0x38
+  sock_write_iter0x78
+  __vfs_write0xaa
+  vfs_write0xa9
+  sys_write0x46
+  entry_SYSCALL_64_fastpath0x16
+    7
+
+Detaching...
+
+Use -r to allow regular expressions.
+
+
+USAGE message:
+
+# ./stackcount -h
+usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s] [-v] pattern
+
+Count kernel function calls and their stack traces
+
+positional arguments:
+  pattern               search expression for kernel functions
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -p PID, --pid PID     trace this PID only
+  -i INTERVAL, --interval INTERVAL
+                        summary interval, seconds
+  -T, --timestamp       include timestamp on output
+  -r, --regexp          use regular expressions. Default is "*" wildcards
+                        only.
+  -s, --offset          show address offsets
+  -v, --verbose         show raw addresses
+
+examples:
+    ./stackcount submit_bio       # count kernel stack traces for submit_bio
+    ./stackcount ip_output        # count kernel stack traces for ip_output
+    ./stackcount -s ip_output     # show symbol offsets
+    ./stackcount -sv ip_output    # show offsets and raw addresses (verbose)
+    ./stackcount 'tcp_send*'      # count stacks for funcs matching tcp_send*
+    ./stackcount -r '^tcp_send.*' # same as above, using regular expressions
+    ./stackcount -Ti 5 ip_output  # output every 5 seconds, with timestamps
+    ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
--- a/tools/stacksnoop
+++ b/tools/stacksnoop
+#!/usr/bin/python
+#
+# stacksnoop    Trace a kernel function and print all kernel stack traces.
+#               For Linux, uses BCC, eBPF, and currently x86_64 only. Inline C.
+#
+# USAGE: stacksnoop [-h] [-p PID] [-s] [-v] function
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# The stack depth is limited to 10 (+1 for the current instruction pointer).
+# This could be tunable in a future version.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 12-Jan-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+
+# arguments
+examples = """examples:
+    ./stacksnoop ext4_sync_fs    # print kernel stack traces for ext4_sync_fs
+    ./stacksnoop -s ext4_sync_fs    # ... also show symbol offsets
+    ./stacksnoop -v ext4_sync_fs    # ... show extra columns
+    ./stacksnoop -p 185 ext4_sync_fs    # ... only when PID 185 is on-CPU
+"""
+parser = argparse.ArgumentParser(
+    description="Trace and print kernel stack traces for a kernel function",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-s", "--offset", action="store_true",
+    help="show address offsets")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="print more fields")
+parser.add_argument("function",
+    help="kernel function name")
+args = parser.parse_args()
+function = args.function
+offset = args.offset
+verbose = args.verbose
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+
+static int print_frame(u64 *bp, int *depth) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (!ret || ret < __START_KERNEL_map)
+            return 0;
+        bpf_trace_printk("r%d: %llx\\n", *depth, ret);
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            return 0;
+        *depth += 1;
+        return 1;
+    }
+    return 0;
+}
+
+void trace_stack(struct pt_regs *ctx) {
+    FILTER
+    u64 bp = 0;
+    int depth = 0;
+
+    bpf_trace_printk("\\n");
+    if (ctx->ip)
+        bpf_trace_printk("ip: %llx\\n", ctx->ip);
+    bp = ctx->bp;
+
+    // unrolled loop, 10 frames deep:
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+};
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        ('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
+        'if (pid != %s) { return 0; }') % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+
+# initialize BPF
+b = BPF(text=bpf_text)
+b.attach_kprobe(event=function, fn_name="trace_stack")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("Function \"%s\" not found. Exiting." % function)
+    exit()
+
+# header
+if verbose:
+    print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
+        "STACK"))
+else:
+    print("%-18s %s" % ("TIME(s)", "STACK"))
+
+# format output
+while 1:
+    (task, pid, cpu, flags, ts, msg) = b.trace_fields()
+    if msg != "":
+        (reg, addr) = msg.split(" ")
+        if offset:
+            ip = b.ksymaddr(int(addr, 16))
+        else:
+            ip = b.ksym(int(addr, 16))
+        msg = msg + " " + ip
+    if verbose:
+        print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
+    else:
+        print("%-18.9f %s" % (ts, msg))
--- a/tools/stacksnoop_example.txt
+++ b/tools/stacksnoop_example.txt
+Demonstrations of stacksnoop, the Linux eBPF/bcc version.
+
+
+This program traces the given kernel function and prints the kernel stack trace
+for every call. This tool is useful for studying low frequency kernel functions,
+to see how they were invoked. For exmaple, tracing the ext4_sync_fs() call:
+
+# ./stacksnoop ext4_sync_fs
+TIME(s)            STACK
+42005194.132250004 
+42005194.132253997 ip: ffffffff81280461 ext4_sync_fs
+42005194.132256001 r0: ffffffff811ed7f9 iterate_supers
+42005194.132257000 r1: ffffffff8121ba25 sys_sync
+42005194.132257000 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+42005194.132275000 
+42005194.132275999 ip: ffffffff81280461 ext4_sync_fs
+42005194.132275999 r0: ffffffff811ed7f9 iterate_supers
+42005194.132276997 r1: ffffffff8121ba35 sys_sync
+42005194.132276997 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+
+This shows that ext4_sync_fs() was called by iterate_supers(), which was called
+by sys_sync(), and so on. (It tells me that this was a syscall invoked sync,
+so an application has requested it.)
+
+The "ip" refers to the instruction pointer, and the "r#" refers to the return
+address for each stack frame.
+
+For high frequency functions, see stackcount, which summarizes in-kernel for
+efficiency. If you don't know if your function is low or high frequency, try
+funccount.
+
+
+The -v option includes more fields, including the on-CPU process (COMM and PID):
+
+# ./stacksnoop -v ext4_sync_fs
+TIME(s)            COMM         PID    CPU STACK
+42005557.056332998 sync         22352  1   
+42005557.056336999 sync         22352  1   ip: ffffffff81280461 ext4_sync_fs
+42005557.056339003 sync         22352  1   r0: ffffffff811ed7f9 iterate_supers
+42005557.056340002 sync         22352  1   r1: ffffffff8121ba25 sys_sync
+42005557.056340002 sync         22352  1   r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+42005557.056358002 sync         22352  1   
+42005557.056358002 sync         22352  1   ip: ffffffff81280461 ext4_sync_fs
+42005557.056359001 sync         22352  1   r0: ffffffff811ed7f9 iterate_supers
+42005557.056359999 sync         22352  1   r1: ffffffff8121ba35 sys_sync
+42005557.056359999 sync         22352  1   r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
+
+This identifies the application issuing the sync syscall: the sync(1) command
+(COMM column).
+
+
+Here's another example, showing the path to second_overflow() and on-CPU
+process:
+
+# ./stacksnoop -v second_overflow
+TIME(s)            COMM         PID    CPU STACK
+42005696.529449999 <idle>       0      0   
+42005696.529457003 <idle>       0      0   ip: ffffffff810e5701 second_overflow
+42005696.529459000 <idle>       0      0   r0: ffffffff810ecb1b tick_do_update_jiffies64
+42005696.529459998 <idle>       0      0   r1: ffffffff810ed6e0 tick_irq_enter
+42005696.529459998 <idle>       0      0   r2: ffffffff8107a195 irq_enter
+42005696.529460996 <idle>       0      0   r3: ffffffff8146bb6f xen_evtchn_do_upcall
+42005696.529460996 <idle>       0      0   r4: ffffffff81777a2e xen_do_hypervisor_callback
+42005697.616295002 <idle>       0      0   
+42005697.616301000 <idle>       0      0   ip: ffffffff810e5701 second_overflow
+42005697.616302997 <idle>       0      0   r0: ffffffff810ecb1b tick_do_update_jiffies64
+42005697.616304003 <idle>       0      0   r1: ffffffff810ed6e0 tick_irq_enter
+42005697.616304003 <idle>       0      0   r2: ffffffff8107a195 irq_enter
+42005697.616305001 <idle>       0      0   r3: ffffffff8146bb6f xen_evtchn_do_upcall
+42005697.616305001 <idle>       0      0   r4: ffffffff81777a2e xen_do_hypervisor_callback
+42005698.556240998 <idle>       0      1   
+42005698.556247003 <idle>       0      1   ip: ffffffff810e5701 second_overflow
+42005698.556249000 <idle>       0      1   r0: ffffffff810ecb1b tick_do_update_jiffies64
+42005698.556249000 <idle>       0      1   r1: ffffffff810ed6e0 tick_irq_enter
+42005698.556249999 <idle>       0      1   r2: ffffffff8107a195 irq_enter
+42005698.556249999 <idle>       0      1   r3: ffffffff8146bb6f xen_evtchn_do_upcall
+42005698.556250997 <idle>       0      1   r4: ffffffff81777a2e xen_do_hypervisor_callback
+[...]
+
+This fires every second (see TIME(s)), and is from tick_do_update_jiffies64().
+
+
+USAGE message:
+
+# ./stacksnoop -h
+usage: stacksnoop [-h] [-p PID] [-s] [-v] function
+
+Trace and print kernel stack traces for a kernel function
+
+positional arguments:
+  function           kernel function name
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  trace this PID only
+  -s, --offset       show address offsets
+  -v, --verbose      print more fields
+
+examples:
+    ./stacksnoop ext4_sync_fs    # print kernel stack traces for ext4_sync_fs
+    ./stacksnoop -s ext4_sync_fs    # ... also show symbol offsets
+    ./stacksnoop -v ext4_sync_fs    # ... show extra columns
+    ./stacksnoop -p 185 ext4_sync_fs    # ... only when PID 185 is on-CPU