updating branch

Merge remote-tracking branch 'origin/master' into percpu

updating branch
Merge remote-tracking branch 'origin/master' into percpu
59d525fa · Zaafar Ahmed · 64258256 · 6c08dcbb · 59d525fa · 59d525fa
Commit 59d525fa authored Mar 29, 2016 by Zaafar Ahmed
38 changed files
--- a/README.md
+++ b/README.md
@@ -100,6 +100,7 @@ Examples:
 - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
 - tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
 - tools/[softirqs](tools/softirqs.py):  Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
+- tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).
 - tools/[stackcount](tools/stackcount.py): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
 - tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
 - tools/[statsnoop](tools/statsnoop.py): Trace stat() syscalls. [Examples](tools/statsnoop_example.txt).
@@ -108,6 +109,7 @@ Examples:
 - tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
 - tools/[tcpconnlat](tools/tcpconnlat.py): Trace TCP active connection latency (connect()). [Examples](tools/tcpconnlat_example.txt).
 - tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
+- tools/[tplist](tools/tplist.py): Display kernel tracepoints or USDT probes and their formats. [Examples](tools/tplist_example.txt).
 - tools/[trace](tools/trace.py): Trace arbitrary functions, with filters. [Examples](tools/trace_example.txt)
 - tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
 - tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).

--- a/examples/tracing/mallocstacks.py
+++ b/examples/tracing/mallocstacks.py
+#!/usr/bin/python
+#
+# mallocstacks  Trace malloc() calls in a process and print the full
+#               stack trace for all callsites.
+#               For Linux, uses BCC, eBPF. Embedded C.
+#
+# This script is a basic example of the new Linux 4.6+ BPF_STACK_TRACE
+# table API.
+#
+# Copyright 2016 GitHub, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+
+from __future__ import print_function
+from bcc import BPF, ProcessSymbols
+from time import sleep
+import sys
+
+if len(sys.argv) < 2:
+    print("USAGE: mallocstacks PID")
+    exit()
+pid = int(sys.argv[1])
+
+# load BPF program
+b = BPF(text="""
+#include <uapi/linux/ptrace.h>
+
+BPF_HASH(calls, int);
+BPF_STACK_TRACE(stack_traces, 1024)
+
+int alloc_enter(struct pt_regs *ctx, size_t size) {
+    int key = stack_traces.get_stackid(ctx,
+        BPF_F_USER_STACK|BPF_F_REUSE_STACKID);
+    if (key < 0)
+        return 0;
+
+    u64 zero = 0, *val;
+    val = calls.lookup_or_init(&key, &zero);
+    (*val) += size;
+    return 0;
+};
+""")
+
+b.attach_uprobe(name="c", sym="malloc", fn_name="alloc_enter", pid=pid)
+print("Attaching to malloc in pid %d, Ctrl+C to quit." % pid)
+
+decoder = ProcessSymbols(pid)
+
+# sleep until Ctrl-C
+try:
+    sleep(99999999)
+except KeyboardInterrupt:
+    pass
+
+calls = b.get_table("calls")
+stack_traces = b.get_table("stack_traces")
+
+for k, v in reversed(sorted(calls.items(), key=lambda c: c[1].value)):
+    print("%d bytes allocated at:" % v.value)
+    for addr in stack_traces.walk(k.value):
+        print("\t%s (%x)" % (decoder.decode_addr(addr), addr))
--- a/man/man8/argdist.8
+++ b/man/man8/argdist.8
@@ -50,11 +50,11 @@ many cases, argdist will deduce the necessary header files automatically.
 .SH SPECIFIER SYNTAX
 The general specifier syntax is as follows:

-.B {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+.B {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 .TP
-.B {p,r}
-Probe type \- "p" for function entry, "r" for function return;
-\-H for histogram collection, \-C for frequency count.
+.B {p,r,t,u}
+Probe type \- "p" for function entry, "r" for function return, "t" for kernel
+tracepoint, "u" for USDT probe; \-H for histogram collection, \-C for frequency count.
 Indicates where to place the probe and whether the probe should collect frequency
 count information, or aggregate the collected values into a histogram. Counting 
 probes will collect the number of times every parameter value was observed,
@@ -68,12 +68,19 @@ Specify the full path to the .so or executable file where the function to probe
 resides. Alternatively, you can specify just the lib name: for example, "c"
 refers to libc. If no library name is specified, the kernel is assumed.
 .TP
+.B category
+The category of the kernel tracepoint. For example: net, sched, block.
+.TP
 .B function(signature)
 The function to probe, and its signature.
 The function name must match exactly for the probe to be placed. The signature,
 on the other hand, is only required if you plan to collect parameter values 
 based on that signature. For example, if you only want to collect the first
 parameter, you don't have to specify the rest of the parameters in the signature.
+When capturing kernel tracepoints, this should be the name of the event, e.g.
+net_dev_start_xmit. The signature for kernel tracepoints should be empty. When
+capturing USDT probes, this should be the name of the probe, e.g. reloc_complete.
+The signature for USDT probes should be empty.
 .TP
 .B [type[,type...]]
 The type(s) of the expression(s) to capture.
@@ -85,6 +92,12 @@ The expression(s) to capture.
 These are the values that are assigned to the histogram or raw event collection.
 You may use the parameters directly, or valid C expressions that involve the
 parameters, such as "size % 10".
+Tracepoints may access a special structure called "tp" that is formatted according
+to the tracepoint format (which you can obtain using tplist). For example, the
+block:block_rq_complete tracepoint can access tp.nr_sector. You may also use the
+members of the "tp" struct directly, e.g. "nr_sector" instead of "tp.nr_sector".
+USDT probes may access the arguments defined by the tracing program in the 
+special arg1, arg2, ... variables. To obtain their types, use the tplist tool.
 Return probes can use the argument values received by the
 function when it was entered, through the $entry(paramname) special variable.
 Return probes can also access the function's return value in $retval, and the
@@ -137,6 +150,18 @@ Count fork() calls in libc across all processes, grouped by pid:
 #
 .B argdist -C 'p:c:fork():int:$PID;fork per process'
 .TP
+Print histogram of number of sectors in completing block I/O requests:
+#
+.B argdist -H 't:block:block_rq_complete():u32:nr_sector'
+.TP
+Aggregate interrupts by interrupt request (IRQ):
+#
+.B argdist -C 't:irq:irq_handler_entry():int:irq'
+.TP
+Print the functions used as thread entry points and how common they are:
+#
+.B argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
+.TP
 Print histograms of sleep() and nanosleep() parameter values:
 #
 .B argdist -H 'p:c:sleep(u32 seconds):u32:seconds' 'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'

--- a/man/man8/memleak.8
+++ b/man/man8/memleak.8
@@ -3,7 +3,7 @@
 memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
 .SH SYNOPSIS
 .B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
-[-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
+[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
 .SH DESCRIPTION
 memleak traces and matches memory allocation and deallocation requests, and
 collects call stacks for each allocation. memleak can then print a summary
@@ -15,10 +15,7 @@ When tracing all processes, memleak instruments kmalloc and kfree.
 memleak may introduce significant overhead when tracing processes that allocate
 and free many blocks very quickly. See the OVERHEAD section below.

-The stack depth is limited to 10 by default (+1 for the current instruction pointer),
-but it can be controlled using the \-d switch if deeper stacks are required.
-
-This currently only works on x86_64. Check for future versions.
+This tool only works on Linux 4.6+. Stack traces are obtained using the new BPF_STACK_TRACE` APIs.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
 .SH OPTIONS
@@ -45,10 +42,6 @@ Run the specified command and trace its allocations only. This traces malloc and
 \-s SAMPLE_RATE
 Record roughly every SAMPLE_RATE-th allocation to reduce overhead.
 .TP
-\-d STACK_DEPTH
-Capture STACK_DEPTH frames (or less) when obtaining allocation call stacks.
-The default value is 10.
-.TP
 \-t TOP
 Print only the top TOP stacks (sorted by size).
 The default value is 10.
@@ -106,9 +99,6 @@ placed in a typical period of 10 seconds:
 #
 .B perf stat -a -e 'probe:__kmalloc' -- sleep 10

-Another setting that may help reduce overhead is lowering the number of stack
-frames captured and parsed by memleak for each allocation, using the \-d switch.
-
 .SH SOURCE
 This is from bcc.
 .IP

--- a/man/man8/offcputime.8
+++ b/man/man8/offcputime.8
@@ -22,10 +22,8 @@ especially the -f format, can be used to generate an "off-CPU time flame graph".

 See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html

-The stack depth is currently limited to 20, and the stack traces are kernel
-mode only. Check for newer versions where either may be improved.
-
-This currently only works on x86_64. Check for future versions.
+This tool only works on Linux 4.6+. It uses the new `BPF_STACK_TRACE` table
+APIs to generate the in-kernel stack traces.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
 .SH OPTIONS

--- a/man/man8/stackcount.8
+++ b/man/man8/stackcount.8
@@ -12,10 +12,8 @@ and their occurrence counts.
 The pattern is a string with optional '*' wildcards, similar to file globbing.
 If you'd prefer to use regular expressions, use the \-r option.

-The stack depth is currently limited to 10 (+1 for the current instruction
-pointer).
+This tool only works on Linux 4.6+. Stack traces are obtained using the new `BPF_STACK_TRACE` APIs.

-This currently only works on x86_64. Check for future versions.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
 .SH OPTIONS
@@ -82,13 +80,12 @@ Only count stacks when PID 185 is on-CPU:
 This summarizes unique stack traces in-kernel for efficiency, allowing it to
 trace a higher rate of function calls than methods that post-process in user
 space. The stack trace data is only copied to user space when the output is
-printed, which usually only happens once. Given these techniques, I'd suspect
-that call rates of < 10,000/sec would incur negligible overhead (for this
-current version; future versions may improve this). Beyond that,
-there will be a point where the overhead is measurable, as this does add
-a number of instructions to each function call to walk and save stacks.
-Test before production use. You can also use funccount to get a handle on
-function call rates first.
+printed, which usually only happens once. The stack walking also happens in an
+optimized code path in the kernel thanks to the new BPF_STACK_TRACE table APIs,
+which should be more efficient than the manual walker in the eBPF tracer which
+older versions of this script used. With this in mind, call rates of <
+10,000/sec would incur negligible overhead. Test before production use. You can
+also use funccount to get a handle on function call rates first.
 .SH SOURCE
 This is from bcc.
 .IP

--- a/man/man8/stacksnoop.8
+++ b/man/man8/stacksnoop.8
@@ -9,10 +9,7 @@ kernel stack back trace for that call. This shows the ancestry of function
 calls, and is a quick way to investigate low frequency kernel functions and
 their cause. For high frequency kernel functions, see stackcount.

-The stack depth is currently limited to 10 (+1 for the current instruction
-pointer).
-
-This currently only works on x86_64. Check for future versions.
+This tool only works on Linux 4.6+. Stack traces are obtained using the new BPF_STACK_TRACE` APIs.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
 .SH OPTIONS

--- a/man/man8/tplist.8
+++ b/man/man8/tplist.8
+.TH tplist 8  "2016-03-20" "USER COMMANDS"
+.SH NAME
+tplist \- Display kernel tracepoints or USDT probes and their formats.
+.SH SYNOPSIS
+.B tplist [-p PID] [-l LIB] [-v] [filter]
+.SH DESCRIPTION
+tplist lists all kernel tracepoints, and can optionally print out the tracepoint
+format; namely, the variables that you can trace when the tracepoint is hit. 
+tplist can also list USDT probes embedded in a specific library or executable,
+and can list USDT probes for all the libraries loaded by a specific process.
+These features are usually used in conjunction with the argdist and/or trace tools.
+
+On a typical system, accessing the tracepoint list and format requires root.
+However, accessing USDT probes does not require root.
+.SH OPTIONS
+.TP
+\-p PID
+Display the USDT probes from all the libraries loaded by the specified process.
+.TP
+\-l LIB
+Display the USDT probes from the specified library or executable. If the librar
+or executable can be found in the standard paths, a full path is not required.
+.TP
+\-v
+Display the variables associated with the tracepoint or USDT probe.
+.TP
+[filter]
+A wildcard expression that specifies which tracepoints or probes to print.
+For example, block:* will print all block tracepoints (block:block_rq_complete,
+etc.). Regular expressions are not supported.
+.SH EXAMPLES
+.TP
+Print all kernel tracepoints:
+#
+.B tplist
+.TP
+Print all net tracepoints with their format:
+#
+.B tplist -v 'net:*'
+.TP
+Print all USDT probes in libpthread:
+$ 
+.B tplist -l pthread
+.TP
+Print all USDT probes in process 4717 from the libc provider:
+$
+.B tplist -p 4717 'libc:*'
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Sasha Goldshtein
--- a/man/man8/trace.8
+++ b/man/man8/trace.8
@@ -45,10 +45,12 @@ information. See PROBE SYNTAX below.
 The general probe syntax is as follows:

 .B [{p,r}]:[library]:function [(predicate)] ["format string"[, arguments]]
+
+.B {t:category:event,u:library:probe} [(predicate)] ["format string"[, arguments]]
 .TP
-.B [{p,r}]
-Probe type \- "p" for function entry, "r" for function return. The default
-probe type is "p".
+.B {[{p,r}],t,u}
+Probe type \- "p" for function entry, "r" for function return, "t" for kernel
+tracepoint, "u" for USDT probe. The default probe type is "p".
 .TP
 .B [library]
 Library containing the probe.
@@ -58,9 +60,18 @@ refers to libc. If no library name is specified, the kernel is assumed. Also,
 you can specify an executable name (without a full path) if it is in the PATH.
 For example, "bash".
 .TP
+.B category
+The tracepoint category. For example, "sched" or "irq".
+.TP
 .B function
 The function to probe.
 .TP
+.B event
+The tracepoint event. For example, "block_rq_complete".
+.TP
+.B probe
+The USDT probe name. For example, "pthread_create".
+.TP
 .B [(predicate)]
 The filter applied to the captured data. Only if the filter evaluates as true,
 the trace message will be printed. The filter can use any valid C expression
@@ -81,6 +92,16 @@ number of arguments as there are placeholders in the format string. The
 format specifier replacements may be any C expressions, and may refer to the
 same special keywords as in the predicate (arg1, arg2, etc.).

+In tracepoints, both the predicate and the arguments may refer to the tracepoint
+format structure, which is stored in the special "tp" variable. For example, the
+block:block_rq_complete tracepoint can print or filter by tp.nr_sector. To 
+discover the format of your tracepoint, use the tplist tool. Note that you can
+also use the members of the "tp" struct directly, e.g "nr_sector" instead of
+"tp.nr_sector".
+
+In USDT probes, the arg1, ..., argN variables refer to the probe's arguments.
+To determine which arguments your probe has, use the tplist tool.
+
 The predicate expression and the format specifier replacements for printing
 may also use the following special keywords: $pid, $tgid to refer to the 
 current process' pid and tgid; $uid, $gid to refer to the current user's
@@ -102,6 +123,14 @@ Trace all malloc calls and print the size of the requested allocation:
 Trace returns from the readline function in bash and print the return value as a string:
 #
 .B trace 'r:bash:readline """%s"", retval' 
+.TP
+Trace the block:block_rq_complete tracepoint and print the number of sectors completed:
+#
+.B trace 't:block:block_rq_complete """%d sectors"", nr_sector'
+.TP
+Trace the pthread_create USDT probe from the pthread library and print the address of the thread's start function:
+#
+.B trace 'u:pthread:pthread_create """start addr = %llx"", arg3'
 .SH SOURCE
 This is from bcc.
 .IP

--- a/src/cc/CMakeLists.txt
+++ b/src/cc/CMakeLists.txt
@@ -42,16 +42,18 @@ set_target_properties(bcc-static PROPERTIES OUTPUT_NAME bcc)

 # BPF is still experimental otherwise it should be available
 #llvm_map_components_to_libnames(llvm_libs bpf mcjit irreader passes)
-llvm_map_components_to_libnames(llvm_libs mcjit irreader passes linker
-  instrumentation objcarcopts bitwriter option x86codegen)
+llvm_map_components_to_libnames(llvm_libs bitwriter bpfcodegen irreader linker
+  mcjit objcarcopts option passes x86codegen)
+llvm_expand_dependencies(expanded_libs ${llvm_libs})
+
 # order is important
 set(clang_libs ${libclangFrontend} ${libclangSerialization} ${libclangDriver} ${libclangParse}
  ${libclangSema} ${libclangCodeGen} ${libclangAnalysis} ${libclangRewrite} ${libclangEdit}
  ${libclangAST} ${libclangLex} ${libclangBasic})

 # Link against LLVM libraries
-target_link_libraries(bcc-shared b_frontend clang_frontend ${clang_libs} ${llvm_libs} LLVMBPFCodeGen)
-target_link_libraries(bcc-static b_frontend clang_frontend bcc-loader-static ${clang_libs} ${llvm_libs} LLVMBPFCodeGen)
+target_link_libraries(bcc-shared b_frontend clang_frontend ${clang_libs} ${expanded_libs})
+target_link_libraries(bcc-static b_frontend clang_frontend bcc-loader-static ${clang_libs} ${expanded_libs})

 install(TARGETS bcc-shared LIBRARY COMPONENT libbcc
  DESTINATION ${CMAKE_INSTALL_LIBDIR})

--- a/src/cc/frontends/clang/b_frontend_action.cc
+++ b/src/cc/frontends/clang/b_frontend_action.cc
@@ -692,7 +692,7 @@ BFrontendAction::BFrontendAction(llvm::raw_ostream &os, unsigned flags)
 }

 void BFrontendAction::EndSourceFileAction() {
-  if (flags_ & 0x4)
+  if (flags_ & DEBUG_PREPROCESSOR)
    rewriter_->getEditBuffer(rewriter_->getSourceMgr().getMainFileID()).write(llvm::errs());
  rewriter_->getEditBuffer(rewriter_->getSourceMgr().getMainFileID()).write(os_);
  os_.flush();

--- a/src/cc/frontends/clang/b_frontend_action.h
+++ b/src/cc/frontends/clang/b_frontend_action.h
@@ -26,6 +26,8 @@

 #include "table_desc.h"

+#define DEBUG_PREPROCESSOR 0x4
+
 namespace clang {
 class ASTConsumer;
 class ASTContext;

--- a/src/cc/frontends/clang/kbuild_helper.cc
+++ b/src/cc/frontends/clang/kbuild_helper.cc
@@ -82,6 +82,9 @@ int KBuildHelper::get_flags(const char *uname_machine, vector<string> *cflags) {
  cflags->push_back("-include");
  cflags->push_back("./include/linux/kconfig.h");
  cflags->push_back("-D__KERNEL__");
+  cflags->push_back("-D__HAVE_BUILTIN_BSWAP16__");
+  cflags->push_back("-D__HAVE_BUILTIN_BSWAP32__");
+  cflags->push_back("-D__HAVE_BUILTIN_BSWAP64__");
  cflags->push_back("-Wno-unused-value");
  cflags->push_back("-Wno-pointer-sign");


--- a/src/cc/frontends/clang/loader.cc
+++ b/src/cc/frontends/clang/loader.cc
@@ -155,7 +155,7 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
  // Initialize a compiler invocation object from the clang (-cc1) arguments.
  const driver::ArgStringList &ccargs = cmd.getArguments();

-  if (flags_ & 0x4) {
+  if (flags_ & DEBUG_PREPROCESSOR) {
    llvm::errs() << "clang";
    for (auto arg : ccargs)
      llvm::errs() << " " << arg;

--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -20,13 +20,16 @@ import json
 import multiprocessing
 import os
 import re
-from subprocess import Popen, PIPE
+from subprocess import Popen, PIPE, STDOUT
 import struct
 import sys
 basestring = (unicode if sys.version_info[0] < 3 else str)

 from .libbcc import lib, _CB_TYPE
+from .procstat import ProcStat
 from .table import Table
+from .tracepoint import Perf, Tracepoint
+from .usyms import ProcessSymbols

 open_kprobes = {}
 open_uprobes = {}
@@ -38,6 +41,10 @@ ksym_names = {}
 ksym_loaded = 0
 _kprobe_limit = 1000

+DEBUG_LLVM_IR = 0x1
+DEBUG_BPF = 0x2
+DEBUG_PREPROCESSOR = 0x4
+
 @atexit.register
 def cleanup_kprobes():
    for k, v in open_kprobes.items():
@@ -72,6 +79,30 @@ class BPF(object):
    _lib_load_address_cache = {}
    _lib_symbol_cache = {}

+    _auto_includes = {
+        "linux/time.h"      : ["time"],
+        "linux/fs.h"        : ["fs", "file"],
+        "linux/blkdev.h"    : ["bio", "request"],
+        "linux/slab.h"      : ["alloc"],
+        "linux/netdevice.h" : ["sk_buff", "net_device"]
+    }
+
+    @classmethod
+    def generate_auto_includes(cls, program_words):
+        """
+        Generates #include statements automatically based on a set of
+        recognized types such as sk_buff and bio. The input is all the words
+        that appear in the BPF program, and the output is a (possibly empty)
+        string of #include statements, such as "#include <linux/fs.h>".
+        """
+        headers = ""
+        for header, keywords in cls._auto_includes.items():
+            for keyword in keywords:
+                for word in program_words:
+                    if keyword in word and header not in headers:
+                        headers += "#include <%s>\n" % header
+        return headers
+
    # defined for compatibility reasons, to be removed
    Table = Table

@@ -105,8 +136,9 @@ class BPF(object):
            hdr_file (Optional[str]): Path to a helper header file for the `src_file`
            text (Optional[str]): Contents of a source file for the module
            debug (Optional[int]): Flags used for debug prints, can be |'d together
-                0x1: print LLVM IR to stderr
-                0x2: print BPF bytecode to stderr
+                DEBUG_LLVM_IR: print LLVM IR to stderr
+                DEBUG_BPF: print BPF bytecode to stderr
+                DEBUG_PREPROCESSOR: print Preprocessed C file to stderr
        """

        self._reader_cb_impl = _CB_TYPE(BPF._reader_cb)
@@ -165,7 +197,7 @@ class BPF(object):
                lib.bpf_module_kern_version(self.module),
                log_buf, ct.sizeof(log_buf) if log_buf else 0)

-        if self.debug & 0x2:
+        if self.debug & DEBUG_BPF:
            print(log_buf.value.decode(), file=sys.stderr)

        if fd < 0:
@@ -325,6 +357,11 @@ class BPF(object):
        global open_kprobes
        return open_kprobes

+    @staticmethod
+    def open_uprobes():
+            global open_uprobes
+            return open_uprobes
+
    @staticmethod
    def detach_kprobe(event):
        ev_name = "p_" + event.replace("+", "_").replace(".", "_")
@@ -701,7 +738,7 @@ class BPF(object):
        if idx == -1:
            return "[unknown]"
        offset = int(addr - ksyms[idx][1])
-        return ksyms[idx][0] + hex(offset)
+        return "%s+0x%x" % (ksyms[idx][0], offset)

    @staticmethod
    def ksymname(name):
@@ -717,6 +754,28 @@ class BPF(object):
            return 0
        return ksyms[idx][1]

+    @classmethod
+    def usymaddr(cls, pid, addr, refresh_symbols=False):
+        """usymaddr(pid, addr, refresh_symbols=False)
+
+        Decode the specified address in the specified process to a symbolic
+        representation that includes the symbol name, offset within the symbol,
+        and the module name. See the ProcessSymbols class for more details.
+
+        Specify refresh_symbols=True if you suspect the set of loaded modules
+        or their load addresses has changed since the last time you called
+        usymaddr() on this pid.
+        """
+        proc_sym = None
+        if pid in cls._process_symbols:
+            proc_sym = cls._process_symbols[pid]
+            if refresh_symbols:
+                proc_sym.refresh_code_ranges()
+        else:
+            proc_sym = ProcessSymbols(pid)
+            cls._process_symbols[pid] = proc_sym
+        return proc_sym.decode_addr(addr)
+
    @staticmethod
    def num_open_kprobes():
        """num_open_kprobes()
@@ -740,3 +799,5 @@ class BPF(object):
        except KeyboardInterrupt:
            exit()

+from .usdt import USDTReader
+
--- a/src/python/bcc/procstat.py
+++ b/src/python/bcc/procstat.py
+# Copyright 2016 Sasha Goldshtein
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+class ProcStat(object):
+        def __init__(self, pid):
+                self.pid = pid
+                self.exe = self._get_exe()
+                self.start_time = self._get_start_time()
+
+        def is_stale(self):
+                return self.exe != self._get_exe() or \
+                       self.start_time != self._get_start_time()
+
+        def _get_exe(self):
+                return os.popen("readlink -f /proc/%d/exe" % self.pid).read()
+
+        def _get_start_time(self):
+                return os.popen("cut -d' ' -f 22 /proc/%d/stat" %
+                                self.pid).read()
+
--- a/src/python/bcc/table.py
+++ b/src/python/bcc/table.py
@@ -523,9 +523,37 @@ class PerCpuArray(ArrayBase):
        return result

 class StackTrace(TableBase):
+    MAX_DEPTH = 127
+
    def __init__(self, *args, **kwargs):
        super(StackTrace, self).__init__(*args, **kwargs)

+    class StackWalker(object):
+        def __init__(self, stack, resolve=None):
+            self.stack = stack
+            self.n = -1
+            self.resolve = resolve
+
+        def __iter__(self):
+            return self
+
+        def __next__(self):
+            return self.next()
+
+        def next(self):
+            self.n += 1
+            if self.n == StackTrace.MAX_DEPTH:
+                raise StopIteration()
+
+            addr = self.stack.ip[self.n]
+            if addr == 0 :
+                raise StopIteration()
+
+            return self.resolve(addr) if self.resolve else addr
+
+    def walk(self, stack_id, resolve=None):
+        return StackTrace.StackWalker(self[self.Key(stack_id)], resolve)
+
    def __len__(self):
        i = 0
        for k in self: i += 1

--- a/src/python/bcc/tracepoint.py
+++ b/src/python/bcc/tracepoint.py
+# Copyright 2016 Sasha Goldshtein
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ctypes as ct
+import multiprocessing
+import os
+import re
+
+class Perf(object):
+        class perf_event_attr(ct.Structure):
+                _fields_ = [
+                        ('type', ct.c_uint),
+                        ('size', ct.c_uint),
+                        ('config', ct.c_ulong),
+                        ('sample_period', ct.c_ulong),
+                        ('sample_type', ct.c_ulong),
+                        ('IGNORE1', ct.c_ulong),
+                        ('IGNORE2', ct.c_ulong),
+                        ('wakeup_events', ct.c_uint),
+                        ('IGNORE3', ct.c_uint),
+                        ('IGNORE4', ct.c_ulong),
+                        ('IGNORE5', ct.c_ulong),
+                        ('IGNORE6', ct.c_ulong),
+                        ('IGNORE7', ct.c_uint),
+                        ('IGNORE8', ct.c_int),
+                        ('IGNORE9', ct.c_ulong),
+                        ('IGNORE10', ct.c_uint),
+                        ('IGNORE11', ct.c_uint)
+                ]
+
+        NR_PERF_EVENT_OPEN = 298
+        PERF_TYPE_TRACEPOINT = 2
+        PERF_SAMPLE_RAW = 1024
+        PERF_FLAG_FD_CLOEXEC = 8
+        PERF_EVENT_IOC_SET_FILTER = 1074275334
+        PERF_EVENT_IOC_ENABLE = 9216
+
+        libc = ct.CDLL('libc.so.6', use_errno=True)
+        syscall = libc.syscall          # not declaring vararg types
+        ioctl = libc.ioctl              # not declaring vararg types
+
+        @staticmethod
+        def _open_for_cpu(cpu, attr):
+                pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
+                                   -1, cpu, -1, Perf.PERF_FLAG_FD_CLOEXEC)
+                if pfd < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
+                              "common_pid == -17") < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+
+        @staticmethod
+        def perf_event_open(tpoint_id):
+                attr = Perf.perf_event_attr()
+                attr.config = tpoint_id
+                attr.type = Perf.PERF_TYPE_TRACEPOINT
+                attr.sample_type = Perf.PERF_SAMPLE_RAW
+                attr.sample_period = 1
+                attr.wakeup_events = 1
+                for cpu in range(0, multiprocessing.cpu_count()):
+                        Perf._open_for_cpu(cpu, attr)
+
+class Tracepoint(object):
+        enabled_tracepoints = []
+        trace_root = "/sys/kernel/debug/tracing"
+        event_root = os.path.join(trace_root, "events")
+
+        @classmethod
+        def _any_tracepoints_enabled(cls):
+                return len(cls.enabled_tracepoints) > 0
+
+        @classmethod
+        def generate_decl(cls):
+                if not cls._any_tracepoints_enabled():
+                        return ""
+                return "\nBPF_HASH(__trace_di, u64, u64);\n"
+
+        @classmethod
+        def generate_entry_probe(cls):
+                if not cls._any_tracepoints_enabled():
+                        return ""
+                return """
+int __trace_entry_update(struct pt_regs *ctx)
+{
+        u64 tid = bpf_get_current_pid_tgid();
+        u64 val = ctx->di;
+        __trace_di.update(&tid, &val);
+        return 0;
+}
+"""
+
+        def __init__(self, category, event, tp_id):
+                self.category = category
+                self.event = event
+                self.tp_id = tp_id
+                self._retrieve_struct_fields()
+
+        def _retrieve_struct_fields(self):
+                self.struct_fields = []
+                format_lines = Tracepoint.get_tpoint_format(self.category,
+                                                            self.event)
+                for line in format_lines:
+                        match = re.search(r'field:([^;]*);.*size:(\d+);', line)
+                        if match is None:
+                                continue
+                        parts = match.group(1).split()
+                        field_name = parts[-1:][0]
+                        field_type = " ".join(parts[:-1])
+                        field_size = int(match.group(2))
+                        if "__data_loc" in field_type:
+                                continue
+                        if field_name.startswith("common_"):
+                                continue
+                        self.struct_fields.append((field_type, field_name))
+
+        def _generate_struct_fields(self):
+                text = ""
+                for field_type, field_name in self.struct_fields:
+                        text += "        %s %s;\n" % (field_type, field_name)
+                return text
+
+        def generate_struct(self):
+                self.struct_name = self.event + "_trace_entry"
+                return """
+struct %s {
+        u64 __do_not_use__;
+%s
+};
+                """ % (self.struct_name, self._generate_struct_fields())
+
+        def _generate_struct_locals(self):
+                text = ""
+                for field_type, field_name in self.struct_fields:
+                        if field_type == "char" and field_name.endswith(']'):
+                                # Special case for 'char whatever[N]', should
+                                # be assigned to a 'char *'
+                                field_type = "char *"
+                                field_name = re.sub(r'\[\d+\]$', '', field_name)
+                        text += "        %s %s = tp.%s;\n" % (
+                                        field_type, field_name, field_name)
+                return text
+
+        def generate_get_struct(self):
+                return """
+        u64 tid = bpf_get_current_pid_tgid();
+        u64 *di = __trace_di.lookup(&tid);
+        if (di == 0) { return 0; }
+        struct %s tp = {};
+        bpf_probe_read(&tp, sizeof(tp), (void *)*di);
+%s
+                """ % (self.struct_name, self._generate_struct_locals())
+
+        @classmethod
+        def enable_tracepoint(cls, category, event):
+                tp_id = cls.get_tpoint_id(category, event)
+                if tp_id == -1:
+                        raise ValueError("no such tracepoint found: %s:%s" %
+                                         (category, event))
+                Perf.perf_event_open(tp_id)
+                new_tp = Tracepoint(category, event, tp_id)
+                cls.enabled_tracepoints.append(new_tp)
+                return new_tp
+
+        @staticmethod
+        def get_tpoint_id(category, event):
+                evt_dir = os.path.join(Tracepoint.event_root, category, event)
+                try:
+                        return int(
+                          open(os.path.join(evt_dir, "id")).read().strip())
+                except:
+                        return -1
+
+        @staticmethod
+        def get_tpoint_format(category, event):
+                evt_dir = os.path.join(Tracepoint.event_root, category, event)
+                try:
+                        return open(os.path.join(evt_dir, "format")).readlines()
+                except:
+                        return ""
+
+        @classmethod
+        def attach(cls, bpf):
+                if cls._any_tracepoints_enabled():
+                        bpf.attach_kprobe(event="tracing_generic_entry_update",
+                                          fn_name="__trace_entry_update")
+
--- a/src/python/bcc/usdt.py
+++ b/src/python/bcc/usdt.py
--- a/src/python/bcc/usyms.py
+++ b/src/python/bcc/usyms.py
+# Copyright 2016 Sasha Goldshtein
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from subprocess import Popen, PIPE, STDOUT
+
+class ProcessSymbols(object):
+    def __init__(self, pid):
+        """
+        Initializes the process symbols store for the specified pid.
+        Call refresh_code_ranges() periodically if you anticipate changes
+        in the set of loaded libraries or their addresses.
+        """
+        self.pid = pid
+        self.refresh_code_ranges()
+
+    def refresh_code_ranges(self):
+        self.code_ranges = self._get_code_ranges()
+        self.ranges_cache = {}
+        self.procstat = ProcStat(self.pid)
+
+    @staticmethod
+    def _is_binary_segment(parts):
+        return len(parts) == 6 and parts[5][0] != '[' and 'x' in parts[1]
+
+    def _get_code_ranges(self):
+        ranges = {}
+        raw_ranges = open("/proc/%d/maps" % self.pid).readlines()
+        # A typical line from /proc/PID/maps looks like this:
+        # 7f21b6635000-7f21b67eb000 r-xp ... /usr/lib64/libc-2.21.so
+        # We are looking for executable segments that have a .so file
+        # or the main executable. The first two lines are the range of
+        # that memory segment, which we index by binary name.
+        for raw_range in raw_ranges:
+            parts = raw_range.split()
+            if not ProcessSymbols._is_binary_segment(parts):
+                continue
+            binary = parts[5]
+            range_parts = parts[0].split('-')
+            addr_range = (int(range_parts[0], 16), int(range_parts[1], 16))
+            ranges[binary] = addr_range
+        return ranges
+
+    @staticmethod
+    def _is_function_symbol(parts):
+        return len(parts) == 6 and parts[3] == ".text" and parts[2] == "F"
+
+    @staticmethod
+    def _run_command_get_output(command):
+        p = Popen(command.split(), stdout=PIPE, stderr=STDOUT)
+        return iter(p.stdout.readline, b'')
+
+    def _get_sym_ranges(self, binary):
+        if binary in self.ranges_cache:
+            return self.ranges_cache[binary]
+        sym_ranges = {}
+        raw_symbols = ProcessSymbols._run_command_get_output(
+                "objdump -t %s" % binary)
+        for raw_symbol in raw_symbols:
+            # A typical line from objdump -t looks like this:
+            # 00000000004007f5 g F .text 000000000000010e main
+            # We only care about functions in the .text segment.
+            # The first number is the start address, and the second
+            # number is the length.
+            parts = raw_symbol.split()
+            if not ProcessSymbols._is_function_symbol(parts):
+                continue
+            sym_start = int(parts[0], 16)
+            sym_len = int(parts[4], 16)
+            sym_name = parts[5]
+            sym_ranges[sym_name] = (sym_start, sym_len)
+            self.ranges_cache[binary] = sym_ranges
+        return sym_ranges
+
+    def _decode_sym(self, binary, offset):
+        sym_ranges = self._get_sym_ranges(binary)
+        # Find the symbol that contains the specified offset.
+        # There might not be one.
+        for name, (start, length) in sym_ranges.items():
+            if offset >= start and offset <= (start + length):
+                return "%s+0x%x" % (name, offset - start)
+        return "%x" % offset
+
+    def _check_pid_wrap(self):
+        if self.procstat.is_stale():
+            self.refresh_code_ranges()
+
+    def decode_addr(self, addr):
+        """
+        Given an address, return the best symbolic representation of it.
+        If it doesn't fall in any module, return its hex string. If it
+        falls within a module but we don't have a symbol for it, return
+        the hex string and the module. If we do have a symbol for it,
+        return the symbol and the module, e.g. "readline+0x10 [bash]".
+        """
+        self._check_pid_wrap()
+        # Find the binary that contains the specified address.
+        # For .so files, look at the relative address; for the main
+        # executable, look at the absolute address.
+        for binary, (start, end) in self.code_ranges.items():
+            if addr >= start and addr <= end:
+                offset = addr - start \
+                         if binary.endswith(".so") else addr
+                return "%s [%s]" % (self._decode_sym(binary, offset),
+                                    binary)
+        return "%x" % addr
+
+from . import ProcStat
--- a/tools/argdist.py
+++ b/tools/argdist.py
--- a/tools/argdist_example.txt
+++ b/tools/argdist_example.txt
@@ -262,6 +262,27 @@ p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
 The flags value must be expanded by hand, but it's still helpful to eliminate
 certain kinds of allocations or visually group them together.

+argdist also has basic support for kernel tracepoints. It is sometimes more
+convenient to use tracepoints because they are documented and don't vary a lot
+between kernel versions like function signatures tend to. For example, let's
+trace the net:net_dev_start_xmit tracepoint and print the interface name that
+is transmitting:
+
+# argdist -C 't:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name' -n 2
+[05:01:10]
+t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
+        COUNT      EVENT
+        4          c->name = eth0
+[05:01:11]
+t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
+        COUNT      EVENT
+        6          c->name = lo
+        92         c->name = eth0
+
+Note that to determine the necessary function signature you need to look at the
+TP_PROTO declaration in the kernel headers. For example, the net_dev_start_xmit
+tracepoint is defined in the include/trace/events/net.h header file.
+
 Here's a final example that finds how many write() system calls are performed
 by each process on the system:

@@ -311,13 +332,14 @@ optional arguments:
                        additional header files to include in the BPF program

 Probe specifier syntax:
-        {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+        {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
-        p,r        -- probe at function entry or at function exit
+        p,r,t,u    -- probe at function entry, function exit, kernel tracepoint,
+                      or USDT probe
                      in exit probes: can use $retval, $entry(param), $latency
        library    -- the library that contains the function
                      (leave empty for kernel functions)
-        function   -- the function name to trace
+        category   -- the category of the kernel tracepoint (e.g. net, sched)
        signature  -- the function's parameters, as in the C header
        type       -- the type of the expression to collect (supports multiple)
        expr       -- the expression to collect (supports multiple)
@@ -365,7 +387,17 @@ argdist -C 'p:c:fork()#fork calls'
        Count fork() calls in libc across all processes
        Can also use funccount.py, which is easier and more flexible 

-argdist  -H \
+argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
+        Print histogram of number of sectors in completing block I/O requests
+
+argdist -C 't:irq:irq_handler_entry():int:tp.irq'
+        Aggregate interrupts by interrupt request (IRQ)
+
+argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
+        Print frequency of function addresses used as a pthread start function,
+        relying on the USDT pthread_start probe in process 1337
+
+argdist -H \
        'p:c:sleep(u32 seconds):u32:seconds' \
        'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
        Print histograms of sleep() and nanosleep() parameter values

--- a/tools/memleak.py
+++ b/tools/memleak.py
--- a/tools/offcputime.py
+++ b/tools/offcputime.py
@@ -9,8 +9,6 @@
 # as a proof of concept. This implementation should be replaced in the future
 # with an appropriate bpf_ call, when available.
 #
-# Currently limited to a stack trace depth of 21 (maxdepth + 1).
-#
 # Copyright 2016 Netflix, Inc.
 # Licensed under the Apache License, Version 2.0 (the "License")
 #
@@ -48,7 +46,6 @@ args = parser.parse_args()
 folded = args.folded
 duration = int(args.duration)
 debug = 0
-maxdepth = 20    # and MAXDEPTH
 if args.pid and args.useronly:
    print("ERROR: use either -p or -u.")
    exit()
@@ -62,31 +59,15 @@ bpf_text = """
 #include <uapi/linux/ptrace.h>
 #include <linux/sched.h>

-#define MAXDEPTH	20
 #define MINBLOCK_US	1

 struct key_t {
    char name[TASK_COMM_LEN];
-    // Skip saving the ip
-    u64 ret[MAXDEPTH];
+    int stack_id;
 };
 BPF_HASH(counts, struct key_t);
 BPF_HASH(start, u32);
-
-static u64 get_frame(u64 *bp) {
-    if (*bp) {
-        // The following stack walker is x86_64 specific
-        u64 ret = 0;
-        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
-            return 0;
-        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
-            *bp = 0;
-        if (ret < __START_KERNEL_map)
-            return 0;
-        return ret;
-    }
-    return 0;
-}
+BPF_STACK_TRACE(stack_traces, 1024)

 int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
    u32 pid;
@@ -111,36 +92,12 @@ int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
        return 0;

    // create map key
-    u64 zero = 0, *val, bp = 0;
-    int depth = 0;
+    u64 zero = 0, *val;
    struct key_t key = {};
+
    bpf_get_current_comm(&key.name, sizeof(key.name));
-    bp = ctx->bp;
-
-    // unrolled loop (MAXDEPTH):
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-
-out:
+    key.stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
+
    val = counts.lookup_or_init(&key, &zero);
    (*val) += delta;
    return 0;
@@ -183,24 +140,17 @@ while (1):
    if not folded:
        print()
    counts = b.get_table("counts")
+    stack_traces = b.get_table("stack_traces")
    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
        if folded:
            # print folded stack output
-            line = k.name.decode() + ";"
-            for i in reversed(range(0, maxdepth)):
-                if k.ret[i] == 0:
-                    continue
-                line = line + b.ksym(k.ret[i])
-                if i != 0:
-                    line = line + ";"
-            print("%s %d" % (line, v.value))
+            stack = list(stack_traces.walk(k.stack_id))[1:]
+            line = [k.name.decode()] + [b.ksym(addr) for addr in reversed(stack)]
+            print("%s %d" % (";".join(line), v.value))
        else:
            # print default multi-line stack output
-            for i in range(0, maxdepth):
-                if k.ret[i] == 0:
-                    break
-                print("    %-16x %s" % (k.ret[i],
-                    b.ksym(k.ret[i])))
+            for addr in stack_traces.walk(k.stack_id):
+                print("    %-16x %s" % (addr, b.ksym(addr)))
            print("    %-16s %s" % ("-", k.name))
            print("        %d\n" % v.value)
    counts.clear()

--- a/tools/old/memleak.py
+++ b/tools/old/memleak.py
--- a/tools/old/offcputime.py
+++ b/tools/old/offcputime.py
+#!/usr/bin/python
+#
+# offcputime    Summarize off-CPU time by kernel stack trace
+#               For Linux, uses BCC, eBPF.
+#
+# USAGE: offcputime [-h] [-u] [-p PID] [-v] [-f] [duration]
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# Currently limited to a stack trace depth of 21 (maxdepth + 1).
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 13-Jan-2016	Brendan Gregg	Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+    ./offcputime             # trace off-CPU stack time until Ctrl-C
+    ./offcputime 5           # trace for 5 seconds only
+    ./offcputime -f 5        # 5 seconds, and output in folded format
+    ./offcputime -u          # don't include kernel threads (user only)
+    ./offcputime -p 185      # trace fo PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Summarize off-CPU time by kernel stack trace",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-u", "--useronly", action="store_true",
+    help="user threads only (no kernel threads)")
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="show raw addresses")
+parser.add_argument("-f", "--folded", action="store_true",
+    help="output folded format")
+parser.add_argument("duration", nargs="?", default=99999999,
+    help="duration of trace, in seconds")
+args = parser.parse_args()
+folded = args.folded
+duration = int(args.duration)
+debug = 0
+maxdepth = 20    # and MAXDEPTH
+if args.pid and args.useronly:
+    print("ERROR: use either -p or -u.")
+    exit()
+
+# signal handler
+def signal_ignore(signal, frame):
+    print()
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/sched.h>
+
+#define MAXDEPTH	20
+#define MINBLOCK_US	1
+
+struct key_t {
+    char name[TASK_COMM_LEN];
+    // Skip saving the ip
+    u64 ret[MAXDEPTH];
+};
+BPF_HASH(counts, struct key_t);
+BPF_HASH(start, u32);
+
+static u64 get_frame(u64 *bp) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            *bp = 0;
+        if (ret < __START_KERNEL_map)
+            return 0;
+        return ret;
+    }
+    return 0;
+}
+
+int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
+    u32 pid;
+    u64 ts, *tsp;
+
+    // record previous thread sleep time
+    if (FILTER) {
+        pid = prev->pid;
+        ts = bpf_ktime_get_ns();
+        start.update(&pid, &ts);
+    }
+
+    // calculate current thread's delta time
+    pid = bpf_get_current_pid_tgid();
+    tsp = start.lookup(&pid);
+    if (tsp == 0)
+        return 0;        // missed start or filtered
+    u64 delta = bpf_ktime_get_ns() - *tsp;
+    start.delete(&pid);
+    delta = delta / 1000;
+    if (delta < MINBLOCK_US)
+        return 0;
+
+    // create map key
+    u64 zero = 0, *val, bp = 0;
+    int depth = 0;
+    struct key_t key = {};
+    bpf_get_current_comm(&key.name, sizeof(key.name));
+    bp = ctx->bp;
+
+    // unrolled loop (MAXDEPTH):
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+out:
+    val = counts.lookup_or_init(&key, &zero);
+    (*val) += delta;
+    return 0;
+}
+"""
+if args.pid:
+    filter = 'pid == %s' % args.pid
+elif args.useronly:
+    filter = '!(prev->flags & PF_KTHREAD)'
+else:
+    filter = '1'
+bpf_text = bpf_text.replace('FILTER', filter)
+if debug:
+    print(bpf_text)
+
+# initialize BPF
+b = BPF(text=bpf_text)
+b.attach_kprobe(event="finish_task_switch", fn_name="oncpu")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("0 functions traced. Exiting.")
+    exit()
+
+# header
+if not folded:
+    print("Tracing off-CPU time (us) by kernel stack", end="")
+    if duration < 99999999:
+        print(" for %d secs." % duration)
+    else:
+        print("... Hit Ctrl-C to end.")
+
+# output
+while (1):
+    try:
+        sleep(duration)
+    except KeyboardInterrupt:
+        # as cleanup can take many seconds, trap Ctrl-C:
+        signal.signal(signal.SIGINT, signal_ignore)
+
+    if not folded:
+        print()
+    counts = b.get_table("counts")
+    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
+        if folded:
+            # print folded stack output
+            line = k.name.decode() + ";"
+            for i in reversed(range(0, maxdepth)):
+                if k.ret[i] == 0:
+                    continue
+                line = line + b.ksym(k.ret[i])
+                if i != 0:
+                    line = line + ";"
+            print("%s %d" % (line, v.value))
+        else:
+            # print default multi-line stack output
+            for i in range(0, maxdepth):
+                if k.ret[i] == 0:
+                    break
+                print("    %-16x %s" % (k.ret[i],
+                    b.ksym(k.ret[i])))
+            print("    %-16s %s" % ("-", k.name))
+            print("        %d\n" % v.value)
+    counts.clear()
+
+    if not folded:
+        print("Detaching...")
+    exit()
--- a/tools/old/stackcount.py
+++ b/tools/old/stackcount.py
+#!/usr/bin/python
+#
+# stackcount    Count kernel function calls and their stack traces.
+#               For Linux, uses BCC, eBPF.
+#
+# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern
+#
+# The pattern is a string with optional '*' wildcards, similar to file
+# globbing. If you'd prefer to use regular expressions, use the -r option.
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# Currently limited to a stack trace depth of 11 (maxdepth + 1).
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 12-Jan-2016	Brendan Gregg	Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+import signal
+
+# arguments
+examples = """examples:
+    ./stackcount submit_bio       # count kernel stack traces for submit_bio
+    ./stackcount ip_output        # count kernel stack traces for ip_output
+    ./stackcount -s ip_output     # show symbol offsets
+    ./stackcount -sv ip_output    # show offsets and raw addresses (verbose)
+    ./stackcount 'tcp_send*'      # count stacks for funcs matching tcp_send*
+    ./stackcount -r '^tcp_send.*' # same as above, using regular expressions
+    ./stackcount -Ti 5 ip_output  # output every 5 seconds, with timestamps
+    ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Count kernel function calls and their stack traces",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-i", "--interval", default=99999999,
+    help="summary interval, seconds")
+parser.add_argument("-T", "--timestamp", action="store_true",
+    help="include timestamp on output")
+parser.add_argument("-r", "--regexp", action="store_true",
+    help="use regular expressions. Default is \"*\" wildcards only.")
+parser.add_argument("-s", "--offset", action="store_true",
+    help="show address offsets")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="show raw addresses")
+parser.add_argument("pattern",
+    help="search expression for kernel functions")
+args = parser.parse_args()
+pattern = args.pattern
+if not args.regexp:
+    pattern = pattern.replace('*', '.*')
+    pattern = '^' + pattern + '$'
+offset = args.offset
+verbose = args.verbose
+debug = 0
+maxdepth = 10    # and MAXDEPTH
+
+# signal handler
+def signal_ignore(signal, frame):
+    print()
+
+# load BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+
+#define MAXDEPTH	10
+
+struct key_t {
+    u64 ip;
+    u64 ret[MAXDEPTH];
+};
+BPF_HASH(counts, struct key_t);
+
+static u64 get_frame(u64 *bp) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            *bp = 0;
+        if (ret < __START_KERNEL_map)
+            return 0;
+        return ret;
+    }
+    return 0;
+}
+
+int trace_count(struct pt_regs *ctx) {
+    FILTER
+    struct key_t key = {};
+    u64 zero = 0, *val, bp = 0;
+    int depth = 0;
+
+    key.ip = ctx->ip;
+    bp = ctx->bp;
+
+    // unrolled loop, 10 (MAXDEPTH) frames deep:
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
+
+out:
+    val = counts.lookup_or_init(&key, &zero);
+    (*val)++;
+    return 0;
+}
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        ('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
+        'if (pid != %s) { return 0; }') % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+b = BPF(text=bpf_text)
+b.attach_kprobe(event_re=pattern, fn_name="trace_count")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("0 functions matched by \"%s\". Exiting." % args.pattern)
+    exit()
+
+# header
+print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
+    (matched, args.pattern))
+
+def print_frame(addr):
+    print("  ", end="")
+    if verbose:
+        print("%-16x " % addr, end="")
+    if offset:
+        print("%s" % b.ksymaddr(addr))
+    else:
+        print("%s" % b.ksym(addr))
+
+# output
+exiting = 0 if args.interval else 1
+while (1):
+    try:
+        sleep(int(args.interval))
+    except KeyboardInterrupt:
+        exiting = 1
+        # as cleanup can take many seconds, trap Ctrl-C:
+        signal.signal(signal.SIGINT, signal_ignore)
+
+    print()
+    if args.timestamp:
+        print("%-8s\n" % strftime("%H:%M:%S"), end="")
+
+    counts = b.get_table("counts")
+    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
+        print_frame(k.ip)
+        for i in range(0, maxdepth):
+            if k.ret[i] == 0:
+                break
+            print_frame(k.ret[i])
+        print("    %d\n" % v.value)
+    counts.clear()
+
+    if exiting:
+        print("Detaching...")
+        exit()
--- a/tools/old/stacksnoop.py
+++ b/tools/old/stacksnoop.py
+#!/usr/bin/python
+#
+# stacksnoop    Trace a kernel function and print all kernel stack traces.
+#               For Linux, uses BCC, eBPF, and currently x86_64 only. Inline C.
+#
+# USAGE: stacksnoop [-h] [-p PID] [-s] [-v] function
+#
+# The current implementation uses an unrolled loop for x86_64, and was written
+# as a proof of concept. This implementation should be replaced in the future
+# with an appropriate bpf_ call, when available.
+#
+# The stack depth is limited to 10 (+1 for the current instruction pointer).
+# This could be tunable in a future version.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 12-Jan-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+
+# arguments
+examples = """examples:
+    ./stacksnoop ext4_sync_fs    # print kernel stack traces for ext4_sync_fs
+    ./stacksnoop -s ext4_sync_fs    # ... also show symbol offsets
+    ./stacksnoop -v ext4_sync_fs    # ... show extra columns
+    ./stacksnoop -p 185 ext4_sync_fs    # ... only when PID 185 is on-CPU
+"""
+parser = argparse.ArgumentParser(
+    description="Trace and print kernel stack traces for a kernel function",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("-s", "--offset", action="store_true",
+    help="show address offsets")
+parser.add_argument("-v", "--verbose", action="store_true",
+    help="print more fields")
+parser.add_argument("function",
+    help="kernel function name")
+args = parser.parse_args()
+function = args.function
+offset = args.offset
+verbose = args.verbose
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+
+static int print_frame(u64 *bp, int *depth) {
+    if (*bp) {
+        // The following stack walker is x86_64 specific
+        u64 ret = 0;
+        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
+            return 0;
+        if (ret < __START_KERNEL_map)
+            return 0;
+        bpf_trace_printk("r%d: %llx\\n", *depth, ret);
+        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
+            return 0;
+        *depth += 1;
+        return 1;
+    }
+    return 0;
+}
+
+void trace_stack(struct pt_regs *ctx) {
+    FILTER
+    u64 bp = 0;
+    int depth = 0;
+
+    bpf_trace_printk("\\n");
+    if (ctx->ip)
+        bpf_trace_printk("ip: %llx\\n", ctx->ip);
+    bp = ctx->bp;
+
+    // unrolled loop, 10 frames deep:
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+    if (!print_frame(&bp, &depth)) return;
+};
+"""
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER',
+        ('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
+        'if (pid != %s) { return; }') % (args.pid))
+else:
+    bpf_text = bpf_text.replace('FILTER', '')
+if debug:
+    print(bpf_text)
+
+# initialize BPF
+b = BPF(text=bpf_text)
+b.attach_kprobe(event=function, fn_name="trace_stack")
+matched = b.num_open_kprobes()
+if matched == 0:
+    print("Function \"%s\" not found. Exiting." % function)
+    exit()
+
+# header
+if verbose:
+    print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
+        "STACK"))
+else:
+    print("%-18s %s" % ("TIME(s)", "STACK"))
+
+# format output
+while 1:
+    (task, pid, cpu, flags, ts, msg) = b.trace_fields()
+    if msg != "":
+        (reg, addr) = msg.split(" ")
+        if offset:
+            ip = b.ksymaddr(int(addr, 16))
+        else:
+            ip = b.ksym(int(addr, 16))
+        msg = msg + " " + ip
+    if verbose:
+        print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
+    else:
+        print("%-18.9f %s" % (ts, msg))
--- a/tools/solisten.py
+++ b/tools/solisten.py
+#!/usr/bin/env python
+#
+# solisten	Trace TCP listen events
+#		For Linux, uses BCC, eBPF. Embedded C.
+#
+# USAGE: solisten.py [-h] [-p PID] [--show-netns]
+#
+# This is provided as a basic example of TCP connection & socket tracing.
+# It could be usefull in scenarios where load balancers needs to be updated
+# dynamically as application is fully initialized.
+#
+# All IPv4 listen attempts are traced, even if they ultimately fail or the
+# the listening program is not willing to accept().
+#
+# Copyright (c) 2016 Jean-Tiare Le Bigot.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 04-Mar-2016	Jean-Tiare Le Bigot	Created this.
+
+import os
+import socket
+import netaddr
+import argparse
+from bcc import BPF
+import ctypes as ct
+
+# Arguments
+examples = """Examples:
+    ./solisten.py              # Stream socket listen
+    ./solisten.py -p 1234      # Stream socket listen for specified PID only
+    ./solisten.py --netns 4242 # Stream socket listen for specified network namespace ID only
+    ./solisten.py --show-netns # Show network namespace ID. Probably usefull if you run containers
+"""
+
+parser = argparse.ArgumentParser(
+    description="Stream sockets listen",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("--show-netns", action="store_true",
+    help="show network namespace")
+parser.add_argument("-p", "--pid", default=0, type=int,
+    help="trace this PID only")
+parser.add_argument("-n", "--netns", default=0, type=int,
+    help="trace this Network Namespace only")
+
+
+# BPF Program
+bpf_text = """ 
+#include <net/sock.h>
+#include <net/inet_sock.h>
+#include <net/net_namespace.h>
+#include <bcc/proto.h>
+
+// Endian conversion. We can't use kernel version here as it uses inline
+// assembly, neither libc version as we can't import it here. Adapted from both.
+#if defined(__LITTLE_ENDIAN)
+#define bcc_be32_to_cpu(x) ((u32)(__builtin_bswap32)((x)))
+#define bcc_be64_to_cpu(x) ((u64)(__builtin_bswap64)((x)))
+#elif defined(__BIG_ENDIAN)
+#define bcc_be32_to_cpu(x) (x)
+#define bcc_be64_to_cpu(x) (x)
+#else
+#error Host endianness not defined
+#endif
+
+// Common structure for UDP/TCP IPv4/IPv6
+struct listen_evt_t {
+    u64 ts_us;
+    u64 pid_tgid;
+    u64 backlog;
+    u64 netns;
+    u64 proto;    // familiy << 16 | type
+    u64 lport;    // use only 16 bits
+    u64 laddr[2]; // IPv4: store in laddr[0]
+    char task[TASK_COMM_LEN];
+};
+BPF_PERF_OUTPUT(listen_evt);
+
+// Send an event for each IPv4 listen with PID, bound address and port
+int kprobe__inet_listen(struct pt_regs *ctx, struct socket *sock, int backlog)
+{
+        // cast types. Intermediate cast not needed, kept for readability
+        struct sock *sk = sock->sk;
+        struct inet_sock *inet = inet_sk(sk);
+
+        // Built event for userland
+        struct listen_evt_t evt = {
+            .ts_us = bpf_ktime_get_ns() / 1000,
+            .backlog = backlog,
+        };
+
+        // Get process comm. Needs LLVM >= 3.7.1 see https://github.com/iovisor/bcc/issues/393
+        bpf_get_current_comm(evt.task, TASK_COMM_LEN);
+
+        // Get socket IP family
+        u16 family = sk->__sk_common.skc_family;
+        evt.proto = family << 16 | SOCK_STREAM;
+
+        // Get PID
+        evt.pid_tgid = bpf_get_current_pid_tgid();
+
+        ##FILTER_PID##
+
+        // Get port
+        bpf_probe_read(&evt.lport, sizeof(u16), &(inet->inet_sport));
+        evt.lport = ntohs(evt.lport);
+
+        // Get network namespace id, if kernel supports it
+#ifdef CONFIG_NET_NS
+        evt.netns = sk->__sk_common.skc_net.net->ns.inum;
+#else
+        evt.netns = 0;
+#endif
+
+        ##FILTER_NETNS##
+
+        // Get IP
+        if (family == AF_INET) {
+            bpf_probe_read(evt.laddr, sizeof(u32), &(inet->inet_rcv_saddr));
+            evt.laddr[0] = bcc_be32_to_cpu(evt.laddr[0]);
+        } else if (family == AF_INET6) {
+            bpf_probe_read(evt.laddr, sizeof(evt.laddr), sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
+            evt.laddr[0] = bcc_be64_to_cpu(evt.laddr[0]);
+            evt.laddr[1] = bcc_be64_to_cpu(evt.laddr[1]);
+        }
+
+        // Send event to userland
+        listen_evt.perf_submit(ctx, &evt, sizeof(evt));
+
+        return 0;
+};
+"""
+
+# event data
+TASK_COMM_LEN = 16      # linux/sched.h
+class ListenEvt(ct.Structure):
+    _fields_ = [
+        ("ts_us", ct.c_ulonglong),
+        ("pid_tgid", ct.c_ulonglong),
+        ("backlog", ct.c_ulonglong),
+        ("netns", ct.c_ulonglong),
+        ("proto", ct.c_ulonglong),
+        ("lport", ct.c_ulonglong),
+        ("laddr", ct.c_ulonglong * 2),
+        ("task", ct.c_char * TASK_COMM_LEN)
+    ]
+
+    # TODO: properties to unpack protocol / ip / pid / tgid ...
+
+# Format output
+def event_printer(show_netns):
+    def print_event(cpu, data, size):
+        # Decode event
+        event = ct.cast(data, ct.POINTER(ListenEvt)).contents
+
+        pid = event.pid_tgid & 0xffffffff
+        proto_family = event.proto & 0xff
+        proto_type = event.proto >> 16 & 0xff
+
+        if proto_family == socket.SOCK_STREAM:
+            protocol = "TCP"
+        elif proto_family == socket.SOCK_DGRAM:
+            protocol = "UDP"
+        else:
+            protocol = "UNK"
+
+        address = ""
+        if proto_type == socket.AF_INET:
+            protocol += "v4"
+            address = netaddr.IPAddress(event.laddr[0])
+        elif proto_type == socket.AF_INET6:
+            address = netaddr.IPAddress(event.laddr[0]<<64 | event.laddr[1], version=6)
+            protocol += "v6"
+
+        # Display
+        if show_netns:
+            print("%-6d %-12.12s %-12s %-6s %-8s %-5s %-39s" % (
+                pid, event.task, event.netns, protocol, event.backlog,
+                event.lport, address,
+            ))
+        else:
+            print("%-6d %-12.12s %-6s %-8s %-5s %-39s" % (
+                pid, event.task, protocol, event.backlog,
+                event.lport, address,
+            ))
+
+    return print_event
+
+if __name__ == "__main__":
+    # Parse arguments
+    args = parser.parse_args()
+
+    pid_filter = ""
+    netns_filter = ""
+
+    if args.pid:
+        pid_filter = "if (evt.pid_tgid != %d) return 0;" % args.pid
+    if args.netns:
+        netns_filter = "if (evt.netns != %d) return 0;" % args.netns
+
+    bpf_text = bpf_text.replace("##FILTER_PID##", pid_filter)
+    bpf_text = bpf_text.replace("##FILTER_NETNS##", netns_filter)
+
+    # Initialize BPF
+    b = BPF(text=bpf_text)
+    b["listen_evt"].open_perf_buffer(event_printer(args.show_netns))
+
+    # Print headers
+    if args.show_netns:
+        print("%-6s %-12s %-12s %-6s %-8s %-5s %-39s" % ("PID", "COMM", "NETNS", "PROTO", "BACKLOG", "PORT", "ADDR"))
+    else:
+        print("%-6s %-12s %-6s %-8s %-5s %-39s" % ("PID", "COMM", "PROTO", "BACKLOG", "PORT", "ADDR"))
+
+    # Read events
+    while 1:
+        b.kprobe_poll()
+
--- a/tools/solisten_example.txt
+++ b/tools/solisten_example.txt
+Demonstrations of solisten.py, the Linux eBPF/bcc version.
+
+
+This tool traces the kernel function called when a program wants to listen
+for TCP connections. It will not see UDP neither UNIX domain sockets.
+
+It can be used to dynamically update a load balancer as a program is actually
+ready to accept connexion, hence avoiding the "downtime" while it is initializing.
+
+# ./solisten.py --show-netns
+PID    COMM         NETNS        PROTO  BACKLOG  ADDR                                    PORT
+3643   nc           4026531957   TCPv4  1        0.0.0.0                                 4242
+3659   nc           4026531957   TCPv6  1        2001:f0d0:1002:51::4                    4242
+4221   redis-server 4026532165   TCPv6  128      ::                                      6379
+4221   redis-server 4026532165   TCPv4  128      0.0.0.0                                 6379
+6067   nginx        4026531957   TCPv4  128      0.0.0.0                                 80
+6067   nginx        4026531957   TCPv6  128      ::                                      80
+6069   nginx        4026531957   TCPv4  128      0.0.0.0                                 80
+6069   nginx        4026531957   TCPv6  128      ::                                      80
+6069   nginx        4026531957   TCPv4  128      0.0.0.0                                 80
+6069   nginx        4026531957   TCPv6  128      ::                                      80
+
+This output show the listen event from 3 programs. Netcat was started twice as
+shown by the 2 different PIDs. The first time on the wilcard IPv4, the second
+time on an IPv6. Netcat being a "one shot" program. It can accept a single
+connection, hence the backlog of "1".
+
+The next program is redis-server. As the netns column shows, it is in a
+different network namespace than netcat and nginx. In this specific case
+it was launched in a docker container. It listens both on IPv4 and IPv4
+with up to 128 pending connections.
+
+Determining the actual container is out if the scope of this tool. It could
+be derived by scrapping /proc/<PID>/cgroup. Note that this is racy.
+
+The overhead of this tool is negligeable as it traces listen() calls which are
+invoked in the initialization path of a program. The operation part will remain
+unaffected. In particular, accept() calls will not be affected. Neither
+individual read() and write().
+
--- a/tools/stackcount.py
+++ b/tools/stackcount.py
@@ -72,52 +72,14 @@ def signal_ignore(signal, frame):
 bpf_text = """
 #include <uapi/linux/ptrace.h>

-#define MAXDEPTH	10
-
-struct key_t {
-    u64 ip;
-    u64 ret[MAXDEPTH];
-};
-BPF_HASH(counts, struct key_t);
-
-static u64 get_frame(u64 *bp) {
-    if (*bp) {
-        // The following stack walker is x86_64 specific
-        u64 ret = 0;
-        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
-            return 0;
-        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
-            *bp = 0;
-        if (ret < __START_KERNEL_map)
-            return 0;
-        return ret;
-    }
-    return 0;
-}
+BPF_HASH(counts, int);
+BPF_STACK_TRACE(stack_traces, 1024);

 int trace_count(struct pt_regs *ctx) {
    FILTER
-    struct key_t key = {};
-    u64 zero = 0, *val, bp = 0;
-    int depth = 0;
-
-    key.ip = ctx->ip;
-    bp = ctx->bp;
-
-    // unrolled loop, 10 (MAXDEPTH) frames deep:
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-    if (!(key.ret[depth++] = get_frame(&bp))) goto out;
-
-out:
-    val = counts.lookup_or_init(&key, &zero);
+    int key = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
+    u64 zero = 0;
+    u64 *val = counts.lookup_or_init(&key, &zero);
    (*val)++;
    return 0;
 }
@@ -164,13 +126,11 @@ while (1):
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

-    counts = b.get_table("counts")
+    counts = b["counts"]
+    stack_traces = b["stack_traces"]
    for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
-        print_frame(k.ip)
-        for i in range(0, maxdepth):
-            if k.ret[i] == 0:
-                break
-            print_frame(k.ret[i])
+        for addr in stack_traces.walk(k.value):
+            print_frame(addr)
        print("    %d\n" % v.value)
    counts.clear()


--- a/tools/stackcount_example.txt
+++ b/tools/stackcount_example.txt
@@ -82,9 +82,6 @@ tracing.
 It can be useful to trace the path to submit_bio to explain unusual rates of
 disk IOPS. These could have in-kernel origins (eg, background scrub).

-This version of stackcount truncates stacks to 10 levels deep (plus 1 for
-the traced function, so 11).
-

 As another example, here are the code paths that led to ip_output(), which
 sends a packet at the IP level:

--- a/tools/stacksnoop.py
+++ b/tools/stacksnoop.py
@@ -20,6 +20,7 @@
 from __future__ import print_function
 from bcc import BPF
 import argparse
+import re

 # arguments
 examples = """examples:
@@ -50,45 +51,14 @@ debug = 0
 bpf_text = """
 #include <uapi/linux/ptrace.h>

-static int print_frame(u64 *bp, int *depth) {
-    if (*bp) {
-        // The following stack walker is x86_64 specific
-        u64 ret = 0;
-        if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
-            return 0;
-        if (ret < __START_KERNEL_map)
-            return 0;
-        bpf_trace_printk("r%d: %llx\\n", *depth, ret);
-        if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
-            return 0;
-        *depth += 1;
-        return 1;
-    }
-    return 0;
-}
+BPF_STACK_TRACE(stack_traces, 128)

 void trace_stack(struct pt_regs *ctx) {
    FILTER
-    u64 bp = 0;
-    int depth = 0;
-
-    bpf_trace_printk("\\n");
-    if (ctx->ip)
-        bpf_trace_printk("ip: %llx\\n", ctx->ip);
-    bp = ctx->bp;
-
-    // unrolled loop, 10 frames deep:
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-    if (!print_frame(&bp, &depth)) return;
-};
+    int stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
+    if (stack_id >= 0)
+        bpf_trace_printk("stack_id=%d\\n", stack_id);
+}
 """
 if args.pid:
    bpf_text = bpf_text.replace('FILTER',
@@ -107,24 +77,28 @@ if matched == 0:
    print("Function \"%s\" not found. Exiting." % function)
    exit()

+stack_traces = b.get_table("stack_traces")
+msg_regexp = re.compile("stack_id=(\d+)")
+
 # header
 if verbose:
-    print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
-        "STACK"))
+    print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU", "SYSCALL"))
 else:
-    print("%-18s %s" % ("TIME(s)", "STACK"))
+    print("%-18s %s" % ("TIME(s)", "SYSCALL"))

 # format output
 while 1:
    (task, pid, cpu, flags, ts, msg) = b.trace_fields()
-    if msg != "":
-        (reg, addr) = msg.split(" ")
-        if offset:
-            ip = b.ksymaddr(int(addr, 16))
+    m = msg_regexp.match(msg)
+    if m:
+        if verbose:
+            print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, function))
        else:
-            ip = b.ksym(int(addr, 16))
-        msg = msg + " " + ip
-    if verbose:
-        print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
-    else:
-        print("%-18.9f %s" % (ts, msg))
+            print("%-18.9f %s" % (ts, function))
+
+        stack_id = int(m.group(1))
+        for addr in stack_traces.walk(stack_id):
+            sym = b.ksymaddr(addr) if offset else b.ksym(addr)
+            print("\t%016x %s" % (addr, sym))
+
+        print()
--- a/tools/stacksnoop_example.txt
+++ b/tools/stacksnoop_example.txt
@@ -3,27 +3,20 @@ Demonstrations of stacksnoop, the Linux eBPF/bcc version.

 This program traces the given kernel function and prints the kernel stack trace
 for every call. This tool is useful for studying low frequency kernel functions,
-to see how they were invoked. For example, tracing the ext4_sync_fs() call:
-
-# ./stacksnoop ext4_sync_fs
-TIME(s)            STACK
-42005194.132250004 
-42005194.132253997 ip: ffffffff81280461 ext4_sync_fs
-42005194.132256001 r0: ffffffff811ed7f9 iterate_supers
-42005194.132257000 r1: ffffffff8121ba25 sys_sync
-42005194.132257000 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
-42005194.132275000 
-42005194.132275999 ip: ffffffff81280461 ext4_sync_fs
-42005194.132275999 r0: ffffffff811ed7f9 iterate_supers
-42005194.132276997 r1: ffffffff8121ba35 sys_sync
-42005194.132276997 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
-
-This shows that ext4_sync_fs() was called by iterate_supers(), which was called
-by sys_sync(), and so on. (It tells me that this was a syscall invoked sync,
-so an application has requested it.)
-
-The "ip" refers to the instruction pointer, and the "r#" refers to the return
-address for each stack frame.
+to see how they were invoked. For example, tracing the submit_bio() call:
+
+# ./stacksnoop submit_bio
+TIME(s)            SYSCALL
+3592.838736000     submit_bio
+        ffffffff813bd961 submit_bio
+        ffffffff81257c12 submit_bh
+        ffffffff81301948 jbd2_journal_commit_transaction
+        ffffffff8130653a kjournald2
+        ffffffff810a2df8 kthread
+        ffffffff8183a122 ret_from_fork
+
+This shows that submit_bio() was called by submit_bh(), which was called
+by jbd2_journal_commit_transaction(), and so on. 

 For high frequency functions, see stackcount, which summarizes in-kernel for
 efficiency. If you don't know if your function is low or high frequency, try
@@ -32,20 +25,17 @@ funccount.

 The -v option includes more fields, including the on-CPU process (COMM and PID):

-# ./stacksnoop -v ext4_sync_fs
-TIME(s)            COMM         PID    CPU STACK
-42005557.056332998 sync         22352  1   
-42005557.056336999 sync         22352  1   ip: ffffffff81280461 ext4_sync_fs
-42005557.056339003 sync         22352  1   r0: ffffffff811ed7f9 iterate_supers
-42005557.056340002 sync         22352  1   r1: ffffffff8121ba25 sys_sync
-42005557.056340002 sync         22352  1   r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
-42005557.056358002 sync         22352  1   
-42005557.056358002 sync         22352  1   ip: ffffffff81280461 ext4_sync_fs
-42005557.056359001 sync         22352  1   r0: ffffffff811ed7f9 iterate_supers
-42005557.056359999 sync         22352  1   r1: ffffffff8121ba35 sys_sync
-42005557.056359999 sync         22352  1   r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
-
-This identifies the application issuing the sync syscall: the sync(1) command
+# ./stacksnoop -v submit_bio
+TIME(s)            COMM         PID    CPU SYSCALL
+3734.855027000     jbd2/dm-0-8  313    0   submit_bio
+        ffffffff813bd961 submit_bio
+        ffffffff81257c12 submit_bh
+        ffffffff81301948 jbd2_journal_commit_transaction
+        ffffffff8130653a kjournald2
+        ffffffff810a2df8 kthread
+        ffffffff8183a122 ret_from_fork
+
+This identifies the application issuing the sync syscall: the jbd2 process
 (COMM column).


@@ -53,29 +43,32 @@ Here's another example, showing the path to second_overflow() and on-CPU
 process:

 # ./stacksnoop -v second_overflow
-TIME(s)            COMM         PID    CPU STACK
-42005696.529449999 <idle>       0      0   
-42005696.529457003 <idle>       0      0   ip: ffffffff810e5701 second_overflow
-42005696.529459000 <idle>       0      0   r0: ffffffff810ecb1b tick_do_update_jiffies64
-42005696.529459998 <idle>       0      0   r1: ffffffff810ed6e0 tick_irq_enter
-42005696.529459998 <idle>       0      0   r2: ffffffff8107a195 irq_enter
-42005696.529460996 <idle>       0      0   r3: ffffffff8146bb6f xen_evtchn_do_upcall
-42005696.529460996 <idle>       0      0   r4: ffffffff81777a2e xen_do_hypervisor_callback
-42005697.616295002 <idle>       0      0   
-42005697.616301000 <idle>       0      0   ip: ffffffff810e5701 second_overflow
-42005697.616302997 <idle>       0      0   r0: ffffffff810ecb1b tick_do_update_jiffies64
-42005697.616304003 <idle>       0      0   r1: ffffffff810ed6e0 tick_irq_enter
-42005697.616304003 <idle>       0      0   r2: ffffffff8107a195 irq_enter
-42005697.616305001 <idle>       0      0   r3: ffffffff8146bb6f xen_evtchn_do_upcall
-42005697.616305001 <idle>       0      0   r4: ffffffff81777a2e xen_do_hypervisor_callback
-42005698.556240998 <idle>       0      1   
-42005698.556247003 <idle>       0      1   ip: ffffffff810e5701 second_overflow
-42005698.556249000 <idle>       0      1   r0: ffffffff810ecb1b tick_do_update_jiffies64
-42005698.556249000 <idle>       0      1   r1: ffffffff810ed6e0 tick_irq_enter
-42005698.556249999 <idle>       0      1   r2: ffffffff8107a195 irq_enter
-42005698.556249999 <idle>       0      1   r3: ffffffff8146bb6f xen_evtchn_do_upcall
-42005698.556250997 <idle>       0      1   r4: ffffffff81777a2e xen_do_hypervisor_callback
-[...]
+TIME(s)            COMM         PID    CPU SYSCALL
+3837.526433000     <idle>       0      1   second_overflow
+        ffffffff810fac41 second_overflow
+        ffffffff81102320 tick_do_update_jiffies64
+        ffffffff81102bf0 tick_irq_enter
+        ffffffff810882ac irq_enter
+        ffffffff8183c7df smp_apic_timer_interrupt
+        ffffffff8183aae2 apic_timer_interrupt
+        ffffffff81038f9e default_idle
+        ffffffff8103979f arch_cpu_idle
+        ffffffff810c69da default_idle_call
+        ffffffff810c6cd7 cpu_startup_entry
+        ffffffff81051cbe start_secondary
+
+3838.526953000     <idle>       0      1   second_overflow
+        ffffffff810fac41 second_overflow
+        ffffffff81102320 tick_do_update_jiffies64
+        ffffffff81102bf0 tick_irq_enter
+        ffffffff810882ac irq_enter
+        ffffffff8183c7df smp_apic_timer_interrupt
+        ffffffff8183aae2 apic_timer_interrupt
+        ffffffff81038f9e default_idle
+        ffffffff8103979f arch_cpu_idle
+        ffffffff810c69da default_idle_call
+        ffffffff810c6cd7 cpu_startup_entry
+        ffffffff81051cbe start_secondary

 This fires every second (see TIME(s)), and is from tick_do_update_jiffies64().


--- a/tools/tplist.py
+++ b/tools/tplist.py
+#!/usr/bin/env python
+#
+# tplist    Display kernel tracepoints or USDT probes and their formats.
+#
+# USAGE:    tplist [-p PID] [-l LIB] [-v] [filter]
+#
+# Licensed under the Apache License, Version 2.0 (the "License")
+# Copyright (C) 2016 Sasha Goldshtein.
+
+import argparse
+import fnmatch
+import os
+import re
+import sys
+
+from bcc import USDTReader
+
+trace_root = "/sys/kernel/debug/tracing"
+event_root = os.path.join(trace_root, "events")
+
+parser = argparse.ArgumentParser(description=
+                "Display kernel tracepoints or USDT probes and their formats.",
+                formatter_class=argparse.RawDescriptionHelpFormatter)
+parser.add_argument("-p", "--pid", type=int, default=-1, help=
+                "List USDT probes in the specified process")
+parser.add_argument("-l", "--lib", default="", help=
+                "List USDT probes in the specified library or executable")
+parser.add_argument("-v", dest="variables", action="store_true", help=
+                "Print the format (available variables)")
+parser.add_argument(dest="filter", nargs="?", help=
+                "A filter that specifies which probes/tracepoints to print")
+args = parser.parse_args()
+
+def print_tpoint_format(category, event):
+        fmt = open(os.path.join(event_root, category, event, "format")
+                  ).readlines()
+        for line in fmt:
+                match = re.search(r'field:([^;]*);', line)
+                if match is None:
+                        continue
+                parts = match.group(1).split()
+                field_name = parts[-1:][0]
+                field_type = " ".join(parts[:-1])
+                if "__data_loc" in field_type:
+                        continue
+                if field_name.startswith("common_"):
+                        continue
+                print("    %s %s;" % (field_type, field_name))
+
+def print_tpoint(category, event):
+        tpoint = "%s:%s" % (category, event)
+        if not args.filter or fnmatch.fnmatch(tpoint, args.filter):
+                print(tpoint)
+                if args.variables:
+                        print_tpoint_format(category, event)
+
+def print_tracepoints():
+        for category in os.listdir(event_root):
+                cat_dir = os.path.join(event_root, category)
+                if not os.path.isdir(cat_dir):
+                        continue
+                for event in os.listdir(cat_dir):
+                        evt_dir = os.path.join(cat_dir, event)
+                        if os.path.isdir(evt_dir):
+                                print_tpoint(category, event)
+
+def print_usdt(pid, lib):
+        reader = USDTReader(bin_path=lib, pid=pid)
+        probes_seen = []
+        for probe in reader.probes:
+                probe_name = "%s:%s" % (probe.provider, probe.name)
+                if not args.filter or fnmatch.fnmatch(probe_name, args.filter):
+                        if probe_name in probes_seen:
+                                continue
+                        probes_seen.append(probe_name)
+                        if args.variables:
+                                print(probe.display_verbose())
+                        else:
+                                print("%s %s:%s" % (probe.bin_path,
+                                        probe.provider, probe.name))
+
+if __name__ == "__main__":
+        try:
+                if args.pid != -1 or args.lib != "":
+                        print_usdt(args.pid, args.lib)
+                else:
+                        print_tracepoints()
+        except:
+                if sys.exc_type is not SystemExit:
+                        print(sys.exc_value)
+
--- a/tools/tplist_example.txt
+++ b/tools/tplist_example.txt
+Demonstrations of tplist.
+
+
+tplist displays kernel tracepoints and USDT probes, including their
+format. It can be used to discover probe points for use with the trace
+and argdist tools. Kernel tracepoints are scattered around the kernel
+and provide valuable static tracing on block and network I/O, scheduling,
+power events, and many other subjects. USDT probes are placed in libraries
+(such as libc) and executables (such as node) and provide static tracing
+information that can (optionally) be turned on and off at runtime.
+
+For example, suppose you want to discover which USDT probes a particular
+executable contains. Just run tplist on that executable (or library):
+
+$ tplist -l basic_usdt
+/home/vagrant/basic_usdt basic_usdt:start_main
+/home/vagrant/basic_usdt basic_usdt:loop_iter
+/home/vagrant/basic_usdt basic_usdt:end_main
+
+The loop_iter probe sounds interesting. What are the locations of that
+probe, and which variables are available?
+
+$ tplist '*loop_iter' -l basic_usdt -v
+/home/vagrant/basic_usdt basic_usdt:loop_iter [sema 0x601036]
+  location 0x400550 raw args: -4@$42 8@%rax
+    4   signed bytes @ constant 42
+    8 unsigned bytes @ register %rax
+  location 0x40056f raw args: 8@-8(%rbp) 8@%rax
+    8 unsigned bytes @ -8(%rbp)
+    8 unsigned bytes @ register %rax
+
+This output indicates that the loop_iter probe is used in two locations
+in the basic_usdt executable. The first location passes a constant value,
+42, to the probe. The second location passes a variable value located at
+an offset from the %rbp register. Don't worry -- you don't have to trace
+the register values yourself. The argdist and trace tools understand the
+probe format and can print out the arguments automatically -- you can
+refer to them as arg1, arg2, and so on.
+
+Try to explore with some common libraries on your system and see if they
+contain UDST probes. Here are two examples you might find interesting:
+
+$ tplist -l pthread     # list probes in libpthread
+/lib64/libpthread.so.0 libpthread:pthread_start
+/lib64/libpthread.so.0 libpthread:pthread_create
+/lib64/libpthread.so.0 libpthread:pthread_join
+/lib64/libpthread.so.0 libpthread:pthread_join_ret
+/lib64/libpthread.so.0 libpthread:mutex_init
+... more output truncated
+
+$ tplist -l c           # list probes in libc
+/lib64/libc.so.6 libc:setjmp
+/lib64/libc.so.6 libc:longjmp
+/lib64/libc.so.6 libc:longjmp_target
+/lib64/libc.so.6 libc:memory_arena_reuse_free_list
+/lib64/libc.so.6 libc:memory_heap_new
+... more output truncated
+
+tplist also understands kernel tracepoints, and can list their format
+as well. For example, let's look for all block I/O-related tracepoints:
+
+# tplist 'block*'
+block:block_touch_buffer
+block:block_dirty_buffer
+block:block_rq_abort
+block:block_rq_requeue
+block:block_rq_complete
+block:block_rq_insert
+block:block_rq_issue
+block:block_bio_bounce
+block:block_bio_complete
+block:block_bio_backmerge
+block:block_bio_frontmerge
+block:block_bio_queue
+block:block_getrq
+block:block_sleeprq
+block:block_plug
+block:block_unplug
+block:block_split
+block:block_bio_remap
+block:block_rq_remap
+
+The block:block_rq_complete tracepoints sounds interesting. Let's print
+its format to see what we can trace with argdist and trace:
+
+$ tplist -v block:block_rq_complete
+block:block_rq_complete
+    dev_t dev;
+    sector_t sector;
+    unsigned int nr_sector;
+    int errors;
+    char rwbs[8];
+
+The dev, sector, nr_sector, etc. variables can now all be used in probes
+you specify with argdist or trace.
+
+
+USAGE message:
+
+$ tplist -h
+usage: tplist.py [-h] [-p PID] [-l LIB] [-v] [filter]
+
+Display kernel tracepoints or USDT probes and their formats.
+
+positional arguments:
+  filter             A filter that specifies which probes/tracepoints to print
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  List USDT probes in the specified process
+  -l LIB, --lib LIB  List USDT probes in the specified library or executable
+  -v                 Print the format (available variables)
+
--- a/tools/trace.py
+++ b/tools/trace.py
--- a/tools/trace_example.txt
+++ b/tools/trace_example.txt
@@ -80,6 +80,31 @@ Note that the retval variable must be cast to int before comparing to zero.
 The reason is that the default type for argN and retval is an unsigned 64-bit
 integer, which can never be smaller than 0.

+trace has also some basic support for kernel tracepoints. For example, let's
+trace the block:block_rq_complete tracepoint and print out the number of sectors
+transferred:
+
+# trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+TIME     PID    COMM         FUNC             -
+01:23:51 0      swapper/0    block_rq_complete sectors=8
+01:23:55 10017  kworker/u64: block_rq_complete sectors=1
+01:23:55 0      swapper/0    block_rq_complete sectors=8
+^C
+
+To discover the tracepoint structure format (which you can refer to as the "tp"
+variable), use the tplist tool. For example:
+
+# tplist -v block:block_rq_complete
+block:block_rq_complete
+    dev_t dev;
+    sector_t sector;
+    unsigned int nr_sector;
+    int errors;
+    char rwbs[8];
+
+This output tells you that you can use "tp.dev", "tp.sector", etc. in your
+predicate and trace arguments.
+
 As a final example, let's trace open syscalls for a specific process. By 
 default, tracing is system-wide, but the -p switch overrides this:

@@ -144,4 +169,8 @@ trace 'r::__kmalloc (retval == 0) "kmalloc failed!"
        Trace returns from __kmalloc which returned a null pointer
 trace 'r:c:malloc (retval) "allocated = %p", retval
        Trace returns from malloc and print non-NULL allocated buffers
+trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
+trace 'u:pthread:pthread_create (arg4 != 0)'
+        Trace the USDT probe pthread_create when its 4th argument is non-zero