Merge branch 'master' into patch

c063fc4f · Brenden Blanco · GitHub · 4bb64cfb · afb19da0 · c063fc4f
Commit c063fc4f authored Feb 14, 2017 by Brenden Blanco Committed by GitHub Feb 14, 2017
97 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -2,6 +2,10 @@
 *.swp
 *.swo
 *.pyc
+.idea

-# Build artefacts
+# Build artifacts
 /build/
+cmake-build-debug
+debian/**/*.log
+obj-x86_64-linux-gnu
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -7,6 +7,7 @@
  - [Arch](#arch---aur)
  - [Gentoo](#gentoo---portage)
 * [Source](#source)
+  - [Debian](#debian---source)
  - [Ubuntu](#ubuntu---source)
  - [Fedora](#fedora---source)
 * [Older Instructions](#older-instructions)
@@ -163,6 +164,90 @@ The appropriate dependencies (e.g., ```clang```, ```llvm``` with BPF backend) wi

 # Source

+## Debian - Source
+### Jessie
+#### Repositories
+
+The automated tests that run as part of the build process require `netperf`.  Since netperf's license is not "certified"
+as an open-source license, it is in Debian's `non-free` repository.
+
+`/etc/apt/sources.list` should include the `non-free` repository and look something like this:
+
+```
+deb http://httpredir.debian.org/debian/ jessie main non-free
+deb-src http://httpredir.debian.org/debian/ jessie main non-free
+
+deb http://security.debian.org/ jessie/updates main non-free
+deb-src http://security.debian.org/ jessie/updates main non-free
+
+# wheezy-updates, previously known as 'volatile'
+deb http://ftp.us.debian.org/debian/ jessie-updates main non-free
+deb-src http://ftp.us.debian.org/debian/ jessie-updates main non-free
+```
+
+BCC also requires kernel version 4.1 or above.  Those kernels are available in the `jessie-backports` repository.  To
+add the `jessie-backports` repository to your system create the file `/etc/apt/sources.list.d/jessie-backports.list`
+with the following contents:
+
+```
+deb http://httpredir.debian.org/debian jessie-backports main
+deb-src http://httpredir.debian.org/debian jessie-backports main
+```
+
+#### Install Build Dependencies
+
+Note, check for the latest `linux-image-4.x` version in `jessie-backports` before proceeding.  Also, have a look at the
+`Build-Depends:` section in `debian/control` file.
+
+```
+# Before you begin
+apt-get update
+
+# Update kernel and linux-base package
+apt-get -t jessie-backports install linux-base linux-image-4.8.0-0.bpo.2-amd64
+
+# BCC build dependencies:
+apt-get install debhelper cmake libllvm3.8 llvm-3.8-dev libclang-3.8-dev \
+  libelf-dev bison flex libedit-dev clang-format-3.8 python python-netaddr \
+  python-pyroute2 luajit libluajit-5.1-dev arping iperf netperf ethtool \
+  devscripts
+```
+
+#### Sudo
+
+Adding eBPF probes to the kernel and removing probes from it requires root privileges.  For the build to complete
+successfully, you must build from an account with `sudo` access.  (You may also build as root, but it is bad style.)
+
+`/etc/sudoers` or `/etc/sudoers.d/build-user` should contain
+
+```
+build-user ALL = (ALL) NOPASSWD: ALL
+```
+
+or
+
+```
+build-user ALL = (ALL) ALL
+```
+
+If using the latter sudoers configuration, please keep an eye out for sudo's password prompt while the build is running.
+
+#### Build
+
+```
+cd <preferred development directory>
+git clone https://github.com/iovisor/bcc.git
+cd bcc
+debuild -b -uc -us
+```
+
+#### Install
+
+```
+cd ..
+sudo dpkg -i *bcc*.deb
+```
+
 ## Ubuntu - Source

 To build the toolchain from source, one needs:
@@ -219,9 +304,12 @@ sudo pip install pyroute2
 wget http://llvm.org/releases/3.7.1/clang+llvm-3.7.1-x86_64-fedora22.tar.xz
 sudo tar xf clang+llvm-3.7.1-x86_64-fedora22.tar.xz -C /usr/local --strip 1

-# FC23 and FC24
+# FC23
 wget http://llvm.org/releases/3.9.0/clang+llvm-3.9.0-x86_64-fedora23.tar.xz
 sudo tar xf clang+llvm-3.9.0-x86_64-fedora23.tar.xz -C /usr/local --strip 1
+
+# FC24 and FC25
+sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
 ```

 ### Install and compile BCC

--- a/README.md
+++ b/README.md
@@ -73,7 +73,7 @@ Examples:
 - examples/tracing/[vfsreadlat.py](examples/tracing/vfsreadlat.py) examples/tracing/[vfsreadlat.c](examples/tracing/vfsreadlat.c): VFS read latency distribution. [Examples](examples/tracing/vfsreadlat_example.txt).

 #### Tools:
-<center><a href="images/bcc_tracing_tools_2016.png"><img src="images/bcc_tracing_tools_2016.png" border=0 width=700></a></center>
+<center><a href="images/bcc_tracing_tools_2017.png"><img src="images/bcc_tracing_tools_2017.png" border=0 width=700></a></center>
 - tools/[argdist](tools/argdist.py): Display function parameter values as a histogram or frequency count. [Examples](tools/argdist_example.txt).
 - tools/[bashreadline](tools/bashreadline.py): Print entered bash commands system wide. [Examples](tools/bashreadline_example.txt).
 - tools/[biolatency](tools/biolatency.py): Summarize block device I/O latency as a histogram. [Examples](tools/biolatency_example.txt).
@@ -89,6 +89,7 @@ Examples:
 - tools/[cpuunclaimed](tools/cpuunclaimed.py): Sample CPU run queues and calculate unclaimed idle CPU. [Examples](tools/cpuunclaimed_example.txt)
 - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
 - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
+- tools/[deadlock_detector](tools/deadlock_detector.py): Detect potential deadlocks on a running process. [Examples](tools/deadlock_detector_example.txt)
 - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
 - tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency distribution as a histogram. [Examples](tools/ext4dist_example.txt).
 - tools/[ext4slower](tools/ext4slower.py): Trace slow ext4 operations. [Examples](tools/ext4slower_example.txt).

--- a/SPECS/bcc.spec
+++ b/SPECS/bcc.spec
 %bcond_with local_clang_static
+#lua jit not available for some architectures
+%ifarch ppc64 aarch64 ppc64le
+%{!?with_lua: %global with_lua 0}
+%else
+%{!?with_lua: %global with_lua 1}
+%endif
 %define debug_package %{nil}

 Name:           bcc
@@ -11,10 +17,12 @@ License:        ASL 2.0
 URL:            https://github.com/iovisor/bcc
 Source0:        bcc.tar.gz

-ExclusiveArch: x86_64
+ExclusiveArch: x86_64 ppc64 aarch64 ppc64le
 BuildRequires: bison cmake >= 2.8.7 flex make
 BuildRequires: gcc gcc-c++ python2-devel elfutils-libelf-devel-static
+%if %{with_lua}
 BuildRequires: luajit luajit-devel
+%endif
 %if %{without local_clang_static}
 BuildRequires: llvm-devel llvm-static
 BuildRequires: clang-devel
@@ -25,6 +33,11 @@ BuildRequires: pkgconfig ncurses-devel
 Python bindings for BPF Compiler Collection (BCC). Control a BPF program from
 userspace.

+%if %{with_lua}
+%global lua_include `pkg-config --variable=includedir luajit`
+%global lua_libs `pkg-config --variable=libdir luajit`/lib`pkg-config --variable=libname luajit`.so
+%global lua_config -DLUAJIT_INCLUDE_DIR=%{lua_include} -DLUAJIT_LIBRARIES=%{lua_libs}
+%endif

 %prep
 %setup -q -n bcc
@@ -35,8 +48,7 @@ mkdir build
 pushd build
 cmake .. -DREVISION_LAST=%{version} -DREVISION=%{version} \
      -DCMAKE_INSTALL_PREFIX=/usr \
-      -DLUAJIT_INCLUDE_DIR=`pkg-config --variable=includedir luajit` \
-      -DLUAJIT_LIBRARIES=`pkg-config --variable=libdir luajit`/lib`pkg-config --variable=libname luajit`.so
+      %{?lua_config}
 make %{?_smp_mflags}
 popd

@@ -56,16 +68,20 @@ Requires: libbcc = %{version}-%{release}
 %description -n python-bcc
 Python bindings for BPF Compiler Collection (BCC)

+%if %{with_lua}
 %package -n bcc-lua
 Summary: Standalone tool to run BCC tracers written in Lua
 Requires: libbcc = %{version}-%{release}
 %description -n bcc-lua
 Standalone tool to run BCC tracers written in Lua
+%endif

 %package -n libbcc-examples
 Summary: Examples for BPF Compiler Collection (BCC)
 Requires: python-bcc = %{version}-%{release}
+%if %{with_lua}
 Requires: bcc-lua = %{version}-%{release}
+%endif
 %description -n libbcc-examples
 Examples for BPF Compiler Collection (BCC)

@@ -82,8 +98,10 @@ Command line tools for BPF Compiler Collection (BCC)
 %files -n python-bcc
 %{python_sitelib}/bcc*

+%if %{with_lua}
 %files -n bcc-lua
 /usr/bin/bcc-lua
+%endif

 %files -n libbcc-examples
 /usr/share/bcc/examples/*

--- a/debian/control
+++ b/debian/control
@@ -3,7 +3,12 @@ Maintainer: Brenden Blanco <bblanco@plumgrid.com>
 Section: misc
 Priority: optional
 Standards-Version: 3.9.5
-Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8, llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev, libelf-dev, bison, flex, libedit-dev, clang-format | clang-format-3.7, python-netaddr, python-pyroute2, luajit, libluajit-5.1-dev
+Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8,
+    llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev,
+    libelf-dev, bison, flex, libedit-dev,
+    clang-format | clang-format-3.7 | clang-format-3.8, python (>= 2.7),
+    python-netaddr, python-pyroute2, luajit, libluajit-5.1-dev, arping,
+    inetutils-ping | iputils-ping, iperf, netperf, ethtool, devscripts
 Homepage: https://github.com/iovisor/bcc

 Package: libbcc

--- a/docs/kernel-versions.md
+++ b/docs/kernel-versions.md
@@ -15,6 +15,7 @@ ARM64 | 3.18 | [e54bcde3d69d](https://git.kernel.org/cgit/linux/kernel/git/torva
 s390 | 4.1 | [054623105728](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=054623105728b06852f077299e2bf1bf3d5f2b0b)
 Constant blinding for JIT machines | 4.7 | [4f3446bb809f](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4f3446bb809f20ad56cadf712e6006815ae7a8f9)
 PowerPC64 | 4.8 | [156d0e290e96](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=156d0e290e969caba25f1851c52417c14d141b24)
+Constant blinding - PowerPC64 | 4.9 | [b7b7013cac55](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b7b7013cac55d794940bd9cb7b7c55c9dececac4)

 ## Main features

@@ -24,7 +25,7 @@ Feature | Kernel version | Commit
 Kernel helpers | 3.15 | [bd4cf0ed331a](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8)
 `bpf()` syscall | 3.18 | [99c55f7d47c0](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=99c55f7d47c0dc6fc64729f37bf435abf43f4c60)
 Tables (_a.k.a._ Maps; details below) | 3.18 | [99c55f7d47c0](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=99c55f7d47c0dc6fc64729f37bf435abf43f4c60)
-BPF attached to sockets | 3.19 | [89aa075832b0](89aa075832b0da4402acebd698d0411dcc82d03e)
+BPF attached to sockets | 3.19 | [89aa075832b0](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=89aa075832b0da4402acebd698d0411dcc82d03e)
 BPF attached to `kprobes` | 4.1 | [2541517c32be](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2541517c32be2531e0da59dfd7efc1ce844644f5)
 `cls_bpf` / `act_bpf` for `tc` | 4.1 | [e2e9b6541dd4](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e2e9b6541dd4b31848079da80fe2253daaafb549)
 Tail calls | 4.2 | [04fd61ab36ec](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=04fd61ab36ec065e194ab5e74ae34a5240d992bb)
@@ -53,9 +54,11 @@ Perf events | 4.3 | [ea317b267e9d](https://git.kernel.org/cgit/linux/kernel/git/
 Per-CPU hash | 4.6 | [824bd0ce6c7c](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=824bd0ce6c7c43a9e1e210abf124958e54d88342)
 Per-CPU array | 4.6 | [a10423b87a7e](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a10423b87a7eae75da79ce80a8d9475047a674ee)
 Stack trace | 4.6 | [d5a3b1f69186](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d5a3b1f691865be576c2bffa708549b8cdccda19)
+Pre-alloc maps memory | 4.6 | [6c9059817432](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6c90598174322b8888029e40dd84a4eb01f56afe)
 cgroup array | 4.8 | [4ed8ec521ed5](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4ed8ec521ed57c4e207ad464ca0388776de74d4b)
-LRU hash | [4.10](https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3) | [](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3)
-LRU per-CPU hash | [4.10](https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0) | [](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0)
+LRU hash | 4.10 | [29ba732acbee](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3)
+LRU per-CPU hash | 4.10 | [8f8449384ec3](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0)
+LPM trie | 4.11 | [b95a5c4db09b](https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=b95a5c4db09bc7c253636cb84dc9b12c577fd5a0)
 Text string | _To be done?_ |
 Variable-length maps | _To be done?_ |


--- a/images/bcc_tracing_tools_2017.png
+++ b/images/bcc_tracing_tools_2017.png
--- a/man/man8/deadlock_detector.8
+++ b/man/man8/deadlock_detector.8
+.TH deadlock_detector 8  "2017-02-01" "USER COMMANDS"
+.SH NAME
+deadlock_detector \- Find potential deadlocks (lock order inversions)
+in a running program.
+.SH SYNOPSIS
+.B deadlock_detector [\-h] [\--binary BINARY] [\--dump-graph DUMP_GRAPH]
+.B                  [\--verbose] [\--lock-symbols LOCK_SYMBOLS]
+.B                  [\--unlock-symbols UNLOCK_SYMBOLS]
+.B                  pid
+.SH DESCRIPTION
+deadlock_detector finds potential deadlocks in a running process. The program
+attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` by default
+to build a mutex wait directed graph, and then looks for a cycle in this graph.
+This graph has the following properties:
+
+- Nodes in the graph represent mutexes.
+
+- Edge (A, B) exists if there exists some thread T where lock(A) was called
+and lock(B) was called before unlock(A) was called.
+
+If there is a cycle in this graph, this indicates that there is a lock order
+inversion (potential deadlock). If the program finds a lock order inversion, the
+program will dump the cycle of mutexes, dump the stack traces where each mutex
+was acquired, and then exit.
+
+This program can only find potential deadlocks that occur while the program is
+tracing the process. It cannot find deadlocks that may have occurred before the
+program was attached to the process.
+
+This tool does not work for shared mutexes or recursive mutexes.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc
+.SH OPTIONS
+.TP
+\-h, --help
+show this help message and exit
+.TP
+\--binary BINARY
+If set, trace the mutexes from the binary at this path. For
+statically-linked binaries, this argument is not required.
+For dynamically-linked binaries, this argument is required and should be the
+path of the pthread library the binary is using.
+Example: /lib/x86_64-linux-gnu/libpthread.so.0
+.TP
+\--dump-graph DUMP_GRAPH
+If set, this will dump the mutex graph to the specified file.
+.TP
+\--verbose
+Print statistics about the mutex wait graph.
+.TP
+\--lock-symbols LOCK_SYMBOLS
+Comma-separated list of lock symbols to trace. Default is pthread_mutex_lock.
+These symbols cannot be inlined in the binary.
+.TP
+\--unlock-symbols UNLOCK_SYMBOLS
+Comma-separated list of unlock symbols to trace. Default is
+pthread_mutex_unlock. These symbols cannot be inlined in the binary.
+.TP
+pid
+Pid to trace
+.SH EXAMPLES
+.TP
+Find potential deadlocks in PID 181. The --binary argument is not needed for \
+statically-linked binaries.
+#
+.B deadlock_detector 181
+.TP
+Find potential deadlocks in PID 181. If the process was created from a \
+dynamically-linked executable, the --binary argument is required and must be \
+the path of the pthread library:
+#
+.B deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+.TP
+Find potential deadlocks in PID 181. If the process was created from a \
+statically-linked executable, optionally pass the location of the binary. \
+On older kernels without https://lkml.org/lkml/2017/1/13/585, binaries that \
+contain `:` in the path cannot be attached with uprobes. As a workaround, we \
+can create a symlink to the binary, and provide the symlink name instead with \
+the `--binary` option:
+#
+.B deadlock_detector 181 --binary /usr/local/bin/lockinversion
+.TP
+Find potential deadlocks in PID 181 and dump the mutex wait graph to a file:
+#
+.B deadlock_detector 181 --dump-graph graph.json
+.TP
+Find potential deadlocks in PID 181 and print mutex wait graph statistics:
+#
+.B deadlock_detector 181 --verbose
+.TP
+Find potential deadlocks in PID 181 with custom mutexes:
+#
+.B deadlock_detector 181
+.B      --lock-symbols custom_mutex1_lock,custom_mutex2_lock
+.B      --unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock
+.SH OUTPUT
+This program does not output any fields. Rather, it will keep running until
+it finds a potential deadlock, or the user hits Ctrl-C. If the program finds
+a potential deadlock, it will output the stack traces and lock order inversion
+in the following format and exit:
+.TP
+Potential Deadlock Detected!
+.TP
+Cycle in lock order graph: Mutex M0 => Mutex M1 => Mutex M0
+.TP
+Mutex M1 acquired here while holding Mutex M0 in Thread T:
+.B [stack trace]
+.TP
+Mutex M0 previously acquired by the same Thread T here:
+.B [stack trace]
+.TP
+Mutex M0 acquired here while holding Mutex M1 in Thread S:
+.B [stack trace]
+.TP
+Mutex M1 previously acquired by the same Thread S here:
+.B [stack trace]
+.TP
+Thread T created by Thread R here:
+.B [stack trace]
+.TP
+Thread S created by Thread Q here:
+.B [stack trace]
+.SH OVERHEAD
+This traces all mutex lock and unlock events and all thread creation events
+on the traced process. The overhead of this can be high if the process has many
+threads and mutexes. You should only run this on a process where the slowdown
+is acceptable.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Kenny Yu
--- a/man/man8/gethostlatency.8
+++ b/man/man8/gethostlatency.8
@@ -21,6 +21,10 @@ and may need modifications to match your software and processor architecture.
 Since this uses BPF, only the root user can use this tool.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-p PID
+Trace this process ID only.
 .SH EXAMPLES
 .TP
 Trace host lookups (getaddrinfo/gethostbyname[2]) system wide:

--- a/man/man8/memleak.8
+++ b/man/man8/memleak.8
@@ -3,7 +3,7 @@
 memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
 .SH SYNOPSIS
 .B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
-[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
+[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJ] [INTERVAL] [COUNT]
 .SH DESCRIPTION
 memleak traces and matches memory allocation and deallocation requests, and
 collects call stacks for each allocation. memleak can then print a summary
@@ -53,6 +53,9 @@ Capture only allocations that are larger than or equal to MIN_SIZE bytes.
 \-Z MAX_SIZE
 Capture only allocations that are smaller than or equal to MAX_SIZE bytes.
 .TP
+\-O OBJ
+Attach to malloc and free in specified object instead of resolving libc. Ignored when kernel allocations are profiled.
+.TP
 INTERVAL
 Print a summary of oustanding allocations and their call stacks every INTERVAL seconds.
 The default interval is 5 seconds.

--- a/man/man8/profile.8
+++ b/man/man8/profile.8
@@ -49,7 +49,7 @@ Show stacks from kernel space only (no user space stacks).
 .TP
 \-\-stack-storage-size COUNT
 The maximum number of unique stack traces that the kernel will count (default
-2048). If the sampled count exceeds this, a warning will be printed.
+10240). If the sampled count exceeds this, a warning will be printed.
 .TP
 duration
 Duration to trace, in seconds.

--- a/man/man8/statsnoop.8
+++ b/man/man8/statsnoop.8
@@ -25,7 +25,8 @@ CONFIG_BPF and bcc.
 Print usage message.
 .TP
 \-t
-Include a timestamp column.
+Include a timestamp column: in seconds since the first event, with decimal
+places.
 .TP
 \-x
 Only print failed stats.

--- a/man/man8/trace.8
+++ b/man/man8/trace.8
@@ -62,7 +62,7 @@ information. See PROBE SYNTAX below.
 .SH PROBE SYNTAX
 The general probe syntax is as follows:

-.B [{p,r}]:[library]:function [(predicate)] ["format string"[, arguments]]
+.B [{p,r}]:[library]:function[(signature)] [(predicate)] ["format string"[, arguments]]

 .B {t:category:event,u:library:probe} [(predicate)] ["format string"[, arguments]]
 .TP
@@ -84,6 +84,12 @@ The tracepoint category. For example, "sched" or "irq".
 .B function
 The function to probe.
 .TP
+.B signature
+The optional signature of the function to probe. This can make it easier to
+access the function's arguments, instead of using the "arg1", "arg2" etc.
+argument specifiers. For example, "(struct timespec *ts)" in the signature
+position lets you use "ts" in the filter or print expressions.
+.TP
 .B event
 The tracepoint event. For example, "block_rq_complete".
 .TP
@@ -159,6 +165,10 @@ Trace the block:block_rq_complete tracepoint and print the number of sectors com
 Trace the pthread_create USDT probe from the pthread library and print the address of the thread's start function:
 #
 .B trace 'u:pthread:pthread_create """start addr = %llx"", arg3'
+.TP
+Trace the nanosleep system call and print the sleep duration in nanoseconds:
+#
+.B trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
 .SH SOURCE
 This is from bcc.
 .IP

--- a/man/man8/ucalls.8
+++ b/man/man8/ucalls.8
@@ -2,28 +2,29 @@
 .SH NAME
 ucalls \- Summarize method calls from high-level languages and Linux syscalls.
 .SH SYNOPSIS
-.B ucalls [-l {java,python,ruby}] [-h] [-T TOP] [-L] [-S] [-v] [-m] pid [interval]
+.B ucalls [-l {java,python,ruby,php}] [-h] [-T TOP] [-L] [-S] [-v] [-m] pid [interval]
 .SH DESCRIPTION
 This tool summarizes method calls from high-level languages such as Python, 
-Java, and Ruby. It can also trace Linux system calls. Whenever a method is 
+Java, Ruby, and PHP. It can also trace Linux system calls. Whenever a method is 
 invoked, ucalls records the call count and optionally the method's execution
 time (latency) and displays a summary.

 This uses in-kernel eBPF maps to store per process summaries for efficiency.

 This tool relies on USDT probes embedded in many high-level languages, such as
-Node, Java, Python, and Ruby. It requires a runtime instrumented with these 
+Java, Python, Ruby, and PHP. It requires a runtime instrumented with these 
 probes, which in some cases requires building from source with a USDT-specific
 flag, such as "--enable-dtrace" or "--with-dtrace". For Java, method probes are
 not enabled by default, and can be turned on by running the Java process with
-the "-XX:+ExtendedDTraceProbes" flag.
+the "-XX:+ExtendedDTraceProbes" flag. For PHP processes, the environment
+variable USE_ZEND_DTRACE must be set to 1.

 Since this uses BPF, only the root user can use this tool.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
 .SH OPTIONS
 .TP
-\-l {java,python,ruby,node}
+\-l {java,python,ruby,php}
 The language to trace. If not provided, only syscalls are traced (when the \-S
 option is used).
 .TP

--- a/man/man8/uflow.8
+++ b/man/man8/uflow.8
@@ -2,16 +2,17 @@
 .SH NAME
 uflow \- Print a flow graph of method calls in high-level languages.
 .SH SYNOPSIS
-.B uflow [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby} pid
+.B uflow [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby,php} pid
 .SH DESCRIPTION
 uflow traces method calls and prints them in a flow graph that can facilitate
 debugging and diagnostics by following the program's execution (method flow).

 This tool relies on USDT probes embedded in many high-level languages, such as
-Node, Java, Python, and Ruby. It requires a runtime instrumented with these 
+Java, Python, Ruby, and PHP. It requires a runtime instrumented with these 
 probes, which in some cases requires building from source with a USDT-specific
 flag, such as "--enable-dtrace" or "--with-dtrace". For Java processes, the
-startup flag "-XX:+ExtendedDTraceProbes" is required.
+startup flag "-XX:+ExtendedDTraceProbes" is required. For PHP processes, the
+environment variable USE_ZEND_DTRACE must be set to 1.

 Since this uses BPF, only the root user can use this tool.
 .SH REQUIREMENTS
@@ -29,7 +30,7 @@ name interpretation strongly depends on the language. For example, in Java use
 \-v
 Print the resulting BPF program, for debugging purposes.
 .TP
-{java,python,ruby}
+{java,python,ruby,php}
 The language to trace.
 .TP
 pid

--- a/man/man8/ugc.8
+++ b/man/man8/ugc.8
@@ -2,7 +2,7 @@
 .SH NAME
 ugc \- Trace garbage collection events in high-level languages.
 .SH SYNOPSIS
-.B ugc [-h] [-v] [-m] {java,python,ruby,node} pid
+.B ugc [-h] [-v] [-m] [-M MINIMUM] [-F FILTER] {java,python,ruby,node} pid
 .SH DESCRIPTION
 This traces garbage collection events as they occur, including their duration
 and any additional information (such as generation collected or type of GC)
@@ -24,6 +24,18 @@ Print the resulting BPF program, for debugging purposes.
 \-m
 Print times in milliseconds. The default is microseconds.
 .TP
+\-M MINIMUM
+Display only collections that are longer than this threshold. The value is
+given in milliseconds. The default is to display all collections.
+.TP
+\-F FILTER
+Display only collections whose textual description matches (contains) this
+string. The default is to display all collections. Note that the filtering here
+is performed in user-space, and not as part of the BPF program. This means that
+if you have thousands of collection events, specifying this filter will not
+reduce the amount of data that has to be transferred from the BPF program to
+the user-space script.
+.TP
 {java,python,ruby,node}
 The language to trace.
 .TP
@@ -39,16 +51,22 @@ Trace garbage collections in a specific Java process, and print GC times in
 milliseconds:
 #
 .B ugc -m java 6004
+.TP
+Trace garbage collections in a specific Java process, and display them only if
+they are longer than 10ms and have the string "Tenured" in their detailed
+description:
+#
+.B ugc -M 10 -F Tenured java 6004
 .SH FIELDS
 .TP
 START
 The start time of the GC, in seconds from the beginning of the trace.
 .TP
-DESCRIPTION
-The runtime-provided description of this garbage collection event.
-.TP
 TIME
 The duration of the garbage collection event.
+.TP
+DESCRIPTION
+The runtime-provided description of this garbage collection event.
 .SH OVERHEAD
 Garbage collection events, even if frequent, should not produce a considerable
 overhead when traced because they are still not very common. Even hundreds of 

--- a/man/man8/ustat.8
+++ b/man/man8/ustat.8
@@ -2,7 +2,7 @@
 .SH NAME
 ustat \- Activity stats from high-level languages.
 .SH SYNOPSIS
-.B ustat [-l {java,python,ruby,node}] [-C] [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d] [interval [count]]
+.B ustat [-l {java,python,ruby,node,php}] [-C] [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d] [interval [count]]
 .SH DESCRIPTION
 This is "top" for high-level language events, such as garbage collections,
 exceptions, thread creations, object allocations, method calls, and more. The
@@ -12,11 +12,12 @@ can be sorted by various fields.
 This uses in-kernel eBPF maps to store per process summaries for efficiency.

 This tool relies on USDT probes embedded in many high-level languages, such as
-Node, Java, Python, and Ruby. It requires a runtime instrumented with these 
+Node, Java, Python, Ruby, and PHP. It requires a runtime instrumented with these 
 probes, which in some cases requires building from source with a USDT-specific
 flag, such as "--enable-dtrace" or "--with-dtrace". For Java, some probes are
 not enabled by default, and can be turned on by running the Java process with
-the "-XX:+ExtendedDTraceProbes" flag.
+the "-XX:+ExtendedDTraceProbes" flag. For PHP processes, the environment
+variable USE_ZEND_DTRACE must be set to 1.

 Newly-created processes will only be traced at the next interval. If you run
 this tool with a short interval (say, 1-5 seconds), this should be virtually
@@ -28,7 +29,7 @@ Since this uses BPF, only the root user can use this tool.
 CONFIG_BPF and bcc.
 .SH OPTIONS
 .TP
-\-l {java,python,ruby,node}
+\-l {java,python,ruby,node,php}
 The language to trace. By default, all languages are traced.
 .TP
 \-C

--- a/scripts/build-deb.sh
+++ b/scripts/build-deb.sh
@@ -24,14 +24,20 @@ pushd $TMP
 tar xf bcc_$revision.orig.tar.gz
 cd bcc

+debuild=debuild
 if [[ "$buildtype" = "test" ]]; then
+  # when testing, use faster compression options
+  debuild+=" --preserve-envvar PATH"
+  echo -e '#!/bin/bash\nexec /usr/bin/dpkg-deb -z1 "$@"' \
+    | sudo tee /usr/local/bin/dpkg-deb
+  sudo chmod +x /usr/local/bin/dpkg-deb
  dch -b -v $revision-$release "$git_subject"
 fi
 if [[ "$buildtype" = "nightly" ]]; then
  dch -v $revision-$release "$git_subject"
 fi

-DEB_BUILD_OPTIONS="nocheck parallel=${PARALLEL}" debuild -us -uc
+DEB_BUILD_OPTIONS="nocheck parallel=${PARALLEL}" $debuild -us -uc
 popd

 cp $TMP/*.deb .
--- a/snapcraft/snapcraft.yaml
+++ b/snapcraft/snapcraft.yaml
@@ -16,7 +16,7 @@
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 #
 name: bcc
-version: 0.2.0-20161215-1402-7151673
+version: 0.2.0-20170208-1555-3e77af5
 summary: BPF Compiler Collection (BCC)
 description: A toolkit for creating efficient kernel tracing and manipulation programs
 confinement: strict
@@ -46,12 +46,18 @@ apps:
        command: wrapper cachestat
    cachetop:
        command: wrapper cachetop
+    capable:
+        command: wrapper capable
    cpudist:
        command: wrapper cpudist
+    cpuunclaimed:
+        command: wrapper cpuunclaimed
    dcsnoop:
        command: wrapper dcsnoop
    dcstat:
        command: wrapper dcstat
+    deadlock-detector:
+        command: wrapper deadlock_detector
    execsnoop:
        command: wrapper execsnoop
    ext4dist:
@@ -74,10 +80,14 @@ apps:
        command: wrapper hardirqs
    killsnoop:
        command: wrapper killsnoop
+    llcstat:
+        command: wrapper llcstat
    mdflush:
        command: wrapper mdflush
    memleak:
        command: wrapper memleak
+    mountsnoop:
+        command: wrapper mountsnoop
    offcputime:
        command: wrapper offcputime
    offwaketime:
@@ -88,12 +98,18 @@ apps:
        command: wrapper opensnoop
    pidpersec:
        command: wrapper pidpersec
+    profile:
+        command: wrapper profile
    runqlat:
        command: wrapper runqlat
+    runqlen:
+        command: wrapper runqlen
    slabratetop:
        command: wrapper slabratetop
    softirqs:
        command: wrapper softirqs
+    solisten:
+        command: wrapper solisten
    sslsniff:
        command: wrapper sslsniff
    stackcount:
@@ -116,10 +132,24 @@ apps:
        command: wrapper tcpretrans
    tcptop:
        command: wrapper tcptop
-    ttysnoop:
-        command: wrapper ttysnop
+    tplist:
+        command: wrapper tplist
    trace:
        command: wrapper trace
+    ttysnoop:
+        command: wrapper ttysnoop
+    ucalls:
+        command: wrapper ucalls
+    uflow:
+        command: wrapper uflow
+    ugc:
+        command: wrapper ugc
+    uobjnew:
+        command: wrapper uobjnew
+    ustat:
+        command: wrapper ustat
+    uthreads:
+        command: wrapper uthreads
    vfscount:
        command: wrapper vfscount
    vfsstat:

--- a/src/cc/BPF.cc
+++ b/src/cc/BPF.cc
@@ -30,6 +30,7 @@
 #include "bpf_module.h"
 #include "libbpf.h"
 #include "perf_reader.h"
+#include "common.h"
 #include "usdt.h"

 #include "BPF.h"
@@ -146,7 +147,7 @@ StatusTuple BPF::detach_all() {

 StatusTuple BPF::attach_kprobe(const std::string& kernel_func,
                               const std::string& probe_func,
-                               bpf_attach_type attach_type,
+                               bpf_probe_attach_type attach_type,
                               pid_t pid, int cpu, int group_fd,
                               perf_reader_cb cb, void* cb_cookie) {
  std::string probe_event = get_kprobe_event(kernel_func, attach_type);
@@ -156,11 +157,8 @@ StatusTuple BPF::attach_kprobe(const std::string& kernel_func,
  int probe_fd;
  TRY2(load_func(probe_func, BPF_PROG_TYPE_KPROBE, probe_fd));

-  std::string probe_event_desc = attach_type_prefix(attach_type);
-  probe_event_desc += ":kprobes/" + probe_event + " " + kernel_func;
-
  void* res =
-      bpf_attach_kprobe(probe_fd, probe_event.c_str(), probe_event_desc.c_str(),
+      bpf_attach_kprobe(probe_fd, attach_type, probe_event.c_str(), kernel_func.c_str(),
                        pid, cpu, group_fd, cb, cb_cookie);

  if (!res) {
@@ -181,7 +179,7 @@ StatusTuple BPF::attach_uprobe(const std::string& binary_path,
                               const std::string& symbol,
                               const std::string& probe_func,
                               uint64_t symbol_addr,
-                               bpf_attach_type attach_type,
+                               bpf_probe_attach_type attach_type,
                               pid_t pid, int cpu, int group_fd,
                               perf_reader_cb cb, void* cb_cookie) {
  bcc_symbol sym = bcc_symbol();
@@ -195,13 +193,9 @@ StatusTuple BPF::attach_uprobe(const std::string& binary_path,
  int probe_fd;
  TRY2(load_func(probe_func, BPF_PROG_TYPE_KPROBE, probe_fd));

-  std::string probe_event_desc = attach_type_prefix(attach_type);
-  probe_event_desc += ":uprobes/" + probe_event + " ";
-  probe_event_desc += binary_path + ":0x" + uint_to_hex(sym.offset);
-
  void* res =
-      bpf_attach_uprobe(probe_fd, probe_event.c_str(), probe_event_desc.c_str(),
-                        pid, cpu, group_fd, cb, cb_cookie);
+      bpf_attach_uprobe(probe_fd, attach_type, probe_event.c_str(), binary_path.c_str(),
+                        sym.offset, pid, cpu, group_fd, cb, cb_cookie);

  if (!res) {
    TRY2(unload_func(probe_func));
@@ -297,11 +291,12 @@ StatusTuple BPF::attach_perf_event(uint32_t ev_type, uint32_t ev_config,
  TRY2(load_func(probe_func, BPF_PROG_TYPE_PERF_EVENT, probe_fd));

  auto fds = new std::map<int, int>();
-  int cpu_st = 0;
-  int cpu_en = sysconf(_SC_NPROCESSORS_ONLN) - 1;
+  std::vector<int> cpus;
  if (cpu >= 0)
-    cpu_st = cpu_en = cpu;
-  for (int i = cpu_st; i <= cpu_en; i++) {
+    cpus.push_back(cpu);
+  else
+    cpus = get_online_cpus();
+  for (int i: cpus) {
    int fd = bpf_attach_perf_event(probe_fd, ev_type, ev_config, sample_period,
                                   sample_freq, pid, i, group_fd);
    if (fd < 0) {
@@ -323,7 +318,7 @@ StatusTuple BPF::attach_perf_event(uint32_t ev_type, uint32_t ev_config,
 }

 StatusTuple BPF::detach_kprobe(const std::string& kernel_func,
-                               bpf_attach_type attach_type) {
+                               bpf_probe_attach_type attach_type) {
  std::string event = get_kprobe_event(kernel_func, attach_type);

  auto it = kprobes_.find(event);
@@ -339,7 +334,7 @@ StatusTuple BPF::detach_kprobe(const std::string& kernel_func,

 StatusTuple BPF::detach_uprobe(const std::string& binary_path,
                               const std::string& symbol, uint64_t symbol_addr,
-                               bpf_attach_type attach_type) {
+                               bpf_probe_attach_type attach_type) {
  bcc_symbol sym = bcc_symbol();
  TRY2(check_binary_symbol(binary_path, symbol, symbol_addr, &sym));

@@ -421,7 +416,7 @@ void BPF::poll_perf_buffer(const std::string& name, int timeout) {
 }

 StatusTuple BPF::load_func(const std::string& func_name,
-                           enum bpf_prog_type type, int& fd) {
+                           bpf_prog_type type, int& fd) {
  if (funcs_.find(func_name) != funcs_.end()) {
    fd = funcs_[func_name];
    return StatusTuple(0);
@@ -462,7 +457,7 @@ StatusTuple BPF::check_binary_symbol(const std::string& binary_path,
                                     const std::string& symbol,
                                     uint64_t symbol_addr, bcc_symbol* output) {
  int res = bcc_resolve_symname(binary_path.c_str(), symbol.c_str(),
-                                symbol_addr, output);
+                                symbol_addr, 0, output);
  if (res < 0)
    return StatusTuple(
        -1, "Unable to find offset for binary %s symbol %s address %lx",
@@ -471,14 +466,14 @@ StatusTuple BPF::check_binary_symbol(const std::string& binary_path,
 }

 std::string BPF::get_kprobe_event(const std::string& kernel_func,
-                                  bpf_attach_type type) {
+                                  bpf_probe_attach_type type) {
  std::string res = attach_type_prefix(type) + "_";
  res += sanitize_str(kernel_func, &BPF::kprobe_event_validator);
  return res;
 }

 std::string BPF::get_uprobe_event(const std::string& binary_path,
-                                  uint64_t offset, bpf_attach_type type) {
+                                  uint64_t offset, bpf_probe_attach_type type) {
  std::string res = attach_type_prefix(type) + "_";
  res += sanitize_str(binary_path, &BPF::uprobe_path_validator);
  res += "_0x" + uint_to_hex(offset);
@@ -492,8 +487,7 @@ StatusTuple BPF::detach_kprobe_event(const std::string& event,
    attr.reader_ptr = nullptr;
  }
  TRY2(unload_func(attr.func));
-  std::string detach_event = "-:kprobes/" + event;
-  if (bpf_detach_kprobe(detach_event.c_str()) < 0)
+  if (bpf_detach_kprobe(event.c_str()) < 0)
    return StatusTuple(-1, "Unable to detach kprobe %s", event.c_str());
  return StatusTuple(0);
 }
@@ -505,8 +499,7 @@ StatusTuple BPF::detach_uprobe_event(const std::string& event,
    attr.reader_ptr = nullptr;
  }
  TRY2(unload_func(attr.func));
-  std::string detach_event = "-:uprobes/" + event;
-  if (bpf_detach_uprobe(detach_event.c_str()) < 0)
+  if (bpf_detach_uprobe(event.c_str()) < 0)
    return StatusTuple(-1, "Unable to detach uprobe %s", event.c_str());
  return StatusTuple(0);
 }

--- a/src/cc/BPF.h
+++ b/src/cc/BPF.h
@@ -29,11 +29,6 @@

 namespace ebpf {

-enum class bpf_attach_type {
-  probe_entry,
-  probe_return
-};
-
 struct open_probe_t {
  void* reader_ptr;
  std::string func;
@@ -56,23 +51,23 @@ public:

  StatusTuple attach_kprobe(
      const std::string& kernel_func, const std::string& probe_func,
-      bpf_attach_type attach_type = bpf_attach_type::probe_entry,
+      bpf_probe_attach_type = BPF_PROBE_ENTRY,
      pid_t pid = -1, int cpu = 0, int group_fd = -1,
      perf_reader_cb cb = nullptr, void* cb_cookie = nullptr);
  StatusTuple detach_kprobe(
      const std::string& kernel_func,
-      bpf_attach_type attach_type = bpf_attach_type::probe_entry);
+      bpf_probe_attach_type attach_type = BPF_PROBE_ENTRY);

  StatusTuple attach_uprobe(
      const std::string& binary_path, const std::string& symbol,
      const std::string& probe_func, uint64_t symbol_addr = 0,
-      bpf_attach_type attach_type = bpf_attach_type::probe_entry,
+      bpf_probe_attach_type attach_type = BPF_PROBE_ENTRY,
      pid_t pid = -1, int cpu = 0, int group_fd = -1,
      perf_reader_cb cb = nullptr, void* cb_cookie = nullptr);
  StatusTuple detach_uprobe(
      const std::string& binary_path, const std::string& symbol,
      uint64_t symbol_addr = 0,
-      bpf_attach_type attach_type = bpf_attach_type::probe_entry);
+      bpf_probe_attach_type attach_type = BPF_PROBE_ENTRY);
  StatusTuple attach_usdt(const USDT& usdt, pid_t pid = -1, int cpu = 0,
                          int group_fd = -1);
  StatusTuple detach_usdt(const USDT& usdt);
@@ -111,9 +106,9 @@ private:
  StatusTuple unload_func(const std::string& func_name);

  std::string get_kprobe_event(const std::string& kernel_func,
-                               bpf_attach_type type);
+                               bpf_probe_attach_type type);
  std::string get_uprobe_event(const std::string& binary_path, uint64_t offset,
-                               bpf_attach_type type);
+                               bpf_probe_attach_type type);

  StatusTuple detach_kprobe_event(const std::string& event, open_probe_t& attr);
  StatusTuple detach_uprobe_event(const std::string& event, open_probe_t& attr);
@@ -121,21 +116,21 @@ private:
                                      open_probe_t& attr);
  StatusTuple detach_perf_event_all_cpu(open_probe_t& attr);

-  std::string attach_type_debug(bpf_attach_type type) {
+  std::string attach_type_debug(bpf_probe_attach_type type) {
    switch (type) {
-    case bpf_attach_type::probe_entry:
+    case BPF_PROBE_ENTRY:
      return "";
-    case bpf_attach_type::probe_return:
+    case BPF_PROBE_RETURN:
      return "return ";
    }
    return "ERROR";
  }

-  std::string attach_type_prefix(bpf_attach_type type) {
+  std::string attach_type_prefix(bpf_probe_attach_type type) {
    switch (type) {
-    case bpf_attach_type::probe_entry:
+    case BPF_PROBE_ENTRY:
      return "p";
-    case bpf_attach_type::probe_return:
+    case BPF_PROBE_RETURN:
      return "r";
    }
    return "ERROR";

--- a/src/cc/BPFTable.cc
+++ b/src/cc/BPFTable.cc
@@ -25,6 +25,7 @@
 #include "bcc_syms.h"
 #include "libbpf.h"
 #include "perf_reader.h"
+#include "common.h"

 namespace ebpf {

@@ -89,7 +90,7 @@ StatusTuple BPFPerfBuffer::open_all_cpu(perf_reader_raw_cb cb,
  if (cpu_readers_.size() != 0 || readers_.size() != 0)
    return StatusTuple(-1, "Previously opened perf buffer not cleaned");

-  for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN); i++) {
+  for (int i: get_online_cpus()) {
    auto res = open_on_cpu(cb, i, cb_cookie);
    if (res.code() != 0) {
      TRY2(close_all_cpu());
@@ -113,7 +114,7 @@ StatusTuple BPFPerfBuffer::close_on_cpu(int cpu) {
 StatusTuple BPFPerfBuffer::close_all_cpu() {
  std::string errors;
  bool has_error = false;
-  for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN); i++) {
+  for (int i: get_online_cpus()) {
    auto res = close_on_cpu(i);
    if (res.code() != 0) {
      errors += "Failed to close CPU" + std::to_string(i) + " perf buffer: ";

--- a/src/cc/CMakeLists.txt
+++ b/src/cc/CMakeLists.txt
@@ -35,12 +35,12 @@ if (CMAKE_COMPILER_IS_GNUCC AND LIBCLANG_ISSTATIC)
  endif()
 endif()

-add_library(bcc-shared SHARED bpf_common.cc bpf_module.cc libbpf.c perf_reader.c shared_table.cc exported_files.cc bcc_elf.c bcc_perf_map.c bcc_proc.c bcc_syms.cc usdt_args.cc usdt.cc BPF.cc BPFTable.cc)
+add_library(bcc-shared SHARED bpf_common.cc bpf_module.cc libbpf.c perf_reader.c shared_table.cc exported_files.cc bcc_elf.c bcc_perf_map.c bcc_proc.c bcc_syms.cc usdt_args.cc usdt.cc common.cc BPF.cc BPFTable.cc)
 set_target_properties(bcc-shared PROPERTIES VERSION ${REVISION_LAST} SOVERSION 0)
 set_target_properties(bcc-shared PROPERTIES OUTPUT_NAME bcc)

 add_library(bcc-loader-static libbpf.c perf_reader.c bcc_elf.c bcc_perf_map.c bcc_proc.c)
-add_library(bcc-static STATIC bpf_common.cc bpf_module.cc shared_table.cc exported_files.cc bcc_syms.cc usdt_args.cc usdt.cc BPF.cc BPFTable.cc)
+add_library(bcc-static STATIC bpf_common.cc bpf_module.cc shared_table.cc exported_files.cc bcc_syms.cc usdt_args.cc usdt.cc common.cc BPF.cc BPFTable.cc)
 set_target_properties(bcc-static PROPERTIES OUTPUT_NAME bcc)

 set(llvm_raw_libs bitwriter bpfcodegen irreader linker

--- a/src/cc/bcc_proc.c
+++ b/src/cc/bcc_proc.c
@@ -24,6 +24,7 @@
 #include <stdint.h>
 #include <ctype.h>
 #include <stdio.h>
+#include <math.h>

 #include "bcc_perf_map.h"
 #include "bcc_proc.h"
@@ -307,13 +308,57 @@ static bool match_so_flags(int flags) {
  return true;
 }

-const char *bcc_procutils_which_so(const char *libname) {
+static bool which_so_in_process(const char* libname, int pid, char* libpath) {
+  int ret, found = false;
+  char endline[4096], *mapname = NULL, *newline;
+  char mappings_file[128];
+  const size_t search_len = strlen(libname) + strlen("/lib.");
+  char search1[search_len + 1];
+  char search2[search_len + 1];
+
+  sprintf(mappings_file, "/proc/%ld/maps", (long)pid);
+  FILE *fp = fopen(mappings_file, "r");
+  if (!fp)
+    return NULL;
+
+  snprintf(search1, search_len + 1, "/lib%s.", libname);
+  snprintf(search2, search_len + 1, "/lib%s-", libname);
+
+  do {
+    ret = fscanf(fp, "%*x-%*x %*s %*x %*s %*d");
+    if (!fgets(endline, sizeof(endline), fp))
+      break;
+
+    mapname = endline;
+    newline = strchr(endline, '\n');
+    if (newline)
+      newline[0] = '\0';
+
+    while (isspace(mapname[0])) mapname++;
+
+    if (strstr(mapname, ".so") && (strstr(mapname, search1) ||
+                                   strstr(mapname, search2))) {
+      found = true;
+      memcpy(libpath, mapname, strlen(mapname) + 1);
+      break;
+    }
+  } while (ret != EOF);
+
+  fclose(fp);
+  return found;
+}
+
+char *bcc_procutils_which_so(const char *libname, int pid) {
  const size_t soname_len = strlen(libname) + strlen("lib.so");
  char soname[soname_len + 1];
+  char libpath[4096];
  int i;

  if (strchr(libname, '/'))
-    return libname;
+    return strdup(libname);
+
+  if (pid && which_so_in_process(libname, pid, libpath))
+    return strdup(libpath);

  if (lib_cache_count < 0)
    return NULL;
@@ -327,8 +372,13 @@ const char *bcc_procutils_which_so(const char *libname) {

  for (i = 0; i < lib_cache_count; ++i) {
    if (!strncmp(lib_cache[i].libname, soname, soname_len) &&
-        match_so_flags(lib_cache[i].flags))
-      return lib_cache[i].path;
+        match_so_flags(lib_cache[i].flags)) {
+      return strdup(lib_cache[i].path);
+    }
  }
  return NULL;
 }
+
+void bcc_procutils_free(const char *ptr) {
+  free((void *)ptr);
+}
--- a/src/cc/bcc_proc.h
+++ b/src/cc/bcc_proc.h
@@ -27,11 +27,12 @@ extern "C" {
 typedef int (*bcc_procutils_modulecb)(const char *, uint64_t, uint64_t, void *);
 typedef void (*bcc_procutils_ksymcb)(const char *, uint64_t, void *);

-const char *bcc_procutils_which_so(const char *libname);
+char *bcc_procutils_which_so(const char *libname, int pid);
 char *bcc_procutils_which(const char *binpath);
 int bcc_procutils_each_module(int pid, bcc_procutils_modulecb callback,
                              void *payload);
 int bcc_procutils_each_ksym(bcc_procutils_ksymcb callback, void *payload);
+void bcc_procutils_free(const char *ptr);

 #ifdef __cplusplus
 }

--- a/src/cc/bcc_syms.cc
+++ b/src/cc/bcc_syms.cc
@@ -304,7 +304,7 @@ int bcc_foreach_symbol(const char *module, SYM_CB cb) {
 }

 int bcc_resolve_symname(const char *module, const char *symname,
-                        const uint64_t addr, struct bcc_symbol *sym) {
+                        const uint64_t addr, int pid, struct bcc_symbol *sym) {
  uint64_t load_addr;

  sym->module = NULL;
@@ -315,9 +315,9 @@ int bcc_resolve_symname(const char *module, const char *symname,
    return -1;

  if (strchr(module, '/')) {
-    sym->module = module;
+    sym->module = strdup(module);
  } else {
-    sym->module = bcc_procutils_which_so(module);
+    sym->module = bcc_procutils_which_so(module, pid);
  }

  if (sym->module == NULL)

--- a/src/cc/bcc_syms.h
+++ b/src/cc/bcc_syms.h
@@ -43,7 +43,7 @@ int bcc_resolve_global_addr(int pid, const char *module, const uint64_t address,
 int bcc_foreach_symbol(const char *module, SYM_CB cb);
 int bcc_find_symbol_addr(struct bcc_symbol *sym);
 int bcc_resolve_symname(const char *module, const char *symname,
-                        const uint64_t addr, struct bcc_symbol *sym);
+                        const uint64_t addr, int pid, struct bcc_symbol *sym);
 #ifdef __cplusplus
 }
 #endif

--- a/src/cc/bpf_module.cc
+++ b/src/cc/bpf_module.cc
@@ -345,7 +345,8 @@ int BPFModule::load_includes(const string &text) {

 int BPFModule::annotate() {
  for (auto fn = mod_->getFunctionList().begin(); fn != mod_->getFunctionList().end(); ++fn)
-    fn->addFnAttr(Attribute::AlwaysInline);
+    if (!fn->hasFnAttribute(Attribute::NoInline))
+      fn->addFnAttr(Attribute::AlwaysInline);

  // separate module to hold the reader functions
  auto m = make_unique<Module>("sscanf", *ctx_);

--- a/src/cc/common.cc
+++ b/src/cc/common.cc
+/*
+ * Copyright (c) 2016 Catalysts GmbH
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <fstream>
+#include <sstream>
+#include "common.h"
+
+namespace ebpf {
+
+std::vector<int> read_cpu_range(std::string path) {
+  std::ifstream cpus_range_stream { path };
+  std::vector<int> cpus;
+  std::string cpu_range;
+
+  while (std::getline(cpus_range_stream, cpu_range, ',')) {
+    std::size_t rangeop = cpu_range.find('-');
+    if (rangeop == std::string::npos) {
+      cpus.push_back(std::stoi(cpu_range));
+    }
+    else {
+      int start = std::stoi(cpu_range.substr(0, rangeop));
+      int end = std::stoi(cpu_range.substr(rangeop + 1));
+      for (int i = start; i <= end; i++)
+        cpus.push_back(i);
+    }
+  }
+  return cpus;
+}
+
+std::vector<int> get_online_cpus() {
+  return read_cpu_range("/sys/devices/system/cpu/online");
+}
+
+std::vector<int> get_possible_cpus() {
+  return read_cpu_range("/sys/devices/system/cpu/possible");
+}
+
+
+} // namespace ebpf
--- a/src/cc/common.h
+++ b/src/cc/common.h
@@ -19,6 +19,7 @@
 #include <memory>
 #include <string>
 #include <tuple>
+#include <vector>

 namespace ebpf {

@@ -28,4 +29,8 @@ make_unique(Args &&... args) {
  return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
 }

+std::vector<int> get_online_cpus();
+
+std::vector<int> get_possible_cpus();
+
 }  // namespace ebpf
--- a/src/cc/compat/linux/bpf.h
+++ b/src/cc/compat/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };

+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+	__u32	prefixlen;	/* up to 32 for AF_INET, 128 for AF_INET6 */
+	__u8	data[0];	/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
 	BPF_MAP_CREATE,
@@ -73,6 +79,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };

 enum bpf_map_type {
@@ -87,6 +95,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CGROUP_ARRAY,
 	BPF_MAP_TYPE_LRU_HASH,
 	BPF_MAP_TYPE_LRU_PERCPU_HASH,
+	BPF_MAP_TYPE_LPM_TRIE,
 };

 enum bpf_prog_type {
@@ -98,8 +107,22 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_CGROUP_SOCK,
+	BPF_PROG_TYPE_LWT_IN,
+	BPF_PROG_TYPE_LWT_OUT,
+	BPF_PROG_TYPE_LWT_XMIT,
+};
+
+enum bpf_attach_type {
+	BPF_CGROUP_INET_INGRESS,
+	BPF_CGROUP_INET_EGRESS,
+	BPF_CGROUP_INET_SOCK_CREATE,
+	__MAX_BPF_ATTACH_TYPE
 };

+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD	1

 /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -150,243 +173,327 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;
+	};
 } __attribute__((aligned(8)));

+/* BPF helper function descriptions:
+ *
+ * void *bpf_map_lookup_elem(&map, &key)
+ *     Return: Map value or NULL
+ *
+ * int bpf_map_update_elem(&map, &key, &value, flags)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_map_delete_elem(&map, &key)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_probe_read(void *dst, int size, void *src)
+ *     Return: 0 on success or negative error
+ *
+ * u64 bpf_ktime_get_ns(void)
+ *     Return: current ktime
+ *
+ * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
+ *     Return: length of buffer written or negative error
+ *
+ * u32 bpf_prandom_u32(void)
+ *     Return: random value
+ *
+ * u32 bpf_raw_smp_processor_id(void)
+ *     Return: SMP processor ID
+ *
+ * int bpf_skb_store_bytes(skb, offset, from, len, flags)
+ *     store bytes into packet
+ *     @skb: pointer to skb
+ *     @offset: offset within packet from skb->mac_header
+ *     @from: pointer where to copy bytes from
+ *     @len: number of bytes to store into packet
+ *     @flags: bit 0 - if true, recompute skb->csum
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_l3_csum_replace(skb, offset, from, to, flags)
+ *     recompute IP checksum
+ *     @skb: pointer to skb
+ *     @offset: offset within packet where IP checksum is located
+ *     @from: old value of header field
+ *     @to: new value of header field
+ *     @flags: bits 0-3 - size of header field
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_l4_csum_replace(skb, offset, from, to, flags)
+ *     recompute TCP/UDP checksum
+ *     @skb: pointer to skb
+ *     @offset: offset within packet where TCP/UDP checksum is located
+ *     @from: old value of header field
+ *     @to: new value of header field
+ *     @flags: bits 0-3 - size of header field
+ *             bit 4 - is pseudo header
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_tail_call(ctx, prog_array_map, index)
+ *     jump into another BPF program
+ *     @ctx: context pointer passed to next program
+ *     @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
+ *     @index: index inside array that selects specific program to run
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_clone_redirect(skb, ifindex, flags)
+ *     redirect to another netdev
+ *     @skb: pointer to skb
+ *     @ifindex: ifindex of the net device
+ *     @flags: bit 0 - if set, redirect to ingress instead of egress
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * u64 bpf_get_current_pid_tgid(void)
+ *     Return: current->tgid << 32 | current->pid
+ *
+ * u64 bpf_get_current_uid_gid(void)
+ *     Return: current_gid << 32 | current_uid
+ *
+ * int bpf_get_current_comm(char *buf, int size_of_buf)
+ *     stores current->comm into buf
+ *     Return: 0 on success or negative error
+ *
+ * u32 bpf_get_cgroup_classid(skb)
+ *     retrieve a proc's classid
+ *     @skb: pointer to skb
+ *     Return: classid if != 0
+ *
+ * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_vlan_pop(skb)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_get_tunnel_key(skb, key, size, flags)
+ * int bpf_skb_set_tunnel_key(skb, key, size, flags)
+ *     retrieve or populate tunnel metadata
+ *     @skb: pointer to skb
+ *     @key: pointer to 'struct bpf_tunnel_key'
+ *     @size: size of 'struct bpf_tunnel_key'
+ *     @flags: room for future extensions
+ *     Return: 0 on success or negative error
+ *
+ * u64 bpf_perf_event_read(&map, index)
+ *     Return: Number events read or error code
+ *
+ * int bpf_redirect(ifindex, flags)
+ *     redirect to another netdev
+ *     @ifindex: ifindex of the net device
+ *     @flags: bit 0 - if set, redirect to ingress instead of egress
+ *             other bits - reserved
+ *     Return: TC_ACT_REDIRECT
+ *
+ * u32 bpf_get_route_realm(skb)
+ *     retrieve a dst's tclassid
+ *     @skb: pointer to skb
+ *     Return: realm if != 0
+ *
+ * int bpf_perf_event_output(ctx, map, index, data, size)
+ *     output perf raw sample
+ *     @ctx: struct pt_regs*
+ *     @map: pointer to perf_event_array map
+ *     @index: index of event in the map
+ *     @data: data on stack to be output as raw data
+ *     @size: size of data
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_get_stackid(ctx, map, flags)
+ *     walk user or kernel stack and return id
+ *     @ctx: struct pt_regs*
+ *     @map: pointer to stack_trace map
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 9 - compare stacks by hash only
+ *             bit 10 - if two different stacks hash into the same stackid
+ *                      discard old
+ *             other bits - reserved
+ *     Return: >= 0 stackid on success or negative error
+ *
+ * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
+ *     calculate csum diff
+ *     @from: raw from buffer
+ *     @from_size: length of from buffer
+ *     @to: raw to buffer
+ *     @to_size: length of to buffer
+ *     @seed: optional seed
+ *     Return: csum result or negative error code
+ *
+ * int bpf_skb_get_tunnel_opt(skb, opt, size)
+ *     retrieve tunnel options metadata
+ *     @skb: pointer to skb
+ *     @opt: pointer to raw tunnel option data
+ *     @size: size of @opt
+ *     Return: option size
+ *
+ * int bpf_skb_set_tunnel_opt(skb, opt, size)
+ *     populate tunnel options metadata
+ *     @skb: pointer to skb
+ *     @opt: pointer to raw tunnel option data
+ *     @size: size of @opt
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_change_proto(skb, proto, flags)
+ *     Change protocol of the skb. Currently supported is v4 -> v6,
+ *     v6 -> v4 transitions. The helper will also resize the skb. eBPF
+ *     program is expected to fill the new headers via skb_store_bytes
+ *     and lX_csum_replace.
+ *     @skb: pointer to skb
+ *     @proto: new skb->protocol type
+ *     @flags: reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_change_type(skb, type)
+ *     Change packet type of skb.
+ *     @skb: pointer to skb
+ *     @type: new skb->pkt_type type
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_under_cgroup(skb, map, index)
+ *     Check cgroup2 membership of skb
+ *     @skb: pointer to skb
+ *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+ *     @index: index of the cgroup in the bpf_map
+ *     Return:
+ *       == 0 skb failed the cgroup2 descendant test
+ *       == 1 skb succeeded the cgroup2 descendant test
+ *        < 0 error
+ *
+ * u32 bpf_get_hash_recalc(skb)
+ *     Retrieve and possibly recalculate skb->hash.
+ *     @skb: pointer to skb
+ *     Return: hash
+ *
+ * u64 bpf_get_current_task(void)
+ *     Returns current task_struct
+ *     Return: current
+ *
+ * int bpf_probe_write_user(void *dst, void *src, int len)
+ *     safely attempt to write to a location
+ *     @dst: destination address in userspace
+ *     @src: source address on stack
+ *     @len: number of bytes to copy
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_current_task_under_cgroup(map, index)
+ *     Check cgroup2 membership of current task
+ *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+ *     @index: index of the cgroup in the bpf_map
+ *     Return:
+ *       == 0 current failed the cgroup2 descendant test
+ *       == 1 current succeeded the cgroup2 descendant test
+ *        < 0 error
+ *
+ * int bpf_skb_change_tail(skb, len, flags)
+ *     The helper will resize the skb to the given new size, to be used f.e.
+ *     with control messages.
+ *     @skb: pointer to skb
+ *     @len: new skb length
+ *     @flags: reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_pull_data(skb, len)
+ *     The helper will pull in non-linear data in case the skb is non-linear
+ *     and not all of len are part of the linear section. Only needed for
+ *     read/write with direct packet access.
+ *     @skb: pointer to skb
+ *     @len: len to make read/writeable
+ *     Return: 0 on success or negative error
+ *
+ * s64 bpf_csum_update(skb, csum)
+ *     Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
+ *     @skb: pointer to skb
+ *     @csum: csum to add
+ *     Return: csum on success or negative error
+ *
+ * void bpf_set_hash_invalid(skb)
+ *     Invalidate current skb->hash.
+ *     @skb: pointer to skb
+ *
+ * int bpf_get_numa_node_id()
+ *     Return: Id of current NUMA node.
+ *
+ * int bpf_skb_change_head()
+ *     Grows headroom of skb and adjusts MAC header offset accordingly.
+ *     Will extends/reallocae as required automatically.
+ *     May change skb data pointer and will thus invalidate any check
+ *     performed for direct packet access.
+ *     @skb: pointer to skb
+ *     @len: length of header to be pushed in front
+ *     @flags: Flags (unused for now)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_xdp_adjust_head(xdp_md, delta)
+ *     Adjust the xdp_md.data by delta
+ *     @xdp_md: pointer to xdp_md
+ *     @delta: An positive/negative integer to be added to xdp_md.data
+ *     Return: 0 on success or negative on error
+ */
+#define __BPF_FUNC_MAPPER(FN)		\
+	FN(unspec),			\
+	FN(map_lookup_elem),		\
+	FN(map_update_elem),		\
+	FN(map_delete_elem),		\
+	FN(probe_read),			\
+	FN(ktime_get_ns),		\
+	FN(trace_printk),		\
+	FN(get_prandom_u32),		\
+	FN(get_smp_processor_id),	\
+	FN(skb_store_bytes),		\
+	FN(l3_csum_replace),		\
+	FN(l4_csum_replace),		\
+	FN(tail_call),			\
+	FN(clone_redirect),		\
+	FN(get_current_pid_tgid),	\
+	FN(get_current_uid_gid),	\
+	FN(get_current_comm),		\
+	FN(get_cgroup_classid),		\
+	FN(skb_vlan_push),		\
+	FN(skb_vlan_pop),		\
+	FN(skb_get_tunnel_key),		\
+	FN(skb_set_tunnel_key),		\
+	FN(perf_event_read),		\
+	FN(redirect),			\
+	FN(get_route_realm),		\
+	FN(perf_event_output),		\
+	FN(skb_load_bytes),		\
+	FN(get_stackid),		\
+	FN(csum_diff),			\
+	FN(skb_get_tunnel_opt),		\
+	FN(skb_set_tunnel_opt),		\
+	FN(skb_change_proto),		\
+	FN(skb_change_type),		\
+	FN(skb_under_cgroup),		\
+	FN(get_hash_recalc),		\
+	FN(get_current_task),		\
+	FN(probe_write_user),		\
+	FN(current_task_under_cgroup),	\
+	FN(skb_change_tail),		\
+	FN(skb_pull_data),		\
+	FN(csum_update),		\
+	FN(set_hash_invalid),		\
+	FN(get_numa_node_id),		\
+	FN(skb_change_head),		\
+	FN(xdp_adjust_head),
+
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
 * function eBPF program intends to call
 */
+#define __BPF_ENUM_FN(x) BPF_FUNC_ ## x
 enum bpf_func_id {
-	BPF_FUNC_unspec,
-	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
-	BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
-	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
-	BPF_FUNC_probe_read,      /* int bpf_probe_read(void *dst, int size, void *src) */
-	BPF_FUNC_ktime_get_ns,    /* u64 bpf_ktime_get_ns(void) */
-	BPF_FUNC_trace_printk,    /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
-	BPF_FUNC_get_prandom_u32, /* u32 prandom_u32(void) */
-	BPF_FUNC_get_smp_processor_id, /* u32 raw_smp_processor_id(void) */
-
-	/**
-	 * skb_store_bytes(skb, offset, from, len, flags) - store bytes into packet
-	 * @skb: pointer to skb
-	 * @offset: offset within packet from skb->mac_header
-	 * @from: pointer where to copy bytes from
-	 * @len: number of bytes to store into packet
-	 * @flags: bit 0 - if true, recompute skb->csum
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_skb_store_bytes,
-
-	/**
-	 * l3_csum_replace(skb, offset, from, to, flags) - recompute IP checksum
-	 * @skb: pointer to skb
-	 * @offset: offset within packet where IP checksum is located
-	 * @from: old value of header field
-	 * @to: new value of header field
-	 * @flags: bits 0-3 - size of header field
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_l3_csum_replace,
-
-	/**
-	 * l4_csum_replace(skb, offset, from, to, flags) - recompute TCP/UDP checksum
-	 * @skb: pointer to skb
-	 * @offset: offset within packet where TCP/UDP checksum is located
-	 * @from: old value of header field
-	 * @to: new value of header field
-	 * @flags: bits 0-3 - size of header field
-	 *         bit 4 - is pseudo header
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_l4_csum_replace,
-
-	/**
-	 * bpf_tail_call(ctx, prog_array_map, index) - jump into another BPF program
-	 * @ctx: context pointer passed to next program
-	 * @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
-	 * @index: index inside array that selects specific program to run
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_tail_call,
-
-	/**
-	 * bpf_clone_redirect(skb, ifindex, flags) - redirect to another netdev
-	 * @skb: pointer to skb
-	 * @ifindex: ifindex of the net device
-	 * @flags: bit 0 - if set, redirect to ingress instead of egress
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_clone_redirect,
-
-	/**
-	 * u64 bpf_get_current_pid_tgid(void)
-	 * Return: current->tgid << 32 | current->pid
-	 */
-	BPF_FUNC_get_current_pid_tgid,
-
-	/**
-	 * u64 bpf_get_current_uid_gid(void)
-	 * Return: current_gid << 32 | current_uid
-	 */
-	BPF_FUNC_get_current_uid_gid,
-
-	/**
-	 * bpf_get_current_comm(char *buf, int size_of_buf)
-	 * stores current->comm into buf
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_get_current_comm,
-
-	/**
-	 * bpf_get_cgroup_classid(skb) - retrieve a proc's classid
-	 * @skb: pointer to skb
-	 * Return: classid if != 0
-	 */
-	BPF_FUNC_get_cgroup_classid,
-	BPF_FUNC_skb_vlan_push, /* bpf_skb_vlan_push(skb, vlan_proto, vlan_tci) */
-	BPF_FUNC_skb_vlan_pop,  /* bpf_skb_vlan_pop(skb) */
-
-	/**
-	 * bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
-	 * retrieve or populate tunnel metadata
-	 * @skb: pointer to skb
-	 * @key: pointer to 'struct bpf_tunnel_key'
-	 * @size: size of 'struct bpf_tunnel_key'
-	 * @flags: room for future extensions
-	 * Retrun: 0 on success
-	 */
-	BPF_FUNC_skb_get_tunnel_key,
-	BPF_FUNC_skb_set_tunnel_key,
-	BPF_FUNC_perf_event_read,	/* u64 bpf_perf_event_read(&map, index) */
-	/**
-	 * bpf_redirect(ifindex, flags) - redirect to another netdev
-	 * @ifindex: ifindex of the net device
-	 * @flags: bit 0 - if set, redirect to ingress instead of egress
-	 *         other bits - reserved
-	 * Return: TC_ACT_REDIRECT
-	 */
-	BPF_FUNC_redirect,
-
-	/**
-	 * bpf_get_route_realm(skb) - retrieve a dst's tclassid
-	 * @skb: pointer to skb
-	 * Return: realm if != 0
-	 */
-	BPF_FUNC_get_route_realm,
-
-	/**
-	 * bpf_perf_event_output(ctx, map, index, data, size) - output perf raw sample
-	 * @ctx: struct pt_regs*
-	 * @map: pointer to perf_event_array map
-	 * @index: index of event in the map
-	 * @data: data on stack to be output as raw data
-	 * @size: size of data
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_perf_event_output,
-	BPF_FUNC_skb_load_bytes,
-
-	/**
-	 * bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
-	 * @ctx: struct pt_regs*
-	 * @map: pointer to stack_trace map
-	 * @flags: bits 0-7 - numer of stack frames to skip
-	 *         bit 8 - collect user stack instead of kernel
-	 *         bit 9 - compare stacks by hash only
-	 *         bit 10 - if two different stacks hash into the same stackid
-	 *                  discard old
-	 *         other bits - reserved
-	 * Return: >= 0 stackid on success or negative error
-	 */
-	BPF_FUNC_get_stackid,
-
-	/**
-	 * bpf_csum_diff(from, from_size, to, to_size, seed) - calculate csum diff
-	 * @from: raw from buffer
-	 * @from_size: length of from buffer
-	 * @to: raw to buffer
-	 * @to_size: length of to buffer
-	 * @seed: optional seed
-	 * Return: csum result
-	 */
-	BPF_FUNC_csum_diff,
-
-	/**
-	 * bpf_skb_[gs]et_tunnel_opt(skb, opt, size)
-	 * retrieve or populate tunnel options metadata
-	 * @skb: pointer to skb
-	 * @opt: pointer to raw tunnel option data
-	 * @size: size of @opt
-	 * Return: 0 on success for set, option size for get
-	 */
-	BPF_FUNC_skb_get_tunnel_opt,
-	BPF_FUNC_skb_set_tunnel_opt,
-
-	/**
-	 * bpf_skb_change_proto(skb, proto, flags)
-	 * Change protocol of the skb. Currently supported is
-	 * v4 -> v6, v6 -> v4 transitions. The helper will also
-	 * resize the skb. eBPF program is expected to fill the
-	 * new headers via skb_store_bytes and lX_csum_replace.
-	 * @skb: pointer to skb
-	 * @proto: new skb->protocol type
-	 * @flags: reserved
-	 * Return: 0 on success or negative error
-	 */
-	BPF_FUNC_skb_change_proto,
-
-	/**
-	 * bpf_skb_change_type(skb, type)
-	 * Change packet type of skb.
-	 * @skb: pointer to skb
-	 * @type: new skb->pkt_type type
-	 * Return: 0 on success or negative error
-	 */
-	BPF_FUNC_skb_change_type,
-
-	/**
-	 * bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb
-	 * @skb: pointer to skb
-	 * @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
-	 * @index: index of the cgroup in the bpf_map
-	 * Return:
-	 *   == 0 skb failed the cgroup2 descendant test
-	 *   == 1 skb succeeded the cgroup2 descendant test
-	 *    < 0 error
-	 */
-	BPF_FUNC_skb_in_cgroup,
-
-	/**
-	 * bpf_get_hash_recalc(skb)
-	 * Retrieve and possibly recalculate skb->hash.
-	 * @skb: pointer to skb
-	 * Return: hash
-	 */
-	BPF_FUNC_get_hash_recalc,
-
-	/**
-	 * u64 bpf_get_current_task(void)
-	 * Returns current task_struct
-	 * Return: current
-	 */
-	BPF_FUNC_get_current_task,
-
-	/**
-	 * bpf_probe_write_user(void *dst, void *src, int len)
-	 * safely attempt to write to a location
-	 * @dst: destination address in userspace
-	 * @src: source address on stack
-	 * @len: number of bytes to copy
-	 * Return: 0 on success or negative error
-	 */
-	BPF_FUNC_probe_write_user,
-
+	__BPF_FUNC_MAPPER(__BPF_ENUM_FN)
 	__BPF_FUNC_MAX_ID,
 };
+#undef __BPF_ENUM_FN

 /* All flags used by eBPF helper functions, placed here. */

@@ -460,6 +567,31 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };

+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+	BPF_OK = 0,
+	/* 1 reserved */
+	BPF_DROP = 2,
+	/* 3-6 reserved */
+	BPF_REDIRECT = 7,
+	/* >127 are reserved for prog type specific return codes */
+};
+
+struct bpf_sock {
+	__u32 bound_dev_if;
+	__u32 family;
+	__u32 type;
+	__u32 protocol;
+};
+
+#define XDP_PACKET_HEADROOM 256
+
 /* User return codes for XDP prog type.
 * A valid XDP program must return one of these defined values. All other
 * return codes are reserved for future use. Unknown return codes will result

--- a/src/cc/compat/linux/virtual_bpf.h
+++ b/src/cc/compat/linux/virtual_bpf.h
@@ -64,6 +64,12 @@ struct bpf_insn {
 	__s32	imm;		/* signed immediate constant */
 };

+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+	__u32	prefixlen;	/* up to 32 for AF_INET, 128 for AF_INET6 */
+	__u8	data[0];	/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
 	BPF_MAP_CREATE,
@@ -74,6 +80,8 @@ enum bpf_cmd {
 	BPF_PROG_LOAD,
 	BPF_OBJ_PIN,
 	BPF_OBJ_GET,
+	BPF_PROG_ATTACH,
+	BPF_PROG_DETACH,
 };

 enum bpf_map_type {
@@ -88,6 +96,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CGROUP_ARRAY,
 	BPF_MAP_TYPE_LRU_HASH,
 	BPF_MAP_TYPE_LRU_PERCPU_HASH,
+	BPF_MAP_TYPE_LPM_TRIE,
 };

 enum bpf_prog_type {
@@ -99,8 +108,22 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_CGROUP_SOCK,
+	BPF_PROG_TYPE_LWT_IN,
+	BPF_PROG_TYPE_LWT_OUT,
+	BPF_PROG_TYPE_LWT_XMIT,
+};
+
+enum bpf_attach_type {
+	BPF_CGROUP_INET_INGRESS,
+	BPF_CGROUP_INET_EGRESS,
+	BPF_CGROUP_INET_SOCK_CREATE,
+	__MAX_BPF_ATTACH_TYPE
 };

+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD	1

 /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -151,243 +174,327 @@ union bpf_attr {
 		__aligned_u64	pathname;
 		__u32		bpf_fd;
 	};
+
+	struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+		__u32		target_fd;	/* container object to attach to */
+		__u32		attach_bpf_fd;	/* eBPF program to attach */
+		__u32		attach_type;
+	};
 } __attribute__((aligned(8)));

+/* BPF helper function descriptions:
+ *
+ * void *bpf_map_lookup_elem(&map, &key)
+ *     Return: Map value or NULL
+ *
+ * int bpf_map_update_elem(&map, &key, &value, flags)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_map_delete_elem(&map, &key)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_probe_read(void *dst, int size, void *src)
+ *     Return: 0 on success or negative error
+ *
+ * u64 bpf_ktime_get_ns(void)
+ *     Return: current ktime
+ *
+ * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
+ *     Return: length of buffer written or negative error
+ *
+ * u32 bpf_prandom_u32(void)
+ *     Return: random value
+ *
+ * u32 bpf_raw_smp_processor_id(void)
+ *     Return: SMP processor ID
+ *
+ * int bpf_skb_store_bytes(skb, offset, from, len, flags)
+ *     store bytes into packet
+ *     @skb: pointer to skb
+ *     @offset: offset within packet from skb->mac_header
+ *     @from: pointer where to copy bytes from
+ *     @len: number of bytes to store into packet
+ *     @flags: bit 0 - if true, recompute skb->csum
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_l3_csum_replace(skb, offset, from, to, flags)
+ *     recompute IP checksum
+ *     @skb: pointer to skb
+ *     @offset: offset within packet where IP checksum is located
+ *     @from: old value of header field
+ *     @to: new value of header field
+ *     @flags: bits 0-3 - size of header field
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_l4_csum_replace(skb, offset, from, to, flags)
+ *     recompute TCP/UDP checksum
+ *     @skb: pointer to skb
+ *     @offset: offset within packet where TCP/UDP checksum is located
+ *     @from: old value of header field
+ *     @to: new value of header field
+ *     @flags: bits 0-3 - size of header field
+ *             bit 4 - is pseudo header
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_tail_call(ctx, prog_array_map, index)
+ *     jump into another BPF program
+ *     @ctx: context pointer passed to next program
+ *     @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
+ *     @index: index inside array that selects specific program to run
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_clone_redirect(skb, ifindex, flags)
+ *     redirect to another netdev
+ *     @skb: pointer to skb
+ *     @ifindex: ifindex of the net device
+ *     @flags: bit 0 - if set, redirect to ingress instead of egress
+ *             other bits - reserved
+ *     Return: 0 on success or negative error
+ *
+ * u64 bpf_get_current_pid_tgid(void)
+ *     Return: current->tgid << 32 | current->pid
+ *
+ * u64 bpf_get_current_uid_gid(void)
+ *     Return: current_gid << 32 | current_uid
+ *
+ * int bpf_get_current_comm(char *buf, int size_of_buf)
+ *     stores current->comm into buf
+ *     Return: 0 on success or negative error
+ *
+ * u32 bpf_get_cgroup_classid(skb)
+ *     retrieve a proc's classid
+ *     @skb: pointer to skb
+ *     Return: classid if != 0
+ *
+ * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_vlan_pop(skb)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_get_tunnel_key(skb, key, size, flags)
+ * int bpf_skb_set_tunnel_key(skb, key, size, flags)
+ *     retrieve or populate tunnel metadata
+ *     @skb: pointer to skb
+ *     @key: pointer to 'struct bpf_tunnel_key'
+ *     @size: size of 'struct bpf_tunnel_key'
+ *     @flags: room for future extensions
+ *     Return: 0 on success or negative error
+ *
+ * u64 bpf_perf_event_read(&map, index)
+ *     Return: Number events read or error code
+ *
+ * int bpf_redirect(ifindex, flags)
+ *     redirect to another netdev
+ *     @ifindex: ifindex of the net device
+ *     @flags: bit 0 - if set, redirect to ingress instead of egress
+ *             other bits - reserved
+ *     Return: TC_ACT_REDIRECT
+ *
+ * u32 bpf_get_route_realm(skb)
+ *     retrieve a dst's tclassid
+ *     @skb: pointer to skb
+ *     Return: realm if != 0
+ *
+ * int bpf_perf_event_output(ctx, map, index, data, size)
+ *     output perf raw sample
+ *     @ctx: struct pt_regs*
+ *     @map: pointer to perf_event_array map
+ *     @index: index of event in the map
+ *     @data: data on stack to be output as raw data
+ *     @size: size of data
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_get_stackid(ctx, map, flags)
+ *     walk user or kernel stack and return id
+ *     @ctx: struct pt_regs*
+ *     @map: pointer to stack_trace map
+ *     @flags: bits 0-7 - numer of stack frames to skip
+ *             bit 8 - collect user stack instead of kernel
+ *             bit 9 - compare stacks by hash only
+ *             bit 10 - if two different stacks hash into the same stackid
+ *                      discard old
+ *             other bits - reserved
+ *     Return: >= 0 stackid on success or negative error
+ *
+ * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
+ *     calculate csum diff
+ *     @from: raw from buffer
+ *     @from_size: length of from buffer
+ *     @to: raw to buffer
+ *     @to_size: length of to buffer
+ *     @seed: optional seed
+ *     Return: csum result or negative error code
+ *
+ * int bpf_skb_get_tunnel_opt(skb, opt, size)
+ *     retrieve tunnel options metadata
+ *     @skb: pointer to skb
+ *     @opt: pointer to raw tunnel option data
+ *     @size: size of @opt
+ *     Return: option size
+ *
+ * int bpf_skb_set_tunnel_opt(skb, opt, size)
+ *     populate tunnel options metadata
+ *     @skb: pointer to skb
+ *     @opt: pointer to raw tunnel option data
+ *     @size: size of @opt
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_change_proto(skb, proto, flags)
+ *     Change protocol of the skb. Currently supported is v4 -> v6,
+ *     v6 -> v4 transitions. The helper will also resize the skb. eBPF
+ *     program is expected to fill the new headers via skb_store_bytes
+ *     and lX_csum_replace.
+ *     @skb: pointer to skb
+ *     @proto: new skb->protocol type
+ *     @flags: reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_change_type(skb, type)
+ *     Change packet type of skb.
+ *     @skb: pointer to skb
+ *     @type: new skb->pkt_type type
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_under_cgroup(skb, map, index)
+ *     Check cgroup2 membership of skb
+ *     @skb: pointer to skb
+ *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+ *     @index: index of the cgroup in the bpf_map
+ *     Return:
+ *       == 0 skb failed the cgroup2 descendant test
+ *       == 1 skb succeeded the cgroup2 descendant test
+ *        < 0 error
+ *
+ * u32 bpf_get_hash_recalc(skb)
+ *     Retrieve and possibly recalculate skb->hash.
+ *     @skb: pointer to skb
+ *     Return: hash
+ *
+ * u64 bpf_get_current_task(void)
+ *     Returns current task_struct
+ *     Return: current
+ *
+ * int bpf_probe_write_user(void *dst, void *src, int len)
+ *     safely attempt to write to a location
+ *     @dst: destination address in userspace
+ *     @src: source address on stack
+ *     @len: number of bytes to copy
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_current_task_under_cgroup(map, index)
+ *     Check cgroup2 membership of current task
+ *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+ *     @index: index of the cgroup in the bpf_map
+ *     Return:
+ *       == 0 current failed the cgroup2 descendant test
+ *       == 1 current succeeded the cgroup2 descendant test
+ *        < 0 error
+ *
+ * int bpf_skb_change_tail(skb, len, flags)
+ *     The helper will resize the skb to the given new size, to be used f.e.
+ *     with control messages.
+ *     @skb: pointer to skb
+ *     @len: new skb length
+ *     @flags: reserved
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_skb_pull_data(skb, len)
+ *     The helper will pull in non-linear data in case the skb is non-linear
+ *     and not all of len are part of the linear section. Only needed for
+ *     read/write with direct packet access.
+ *     @skb: pointer to skb
+ *     @len: len to make read/writeable
+ *     Return: 0 on success or negative error
+ *
+ * s64 bpf_csum_update(skb, csum)
+ *     Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
+ *     @skb: pointer to skb
+ *     @csum: csum to add
+ *     Return: csum on success or negative error
+ *
+ * void bpf_set_hash_invalid(skb)
+ *     Invalidate current skb->hash.
+ *     @skb: pointer to skb
+ *
+ * int bpf_get_numa_node_id()
+ *     Return: Id of current NUMA node.
+ *
+ * int bpf_skb_change_head()
+ *     Grows headroom of skb and adjusts MAC header offset accordingly.
+ *     Will extends/reallocae as required automatically.
+ *     May change skb data pointer and will thus invalidate any check
+ *     performed for direct packet access.
+ *     @skb: pointer to skb
+ *     @len: length of header to be pushed in front
+ *     @flags: Flags (unused for now)
+ *     Return: 0 on success or negative error
+ *
+ * int bpf_xdp_adjust_head(xdp_md, delta)
+ *     Adjust the xdp_md.data by delta
+ *     @xdp_md: pointer to xdp_md
+ *     @delta: An positive/negative integer to be added to xdp_md.data
+ *     Return: 0 on success or negative on error
+ */
+#define __BPF_FUNC_MAPPER(FN)		\
+	FN(unspec),			\
+	FN(map_lookup_elem),		\
+	FN(map_update_elem),		\
+	FN(map_delete_elem),		\
+	FN(probe_read),			\
+	FN(ktime_get_ns),		\
+	FN(trace_printk),		\
+	FN(get_prandom_u32),		\
+	FN(get_smp_processor_id),	\
+	FN(skb_store_bytes),		\
+	FN(l3_csum_replace),		\
+	FN(l4_csum_replace),		\
+	FN(tail_call),			\
+	FN(clone_redirect),		\
+	FN(get_current_pid_tgid),	\
+	FN(get_current_uid_gid),	\
+	FN(get_current_comm),		\
+	FN(get_cgroup_classid),		\
+	FN(skb_vlan_push),		\
+	FN(skb_vlan_pop),		\
+	FN(skb_get_tunnel_key),		\
+	FN(skb_set_tunnel_key),		\
+	FN(perf_event_read),		\
+	FN(redirect),			\
+	FN(get_route_realm),		\
+	FN(perf_event_output),		\
+	FN(skb_load_bytes),		\
+	FN(get_stackid),		\
+	FN(csum_diff),			\
+	FN(skb_get_tunnel_opt),		\
+	FN(skb_set_tunnel_opt),		\
+	FN(skb_change_proto),		\
+	FN(skb_change_type),		\
+	FN(skb_under_cgroup),		\
+	FN(get_hash_recalc),		\
+	FN(get_current_task),		\
+	FN(probe_write_user),		\
+	FN(current_task_under_cgroup),	\
+	FN(skb_change_tail),		\
+	FN(skb_pull_data),		\
+	FN(csum_update),		\
+	FN(set_hash_invalid),		\
+	FN(get_numa_node_id),		\
+	FN(skb_change_head),		\
+	FN(xdp_adjust_head),
+
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
 * function eBPF program intends to call
 */
+#define __BPF_ENUM_FN(x) BPF_FUNC_ ## x
 enum bpf_func_id {
-	BPF_FUNC_unspec,
-	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
-	BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
-	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
-	BPF_FUNC_probe_read,      /* int bpf_probe_read(void *dst, int size, void *src) */
-	BPF_FUNC_ktime_get_ns,    /* u64 bpf_ktime_get_ns(void) */
-	BPF_FUNC_trace_printk,    /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
-	BPF_FUNC_get_prandom_u32, /* u32 prandom_u32(void) */
-	BPF_FUNC_get_smp_processor_id, /* u32 raw_smp_processor_id(void) */
-
-	/**
-	 * skb_store_bytes(skb, offset, from, len, flags) - store bytes into packet
-	 * @skb: pointer to skb
-	 * @offset: offset within packet from skb->mac_header
-	 * @from: pointer where to copy bytes from
-	 * @len: number of bytes to store into packet
-	 * @flags: bit 0 - if true, recompute skb->csum
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_skb_store_bytes,
-
-	/**
-	 * l3_csum_replace(skb, offset, from, to, flags) - recompute IP checksum
-	 * @skb: pointer to skb
-	 * @offset: offset within packet where IP checksum is located
-	 * @from: old value of header field
-	 * @to: new value of header field
-	 * @flags: bits 0-3 - size of header field
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_l3_csum_replace,
-
-	/**
-	 * l4_csum_replace(skb, offset, from, to, flags) - recompute TCP/UDP checksum
-	 * @skb: pointer to skb
-	 * @offset: offset within packet where TCP/UDP checksum is located
-	 * @from: old value of header field
-	 * @to: new value of header field
-	 * @flags: bits 0-3 - size of header field
-	 *         bit 4 - is pseudo header
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_l4_csum_replace,
-
-	/**
-	 * bpf_tail_call(ctx, prog_array_map, index) - jump into another BPF program
-	 * @ctx: context pointer passed to next program
-	 * @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
-	 * @index: index inside array that selects specific program to run
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_tail_call,
-
-	/**
-	 * bpf_clone_redirect(skb, ifindex, flags) - redirect to another netdev
-	 * @skb: pointer to skb
-	 * @ifindex: ifindex of the net device
-	 * @flags: bit 0 - if set, redirect to ingress instead of egress
-	 *         other bits - reserved
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_clone_redirect,
-
-	/**
-	 * u64 bpf_get_current_pid_tgid(void)
-	 * Return: current->tgid << 32 | current->pid
-	 */
-	BPF_FUNC_get_current_pid_tgid,
-
-	/**
-	 * u64 bpf_get_current_uid_gid(void)
-	 * Return: current_gid << 32 | current_uid
-	 */
-	BPF_FUNC_get_current_uid_gid,
-
-	/**
-	 * bpf_get_current_comm(char *buf, int size_of_buf)
-	 * stores current->comm into buf
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_get_current_comm,
-
-	/**
-	 * bpf_get_cgroup_classid(skb) - retrieve a proc's classid
-	 * @skb: pointer to skb
-	 * Return: classid if != 0
-	 */
-	BPF_FUNC_get_cgroup_classid,
-	BPF_FUNC_skb_vlan_push, /* bpf_skb_vlan_push(skb, vlan_proto, vlan_tci) */
-	BPF_FUNC_skb_vlan_pop,  /* bpf_skb_vlan_pop(skb) */
-
-	/**
-	 * bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
-	 * retrieve or populate tunnel metadata
-	 * @skb: pointer to skb
-	 * @key: pointer to 'struct bpf_tunnel_key'
-	 * @size: size of 'struct bpf_tunnel_key'
-	 * @flags: room for future extensions
-	 * Retrun: 0 on success
-	 */
-	BPF_FUNC_skb_get_tunnel_key,
-	BPF_FUNC_skb_set_tunnel_key,
-	BPF_FUNC_perf_event_read,	/* u64 bpf_perf_event_read(&map, index) */
-	/**
-	 * bpf_redirect(ifindex, flags) - redirect to another netdev
-	 * @ifindex: ifindex of the net device
-	 * @flags: bit 0 - if set, redirect to ingress instead of egress
-	 *         other bits - reserved
-	 * Return: TC_ACT_REDIRECT
-	 */
-	BPF_FUNC_redirect,
-
-	/**
-	 * bpf_get_route_realm(skb) - retrieve a dst's tclassid
-	 * @skb: pointer to skb
-	 * Return: realm if != 0
-	 */
-	BPF_FUNC_get_route_realm,
-
-	/**
-	 * bpf_perf_event_output(ctx, map, index, data, size) - output perf raw sample
-	 * @ctx: struct pt_regs*
-	 * @map: pointer to perf_event_array map
-	 * @index: index of event in the map
-	 * @data: data on stack to be output as raw data
-	 * @size: size of data
-	 * Return: 0 on success
-	 */
-	BPF_FUNC_perf_event_output,
-	BPF_FUNC_skb_load_bytes,
-
-	/**
-	 * bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
-	 * @ctx: struct pt_regs*
-	 * @map: pointer to stack_trace map
-	 * @flags: bits 0-7 - numer of stack frames to skip
-	 *         bit 8 - collect user stack instead of kernel
-	 *         bit 9 - compare stacks by hash only
-	 *         bit 10 - if two different stacks hash into the same stackid
-	 *                  discard old
-	 *         other bits - reserved
-	 * Return: >= 0 stackid on success or negative error
-	 */
-	BPF_FUNC_get_stackid,
-
-	/**
-	 * bpf_csum_diff(from, from_size, to, to_size, seed) - calculate csum diff
-	 * @from: raw from buffer
-	 * @from_size: length of from buffer
-	 * @to: raw to buffer
-	 * @to_size: length of to buffer
-	 * @seed: optional seed
-	 * Return: csum result
-	 */
-	BPF_FUNC_csum_diff,
-
-	/**
-	 * bpf_skb_[gs]et_tunnel_opt(skb, opt, size)
-	 * retrieve or populate tunnel options metadata
-	 * @skb: pointer to skb
-	 * @opt: pointer to raw tunnel option data
-	 * @size: size of @opt
-	 * Return: 0 on success for set, option size for get
-	 */
-	BPF_FUNC_skb_get_tunnel_opt,
-	BPF_FUNC_skb_set_tunnel_opt,
-
-	/**
-	 * bpf_skb_change_proto(skb, proto, flags)
-	 * Change protocol of the skb. Currently supported is
-	 * v4 -> v6, v6 -> v4 transitions. The helper will also
-	 * resize the skb. eBPF program is expected to fill the
-	 * new headers via skb_store_bytes and lX_csum_replace.
-	 * @skb: pointer to skb
-	 * @proto: new skb->protocol type
-	 * @flags: reserved
-	 * Return: 0 on success or negative error
-	 */
-	BPF_FUNC_skb_change_proto,
-
-	/**
-	 * bpf_skb_change_type(skb, type)
-	 * Change packet type of skb.
-	 * @skb: pointer to skb
-	 * @type: new skb->pkt_type type
-	 * Return: 0 on success or negative error
-	 */
-	BPF_FUNC_skb_change_type,
-
-	/**
-	 * bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb
-	 * @skb: pointer to skb
-	 * @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
-	 * @index: index of the cgroup in the bpf_map
-	 * Return:
-	 *   == 0 skb failed the cgroup2 descendant test
-	 *   == 1 skb succeeded the cgroup2 descendant test
-	 *    < 0 error
-	 */
-	BPF_FUNC_skb_in_cgroup,
-
-	/**
-	 * bpf_get_hash_recalc(skb)
-	 * Retrieve and possibly recalculate skb->hash.
-	 * @skb: pointer to skb
-	 * Return: hash
-	 */
-	BPF_FUNC_get_hash_recalc,
-
-	/**
-	 * u64 bpf_get_current_task(void)
-	 * Returns current task_struct
-	 * Return: current
-	 */
-	BPF_FUNC_get_current_task,
-
-	/**
-	 * bpf_probe_write_user(void *dst, void *src, int len)
-	 * safely attempt to write to a location
-	 * @dst: destination address in userspace
-	 * @src: source address on stack
-	 * @len: number of bytes to copy
-	 * Return: 0 on success or negative error
-	 */
-	BPF_FUNC_probe_write_user,
-
+	__BPF_FUNC_MAPPER(__BPF_ENUM_FN)
 	__BPF_FUNC_MAX_ID,
 };
+#undef __BPF_ENUM_FN

 /* All flags used by eBPF helper functions, placed here. */

@@ -461,6 +568,31 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };

+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+	BPF_OK = 0,
+	/* 1 reserved */
+	BPF_DROP = 2,
+	/* 3-6 reserved */
+	BPF_REDIRECT = 7,
+	/* >127 are reserved for prog type specific return codes */
+};
+
+struct bpf_sock {
+	__u32 bound_dev_if;
+	__u32 family;
+	__u32 type;
+	__u32 protocol;
+};
+
+#define XDP_PACKET_HEADROOM 256
+
 /* User return codes for XDP prog type.
 * A valid XDP program must return one of these defined values. All other
 * return codes are reserved for future use. Unknown return codes will result

--- a/src/cc/export/helpers.h
+++ b/src/cc/export/helpers.h
@@ -147,7 +147,7 @@ static u32 (*bpf_get_prandom_u32)(void) =
 static int (*bpf_trace_printk_)(const char *fmt, u64 fmt_size, ...) =
  (void *) BPF_FUNC_trace_printk;
 int bpf_trace_printk(const char *fmt, ...) asm("llvm.bpf.extra");
-static void bpf_tail_call_(u64 map_fd, void *ctx, int index) {
+static inline void bpf_tail_call_(u64 map_fd, void *ctx, int index) {
  ((void (*)(void *, u64, int))BPF_FUNC_tail_call)(ctx, map_fd, index);
 }
 static int (*bpf_clone_redirect)(void *ctx, int ifindex, u32 flags) =
@@ -180,8 +180,6 @@ static int (*bpf_perf_event_output)(void *ctx, void *map, u64 index, void *data,
  (void *) BPF_FUNC_perf_event_output;
 static int (*bpf_skb_load_bytes)(void *ctx, int offset, void *to, u32 len) =
  (void *) BPF_FUNC_skb_load_bytes;
-static u64 (*bpf_get_current_task)(void) =
-  (void *) BPF_FUNC_get_current_task;

 /* bpf_get_stackid will return a negative value in the case of an error
 *
@@ -203,6 +201,34 @@ int bpf_get_stackid(uintptr_t map, void *ctx, u64 flags) {

 static int (*bpf_csum_diff)(void *from, u64 from_size, void *to, u64 to_size, u64 seed) =
  (void *) BPF_FUNC_csum_diff;
+static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, u32 size) =
+  (void *) BPF_FUNC_skb_get_tunnel_opt;
+static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, u32 size) =
+  (void *) BPF_FUNC_skb_set_tunnel_opt;
+static int (*bpf_skb_change_proto)(void *ctx, u16 proto, u64 flags) =
+  (void *) BPF_FUNC_skb_change_proto;
+static int (*bpf_skb_change_type)(void *ctx, u32 type) =
+  (void *) BPF_FUNC_skb_change_type;
+static u32 (*bpf_get_hash_recalc)(void *ctx) =
+  (void *) BPF_FUNC_get_hash_recalc;
+static u64 (*bpf_get_current_task)(void) =
+  (void *) BPF_FUNC_get_current_task;
+static int (*bpf_probe_write_user)(void *dst, void *src, u32 size) =
+  (void *) BPF_FUNC_probe_write_user;
+static int (*bpf_skb_change_tail)(void *ctx, u32 new_len, u64 flags) =
+  (void *) BPF_FUNC_skb_change_tail;
+static int (*bpf_skb_pull_data)(void *ctx, u32 len) =
+  (void *) BPF_FUNC_skb_pull_data;
+static int (*bpf_csum_update)(void *ctx, u16 csum) =
+  (void *) BPF_FUNC_csum_update;
+static int (*bpf_set_hash_invalid)(void *ctx) =
+  (void *) BPF_FUNC_set_hash_invalid;
+static int (*bpf_get_numa_node_id)(void) =
+  (void *) BPF_FUNC_get_numa_node_id;
+static int (*bpf_skb_change_head)(void *ctx, u32 len, u64 flags) =
+  (void *) BPF_FUNC_skb_change_head;
+static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
+  (void *) BPF_FUNC_xdp_adjust_head;

 /* llvm builtin functions that eBPF C program may use to
 * emit BPF_LD_ABS and BPF_LD_IND instructions
@@ -284,7 +310,7 @@ static inline void bpf_store_dword(void *skb, u64 off, u64 val) {
 #define MASK(_n) ((_n) < 64 ? (1ull << (_n)) - 1 : ((u64)-1LL))
 #define MASK128(_n) ((_n) < 128 ? ((unsigned __int128)1 << (_n)) - 1 : ((unsigned __int128)-1))

-static unsigned int bpf_log2(unsigned int v)
+static inline unsigned int bpf_log2(unsigned int v)
 {
  unsigned int r;
  unsigned int shift;
@@ -297,7 +323,7 @@ static unsigned int bpf_log2(unsigned int v)
  return r;
 }

-static unsigned int bpf_log2l(unsigned long v)
+static inline unsigned int bpf_log2l(unsigned long v)
 {
  unsigned int hi = v >> 32;
  if (hi)
@@ -473,5 +499,18 @@ int bpf_usdt_readarg_p(int argc, struct pt_regs *ctx, void *buf, u64 len) asm("l
 #define TRACEPOINT_PROBE(category, event) \
 int tracepoint__##category##__##event(struct tracepoint__##category##__##event *args)

+#define TP_DATA_LOC_READ_CONST(dst, field, length)                        \
+        do {                                                              \
+            short __offset = args->data_loc_##field & 0xFFFF;             \
+            bpf_probe_read((void *)dst, length, (char *)args + __offset); \
+        } while (0);
+
+#define TP_DATA_LOC_READ(dst, field)                                        \
+        do {                                                                \
+            short __offset = args->data_loc_##field & 0xFFFF;               \
+            short __length = args->data_loc_##field >> 16;                  \
+            bpf_probe_read((void *)dst, __length, (char *)args + __offset); \
+        } while (0);
+
 #endif
 )********"
--- a/src/cc/export/proto.h
+++ b/src/cc/export/proto.h
@@ -15,6 +15,9 @@ R"********(
 * limitations under the License.
 */

+#ifndef __BCC_PROTO_H
+#define __BCC_PROTO_H
+
 #include <uapi/linux/if_ether.h>

 #define BPF_PACKET_HEADER __attribute__((packed)) __attribute__((deprecated("packet")))
@@ -142,4 +145,6 @@ struct vxlan_gbp_t {
  unsigned int key:24;
  unsigned int rsv6:8;
 } BPF_PACKET_HEADER;
+
+#endif
 )********"
--- a/src/cc/frontends/b/codegen_llvm.cc
+++ b/src/cc/frontends/b/codegen_llvm.cc
@@ -1163,6 +1163,8 @@ StatusTuple CodegenLLVM::visit_func_decl_stmt_node(FuncDeclStmtNode *n) {
  Function *fn = mod_->getFunction(n->id_->name_);
  if (fn) return mkstatus_(n, "Function %s already defined", n->id_->c_str());
  fn = Function::Create(fn_type, GlobalValue::ExternalLinkage, n->id_->name_, mod_);
+  fn->setCallingConv(CallingConv::C);
+  fn->addFnAttr(Attribute::NoUnwind);
  fn->setSection(BPF_FN_PREFIX + n->id_->name_);

  BasicBlock *label_entry = BasicBlock::Create(ctx(), "entry", fn);

--- a/src/cc/frontends/clang/CMakeLists.txt
+++ b/src/cc/frontends/clang/CMakeLists.txt
@@ -2,4 +2,4 @@
 # Licensed under the Apache License, Version 2.0 (the "License")

 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DKERNEL_MODULES_DIR='\"${BCC_KERNEL_MODULES_DIR}\"'")
-add_library(clang_frontend loader.cc b_frontend_action.cc tp_frontend_action.cc kbuild_helper.cc)
+add_library(clang_frontend loader.cc b_frontend_action.cc tp_frontend_action.cc kbuild_helper.cc ../../common.cc)
--- a/src/cc/frontends/clang/b_frontend_action.cc
+++ b/src/cc/frontends/clang/b_frontend_action.cc
@@ -27,6 +27,7 @@

 #include "b_frontend_action.h"
 #include "shared_table.h"
+#include "common.h"

 #include "libbpf.h"

@@ -644,6 +645,8 @@ bool BTypeVisitor::VisitVarDecl(VarDecl *Decl) {
      map_type = BPF_MAP_TYPE_LRU_HASH;
    } else if (A->getName() == "maps/lru_percpu_hash") {
      map_type = BPF_MAP_TYPE_LRU_PERCPU_HASH;
+    } else if (A->getName() == "maps/lpm_trie") {
+      map_type = BPF_MAP_TYPE_LPM_TRIE;
    } else if (A->getName() == "maps/histogram") {
      if (table.key_desc == "\"int\"")
        map_type = BPF_MAP_TYPE_ARRAY;
@@ -656,7 +659,7 @@ bool BTypeVisitor::VisitVarDecl(VarDecl *Decl) {
      map_type = BPF_MAP_TYPE_PROG_ARRAY;
    } else if (A->getName() == "maps/perf_output") {
      map_type = BPF_MAP_TYPE_PERF_EVENT_ARRAY;
-      int numcpu = sysconf(_SC_NPROCESSORS_ONLN);
+      int numcpu = get_possible_cpus().size();
      if (numcpu <= 0)
        numcpu = 1;
      table.max_entries = numcpu;

--- a/src/cc/frontends/clang/kbuild_helper.cc
+++ b/src/cc/frontends/clang/kbuild_helper.cc
@@ -89,6 +89,7 @@ int KBuildHelper::get_flags(const char *uname_machine, vector<string> *cflags) {
  cflags->push_back("-D__HAVE_BUILTIN_BSWAP64__");
  cflags->push_back("-Wno-unused-value");
  cflags->push_back("-Wno-pointer-sign");
+  cflags->push_back("-fno-stack-protector");

  return 0;
 }

--- a/src/cc/frontends/clang/loader.cc
+++ b/src/cc/frontends/clang/loader.cc
@@ -137,6 +137,8 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
                                   "-Wno-deprecated-declarations",
                                   "-Wno-gnu-variable-sized-type-not-at-end",
                                   "-fno-color-diagnostics",
+                                   "-fno-unwind-tables",
+                                   "-fno-asynchronous-unwind-tables",
                                   "-x", "c", "-c", abs_file.c_str()});

  KBuildHelper kbuild_helper(kdir, kernel_path_info.first);
@@ -165,7 +167,11 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes

  // set up the command line argument wrapper
 #if defined(__powerpc64__)
-  driver::Driver drv("", "ppc64le-unknown-linux-gnu", diags);
+#if defined(_CALL_ELF) && _CALL_ELF == 2
+  driver::Driver drv("", "powerpc64le-unknown-linux-gnu", diags);
+#else
+  driver::Driver drv("", "powerpc64-unknown-linux-gnu", diags);
+#endif
 #elif defined(__aarch64__)
  driver::Driver drv("", "aarch64-unknown-linux-gnu", diags);
 #else
@@ -205,24 +211,25 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
  }

  // pre-compilation pass for generating tracepoint structures
-  auto invocation0 = make_unique<CompilerInvocation>();
-  if (!CompilerInvocation::CreateFromArgs(*invocation0, const_cast<const char **>(ccargs.data()),
-                                          const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
+  CompilerInstance compiler0;
+  CompilerInvocation &invocation0 = compiler0.getInvocation();
+  if (!CompilerInvocation::CreateFromArgs(
+          invocation0, const_cast<const char **>(ccargs.data()),
+          const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
    return -1;

-  invocation0->getPreprocessorOpts().RetainRemappedFileBuffers = true;
+  invocation0.getPreprocessorOpts().RetainRemappedFileBuffers = true;
  for (const auto &f : remapped_files_)
-    invocation0->getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
+    invocation0.getPreprocessorOpts().addRemappedFile(f.first, &*f.second);

  if (in_memory) {
-    invocation0->getPreprocessorOpts().addRemappedFile(main_path, &*main_buf);
-    invocation0->getFrontendOpts().Inputs.clear();
-    invocation0->getFrontendOpts().Inputs.push_back(FrontendInputFile(main_path, IK_C));
+    invocation0.getPreprocessorOpts().addRemappedFile(main_path, &*main_buf);
+    invocation0.getFrontendOpts().Inputs.clear();
+    invocation0.getFrontendOpts().Inputs.push_back(
+        FrontendInputFile(main_path, IK_C));
  }
-  invocation0->getFrontendOpts().DisableFree = false;
+  invocation0.getFrontendOpts().DisableFree = false;

-  CompilerInstance compiler0;
-  compiler0.setInvocation(invocation0.release());
  compiler0.createDiagnostics(new IgnoringDiagConsumer());

  // capture the rewritten c file
@@ -233,24 +240,25 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
  unique_ptr<llvm::MemoryBuffer> out_buf = llvm::MemoryBuffer::getMemBuffer(out_str);

  // first pass
-  auto invocation1 = make_unique<CompilerInvocation>();
-  if (!CompilerInvocation::CreateFromArgs(*invocation1, const_cast<const char **>(ccargs.data()),
-                                          const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
+  CompilerInstance compiler1;
+  CompilerInvocation &invocation1 = compiler1.getInvocation();
+  if (!CompilerInvocation::CreateFromArgs(
+          invocation1, const_cast<const char **>(ccargs.data()),
+          const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
    return -1;

  // This option instructs clang whether or not to free the file buffers that we
  // give to it. Since the embedded header files should be copied fewer times
  // and reused if possible, set this flag to true.
-  invocation1->getPreprocessorOpts().RetainRemappedFileBuffers = true;
+  invocation1.getPreprocessorOpts().RetainRemappedFileBuffers = true;
  for (const auto &f : remapped_files_)
-    invocation1->getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
-  invocation1->getPreprocessorOpts().addRemappedFile(main_path, &*out_buf);
-  invocation1->getFrontendOpts().Inputs.clear();
-  invocation1->getFrontendOpts().Inputs.push_back(FrontendInputFile(main_path, IK_C));
-  invocation1->getFrontendOpts().DisableFree = false;
+    invocation1.getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
+  invocation1.getPreprocessorOpts().addRemappedFile(main_path, &*out_buf);
+  invocation1.getFrontendOpts().Inputs.clear();
+  invocation1.getFrontendOpts().Inputs.push_back(
+      FrontendInputFile(main_path, IK_C));
+  invocation1.getFrontendOpts().DisableFree = false;

-  CompilerInstance compiler1;
-  compiler1.setInvocation(invocation1.release());
  compiler1.createDiagnostics();

  // capture the rewritten c file
@@ -264,21 +272,22 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
  *tables = bact.take_tables();

  // second pass, clear input and take rewrite buffer
-  auto invocation2 = make_unique<CompilerInvocation>();
-  if (!CompilerInvocation::CreateFromArgs(*invocation2, const_cast<const char **>(ccargs.data()),
-                                          const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
-    return -1;
  CompilerInstance compiler2;
-  invocation2->getPreprocessorOpts().RetainRemappedFileBuffers = true;
+  CompilerInvocation &invocation2 = compiler2.getInvocation();
+  if (!CompilerInvocation::CreateFromArgs(
+          invocation2, const_cast<const char **>(ccargs.data()),
+          const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
+    return -1;
+  invocation2.getPreprocessorOpts().RetainRemappedFileBuffers = true;
  for (const auto &f : remapped_files_)
-    invocation2->getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
-  invocation2->getPreprocessorOpts().addRemappedFile(main_path, &*out_buf1);
-  invocation2->getFrontendOpts().Inputs.clear();
-  invocation2->getFrontendOpts().Inputs.push_back(FrontendInputFile(main_path, IK_C));
-  invocation2->getFrontendOpts().DisableFree = false;
+    invocation2.getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
+  invocation2.getPreprocessorOpts().addRemappedFile(main_path, &*out_buf1);
+  invocation2.getFrontendOpts().Inputs.clear();
+  invocation2.getFrontendOpts().Inputs.push_back(
+      FrontendInputFile(main_path, IK_C));
+  invocation2.getFrontendOpts().DisableFree = false;
  // suppress warnings in the 2nd pass, but bail out on errors (our fault)
-  invocation2->getDiagnosticOpts().IgnoreWarnings = true;
-  compiler2.setInvocation(invocation2.release());
+  invocation2.getDiagnosticOpts().IgnoreWarnings = true;
  compiler2.createDiagnostics();

  EmitLLVMOnlyAction ir_act(&*ctx_);

--- a/src/cc/frontends/clang/tp_frontend_action.cc
+++ b/src/cc/frontends/clang/tp_frontend_action.cc
@@ -48,35 +48,42 @@ TracepointTypeVisitor::TracepointTypeVisitor(ASTContext &C, Rewriter &rewriter)
    : C(C), diag_(C.getDiagnostics()), rewriter_(rewriter), out_(llvm::errs()) {
 }

-static inline bool _is_valid_field(string const& line,
-                                   string& field_type,
-                                   string& field_name) {
+enum class field_kind_t {
+    common,
+    data_loc,
+    regular,
+    invalid
+};
+
+static inline field_kind_t _get_field_kind(string const& line,
+                                           string& field_type,
+                                           string& field_name) {
  auto field_pos = line.find("field:");
  if (field_pos == string::npos)
-    return false;
+    return field_kind_t::invalid;

  auto semi_pos = line.find(';', field_pos);
  if (semi_pos == string::npos)
-    return false;
+    return field_kind_t::invalid;

  auto size_pos = line.find("size:", semi_pos);
  if (size_pos == string::npos)
-    return false;
+    return field_kind_t::invalid;

  auto field = line.substr(field_pos + 6/*"field:"*/,
                           semi_pos - field_pos - 6);
  auto pos = field.find_last_of("\t ");
  if (pos == string::npos)
-    return false;
+    return field_kind_t::invalid;

  field_type = field.substr(0, pos);
  field_name = field.substr(pos + 1);
  if (field_type.find("__data_loc") != string::npos)
-    return false;
+    return field_kind_t::data_loc;
  if (field_name.find("common_") == 0)
-    return false;
+    return field_kind_t::common;

-  return true;
+  return field_kind_t::regular;
 }

 string TracepointTypeVisitor::GenerateTracepointStruct(
@@ -91,9 +98,17 @@ string TracepointTypeVisitor::GenerateTracepointStruct(
  tp_struct += "\tu64 __do_not_use__;\n";
  for (string line; getline(input, line); ) {
    string field_type, field_name;
-    if (!_is_valid_field(line, field_type, field_name))
-      continue;
-    tp_struct += "\t" + field_type + " " + field_name + ";\n";
+    switch (_get_field_kind(line, field_type, field_name)) {
+    case field_kind_t::invalid:
+    case field_kind_t::common:
+        continue;
+    case field_kind_t::data_loc:
+        tp_struct += "\tint data_loc_" + field_name + ";\n";
+        break;
+    case field_kind_t::regular:
+        tp_struct += "\t" + field_type + " " + field_name + ";\n";
+        break;
+    }
  }

  tp_struct += "};\n";

--- a/src/cc/libbpf.c
+++ b/src/cc/libbpf.c
@@ -35,6 +35,8 @@
 #include <sys/resource.h>
 #include <unistd.h>
 #include <stdbool.h>
+#include <sys/stat.h>
+#include <sys/types.h>

 #include "libbpf.h"
 #include "perf_reader.h"
@@ -137,6 +139,37 @@ int bpf_get_next_key(int fd, void *key, void *next_key)
  return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
 }

+void bpf_print_hints(char *log)
+{
+  if (log == NULL)
+    return;
+
+  // The following error strings will need maintenance to match LLVM.
+
+  // stack busting
+  if (strstr(log, "invalid stack off=-") != NULL) {
+    fprintf(stderr, "HINT: Looks like you exceeded the BPF stack limit. "
+      "This can happen if you allocate too much local variable storage. "
+      "For example, if you allocated a 1 Kbyte struct (maybe for "
+      "BPF_PERF_OUTPUT), busting a max stack of 512 bytes.\n\n");
+  }
+
+  // didn't check NULL on map lookup
+  if (strstr(log, "invalid mem access 'map_value_or_null'") != NULL) {
+    fprintf(stderr, "HINT: The 'map_value_or_null' error can happen if "
+      "you dereference a pointer value from a map lookup without first "
+      "checking if that pointer is NULL.\n\n");
+  }
+
+  // lacking a bpf_probe_read
+  if (strstr(log, "invalid mem access 'inv'") != NULL) {
+    fprintf(stderr, "HINT: The invalid mem access 'inv' error can happen "
+      "if you try to dereference memory without first using "
+      "bpf_probe_read() to copy it to the BPF stack. Sometimes the "
+      "bpf_probe_read is automatic by the bcc rewriter, other times "
+      "you'll need to be explicit.\n\n");
+  }
+}
 #define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))

 int bpf_prog_load(enum bpf_prog_type prog_type,
@@ -221,6 +254,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
    }

    fprintf(stderr, "bpf: %s\n%s\n", strerror(errno), bpf_log_buffer);
+    bpf_print_hints(bpf_log_buffer);

    free(bpf_log_buffer);
  }
@@ -304,14 +338,19 @@ static int bpf_attach_tracing_event(int progfd, const char *event_path,
  return 0;
 }

-static void * bpf_attach_probe(int progfd, const char *event,
-                               const char *event_desc, const char *event_type,
-                               pid_t pid, int cpu, int group_fd,
-                               perf_reader_cb cb, void *cb_cookie) {
+void * bpf_attach_kprobe(int progfd, enum bpf_probe_attach_type attach_type, const char *ev_name,
+                        const char *fn_name,
+                        pid_t pid, int cpu, int group_fd,
+                        perf_reader_cb cb, void *cb_cookie) 
+{
  int kfd;
  char buf[256];
+  char new_name[128];
  struct perf_reader *reader = NULL;
+  static char *event_type = "kprobe";
+  int n;

+  snprintf(new_name, sizeof(new_name), "%s_bcc_%d", ev_name, getpid());
  reader = perf_reader_new(cb, NULL, cb_cookie);
  if (!reader)
    goto error;
@@ -323,8 +362,9 @@ static void * bpf_attach_probe(int progfd, const char *event,
    goto error;
  }

-  if (write(kfd, event_desc, strlen(event_desc)) < 0) {
-    fprintf(stderr, "write(%s, \"%s\") failed: %s\n", buf, event_desc, strerror(errno));
+  snprintf(buf, sizeof(buf), "%c:%ss/%s %s", attach_type==BPF_PROBE_ENTRY ? 'p' : 'r', 
+			event_type, new_name, fn_name);
+  if (write(kfd, buf, strlen(buf)) < 0) {
    if (errno == EINVAL)
      fprintf(stderr, "check dmesg output for possible cause\n");
    close(kfd);
@@ -332,34 +372,84 @@ static void * bpf_attach_probe(int progfd, const char *event,
  }
  close(kfd);

-  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s", event_type, event);
+  if (access("/sys/kernel/debug/tracing/instances", F_OK) != -1) {
+    snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d", getpid());
+    if (access(buf, F_OK) == -1) {
+      if (mkdir(buf, 0755) == -1) 
+        goto retry;
+    }
+    n = snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d/events/%ss/%s", 
+             getpid(), event_type, new_name);
+    if (n < sizeof(buf) && bpf_attach_tracing_event(progfd, buf, reader, pid, cpu, group_fd) == 0)
+	  goto out;
+    snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d", getpid());
+    rmdir(buf);
+  }
+retry:
+  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s", event_type, new_name);
  if (bpf_attach_tracing_event(progfd, buf, reader, pid, cpu, group_fd) < 0)
    goto error;
-
+out:
  return reader;

 error:
  perf_reader_free(reader);
  return NULL;
-}

-void * bpf_attach_kprobe(int progfd, const char *event,
-                        const char *event_desc,
-                        pid_t pid, int cpu, int group_fd,
-                        perf_reader_cb cb, void *cb_cookie) {
-  return bpf_attach_probe(progfd, event, event_desc, "kprobe", pid, cpu, group_fd, cb, cb_cookie);
 }

-void * bpf_attach_uprobe(int progfd, const char *event,
-                        const char *event_desc,
+void * bpf_attach_uprobe(int progfd, enum bpf_probe_attach_type attach_type, const char *ev_name,
+                        const char *binary_path, uint64_t offset,
                        pid_t pid, int cpu, int group_fd,
-                        perf_reader_cb cb, void *cb_cookie) {
-  return bpf_attach_probe(progfd, event, event_desc, "uprobe", pid, cpu, group_fd, cb, cb_cookie);
+                        perf_reader_cb cb, void *cb_cookie) 
+{
+  int kfd;
+  char buf[PATH_MAX];
+  char new_name[128];
+  struct perf_reader *reader = NULL;
+  static char *event_type = "uprobe";
+  int n;
+
+  snprintf(new_name, sizeof(new_name), "%s_bcc_%d", ev_name, getpid());
+  reader = perf_reader_new(cb, NULL, cb_cookie);
+  if (!reader)
+    goto error;
+
+  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", event_type);
+  kfd = open(buf, O_WRONLY | O_APPEND, 0);
+  if (kfd < 0) {
+    fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
+    goto error;
+  }
+
+  n = snprintf(buf, sizeof(buf), "%c:%ss/%s %s:0x%lx", attach_type==BPF_PROBE_ENTRY ? 'p' : 'r', 
+			event_type, new_name, binary_path, offset);
+  if (n >= sizeof(buf)) {
+    close(kfd);
+    goto error;
+  }
+  if (write(kfd, buf, strlen(buf)) < 0) {
+    if (errno == EINVAL)
+      fprintf(stderr, "check dmesg output for possible cause\n");
+    close(kfd);
+    goto error;
+  }
+  close(kfd);
+
+  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s", event_type, new_name);
+  if (bpf_attach_tracing_event(progfd, buf, reader, pid, cpu, group_fd) < 0)
+    goto error;
+
+  return reader;
+
+error:
+  perf_reader_free(reader);
+  return NULL;
 }

-static int bpf_detach_probe(const char *event_desc, const char *event_type) {
+static int bpf_detach_probe(const char *ev_name, const char *event_type)
+{
  int kfd;
-
  char buf[256];
  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", event_type);
  kfd = open(buf, O_WRONLY | O_APPEND, 0);
@@ -368,7 +458,8 @@ static int bpf_detach_probe(const char *event_desc, const char *event_type) {
    return -1;
  }

-  if (write(kfd, event_desc, strlen(event_desc)) < 0) {
+  snprintf(buf, sizeof(buf), "-:%ss/%s_bcc_%d", event_type, ev_name, getpid());
+  if (write(kfd, buf, strlen(buf)) < 0) {
    fprintf(stderr, "write(%s): %s\n", buf, strerror(errno));
    close(kfd);
    return -1;
@@ -378,14 +469,24 @@ static int bpf_detach_probe(const char *event_desc, const char *event_type) {
  return 0;
 }

-int bpf_detach_kprobe(const char *event_desc) {
-  return bpf_detach_probe(event_desc, "kprobe");
+int bpf_detach_kprobe(const char *ev_name)
+{
+  char buf[256];
+  int ret = bpf_detach_probe(ev_name, "kprobe");
+  snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d", getpid());
+  if (access(buf, F_OK) != -1) {
+    rmdir(buf);
+  }
+
+  return ret;
 }

-int bpf_detach_uprobe(const char *event_desc) {
-  return bpf_detach_probe(event_desc, "uprobe");
+int bpf_detach_uprobe(const char *ev_name)
+{
+  return bpf_detach_probe(ev_name, "uprobe");
 }

+
 void * bpf_attach_tracepoint(int progfd, const char *tp_category,
                             const char *tp_name, int pid, int cpu,
                             int group_fd, perf_reader_cb cb, void *cb_cookie) {

--- a/src/cc/libbpf.h
+++ b/src/cc/libbpf.h
@@ -24,6 +24,11 @@
 extern "C" {
 #endif

+enum bpf_probe_attach_type {
+	BPF_PROBE_ENTRY,
+	BPF_PROBE_RETURN
+};
+
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
 		   int max_entries, int map_flags);
 int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags);
@@ -44,15 +49,19 @@ typedef void (*perf_reader_cb)(void *cb_cookie, int pid, uint64_t callchain_num,
                               void *callchain);
 typedef void (*perf_reader_raw_cb)(void *cb_cookie, void *raw, int raw_size);

-void * bpf_attach_kprobe(int progfd, const char *event, const char *event_desc,
-                         int pid, int cpu, int group_fd, perf_reader_cb cb,
-                         void *cb_cookie);
-int bpf_detach_kprobe(const char *event_desc);
+void * bpf_attach_kprobe(int progfd, enum bpf_probe_attach_type attach_type, 
+                        const char *ev_name, const char *fn_name,
+                        pid_t pid, int cpu, int group_fd,
+                        perf_reader_cb cb, void *cb_cookie);
+
+int bpf_detach_kprobe(const char *ev_name);
+
+void * bpf_attach_uprobe(int progfd, enum bpf_probe_attach_type attach_type,
+                        const char *ev_name, const char *binary_path, uint64_t offset,
+                        pid_t pid, int cpu, int group_fd,
+                        perf_reader_cb cb, void *cb_cookie);

-void * bpf_attach_uprobe(int progfd, const char *event, const char *event_desc,
-                         int pid, int cpu, int group_fd, perf_reader_cb cb,
-                         void *cb_cookie);
-int bpf_detach_uprobe(const char *event_desc);
+int bpf_detach_uprobe(const char *ev_name);

 void * bpf_attach_tracepoint(int progfd, const char *tp_category,
                             const char *tp_name, int pid, int cpu,

--- a/src/cc/usdt.cc
+++ b/src/cc/usdt.cc
@@ -223,8 +223,9 @@ std::string Context::resolve_bin_path(const std::string &bin_path) {
  if (char *which = bcc_procutils_which(bin_path.c_str())) {
    result = which;
    ::free(which);
-  } else if (const char *which_so = bcc_procutils_which_so(bin_path.c_str())) {
+  } else if (char *which_so = bcc_procutils_which_so(bin_path.c_str(), 0)) {
    result = which_so;
+    ::free(which_so);
  }

  return result;

--- a/src/lua/bcc/bpf.lua
+++ b/src/lua/bcc/bpf.lua
@@ -43,13 +43,10 @@ function Bpf.static.cleanup()
      libbcc.perf_reader_free(probe)
      -- skip bcc-specific kprobes
      if not key:starts("bcc:") then
-        local desc = string.format("-:%s/%s", probe_type, key)
-        log.info("detaching %s", desc)
-
        if probe_type == "kprobes" then
-          libbcc.bpf_detach_kprobe(desc)
+          libbcc.bpf_detach_kprobe(key)
        elseif probe_type == "uprobes" then
-          libbcc.bpf_detach_uprobe(desc)
+          libbcc.bpf_detach_uprobe(key)
        end
      end
      all_probes[key] = nil
@@ -183,15 +180,13 @@ end
 function Bpf:attach_uprobe(args)
  Bpf.check_probe_quota(1)

-  local path, addr = Sym.check_path_symbol(args.name, args.sym, args.addr)
+  local path, addr = Sym.check_path_symbol(args.name, args.sym, args.addr, args.pid)
  local fn = self:load_func(args.fn_name, 'BPF_PROG_TYPE_KPROBE')
  local ptype = args.retprobe and "r" or "p"
  local ev_name = string.format("%s_%s_0x%p", ptype, path:gsub("[^%a%d]", "_"), addr)
-  local desc = string.format("%s:uprobes/%s %s:0x%p", ptype, ev_name, path, addr)
-
-  log.info(desc)
+  local retprobe = args.retprobe and 1 or 0

-  local res = libbcc.bpf_attach_uprobe(fn.fd, ev_name, desc,
+  local res = libbcc.bpf_attach_uprobe(fn.fd, retprobe, ev_name, path, addr,
    args.pid or -1,
    args.cpu or 0,
    args.group_fd or -1, nil, nil) -- TODO; reader callback
@@ -209,11 +204,9 @@ function Bpf:attach_kprobe(args)
  local event = args.event or ""
  local ptype = args.retprobe and "r" or "p"
  local ev_name = string.format("%s_%s", ptype, event:gsub("[%+%.]", "_"))
-  local desc = string.format("%s:kprobes/%s %s", ptype, ev_name, event)
-
-  log.info(desc)
+  local retprobe = args.retprobe and 1 or 0

-  local res = libbcc.bpf_attach_kprobe(fn.fd, ev_name, desc,
+  local res = libbcc.bpf_attach_kprobe(fn.fd, retprobe, ev_name, event,
    args.pid or -1,
    args.cpu or 0,
    args.group_fd or -1, nil, nil) -- TODO; reader callback

--- a/src/lua/bcc/libbcc.lua
+++ b/src/lua/bcc/libbcc.lua
@@ -40,13 +40,19 @@ int bpf_open_raw_sock(const char *name);
 typedef void (*perf_reader_cb)(void *cb_cookie, int pid, uint64_t callchain_num, void *callchain);
 typedef void (*perf_reader_raw_cb)(void *cb_cookie, void *raw, int raw_size);

-void * bpf_attach_kprobe(int progfd, const char *event, const char *event_desc,
-  int pid, int cpu, int group_fd, perf_reader_cb cb, void *cb_cookie);
-int bpf_detach_kprobe(const char *event_desc);
+void * bpf_attach_kprobe(int progfd, int attach_type, const char *ev_name,
+                        const char *fn_name,
+                        int pid, int cpu, int group_fd,
+                        perf_reader_cb cb, void *cb_cookie);

-void * bpf_attach_uprobe(int progfd, const char *event, const char *event_desc,
-  int pid, int cpu, int group_fd, perf_reader_cb cb, void *cb_cookie);
-int bpf_detach_uprobe(const char *event_desc);
+int bpf_detach_kprobe(const char *ev_name);
+
+void * bpf_attach_uprobe(int progfd, int attach_type, const char *ev_name,
+                        const char *binary_path, uint64_t offset,
+                        int pid, int cpu, int group_fd,
+                        perf_reader_cb cb, void *cb_cookie);
+
+int bpf_detach_uprobe(const char *ev_name);

 void * bpf_open_perf_buffer(perf_reader_raw_cb raw_cb, void *cb_cookie, int pid, int cpu);
 ]]
@@ -109,7 +115,8 @@ struct bcc_symbol {
 };

 int bcc_resolve_symname(const char *module, const char *symname, const uint64_t addr,
-		struct bcc_symbol *sym);
+		int pid, struct bcc_symbol *sym);
+void bcc_procutils_free(const char *ptr);
 void *bcc_symcache_new(int pid);
 int bcc_symcache_resolve(void *symcache, uint64_t addr, struct bcc_symbol *sym);
 void bcc_symcache_refresh(void *resolver);

--- a/src/lua/bcc/sym.lua
+++ b/src/lua/bcc/sym.lua
@@ -30,17 +30,22 @@ local function create_cache(pid)
  }
 end

-local function check_path_symbol(module, symname, addr)
+local function check_path_symbol(module, symname, addr, pid)
  local sym = SYM()
-  if libbcc.bcc_resolve_symname(module, symname, addr or 0x0, sym) < 0 then
+  local module_path
+  if libbcc.bcc_resolve_symname(module, symname, addr or 0x0, pid or 0, sym) < 0 then
    if sym[0].module == nil then
      error("could not find library '%s' in the library path" % module)
    else
+      module_path = ffi.string(sym[0].module)
+      libbcc.bcc_procutils_free(sym[0].module)
      error("failed to resolve symbol '%s' in '%s'" % {
-        symname, ffi.string(sym[0].module)})
+        symname, module_path})
    end
  end
-  return ffi.string(sym[0].module), sym[0].offset
+  module_path = ffi.string(sym[0].module)
+  libbcc.bcc_procutils_free(sym[0].module)
+  return module_path, sym[0].offset
 end

 return { create_cache=create_cache, check_path_symbol=check_path_symbol }
--- a/src/lua/bcc/table.lua
+++ b/src/lua/bcc/table.lua
@@ -29,6 +29,7 @@ BaseTable.static.BPF_MAP_TYPE_STACK_TRACE = 7
 BaseTable.static.BPF_MAP_TYPE_CGROUP_ARRAY = 8
 BaseTable.static.BPF_MAP_TYPE_LRU_HASH = 9
 BaseTable.static.BPF_MAP_TYPE_LRU_PERCPU_HASH = 10
+BaseTable.static.BPF_MAP_TYPE_LPM_TRIE = 11

 function BaseTable:initialize(t_type, bpf, map_id, map_fd, key_type, leaf_type)
  assert(t_type == libbcc.bpf_table_type_id(bpf.module, map_id))

--- a/src/lua/bpf/cdef.lua
+++ b/src/lua/bpf/cdef.lua
@@ -140,6 +140,7 @@ else
 		S.c.BPF_MAP.CGROUP_ARRAY = 8
 		S.c.BPF_MAP.LRU_HASH     = 9
 		S.c.BPF_MAP.LRU_PERCPU_HASH = 10
+		S.c.BPF_MAP.LPM_TRIE     = 11
 	end
 	if not S.c.BPF_PROG.TRACEPOINT then
 		S.c.BPF_PROG.TRACEPOINT  = 5

--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -17,7 +17,6 @@ import atexit
 import ctypes as ct
 import fcntl
 import json
-import multiprocessing
 import os
 import re
 import struct
@@ -29,6 +28,7 @@ from .libbcc import lib, _CB_TYPE, bcc_symbol, _SYM_CB_TYPE
 from .table import Table
 from .perf import Perf
 from .usyms import ProcessSymbols
+from .utils import get_online_cpus

 _kprobe_limit = 1000
 _num_open_probes = 0
@@ -459,9 +459,8 @@ class BPF(object):
        self._check_probe_quota(1)
        fn = self.load_func(fn_name, BPF.KPROBE)
        ev_name = "p_" + event.replace("+", "_").replace(".", "_")
-        desc = "p:kprobes/%s %s" % (ev_name, event)
-        res = lib.bpf_attach_kprobe(fn.fd, ev_name.encode("ascii"),
-                desc.encode("ascii"), pid, cpu, group_fd,
+        res = lib.bpf_attach_kprobe(fn.fd, 0, ev_name.encode("ascii"),
+                event.encode("ascii"), pid, cpu, group_fd,
                self._reader_cb_impl, ct.cast(id(self), ct.py_object))
        res = ct.cast(res, ct.c_void_p)
        if not res:
@@ -475,8 +474,7 @@ class BPF(object):
        if ev_name not in self.open_kprobes:
            raise Exception("Kprobe %s is not attached" % event)
        lib.perf_reader_free(self.open_kprobes[ev_name])
-        desc = "-:kprobes/%s" % ev_name
-        res = lib.bpf_detach_kprobe(desc.encode("ascii"))
+        res = lib.bpf_detach_kprobe(ev_name.encode("ascii"))
        if res < 0:
            raise Exception("Failed to detach BPF from kprobe")
        self._del_kprobe(ev_name)
@@ -498,9 +496,8 @@ class BPF(object):
        self._check_probe_quota(1)
        fn = self.load_func(fn_name, BPF.KPROBE)
        ev_name = "r_" + event.replace("+", "_").replace(".", "_")
-        desc = "r:kprobes/%s %s" % (ev_name, event)
-        res = lib.bpf_attach_kprobe(fn.fd, ev_name.encode("ascii"),
-                desc.encode("ascii"), pid, cpu, group_fd,
+        res = lib.bpf_attach_kprobe(fn.fd, 1, ev_name.encode("ascii"),
+                event.encode("ascii"), pid, cpu, group_fd,
                self._reader_cb_impl, ct.cast(id(self), ct.py_object))
        res = ct.cast(res, ct.c_void_p)
        if not res:
@@ -514,8 +511,7 @@ class BPF(object):
        if ev_name not in self.open_kprobes:
            raise Exception("Kretprobe %s is not attached" % event)
        lib.perf_reader_free(self.open_kprobes[ev_name])
-        desc = "-:kprobes/%s" % ev_name
-        res = lib.bpf_detach_kprobe(desc.encode("ascii"))
+        res = lib.bpf_detach_kprobe(ev_name.encode("ascii"))
        if res < 0:
            raise Exception("Failed to detach BPF from kprobe")
        self._del_kprobe(ev_name)
@@ -554,20 +550,28 @@ class BPF(object):


    @classmethod
-    def _check_path_symbol(cls, module, symname, addr):
+    def _check_path_symbol(cls, module, symname, addr, pid):
        sym = bcc_symbol()
        psym = ct.pointer(sym)
+        c_pid = 0 if pid == -1 else pid
        if lib.bcc_resolve_symname(module.encode("ascii"),
-                symname.encode("ascii"), addr or 0x0, psym) < 0:
+                symname.encode("ascii"), addr or 0x0, c_pid, psym) < 0:
            if not sym.module:
                raise Exception("could not find library %s" % module)
+            lib.bcc_procutils_free(sym.module)
            raise Exception("could not determine address of symbol %s" % symname)
-        return sym.module.decode(), sym.offset
+        module_path = ct.cast(sym.module, ct.c_char_p).value.decode()
+        lib.bcc_procutils_free(sym.module)
+        return module_path, sym.offset

    @staticmethod
    def find_library(libname):
-        res = lib.bcc_procutils_which_so(libname.encode("ascii"))
-        return res if res is None else res.decode()
+        res = lib.bcc_procutils_which_so(libname.encode("ascii"), 0)
+        if not res:
+            return None
+        libpath = ct.cast(res, ct.c_char_p).value.decode()
+        lib.bcc_procutils_free(res)
+        return libpath

    @staticmethod
    def get_tracepoints(tp_re):
@@ -660,7 +664,7 @@ class BPF(object):
            res[cpu] = self._attach_perf_event(fn.fd, ev_type, ev_config,
                    sample_period, sample_freq, pid, cpu, group_fd)
        else:
-            for i in range(0, multiprocessing.cpu_count()):
+            for i in get_online_cpus():
                res[i] = self._attach_perf_event(fn.fd, ev_type, ev_config,
                        sample_period, sample_freq, pid, i, group_fd)
        self.open_perf_events[(ev_type, ev_config)] = res
@@ -736,7 +740,8 @@ class BPF(object):

        Libraries can be given in the name argument without the lib prefix, or
        with the full path (/usr/lib/...). Binaries can be given only with the
-        full path (/bin/sh).
+        full path (/bin/sh). If a PID is given, the uprobe will attach to the
+        version of the library used by the process.

        Example: BPF(text).attach_uprobe("c", "malloc")
                 BPF(text).attach_uprobe("/usr/bin/python", "main")
@@ -753,14 +758,13 @@ class BPF(object):
                                   group_fd=group_fd)
            return

-        (path, addr) = BPF._check_path_symbol(name, sym, addr)
+        (path, addr) = BPF._check_path_symbol(name, sym, addr, pid)

        self._check_probe_quota(1)
        fn = self.load_func(fn_name, BPF.KPROBE)
        ev_name = "p_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
-        desc = "p:uprobes/%s %s:0x%x" % (ev_name, path, addr)
-        res = lib.bpf_attach_uprobe(fn.fd, ev_name.encode("ascii"),
-                desc.encode("ascii"), pid, cpu, group_fd,
+        res = lib.bpf_attach_uprobe(fn.fd, 0, ev_name.encode("ascii"),
+                path.encode("ascii"), addr, pid, cpu, group_fd,
                self._reader_cb_impl, ct.cast(id(self), ct.py_object))
        res = ct.cast(res, ct.c_void_p)
        if not res:
@@ -768,21 +772,20 @@ class BPF(object):
        self._add_uprobe(ev_name, res)
        return self

-    def detach_uprobe(self, name="", sym="", addr=None):
-        """detach_uprobe(name="", sym="", addr=None)
+    def detach_uprobe(self, name="", sym="", addr=None, pid=-1):
+        """detach_uprobe(name="", sym="", addr=None, pid=-1)

        Stop running a bpf function that is attached to symbol 'sym' in library
        or binary 'name'.
        """

        name = str(name)
-        (path, addr) = BPF._check_path_symbol(name, sym, addr)
+        (path, addr) = BPF._check_path_symbol(name, sym, addr, pid)
        ev_name = "p_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
        if ev_name not in self.open_uprobes:
-            raise Exception("Uprobe %s is not attached" % event)
+            raise Exception("Uprobe %s is not attached" % ev_name)
        lib.perf_reader_free(self.open_uprobes[ev_name])
-        desc = "-:uprobes/%s" % ev_name
-        res = lib.bpf_detach_uprobe(desc.encode("ascii"))
+        res = lib.bpf_detach_uprobe(ev_name.encode("ascii"))
        if res < 0:
            raise Exception("Failed to detach BPF from uprobe")
        self._del_uprobe(ev_name)
@@ -805,14 +808,13 @@ class BPF(object):
            return

        name = str(name)
-        (path, addr) = BPF._check_path_symbol(name, sym, addr)
+        (path, addr) = BPF._check_path_symbol(name, sym, addr, pid)

        self._check_probe_quota(1)
        fn = self.load_func(fn_name, BPF.KPROBE)
        ev_name = "r_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
-        desc = "r:uprobes/%s %s:0x%x" % (ev_name, path, addr)
-        res = lib.bpf_attach_uprobe(fn.fd, ev_name.encode("ascii"),
-                desc.encode("ascii"), pid, cpu, group_fd,
+        res = lib.bpf_attach_uprobe(fn.fd, 1, ev_name.encode("ascii"),
+                path.encode("ascii"), addr, pid, cpu, group_fd,
                self._reader_cb_impl, ct.cast(id(self), ct.py_object))
        res = ct.cast(res, ct.c_void_p)
        if not res:
@@ -820,21 +822,20 @@ class BPF(object):
        self._add_uprobe(ev_name, res)
        return self

-    def detach_uretprobe(self, name="", sym="", addr=None):
-        """detach_uretprobe(name="", sym="", addr=None)
+    def detach_uretprobe(self, name="", sym="", addr=None, pid=-1):
+        """detach_uretprobe(name="", sym="", addr=None, pid=-1)

        Stop running a bpf function that is attached to symbol 'sym' in library
        or binary 'name'.
        """

        name = str(name)
-        (path, addr) = BPF._check_path_symbol(name, sym, addr)
+        (path, addr) = BPF._check_path_symbol(name, sym, addr, pid)
        ev_name = "r_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
        if ev_name not in self.open_uprobes:
-            raise Exception("Kretprobe %s is not attached" % event)
+            raise Exception("Uretprobe %s is not attached" % ev_name)
        lib.perf_reader_free(self.open_uprobes[ev_name])
-        desc = "-:uprobes/%s" % ev_name
-        res = lib.bpf_detach_uprobe(desc.encode("ascii"))
+        res = lib.bpf_detach_uprobe(ev_name.encode("ascii"))
        if res < 0:
            raise Exception("Failed to detach BPF from uprobe")
        self._del_uprobe(ev_name)
@@ -1036,18 +1037,17 @@ class BPF(object):
            lib.perf_reader_free(v)
            # non-string keys here include the perf_events reader
            if isinstance(k, str):
-                desc = "-:kprobes/%s" % k
-                lib.bpf_detach_kprobe(desc.encode("ascii"))
+                lib.bpf_detach_kprobe(str(k).encode("ascii"))
            self._del_kprobe(k)
        for k, v in list(self.open_uprobes.items()):
            lib.perf_reader_free(v)
-            desc = "-:uprobes/%s" % k
-            lib.bpf_detach_uprobe(desc.encode("ascii"))
+            lib.bpf_detach_uprobe(str(k).encode("ascii"))
            self._del_uprobe(k)
        for k, v in self.open_tracepoints.items():
            lib.perf_reader_free(v)
            (tp_category, tp_name) = k.split(':')
-            lib.bpf_detach_tracepoint(tp_category, tp_name)
+            lib.bpf_detach_tracepoint(tp_category.encode("ascii"),
+                    tp_name.encode("ascii"))
        self.open_tracepoints.clear()
        for (ev_type, ev_config) in list(self.open_perf_events.keys()):
            self.detach_perf_event(ev_type, ev_config)
@@ -1065,4 +1065,4 @@ class BPF(object):
        self.cleanup()


-from .usdt import USDT
+from .usdt import USDT, USDTException
--- a/src/python/bcc/libbcc.py
+++ b/src/python/bcc/libbcc.py
@@ -87,13 +87,13 @@ lib.bpf_attach_kprobe.restype = ct.c_void_p
 _CB_TYPE = ct.CFUNCTYPE(None, ct.py_object, ct.c_int,
        ct.c_ulonglong, ct.POINTER(ct.c_ulonglong))
 _RAW_CB_TYPE = ct.CFUNCTYPE(None, ct.py_object, ct.c_void_p, ct.c_int)
-lib.bpf_attach_kprobe.argtypes = [ct.c_int, ct.c_char_p, ct.c_char_p, ct.c_int,
+lib.bpf_attach_kprobe.argtypes = [ct.c_int, ct.c_int, ct.c_char_p, ct.c_char_p, ct.c_int,
        ct.c_int, ct.c_int, _CB_TYPE, ct.py_object]
 lib.bpf_detach_kprobe.restype = ct.c_int
 lib.bpf_detach_kprobe.argtypes = [ct.c_char_p]
 lib.bpf_attach_uprobe.restype = ct.c_void_p
-lib.bpf_attach_uprobe.argtypes = [ct.c_int, ct.c_char_p, ct.c_char_p, ct.c_int,
-        ct.c_int, ct.c_int, _CB_TYPE, ct.py_object]
+lib.bpf_attach_uprobe.argtypes = [ct.c_int, ct.c_int, ct.c_char_p, ct.c_char_p, 
+        ct.c_ulonglong, ct.c_int, ct.c_int, ct.c_int, _CB_TYPE, ct.py_object]
 lib.bpf_detach_uprobe.restype = ct.c_int
 lib.bpf_detach_uprobe.argtypes = [ct.c_char_p]
 lib.bpf_attach_tracepoint.restype = ct.c_void_p
@@ -126,16 +126,18 @@ class bcc_symbol(ct.Structure):
    _fields_ = [
            ('name', ct.c_char_p),
            ('demangle_name', ct.c_char_p),
-            ('module', ct.c_char_p),
+            ('module', ct.POINTER(ct.c_char)),
            ('offset', ct.c_ulonglong),
        ]

-lib.bcc_procutils_which_so.restype = ct.c_char_p
-lib.bcc_procutils_which_so.argtypes = [ct.c_char_p]
+lib.bcc_procutils_which_so.restype = ct.POINTER(ct.c_char)
+lib.bcc_procutils_which_so.argtypes = [ct.c_char_p, ct.c_int]
+lib.bcc_procutils_free.restype = None
+lib.bcc_procutils_free.argtypes = [ct.c_void_p]

 lib.bcc_resolve_symname.restype = ct.c_int
 lib.bcc_resolve_symname.argtypes = [
-    ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.POINTER(bcc_symbol)]
+    ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.c_int, ct.POINTER(bcc_symbol)]

 _SYM_CB_TYPE = ct.CFUNCTYPE(ct.c_int, ct.c_char_p, ct.c_ulonglong)
 lib.bcc_foreach_symbol.restype = ct.c_int

--- a/src/python/bcc/perf.py
+++ b/src/python/bcc/perf.py
@@ -13,8 +13,8 @@
 # limitations under the License.

 import ctypes as ct
-import multiprocessing
 import os
+from .utils import get_online_cpus

 class Perf(object):
        class perf_event_attr(ct.Structure):
@@ -105,5 +105,5 @@ class Perf(object):
                    attr.sample_period = 1
                attr.wakeup_events = 9999999                # don't wake up

-                for cpu in range(0, multiprocessing.cpu_count()):
+                for cpu in get_online_cpus():
                        Perf._open_for_cpu(cpu, attr)
--- a/src/python/bcc/table.py
+++ b/src/python/bcc/table.py
@@ -14,11 +14,14 @@

 from collections import MutableMapping
 import ctypes as ct
+from functools import reduce
 import multiprocessing
 import os

 from .libbcc import lib, _RAW_CB_TYPE
 from .perf import Perf
+from .utils import get_online_cpus
+from .utils import get_possible_cpus
 from subprocess import check_output

 BPF_MAP_TYPE_HASH = 1
@@ -31,6 +34,7 @@ BPF_MAP_TYPE_STACK_TRACE = 7
 BPF_MAP_TYPE_CGROUP_ARRAY = 8
 BPF_MAP_TYPE_LRU_HASH = 9
 BPF_MAP_TYPE_LRU_PERCPU_HASH = 10
+BPF_MAP_TYPE_LPM_TRIE = 11

 stars_max = 40
 log2_index_max = 65
@@ -121,6 +125,8 @@ def Table(bpf, map_id, map_fd, keytype, leaftype, **kwargs):
        t = PerCpuHash(bpf, map_id, map_fd, keytype, leaftype, **kwargs)
    elif ttype == BPF_MAP_TYPE_PERCPU_ARRAY:
        t = PerCpuArray(bpf, map_id, map_fd, keytype, leaftype, **kwargs)
+    elif ttype == BPF_MAP_TYPE_LPM_TRIE:
+        t = LpmTrie(bpf, map_id, map_fd, keytype, leaftype)
    elif ttype == BPF_MAP_TYPE_STACK_TRACE:
        t = StackTrace(bpf, map_id, map_fd, keytype, leaftype)
    elif ttype == BPF_MAP_TYPE_LRU_HASH:
@@ -509,7 +515,7 @@ class PerfEventArray(ArrayBase):
        event submitted from the kernel, up to millions per second.
        """

-        for i in range(0, multiprocessing.cpu_count()):
+        for i in get_online_cpus():
            self._open_perf_buffer(i, callback)

    def _open_perf_buffer(self, cpu, callback):
@@ -550,7 +556,7 @@ class PerfEventArray(ArrayBase):
        if not isinstance(ev, self.Event):
            raise Exception("argument must be an Event, got %s", type(ev))

-        for i in range(0, multiprocessing.cpu_count()):
+        for i in get_online_cpus():
            self._open_perf_event(i, ev.typ, ev.config)


@@ -559,7 +565,7 @@ class PerCpuHash(HashTable):
        self.reducer = kwargs.pop("reducer", None)
        super(PerCpuHash, self).__init__(*args, **kwargs)
        self.sLeaf = self.Leaf
-        self.total_cpu = multiprocessing.cpu_count()
+        self.total_cpu = len(get_possible_cpus())
        # This needs to be 8 as hard coded into the linux kernel.
        self.alignment = ct.sizeof(self.sLeaf) % 8
        if self.alignment is 0:
@@ -595,7 +601,7 @@ class PerCpuHash(HashTable):
    def sum(self, key):
        if isinstance(self.Leaf(), ct.Structure):
            raise IndexError("Leaf must be an integer type for default sum functions")
-        return self.sLeaf(reduce(lambda x,y: x+y, self.getvalue(key)))
+        return self.sLeaf(sum(self.getvalue(key)))

    def max(self, key):
        if isinstance(self.Leaf(), ct.Structure):
@@ -604,8 +610,7 @@ class PerCpuHash(HashTable):

    def average(self, key):
        result = self.sum(key)
-        result.value/=self.total_cpu
-        return result
+        return result.value / self.total_cpu

 class LruPerCpuHash(PerCpuHash):
    def __init__(self, *args, **kwargs):
@@ -616,7 +621,7 @@ class PerCpuArray(ArrayBase):
        self.reducer = kwargs.pop("reducer", None)
        super(PerCpuArray, self).__init__(*args, **kwargs)
        self.sLeaf = self.Leaf
-        self.total_cpu = multiprocessing.cpu_count()
+        self.total_cpu = len(get_possible_cpus())
        # This needs to be 8 as hard coded into the linux kernel.
        self.alignment = ct.sizeof(self.sLeaf) % 8
        if self.alignment is 0:
@@ -652,7 +657,7 @@ class PerCpuArray(ArrayBase):
    def sum(self, key):
        if isinstance(self.Leaf(), ct.Structure):
            raise IndexError("Leaf must be an integer type for default sum functions")
-        return self.sLeaf(reduce(lambda x,y: x+y, self.getvalue(key)))
+        return self.sLeaf(sum(self.getvalue(key)))

    def max(self, key):
        if isinstance(self.Leaf(), ct.Structure):
@@ -661,8 +666,19 @@ class PerCpuArray(ArrayBase):

    def average(self, key):
        result = self.sum(key)
-        result.value/=self.total_cpu
-        return result
+        return result.value / self.total_cpu
+
+class LpmTrie(TableBase):
+    def __init__(self, *args, **kwargs):
+        super(LpmTrie, self).__init__(*args, **kwargs)
+
+    def __len__(self):
+        raise NotImplementedError
+
+    def __delitem__(self, key):
+        # Not implemented for lpm trie as of kernel commit
+        # b95a5c4db09bc7c253636cb84dc9b12c577fd5a0
+        raise NotImplementedError

 class StackTrace(TableBase):
    MAX_DEPTH = 127

--- a/src/python/bcc/usdt.py
+++ b/src/python/bcc/usdt.py
@@ -13,10 +13,14 @@
 # limitations under the License.

 import ctypes as ct
+import sys
 from .libbcc import lib, _USDT_CB, _USDT_PROBE_CB, \
                    bcc_usdt_location, bcc_usdt_argument, \
                    BCC_USDT_ARGUMENT_FLAGS

+class USDTException(Exception):
+    pass
+
 class USDTProbeArgument(object):
    def __init__(self, argument):
        self.signed = argument.size < 0
@@ -77,8 +81,9 @@ class USDTProbeLocation(object):
        res = lib.bcc_usdt_get_argument(self.probe.context, self.probe.name,
                                        self.index, index, ct.pointer(arg))
        if res != 0:
-            raise Exception("error retrieving probe argument %d location %d" %
-                            (index, self.index))
+            raise USDTException(
+                    "error retrieving probe argument %d location %d" %
+                    (index, self.index))
        return USDTProbeArgument(arg)

 class USDTProbe(object):
@@ -103,7 +108,7 @@ class USDTProbe(object):
        res = lib.bcc_usdt_get_location(self.context, self.name,
                                        index, ct.pointer(loc))
        if res != 0:
-            raise Exception("error retrieving probe location %d" % index)
+            raise USDTException("error retrieving probe location %d" % index)
        return USDTProbeLocation(self, index, loc)

 class USDT(object):
@@ -112,23 +117,36 @@ class USDT(object):
            self.pid = pid
            self.context = lib.bcc_usdt_new_frompid(pid)
            if self.context == None:
-                raise Exception("USDT failed to instrument PID %d" % pid)
+                raise USDTException("USDT failed to instrument PID %d" % pid)
        elif path:
            self.path = path
            self.context = lib.bcc_usdt_new_frompath(path)
            if self.context == None:
-                raise Exception("USDT failed to instrument path %s" % path)
+                raise USDTException("USDT failed to instrument path %s" % path)
        else:
-            raise Exception("either a pid or a binary path must be specified")
+            raise USDTException(
+                    "either a pid or a binary path must be specified")

    def enable_probe(self, probe, fn_name):
        if lib.bcc_usdt_enable_probe(self.context, probe, fn_name) != 0:
-            raise Exception(("failed to enable probe '%s'; a possible cause " +
-                            "can be that the probe requires a pid to enable") %
-                            probe)
+            raise USDTException(
+                    ("failed to enable probe '%s'; a possible cause " +
+                     "can be that the probe requires a pid to enable") %
+                     probe
+                  )
+
+    def enable_probe_or_bail(self, probe, fn_name):
+        if lib.bcc_usdt_enable_probe(self.context, probe, fn_name) != 0:
+            print(
+"""Error attaching USDT probes: the specified pid might not contain the
+given language's runtime, or the runtime was not built with the required
+USDT probes. Look for a configure flag similar to --with-dtrace or
+--enable-dtrace. To check which probes are present in the process, use the
+tplist tool.""")
+            sys.exit(1)

    def get_text(self):
-        return lib.bcc_usdt_genargs(self.context)
+        return lib.bcc_usdt_genargs(self.context).decode()

    def get_probe_arg_ctype(self, probe_name, arg_index):
        return lib.bcc_usdt_get_probe_argctype(

--- a/src/python/bcc/utils.py
+++ b/src/python/bcc/utils.py
+# Copyright 2016 Catalysts GmbH
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+def _read_cpu_range(path):
+    cpus = []
+    with open(path, 'r') as f:
+        cpus_range_str = f.read()
+        for cpu_range in cpus_range_str.split(','):
+            rangeop = cpu_range.find('-')
+            if rangeop == -1:
+                cpus.append(int(cpu_range))
+            else:
+                start = int(cpu_range[:rangeop])
+                end = int(cpu_range[rangeop+1:])
+                cpus.extend(range(start, end+1))
+    return cpus
+
+def get_online_cpus():
+    return _read_cpu_range('/sys/devices/system/cpu/online')
+
+def get_possible_cpus():
+    return _read_cpu_range('/sys/devices/system/cpu/possible')
--- a/tests/cc/test_c_api.cc
+++ b/tests/cc/test_c_api.cc
@@ -23,6 +23,7 @@
 #include "bcc_perf_map.h"
 #include "bcc_proc.h"
 #include "bcc_syms.h"
+#include "common.h"
 #include "vendor/tinyformat.hpp"

 #include "catch.hpp"
@@ -30,10 +31,19 @@
 using namespace std;

 TEST_CASE("shared object resolution", "[c_api]") {
-  const char *libm = bcc_procutils_which_so("m");
+  char *libm = bcc_procutils_which_so("m", 0);
  REQUIRE(libm);
  REQUIRE(libm[0] == '/');
  REQUIRE(string(libm).find("libm.so") != string::npos);
+  free(libm);
+}
+
+TEST_CASE("shared object resolution using loaded libraries", "[c_api]") {
+  char *libelf = bcc_procutils_which_so("elf", getpid());
+  REQUIRE(libelf);
+  REQUIRE(libelf[0] == '/');
+  REQUIRE(string(libelf).find("libelf") != string::npos);
+  free(libelf);
 }

 TEST_CASE("binary resolution with `which`", "[c_api]") {
@@ -57,10 +67,21 @@ TEST_CASE("list all kernel symbols", "[c_api]") {
 TEST_CASE("resolve symbol name in external library", "[c_api]") {
  struct bcc_symbol sym;

-  REQUIRE(bcc_resolve_symname("c", "malloc", 0x0, &sym) == 0);
+  REQUIRE(bcc_resolve_symname("c", "malloc", 0x0, 0, &sym) == 0);
  REQUIRE(string(sym.module).find("libc.so") != string::npos);
  REQUIRE(sym.module[0] == '/');
  REQUIRE(sym.offset != 0);
+  bcc_procutils_free(sym.module);
+}
+
+TEST_CASE("resolve symbol name in external library using loaded libraries", "[c_api]") {
+  struct bcc_symbol sym;
+
+  REQUIRE(bcc_resolve_symname("bcc", "bcc_procutils_which", 0x0, getpid(), &sym) == 0);
+  REQUIRE(string(sym.module).find("libbcc.so") != string::npos);
+  REQUIRE(sym.module[0] == '/');
+  REQUIRE(sym.offset != 0);
+  bcc_procutils_free(sym.module);
 }

 extern "C" int _a_test_function(const char *a_string) {
@@ -196,3 +217,10 @@ TEST_CASE("resolve symbols using /tmp/perf-pid.map", "[c_api]") {

  munmap(map_addr, map_sz);
 }
+
+
+TEST_CASE("get online CPUs", "[c_api]") {
+	std::vector<int> cpus = ebpf::get_online_cpus();
+	int num_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	REQUIRE(cpus.size() == num_cpus);
+}
--- a/tests/lua/test_standalone.sh
+++ b/tests/lua/test_standalone.sh
@@ -19,23 +19,9 @@ if ldd bcc-lua | grep -q luajit; then
    fail "bcc-lua depends on libluajit"
 fi

-rm -f libbcc.so probe.lua
+rm -f probe.lua
 echo "return function(BPF) print(\"Hello world\") end" > probe.lua

-if ./bcc-lua "probe.lua"; then
-    fail "bcc-lua runs without libbcc.so"
-fi
-
-if ! env LIBBCC_SO_PATH=../cc/libbcc.so ./bcc-lua "probe.lua"; then
-    fail "bcc-lua cannot load libbcc.so through the environment"
-fi
-
-ln -s ../cc/libbcc.so
-
-if ! ./bcc-lua "probe.lua"; then
-    fail "bcc-lua cannot find local libbcc.so"
-fi
-
 PROBE="../../../examples/lua/offcputime.lua"

 if ! sudo ./bcc-lua "$PROBE" -d 1 >/dev/null 2>/dev/null; then

--- a/tests/lua/test_uprobes.lua
+++ b/tests/lua/test_uprobes.lua
@@ -27,8 +27,8 @@ int count(struct pt_regs *ctx) {
  local text = text:gsub("PID", tostring(pid))

  local b = BPF:new{text=text}
-  b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count"}
-  b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count", retprobe=true}
+  b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count", pid=pid}
+  b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count", pid=pid, retprobe=true}

  assert_equals(BPF.num_open_uprobes(), 2)


--- a/tests/python/CMakeLists.txt
+++ b/tests/python/CMakeLists.txt
@@ -17,7 +17,7 @@ endif()
 add_test(NAME py_test_stat1_b WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
  COMMAND ${TEST_WRAPPER} py_stat1_b namespace ${CMAKE_CURRENT_SOURCE_DIR}/test_stat1.py test_stat1.b proto.b)
 add_test(NAME py_test_bpf_log WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
-  COMMAND ${TEST_WRAPPER} py_bpf_prog namespace ${CMAKE_CURRENT_SOURCE_DIR}/test_bpf_log.py)
+  COMMAND ${TEST_WRAPPER} py_bpf_prog sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_bpf_log.py)
 add_test(NAME py_test_stat1_c WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
  COMMAND ${TEST_WRAPPER} py_stat1_c namespace ${CMAKE_CURRENT_SOURCE_DIR}/test_stat1.py test_stat1.c)
 #add_test(NAME py_test_xlate1_b WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
@@ -56,6 +56,10 @@ add_test(NAME py_test_tracepoint WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
  COMMAND ${TEST_WRAPPER} py_test_tracepoint sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_tracepoint.py)
 add_test(NAME py_test_perf_event WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
  COMMAND ${TEST_WRAPPER} py_test_perf_event sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_perf_event.py)
+add_test(NAME py_test_utils WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+  COMMAND ${TEST_WRAPPER} py_test_utils sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_utils.py)
+add_test(NAME py_test_percpu WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+  COMMAND ${TEST_WRAPPER} py_test_percpu sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_percpu.py)

 add_test(NAME py_test_dump_func WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
  COMMAND ${TEST_WRAPPER} py_dump_func simple ${CMAKE_CURRENT_SOURCE_DIR}/test_dump_func.py)
--- a/tests/python/test_array.py
+++ b/tests/python/test_array.py
@@ -6,6 +6,8 @@ from bcc import BPF
 import ctypes as ct
 import random
 import time
+import subprocess
+from bcc.utils import get_online_cpus
 from unittest import main, TestCase

 class TestArray(TestCase):
@@ -62,6 +64,37 @@ int kprobe__sys_nanosleep(void *ctx) {
        time.sleep(0.1)
        b.kprobe_poll()
        self.assertGreater(self.counter, 0)
+        b.cleanup()
+
+    def test_perf_buffer_for_each_cpu(self):
+        self.events = []
+
+        class Data(ct.Structure):
+            _fields_ = [("cpu", ct.c_ulonglong)]
+
+        def cb(cpu, data, size):
+            self.assertGreater(size, ct.sizeof(Data))
+            event = ct.cast(data, ct.POINTER(Data)).contents
+            self.events.append(event)
+
+        text = """
+BPF_PERF_OUTPUT(events);
+int kprobe__sys_nanosleep(void *ctx) {
+    struct {
+        u64 cpu;
+    } data = {bpf_get_smp_processor_id()};
+    events.perf_submit(ctx, &data, sizeof(data));
+    return 0;
+}
+"""
+        b = BPF(text=text)
+        b["events"].open_perf_buffer(cb)
+        online_cpus = get_online_cpus()
+        for cpu in online_cpus:
+            subprocess.call(['taskset', '-c', str(cpu), 'sleep', '0.1'])
+        b.kprobe_poll()
+        b.cleanup()
+        self.assertGreaterEqual(len(self.events), len(online_cpus), 'Received only {}/{} events'.format(len(self.events), len(online_cpus)))

 if __name__ == "__main__":
    main()
--- a/tests/python/test_bpf_log.py
+++ b/tests/python/test_bpf_log.py
@@ -51,7 +51,7 @@ class TestBPFProgLoad(TestCase):
        except Exception:
            self.fp.flush()
            self.fp.seek(0)
-            self.assertEqual(error_msg in self.fp.read(), True)
+            self.assertEqual(error_msg in self.fp.read().decode(), True)


    def test_log_no_debug(self):
@@ -61,7 +61,7 @@ class TestBPFProgLoad(TestCase):
        except Exception:
            self.fp.flush()
            self.fp.seek(0)
-            self.assertEqual(error_msg in self.fp.read(), True)
+            self.assertEqual(error_msg in self.fp.read().decode(), True)


 if __name__ == "__main__":

--- a/tests/python/test_brb2.py
+++ b/tests/python/test_brb2.py
@@ -68,6 +68,16 @@ ipr = IPRoute()
 ipdb = IPDB(nl=ipr)
 sim = Simulation(ipdb)

+allocated_interfaces = set(ipdb.interfaces.keys())
+
+def get_next_iface(prefix):
+    i = 0
+    while True:
+        iface = "{0}{1}".format(prefix, i)
+        if iface not in allocated_interfaces:
+            allocated_interfaces.add(iface)
+            return iface
+        i += 1

 class TestBPFSocket(TestCase):
    def setup_br(self, br, veth_rt_2_br, veth_pem_2_br, veth_br_2_pem):
@@ -84,15 +94,15 @@ class TestBPFSocket(TestCase):
            br1.add_port(ipdb.interfaces[veth_rt_2_br])
            br1.up()
        subprocess.call(["sysctl", "-q", "-w", "net.ipv6.conf." + br + ".disable_ipv6=1"])
-            
+
    def set_default_const(self):
        self.ns1            = "ns1"
        self.ns2            = "ns2"
        self.ns_router      = "ns_router"
-        self.br1            = "br1"
+        self.br1            = get_next_iface("br")
        self.veth_pem_2_br1 = "v20"
        self.veth_br1_2_pem = "v21"
-        self.br2            = "br2"
+        self.br2            = get_next_iface("br")
        self.veth_pem_2_br2 = "v22"
        self.veth_br2_2_pem = "v23"


--- a/tests/python/test_lpm_trie.py
+++ b/tests/python/test_lpm_trie.py
+#!/usr/bin/env python
+# Copyright (c) 2017 Facebook, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+
+import ctypes as ct
+import unittest
+from bcc import BPF
+from netaddr import IPAddress
+
+class KeyV4(ct.Structure):
+    _fields_ = [("prefixlen", ct.c_uint),
+                ("data", ct.c_ubyte * 4)]
+
+class KeyV6(ct.Structure):
+    _fields_ = [("prefixlen", ct.c_uint),
+                ("data", ct.c_ushort * 8)]
+
+class TestLpmTrie(unittest.TestCase):
+    def test_lpm_trie_v4(self):
+        test_prog1 = """
+        BPF_F_TABLE("lpm_trie", u64, int, trie, 16, BPF_F_NO_PREALLOC);
+        """
+        b = BPF(text=test_prog1)
+        t = b["trie"]
+
+        k1 = KeyV4(24, (192, 168, 0, 0))
+        v1 = ct.c_int(24)
+        t[k1] = v1
+
+        k2 = KeyV4(28, (192, 168, 0, 0))
+        v2 = ct.c_int(28)
+        t[k2] = v2
+
+        k = KeyV4(32, (192, 168, 0, 15))
+        self.assertEqual(t[k].value, 28)
+
+        k = KeyV4(32, (192, 168, 0, 127))
+        self.assertEqual(t[k].value, 24)
+
+        with self.assertRaises(KeyError):
+            k = KeyV4(32, (172, 16, 1, 127))
+            v = t[k]
+
+    def test_lpm_trie_v6(self):
+        test_prog1 = """
+        struct key_v6 {
+            u32 prefixlen;
+            u32 data[4];
+        };
+        BPF_F_TABLE("lpm_trie", struct key_v6, int, trie, 16, BPF_F_NO_PREALLOC);
+        """
+        b = BPF(text=test_prog1)
+        t = b["trie"]
+
+        k1 = KeyV6(64, IPAddress('2a00:1450:4001:814:200e::').words)
+        v1 = ct.c_int(64)
+        t[k1] = v1
+
+        k2 = KeyV6(96, IPAddress('2a00:1450:4001:814::200e').words)
+        v2 = ct.c_int(96)
+        t[k2] = v2
+
+        k = KeyV6(128, IPAddress('2a00:1450:4001:814::1024').words)
+        self.assertEqual(t[k].value, 96)
+
+        k = KeyV6(128, IPAddress('2a00:1450:4001:814:2046::').words)
+        self.assertEqual(t[k].value, 64)
+
+        with self.assertRaises(KeyError):
+            k = KeyV6(128, IPAddress('2a00:ffff::').words)
+            v = t[k]
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tests/python/test_percpu.py
+++ b/tests/python/test_percpu.py
@@ -9,6 +9,12 @@ import multiprocessing

 class TestPercpu(unittest.TestCase):

+    def setUp(self):
+        try:
+            b = BPF(text='BPF_TABLE("percpu_array", u32, u32, stub, 1);')
+        except:
+            raise unittest.SkipTest("PerCpu unsupported on this kernel")
+
    def test_u64(self):
        test_prog1 = """
        BPF_TABLE("percpu_hash", u32, u64, stats, 1);
@@ -34,8 +40,8 @@ class TestPercpu(unittest.TestCase):
        sum = stats_map.sum(stats_map.Key(0))
        avg = stats_map.average(stats_map.Key(0))
        max = stats_map.max(stats_map.Key(0))
-        self.assertGreater(sum.value, 0L)
-        self.assertGreater(max.value, 0L)
+        self.assertGreater(sum.value, int(0))
+        self.assertGreater(max.value, int(0))
        bpf_code.detach_kprobe("sys_clone")

    def test_u32(self):
@@ -63,8 +69,8 @@ class TestPercpu(unittest.TestCase):
        sum = stats_map.sum(stats_map.Key(0))
        avg = stats_map.average(stats_map.Key(0))
        max = stats_map.max(stats_map.Key(0))
-        self.assertGreater(sum.value, 0L)
-        self.assertGreater(max.value, 0L)
+        self.assertGreater(sum.value, int(0))
+        self.assertGreater(max.value, int(0))
        bpf_code.detach_kprobe("sys_clone")

    def test_struct_custom_func(self):
@@ -95,7 +101,7 @@ class TestPercpu(unittest.TestCase):
        f.close()
        self.assertEqual(len(stats_map),1)
        k = stats_map[ stats_map.Key(0) ]
-        self.assertGreater(k.c1, 0L)
+        self.assertGreater(k.c1, int(0))
        bpf_code.detach_kprobe("sys_clone")



--- a/tests/python/test_tracepoint.py
+++ b/tests/python/test_tracepoint.py
@@ -7,6 +7,7 @@ import unittest
 from time import sleep
 import distutils.version
 import os
+import subprocess

 def kernel_version_ge(major, minor):
    # True if running kernel is >= X.Y
@@ -39,5 +40,29 @@ class TestTracepoint(unittest.TestCase):
            total_switches += v.value
        self.assertNotEqual(0, total_switches)

+@unittest.skipUnless(kernel_version_ge(4,7), "requires kernel >= 4.7")
+class TestTracepointDataLoc(unittest.TestCase):
+    def test_tracepoint_data_loc(self):
+        text = """
+        struct value_t {
+            char filename[64];
+        };
+        BPF_HASH(execs, u32, struct value_t);
+        TRACEPOINT_PROBE(sched, sched_process_exec) {
+            struct value_t val = {0};
+            char fn[64];
+            u32 pid = args->pid;
+            struct value_t *existing = execs.lookup_or_init(&pid, &val);
+            TP_DATA_LOC_READ_CONST(fn, filename, 64);
+            __builtin_memcpy(existing->filename, fn, 64);
+            return 0;
+        }
+        """
+        b = bcc.BPF(text=text)
+        subprocess.check_output(["/bin/ls"])
+        sleep(1)
+        self.assertTrue("/bin/ls" in [v.filename.decode()
+                                      for v in b["execs"].values()])
+
 if __name__ == "__main__":
    unittest.main()
--- a/tests/python/test_uprobes.py
+++ b/tests/python/test_uprobes.py
@@ -18,22 +18,24 @@ static void incr(int idx) {
        ++(*ptr);
 }
 int count(struct pt_regs *ctx) {
+    bpf_trace_printk("count() uprobe fired");
    u32 pid = bpf_get_current_pid_tgid();
    if (pid == PID)
        incr(0);
    return 0;
 }"""
-        text = text.replace("PID", "%d" % os.getpid())
+        test_pid = os.getpid()
+        text = text.replace("PID", "%d" % test_pid)
        b = bcc.BPF(text=text)
-        b.attach_uprobe(name="c", sym="malloc_stats", fn_name="count")
-        b.attach_uretprobe(name="c", sym="malloc_stats", fn_name="count")
+        b.attach_uprobe(name="c", sym="malloc_stats", fn_name="count", pid=test_pid)
+        b.attach_uretprobe(name="c", sym="malloc_stats", fn_name="count", pid=test_pid)
        libc = ctypes.CDLL("libc.so.6")
        libc.malloc_stats.restype = None
        libc.malloc_stats.argtypes = []
        libc.malloc_stats()
        self.assertEqual(b["stats"][ctypes.c_int(0)].value, 2)
-        b.detach_uretprobe(name="c", sym="malloc_stats")
-        b.detach_uprobe(name="c", sym="malloc_stats")
+        b.detach_uretprobe(name="c", sym="malloc_stats", pid=test_pid)
+        b.detach_uprobe(name="c", sym="malloc_stats", pid=test_pid)

    def test_simple_binary(self):
        text = """

--- a/tests/python/test_utils.py
+++ b/tests/python/test_utils.py
+#!/usr/bin/python
+# Copyright (c) Catalysts GmbH
+# Licensed under the Apache License, Version 2.0 (the "License")
+
+from bcc.utils import get_online_cpus
+import multiprocessing
+import unittest
+
+class TestUtils(unittest.TestCase):
+    def test_get_online_cpus(self):
+        online_cpus = get_online_cpus()
+        num_cores = multiprocessing.cpu_count()
+
+        self.assertEqual(len(online_cpus), num_cores)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/tools/argdist.py
+++ b/tools/argdist.py
@@ -159,7 +159,7 @@ u64 __time = bpf_ktime_get_ns();
                if parts[0] not in ["r", "p", "t", "u"]:
                        self._bail("probe type must be 'p', 'r', 't', or 'u'" +
                                   " but got '%s'" % parts[0])
-                if re.match(r"\w+\(.*\)", parts[2]) is None:
+                if re.match(r"\S+\(.*\)", parts[2]) is None:
                        self._bail(("function signature '%s' has an invalid " +
                                    "format") % parts[2])

@@ -173,6 +173,9 @@ u64 __time = bpf_ktime_get_ns();
                        self._bail("no exprs specified")
                self.exprs = exprs.split(',')

+        def _make_valid_identifier(self, ident):
+                return re.sub(r'[^A-Za-z0-9_]', '_', ident)
+
        def __init__(self, tool, type, specifier):
                self.usdt_ctx = None
                self.streq_functions = ""
@@ -196,8 +199,9 @@ u64 __time = bpf_ktime_get_ns();
                        self.tp_event = self.function
                elif self.probe_type == "u":
                        self.library = parts[1]
-                        self.probe_func_name = "%s_probe%d" % \
-                                (self.function, Probe.next_probe_index)
+                        self.probe_func_name = self._make_valid_identifier(
+                                "%s_probe%d" % \
+                                (self.function, Probe.next_probe_index))
                        self._enable_usdt_probe()
                else:
                        self.library = parts[1]
@@ -233,10 +237,12 @@ u64 __time = bpf_ktime_get_ns();
                self.entry_probe_required = self.probe_type == "r" and \
                        (any(map(check, self.exprs)) or check(self.filter))

-                self.probe_func_name = "%s_probe%d" % \
-                        (self.function, Probe.next_probe_index)
-                self.probe_hash_name = "%s_hash%d" % \
-                        (self.function, Probe.next_probe_index)
+                self.probe_func_name = self._make_valid_identifier(
+                        "%s_probe%d" % \
+                        (self.function, Probe.next_probe_index))
+                self.probe_hash_name = self._make_valid_identifier(
+                        "%s_hash%d" % \
+                        (self.function, Probe.next_probe_index))
                Probe.next_probe_index += 1

        def _enable_usdt_probe(self):
@@ -252,7 +258,7 @@ static inline bool %s(char const *ignored, char const *str) {
        char needle[] = %s;
        char haystack[sizeof(needle)];
        bpf_probe_read(&haystack, sizeof(haystack), (void *)str);
-        for (int i = 0; i < sizeof(needle); ++i) {
+        for (int i = 0; i < sizeof(needle) - 1; ++i) {
                if (needle[i] != haystack[i]) {
                        return false;
                }
@@ -613,7 +619,8 @@ argdist -p 2780 -z 120 \\
                  "(see examples below)")
                parser.add_argument("-I", "--include", action="append",
                  metavar="header",
-                  help="additional header files to include in the BPF program")
+                  help="additional header files to include in the BPF program "
+                       "as either full path, or relative to '/usr/include'")
                self.args = parser.parse_args()
                self.usdt_ctx = None

@@ -634,7 +641,12 @@ struct __string_t { char s[%d]; };
 #include <uapi/linux/ptrace.h>
                """ % self.args.string_size
                for include in (self.args.include or []):
-                        bpf_source += "#include <%s>\n" % include
+                        if include.startswith((".", "/")):
+                                include = os.path.abspath(include)
+                                bpf_source += "#include \"%s\"\n" % include
+                        else:
+                                bpf_source += "#include <%s>\n" % include
+
                bpf_source += BPF.generate_auto_includes(
                                map(lambda p: p.raw_spec, self.probes))
                for probe in self.probes:

--- a/tools/argdist_example.txt
+++ b/tools/argdist_example.txt
@@ -363,6 +363,7 @@ optional arguments:
                        below)
  -I header, --include header
                        additional header files to include in the BPF program
+                        as either full path, or relative to '/usr/include'

 Probe specifier syntax:
        {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]

--- a/tools/biosnoop.py
+++ b/tools/biosnoop.py
@@ -106,14 +106,12 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
 * test, and maintenance burden.
 */
 #ifdef REQ_WRITE
-if (req->cmd_flags & REQ_WRITE) {
+    data.rwflag = !!(req->cmd_flags & REQ_WRITE);
+#elif defined(REQ_OP_SHIFT)
+    data.rwflag = !!((req->cmd_flags >> REQ_OP_SHIFT) == REQ_OP_WRITE);
 #else
-if ((req->cmd_flags >> REQ_OP_SHIFT) == REQ_OP_WRITE) {
+    data.rwflag = !!((req->cmd_flags & REQ_OP_MASK) == REQ_OP_WRITE);
 #endif
-        data.rwflag = 1;
-    } else {
-        data.rwflag = 0;
-    }

    events.perf_submit(ctx, &data, sizeof(data));
    start.delete(&req);

--- a/tools/biotop.py
+++ b/tools/biotop.py
@@ -137,8 +137,10 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
 */
 #ifdef REQ_WRITE
    info.rwflag = !!(req->cmd_flags & REQ_WRITE);
-#else
+#elif defined(REQ_OP_SHIFT)
    info.rwflag = !!((req->cmd_flags >> REQ_OP_SHIFT) == REQ_OP_WRITE);
+#else
+    info.rwflag = !!((req->cmd_flags & REQ_OP_MASK) == REQ_OP_WRITE);
 #endif

    whop = whobyreq.lookup(&req);

--- a/tools/deadlock_detector.c
+++ b/tools/deadlock_detector.c
+/*
+ * deadlock_detector.c  Detects potential deadlocks in a running process.
+ *                      For Linux, uses BCC, eBPF. See .py file.
+ *
+ * Copyright 2017 Facebook, Inc.
+ * Licensed under the Apache License, Version 2.0 (the "License")
+ *
+ * 1-Feb-2016   Kenny Yu   Created this.
+ */
+
+#include <linux/sched.h>
+#include <uapi/linux/ptrace.h>
+
+// Maximum number of mutexes a single thread can hold at once.
+// If the number is too big, the unrolled loops wil cause the stack
+// to be too big, and the bpf verifier will fail.
+#define MAX_HELD_MUTEXES 16
+
+// Info about held mutexes. `mutex` will be 0 if not held.
+struct held_mutex_t {
+  u64 mutex;
+  u64 stack_id;
+};
+
+// List of mutexes that a thread is holding. Whenever we loop over this array,
+// we need to force the compiler to unroll the loop, otherwise the bcc verifier
+// will fail because the loop will create a backwards edge.
+struct thread_to_held_mutex_leaf_t {
+  struct held_mutex_t held_mutexes[MAX_HELD_MUTEXES];
+};
+
+// Map of thread ID -> array of (mutex addresses, stack id)
+BPF_TABLE("hash", u32, struct thread_to_held_mutex_leaf_t,
+          thread_to_held_mutexes, 2097152);
+
+// Key type for edges. Represents an edge from mutex1 => mutex2.
+struct edges_key_t {
+  u64 mutex1;
+  u64 mutex2;
+};
+
+// Leaf type for edges. Holds information about where each mutex was acquired.
+struct edges_leaf_t {
+  u64 mutex1_stack_id;
+  u64 mutex2_stack_id;
+  u32 thread_pid;
+  char comm[TASK_COMM_LEN];
+};
+
+// Represents all edges currently in the mutex wait graph.
+BPF_TABLE("hash", struct edges_key_t, struct edges_leaf_t, edges, 2097152);
+
+// Info about parent thread when a child thread is created.
+struct thread_created_leaf_t {
+  u64 stack_id;
+  u32 parent_pid;
+  char comm[TASK_COMM_LEN];
+};
+
+// Map of child thread pid -> info about parent thread.
+BPF_TABLE("hash", u32, struct thread_created_leaf_t, thread_to_parent, 10240);
+
+// Stack traces when threads are created and when mutexes are locked/unlocked.
+BPF_STACK_TRACE(stack_traces, 655360);
+
+// The first argument to the user space function we are tracing
+// is a pointer to the mutex M held by thread T.
+//
+// For all mutexes N held by mutexes_held[T]
+//   add edge N => M (held by T)
+// mutexes_held[T].add(M)
+int trace_mutex_acquire(struct pt_regs *ctx, void *mutex_addr) {
+  // Higher 32 bits is process ID, Lower 32 bits is thread ID
+  u32 pid = bpf_get_current_pid_tgid();
+  u64 mutex = (u64)mutex_addr;
+
+  struct thread_to_held_mutex_leaf_t empty_leaf = {};
+  struct thread_to_held_mutex_leaf_t *leaf =
+      thread_to_held_mutexes.lookup_or_init(&pid, &empty_leaf);
+  if (!leaf) {
+    bpf_trace_printk(
+        "could not add thread_to_held_mutex key, thread: %d, mutex: %p\n", pid,
+        mutex);
+    return 1; // Could not insert, no more memory
+  }
+
+  // Recursive mutexes lock the same mutex multiple times. We cannot tell if
+  // the mutex is recursive after the mutex is already created. To avoid noisy
+  // reports, disallow self edges. Do one pass to check if we are already
+  // holding the mutex, and if we are, do nothing.
+  #pragma unroll
+  for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
+    if (leaf->held_mutexes[i].mutex == mutex) {
+      return 1; // Disallow self edges
+    }
+  }
+
+  u64 stack_id =
+      stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
+
+  int added_mutex = 0;
+  #pragma unroll
+  for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
+    // If this is a free slot, see if we can insert.
+    if (!leaf->held_mutexes[i].mutex) {
+      if (!added_mutex) {
+        leaf->held_mutexes[i].mutex = mutex;
+        leaf->held_mutexes[i].stack_id = stack_id;
+        added_mutex = 1;
+      }
+      continue; // Nothing to do for a free slot
+    }
+
+    // Add edges from held mutex => current mutex
+    struct edges_key_t edge_key = {};
+    edge_key.mutex1 = leaf->held_mutexes[i].mutex;
+    edge_key.mutex2 = mutex;
+
+    struct edges_leaf_t edge_leaf = {};
+    edge_leaf.mutex1_stack_id = leaf->held_mutexes[i].stack_id;
+    edge_leaf.mutex2_stack_id = stack_id;
+    edge_leaf.thread_pid = pid;
+    bpf_get_current_comm(&edge_leaf.comm, sizeof(edge_leaf.comm));
+
+    // Returns non-zero on error
+    int result = edges.update(&edge_key, &edge_leaf);
+    if (result) {
+      bpf_trace_printk("could not add edge key %p, %p, error: %d\n",
+                       edge_key.mutex1, edge_key.mutex2, result);
+      continue; // Could not insert, no more memory
+    }
+  }
+
+  // There were no free slots for this mutex.
+  if (!added_mutex) {
+    bpf_trace_printk("could not add mutex %p, added_mutex: %d\n", mutex,
+                     added_mutex);
+    return 1;
+  }
+  return 0;
+}
+
+// The first argument to the user space function we are tracing
+// is a pointer to the mutex M held by thread T.
+//
+// mutexes_held[T].remove(M)
+int trace_mutex_release(struct pt_regs *ctx, void *mutex_addr) {
+  // Higher 32 bits is process ID, Lower 32 bits is thread ID
+  u32 pid = bpf_get_current_pid_tgid();
+  u64 mutex = (u64)mutex_addr;
+
+  struct thread_to_held_mutex_leaf_t *leaf =
+      thread_to_held_mutexes.lookup(&pid);
+  if (!leaf) {
+    // If the leaf does not exist for the pid, then it means we either missed
+    // the acquire event, or we had no more memory and could not add it.
+    bpf_trace_printk(
+        "could not find thread_to_held_mutex, thread: %d, mutex: %p\n", pid,
+        mutex);
+    return 1;
+  }
+
+  // For older kernels without "Bpf: allow access into map value arrays"
+  // (https://lkml.org/lkml/2016/8/30/287) the bpf verifier will fail with an
+  // invalid memory access on `leaf->held_mutexes[i]` below. On newer kernels,
+  // we can avoid making this extra copy in `value` and use `leaf` directly.
+  struct thread_to_held_mutex_leaf_t value = {};
+  bpf_probe_read(&value, sizeof(struct thread_to_held_mutex_leaf_t), leaf);
+
+  #pragma unroll
+  for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
+    // Find the current mutex (if it exists), and clear it.
+    // Note: Can't use `leaf->` in this if condition, see comment above.
+    if (value.held_mutexes[i].mutex == mutex) {
+      leaf->held_mutexes[i].mutex = 0;
+      leaf->held_mutexes[i].stack_id = 0;
+    }
+  }
+
+  return 0;
+}
+
+// Trace return from clone() syscall in the child thread (return value > 0).
+int trace_clone(struct pt_regs *ctx, unsigned long flags, void *child_stack,
+                void *ptid, void *ctid, struct pt_regs *regs) {
+  u32 child_pid = PT_REGS_RC(ctx);
+  if (child_pid <= 0) {
+    return 1;
+  }
+
+  struct thread_created_leaf_t thread_created_leaf = {};
+  thread_created_leaf.parent_pid = bpf_get_current_pid_tgid();
+  thread_created_leaf.stack_id =
+      stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
+  bpf_get_current_comm(&thread_created_leaf.comm,
+                       sizeof(thread_created_leaf.comm));
+
+  struct thread_created_leaf_t *insert_result =
+      thread_to_parent.lookup_or_init(&child_pid, &thread_created_leaf);
+  if (!insert_result) {
+    bpf_trace_printk(
+        "could not add thread_created_key, child: %d, parent: %d\n", child_pid,
+        thread_created_leaf.parent_pid);
+    return 1; // Could not insert, no more memory
+  }
+  return 0;
+}
--- a/tools/deadlock_detector.py
+++ b/tools/deadlock_detector.py
+#!/usr/bin/env python
+#
+# deadlock_detector  Detects potential deadlocks (lock order inversions)
+#                    on a running process. For Linux, uses BCC, eBPF.
+#
+# USAGE: deadlock_detector.py [-h] [--binary BINARY] [--dump-graph DUMP_GRAPH]
+#                             [--verbose] [--lock-symbols LOCK_SYMBOLS]
+#                             [--unlock-symbols UNLOCK_SYMBOLS]
+#                             pid
+#
+# This traces pthread mutex lock and unlock calls to build a directed graph
+# representing the mutex wait graph:
+#
+# - Nodes in the graph represent mutexes.
+# - Edge (A, B) exists if there exists some thread T where lock(A) was called
+#   and lock(B) was called before unlock(A) was called.
+#
+# If the program finds a potential lock order inversion, the program will dump
+# the cycle of mutexes and the stack traces where each mutex was acquired, and
+# then exit.
+#
+# This program can only find potential deadlocks that occur while the program
+# is tracing the process. It cannot find deadlocks that may have occurred
+# before the program was attached to the process.
+#
+# Since this traces all mutex lock and unlock events and all thread creation
+# events on the traced process, the overhead of this bpf program can be very
+# high if the process has many threads and mutexes. You should only run this on
+# a process where the slowdown is acceptable.
+#
+# Note: This tool does not work for shared mutexes or recursive mutexes.
+#
+# For shared (read-write) mutexes, a deadlock requires a cycle in the wait
+# graph where at least one of the mutexes in the cycle is acquiring exclusive
+# (write) ownership.
+#
+# For recursive mutexes, lock() is called multiple times on the same mutex.
+# However, there is no way to determine if a mutex is a recursive mutex
+# after the mutex has been created. As a result, this tool will not find
+# potential deadlocks that involve only one mutex.
+#
+# Copyright 2017 Facebook, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 01-Feb-2017   Kenny Yu   Created this.
+
+from __future__ import (
+    absolute_import, division, unicode_literals, print_function
+)
+from bcc import BPF
+from collections import defaultdict
+import argparse
+import json
+import os
+import subprocess
+import sys
+import time
+
+
+class DiGraph(object):
+    '''
+    Adapted from networkx: http://networkx.github.io/
+    Represents a directed graph. Edges can store (key, value) attributes.
+    '''
+
+    def __init__(self):
+        # Map of node -> set of nodes
+        self.adjacency_map = {}
+        # Map of (node1, node2) -> map string -> arbitrary attribute
+        # This will not be copied in subgraph()
+        self.attributes_map = {}
+
+    def neighbors(self, node):
+        return self.adjacency_map.get(node, set())
+
+    def edges(self):
+        edges = []
+        for node, neighbors in self.adjacency_map.items():
+            for neighbor in neighbors:
+                edges.append((node, neighbor))
+        return edges
+
+    def nodes(self):
+        return self.adjacency_map.keys()
+
+    def attributes(self, node1, node2):
+        return self.attributes_map[(node1, node2)]
+
+    def add_edge(self, node1, node2, **kwargs):
+        if node1 not in self.adjacency_map:
+            self.adjacency_map[node1] = set()
+        if node2 not in self.adjacency_map:
+            self.adjacency_map[node2] = set()
+        self.adjacency_map[node1].add(node2)
+        self.attributes_map[(node1, node2)] = kwargs
+
+    def remove_node(self, node):
+        self.adjacency_map.pop(node, None)
+        for _, neighbors in self.adjacency_map.items():
+            neighbors.discard(node)
+
+    def subgraph(self, nodes):
+        graph = DiGraph()
+        for node in nodes:
+            for neighbor in self.neighbors(node):
+                if neighbor in nodes:
+                    graph.add_edge(node, neighbor)
+        return graph
+
+    def node_link_data(self):
+        '''
+        Returns the graph as a dictionary in a format that can be
+        serialized.
+        '''
+        data = {
+            'directed': True,
+            'multigraph': False,
+            'graph': {},
+            'links': [],
+            'nodes': [],
+        }
+
+        # Do one pass to build a map of node -> position in nodes
+        node_to_number = {}
+        for node in self.adjacency_map.keys():
+            node_to_number[node] = len(data['nodes'])
+            data['nodes'].append({'id': node})
+
+        # Do another pass to build the link information
+        for node, neighbors in self.adjacency_map.items():
+            for neighbor in neighbors:
+                link = self.attributes_map[(node, neighbor)].copy()
+                link['source'] = node_to_number[node]
+                link['target'] = node_to_number[neighbor]
+                data['links'].append(link)
+        return data
+
+
+def strongly_connected_components(G):
+    '''
+    Adapted from networkx: http://networkx.github.io/
+    Parameters
+    ----------
+    G : DiGraph
+    Returns
+    -------
+    comp : generator of sets
+        A generator of sets of nodes, one for each strongly connected
+        component of G.
+    '''
+    preorder = {}
+    lowlink = {}
+    scc_found = {}
+    scc_queue = []
+    i = 0  # Preorder counter
+    for source in G.nodes():
+        if source not in scc_found:
+            queue = [source]
+            while queue:
+                v = queue[-1]
+                if v not in preorder:
+                    i = i + 1
+                    preorder[v] = i
+                done = 1
+                v_nbrs = G.neighbors(v)
+                for w in v_nbrs:
+                    if w not in preorder:
+                        queue.append(w)
+                        done = 0
+                        break
+                if done == 1:
+                    lowlink[v] = preorder[v]
+                    for w in v_nbrs:
+                        if w not in scc_found:
+                            if preorder[w] > preorder[v]:
+                                lowlink[v] = min([lowlink[v], lowlink[w]])
+                            else:
+                                lowlink[v] = min([lowlink[v], preorder[w]])
+                    queue.pop()
+                    if lowlink[v] == preorder[v]:
+                        scc_found[v] = True
+                        scc = {v}
+                        while (
+                            scc_queue and preorder[scc_queue[-1]] > preorder[v]
+                        ):
+                            k = scc_queue.pop()
+                            scc_found[k] = True
+                            scc.add(k)
+                        yield scc
+                    else:
+                        scc_queue.append(v)
+
+
+def simple_cycles(G):
+    '''
+    Adapted from networkx: http://networkx.github.io/
+    Parameters
+    ----------
+    G : DiGraph
+    Returns
+    -------
+    cycle_generator: generator
+       A generator that produces elementary cycles of the graph.
+       Each cycle is represented by a list of nodes along the cycle.
+    '''
+
+    def _unblock(thisnode, blocked, B):
+        stack = set([thisnode])
+        while stack:
+            node = stack.pop()
+            if node in blocked:
+                blocked.remove(node)
+                stack.update(B[node])
+                B[node].clear()
+
+    # Johnson's algorithm requires some ordering of the nodes.
+    # We assign the arbitrary ordering given by the strongly connected comps
+    # There is no need to track the ordering as each node removed as processed.
+    # save the actual graph so we can mutate it here
+    # We only take the edges because we do not want to
+    # copy edge and node attributes here.
+    subG = G.subgraph(G.nodes())
+    sccs = list(strongly_connected_components(subG))
+    while sccs:
+        scc = sccs.pop()
+        # order of scc determines ordering of nodes
+        startnode = scc.pop()
+        # Processing node runs 'circuit' routine from recursive version
+        path = [startnode]
+        blocked = set()  # vertex: blocked from search?
+        closed = set()  # nodes involved in a cycle
+        blocked.add(startnode)
+        B = defaultdict(set)  # graph portions that yield no elementary circuit
+        stack = [(startnode, list(subG.neighbors(startnode)))]
+        while stack:
+            thisnode, nbrs = stack[-1]
+            if nbrs:
+                nextnode = nbrs.pop()
+                if nextnode == startnode:
+                    yield path[:]
+                    closed.update(path)
+                elif nextnode not in blocked:
+                    path.append(nextnode)
+                    stack.append((nextnode, list(subG.neighbors(nextnode))))
+                    closed.discard(nextnode)
+                    blocked.add(nextnode)
+                    continue
+            # done with nextnode... look for more neighbors
+            if not nbrs:  # no more nbrs
+                if thisnode in closed:
+                    _unblock(thisnode, blocked, B)
+                else:
+                    for nbr in subG.neighbors(thisnode):
+                        if thisnode not in B[nbr]:
+                            B[nbr].add(thisnode)
+                stack.pop()
+                path.pop()
+        # done processing this node
+        subG.remove_node(startnode)
+        H = subG.subgraph(scc)  # make smaller to avoid work in SCC routine
+        sccs.extend(list(strongly_connected_components(H)))
+
+
+def find_cycle(graph):
+    '''
+    Looks for a cycle in the graph. If found, returns the first cycle.
+    If nodes a1, a2, ..., an are in a cycle, then this returns:
+        [(a1,a2), (a2,a3), ... (an-1,an), (an, a1)]
+    Otherwise returns an empty list.
+    '''
+    cycles = list(simple_cycles(graph))
+    if cycles:
+        nodes = cycles[0]
+        nodes.append(nodes[0])
+        edges = []
+        prev = nodes[0]
+        for node in nodes[1:]:
+            edges.append((prev, node))
+            prev = node
+        return edges
+    else:
+        return []
+
+
+def print_cycle(binary, graph, edges, thread_info, print_stack_trace_fn):
+    '''
+    Prints the cycle in the mutex graph in the following format:
+
+    Potential Deadlock Detected!
+
+    Cycle in lock order graph: M0 => M1 => M2 => M0
+
+    for (m, n) in cycle:
+        Mutex n acquired here while holding Mutex m in thread T:
+            [ stack trace ]
+
+        Mutex m previously acquired by thread T here:
+            [ stack trace ]
+
+    for T in all threads:
+        Thread T was created here:
+            [ stack trace ]
+    '''
+
+    # List of mutexes in the cycle, first and last repeated
+    nodes_in_order = []
+    # Map mutex address -> readable alias
+    node_addr_to_name = {}
+    for counter, (m, n) in enumerate(edges):
+        nodes_in_order.append(m)
+        # For global or static variables, try to symbolize the mutex address.
+        symbol = symbolize_with_objdump(binary, m)
+        if symbol:
+            symbol += ' '
+        node_addr_to_name[m] = 'Mutex M%d (%s0x%016x)' % (counter, symbol, m)
+    nodes_in_order.append(nodes_in_order[0])
+
+    print('----------------\nPotential Deadlock Detected!\n')
+    print(
+        'Cycle in lock order graph: %s\n' %
+        (' => '.join([node_addr_to_name[n] for n in nodes_in_order]))
+    )
+
+    # Set of threads involved in the lock inversion
+    thread_pids = set()
+
+    # For each edge in the cycle, print where the two mutexes were held
+    for (m, n) in edges:
+        thread_pid = graph.attributes(m, n)['thread_pid']
+        thread_comm = graph.attributes(m, n)['thread_comm']
+        first_mutex_stack_id = graph.attributes(m, n)['first_mutex_stack_id']
+        second_mutex_stack_id = graph.attributes(m, n)['second_mutex_stack_id']
+        thread_pids.add(thread_pid)
+        print(
+            '%s acquired here while holding %s in Thread %d (%s):' % (
+                node_addr_to_name[n], node_addr_to_name[m], thread_pid,
+                thread_comm
+            )
+        )
+        print_stack_trace_fn(second_mutex_stack_id)
+        print('')
+        print(
+            '%s previously acquired by the same Thread %d (%s) here:' %
+            (node_addr_to_name[m], thread_pid, thread_comm)
+        )
+        print_stack_trace_fn(first_mutex_stack_id)
+        print('')
+
+    # Print where the threads were created, if available
+    for thread_pid in thread_pids:
+        parent_pid, stack_id, parent_comm = thread_info.get(
+            thread_pid, (None, None, None)
+        )
+        if parent_pid:
+            print(
+                'Thread %d created by Thread %d (%s) here: ' %
+                (thread_pid, parent_pid, parent_comm)
+            )
+            print_stack_trace_fn(stack_id)
+        else:
+            print(
+                'Could not find stack trace where Thread %d was created' %
+                thread_pid
+            )
+        print('')
+
+
+def symbolize_with_objdump(binary, addr):
+    '''
+    Searches the binary for the address using objdump. Returns the symbol if
+    it is found, otherwise returns empty string.
+    '''
+    try:
+        command = (
+            'objdump -tT %s | grep %x | awk {\'print $NF\'} | c++filt' %
+            (binary, addr)
+        )
+        output = subprocess.check_output(command, shell=True)
+        return output.decode('utf-8').strip()
+    except subprocess.CalledProcessError:
+        return ''
+
+
+def strlist(s):
+    '''Given a comma-separated string, returns a list of substrings'''
+    return s.strip().split(',')
+
+
+def main():
+    examples = '''Examples:
+    deadlock_detector 181        # Analyze PID 181
+
+    deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+                                 # Analyze PID 181 and locks from this binary.
+                                 # If tracing a process that is running from
+                                 # a dynamically-linked binary, this argument
+                                 # is required and should be the path to the
+                                 # pthread library.
+
+    deadlock_detector 181 --verbose
+                                 # Analyze PID 181 and print statistics about
+                                 # the mutex wait graph.
+
+    deadlock_detector 181 --lock-symbols my_mutex_lock1,my_mutex_lock2 \\
+        --unlock-symbols my_mutex_unlock1,my_mutex_unlock2
+                                 # Analyze PID 181 and trace custom mutex
+                                 # symbols instead of pthread mutexes.
+
+    deadlock_detector 181 --dump-graph graph.json
+                                 # Analyze PID 181 and dump the mutex wait
+                                 # graph to graph.json.
+    '''
+    parser = argparse.ArgumentParser(
+        description=(
+            'Detect potential deadlocks (lock inversions) in a running binary.'
+            '\nMust be run as root.'
+        ),
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=examples,
+    )
+    parser.add_argument('pid', type=int, help='Pid to trace')
+    # Binaries with `:` in the path will fail to attach uprobes on kernels
+    # running without this patch: https://lkml.org/lkml/2017/1/13/585.
+    # Symlinks to the binary without `:` in the path can get around this issue.
+    parser.add_argument(
+        '--binary',
+        type=str,
+        default='',
+        help='If set, trace the mutexes from the binary at this path. '
+        'For statically-linked binaries, this argument is not required. '
+        'For dynamically-linked binaries, this argument is required and '
+        'should be the path of the pthread library the binary is using. '
+        'Example: /lib/x86_64-linux-gnu/libpthread.so.0',
+    )
+    parser.add_argument(
+        '--dump-graph',
+        type=str,
+        default='',
+        help='If set, this will dump the mutex graph to the specified file.',
+    )
+    parser.add_argument(
+        '--verbose',
+        action='store_true',
+        help='Print statistics about the mutex wait graph.',
+    )
+    parser.add_argument(
+        '--lock-symbols',
+        type=strlist,
+        default=['pthread_mutex_lock'],
+        help='Comma-separated list of lock symbols to trace. Default is '
+        'pthread_mutex_lock. These symbols cannot be inlined in the binary.',
+    )
+    parser.add_argument(
+        '--unlock-symbols',
+        type=strlist,
+        default=['pthread_mutex_unlock'],
+        help='Comma-separated list of unlock symbols to trace. Default is '
+        'pthread_mutex_unlock. These symbols cannot be inlined in the binary.',
+    )
+    args = parser.parse_args()
+    if not args.binary:
+        try:
+            args.binary = os.readlink('/proc/%d/exe' % args.pid)
+        except OSError as e:
+            print('%s. Is the process (pid=%d) running?' % (str(e), args.pid))
+            sys.exit(1)
+
+    bpf = BPF(src_file='deadlock_detector.c')
+
+    # Trace where threads are created
+    bpf.attach_kretprobe(
+        event='sys_clone', fn_name='trace_clone', pid=args.pid
+    )
+
+    # We must trace unlock first, otherwise in the time we attached the probe
+    # on lock() and have not yet attached the probe on unlock(), a thread can
+    # acquire mutexes and release them, but the release events will not be
+    # traced, resulting in noisy reports.
+    for symbol in args.unlock_symbols:
+        try:
+            bpf.attach_uprobe(
+                name=args.binary,
+                sym=symbol,
+                fn_name='trace_mutex_release',
+                pid=args.pid,
+            )
+        except Exception as e:
+            print('%s. Failed to attach to symbol: %s' % (str(e), symbol))
+            sys.exit(1)
+    for symbol in args.lock_symbols:
+        try:
+            bpf.attach_uprobe(
+                name=args.binary,
+                sym=symbol,
+                fn_name='trace_mutex_acquire',
+                pid=args.pid,
+            )
+        except Exception as e:
+            print('%s. Failed to attach to symbol: %s' % (str(e), symbol))
+            sys.exit(1)
+
+    def print_stack_trace(stack_id):
+        '''Closure that prints the symbolized stack trace.'''
+        for addr in bpf.get_table('stack_traces').walk(stack_id):
+            line = bpf.sym(addr, args.pid)
+            # Try to symbolize with objdump if we cannot with bpf.
+            if line == '[unknown]':
+                symbol = symbolize_with_objdump(args.binary, addr)
+                if symbol:
+                    line = symbol
+            print('@ %016x %s' % (addr, line))
+
+    print('Tracing... Hit Ctrl-C to end.')
+    while True:
+        try:
+            # Map of child thread pid -> parent info
+            thread_info = {
+                child.value: (parent.parent_pid, parent.stack_id, parent.comm)
+                for child, parent in bpf.get_table('thread_to_parent').items()
+            }
+
+            # Mutex wait directed graph. Nodes are mutexes. Edge (A,B) exists
+            # if there exists some thread T where lock(A) was called and
+            # lock(B) was called before unlock(A) was called.
+            graph = DiGraph()
+            for key, leaf in bpf.get_table('edges').items():
+                graph.add_edge(
+                    key.mutex1,
+                    key.mutex2,
+                    thread_pid=leaf.thread_pid,
+                    thread_comm=leaf.comm.decode('utf-8'),
+                    first_mutex_stack_id=leaf.mutex1_stack_id,
+                    second_mutex_stack_id=leaf.mutex2_stack_id,
+                )
+            if args.verbose:
+                print(
+                    'Mutexes: %d, Edges: %d' %
+                    (len(graph.nodes()), len(graph.edges()))
+                )
+            if args.dump_graph:
+                with open(args.dump_graph, 'w') as f:
+                    data = graph.node_link_data()
+                    f.write(json.dumps(data, indent=2))
+
+            cycle = find_cycle(graph)
+            if cycle:
+                print_cycle(
+                    args.binary, graph, cycle, thread_info, print_stack_trace
+                )
+                sys.exit(1)
+
+            time.sleep(1)
+        except KeyboardInterrupt:
+            break
+
+
+if __name__ == '__main__':
+    main()
--- a/tools/deadlock_detector_example.txt
+++ b/tools/deadlock_detector_example.txt
+Demonstrations of deadlock_detector.
+
+This program detects potential deadlocks on a running process. The program
+attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` to build
+a mutex wait directed graph, and then looks for a cycle in this graph. This
+graph has the following properties:
+
+- Nodes in the graph represent mutexes.
+- Edge (A, B) exists if there exists some thread T where lock(A) was called
+  and lock(B) was called before unlock(A) was called.
+
+If there is a cycle in this graph, this indicates that there is a lock order
+inversion (potential deadlock). If the program finds a lock order inversion, the
+program will dump the cycle of mutexes, dump the stack traces where each mutex
+was acquired, and then exit.
+
+This program can only find potential deadlocks that occur while the program
+is tracing the process. It cannot find deadlocks that may have occurred
+before the program was attached to the process.
+
+Since this traces all mutex lock and unlock events and all thread creation
+events on the traced process, the overhead of this bpf program can be very
+high if the process has many threads and mutexes. You should only run this on
+a process where the slowdown is acceptable.
+
+Note: This tool does not work for shared mutexes or recursive mutexes.
+
+For shared (read-write) mutexes, a deadlock requires a cycle in the wait
+graph where at least one of the mutexes in the cycle is acquiring exclusive
+(write) ownership.
+
+For recursive mutexes, lock() is called multiple times on the same mutex.
+However, there is no way to determine if a mutex is a recursive mutex
+after the mutex has been created. As a result, this tool will not find
+potential deadlocks that involve only one mutex.
+
+
+# ./deadlock_detector.py 181
+Tracing... Hit Ctrl-C to end.
+----------------
+Potential Deadlock Detected!
+
+Cycle in lock order graph: Mutex M0 (main::static_mutex3 0x0000000000473c60) => Mutex M1 (0x00007fff6d738400) => Mutex M2 (global_mutex1 0x0000000000473be0) => Mutex M3 (global_mutex2 0x0000000000473c20) => Mutex M0 (main::static_mutex3 0x0000000000473c60)
+
+Mutex M1 (0x00007fff6d738400) acquired here while holding Mutex M0 (main::static_mutex3 0x0000000000473c60) in Thread 357250 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402e38 main::{lambda()#3}::operator()() const
+@ 0000000000406ba8 void std::_Bind_simple<main::{lambda()#3} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406951 std::_Bind_simple<main::{lambda()#3} ()>::operator()()
+@ 000000000040673a std::thread::_Impl<std::_Bind_simple<main::{lambda()#3} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M0 (main::static_mutex3 0x0000000000473c60) previously acquired by the same Thread 357250 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402e22 main::{lambda()#3}::operator()() const
+@ 0000000000406ba8 void std::_Bind_simple<main::{lambda()#3} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406951 std::_Bind_simple<main::{lambda()#3} ()>::operator()()
+@ 000000000040673a std::thread::_Impl<std::_Bind_simple<main::{lambda()#3} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M2 (global_mutex1 0x0000000000473be0) acquired here while holding Mutex M1 (0x00007fff6d738400) in Thread 357251 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402ea8 main::{lambda()#4}::operator()() const
+@ 0000000000406b46 void std::_Bind_simple<main::{lambda()#4} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 000000000040692d std::_Bind_simple<main::{lambda()#4} ()>::operator()()
+@ 000000000040671c std::thread::_Impl<std::_Bind_simple<main::{lambda()#4} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M1 (0x00007fff6d738400) previously acquired by the same Thread 357251 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402e97 main::{lambda()#4}::operator()() const
+@ 0000000000406b46 void std::_Bind_simple<main::{lambda()#4} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 000000000040692d std::_Bind_simple<main::{lambda()#4} ()>::operator()()
+@ 000000000040671c std::thread::_Impl<std::_Bind_simple<main::{lambda()#4} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M3 (global_mutex2 0x0000000000473c20) acquired here while holding Mutex M2 (global_mutex1 0x0000000000473be0) in Thread 357247 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402d5f main::{lambda()#1}::operator()() const
+@ 0000000000406c6c void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406999 std::_Bind_simple<main::{lambda()#1} ()>::operator()()
+@ 0000000000406776 std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M2 (global_mutex1 0x0000000000473be0) previously acquired by the same Thread 357247 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402d4e main::{lambda()#1}::operator()() const
+@ 0000000000406c6c void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406999 std::_Bind_simple<main::{lambda()#1} ()>::operator()()
+@ 0000000000406776 std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M0 (main::static_mutex3 0x0000000000473c60) acquired here while holding Mutex M3 (global_mutex2 0x0000000000473c20) in Thread 357248 (lockinversion):
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402dc9 main::{lambda()#2}::operator()() const
+@ 0000000000406c0a void std::_Bind_simple<main::{lambda()#2} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406975 std::_Bind_simple<main::{lambda()#2} ()>::operator()()
+@ 0000000000406758 std::thread::_Impl<std::_Bind_simple<main::{lambda()#2} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Mutex M3 (global_mutex2 0x0000000000473c20) previously acquired by the same Thread 357248 (lockinversion) here:
+@ 00000000004024d0 pthread_mutex_lock
+@ 0000000000406dd0 std::mutex::lock()
+@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
+@ 0000000000402db8 main::{lambda()#2}::operator()() const
+@ 0000000000406c0a void std::_Bind_simple<main::{lambda()#2} ()>::_M_invoke<>(std::_Index_tuple<>)
+@ 0000000000406975 std::_Bind_simple<main::{lambda()#2} ()>::operator()()
+@ 0000000000406758 std::thread::_Impl<std::_Bind_simple<main::{lambda()#2} ()> >::_M_run()
+@ 00007fd4496564e1 execute_native_thread_routine
+@ 00007fd449dd57f1 start_thread
+@ 00007fd44909746d __clone
+
+Thread 357248 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004033ac std::thread::thread<main::{lambda()#2}>(main::{lambda()#2}&&)
+@ 000000000040308f main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+Thread 357250 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004034b2 std::thread::thread<main::{lambda()#3}>(main::{lambda()#3}&&)
+@ 00000000004030b9 main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+Thread 357251 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004035b8 std::thread::thread<main::{lambda()#4}>(main::{lambda()#4}&&)
+@ 00000000004030e6 main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+Thread 357247 created by Thread 350692 (lockinversion) here:
+@ 00007fd449097431 __clone
+@ 00007fd449dd5ef5 pthread_create
+@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
+@ 00000000004032a6 std::thread::thread<main::{lambda()#1}>(main::{lambda()#1}&&)
+@ 0000000000403070 main
+@ 00007fd448faa0f6 __libc_start_main
+@ 0000000000402ad8 [unknown]
+
+This is output from a process that has a potential deadlock involving 4 mutexes
+and 4 threads:
+
+- Thread 357250 acquired M1 while holding M0 (edge M0 -> M1)
+- Thread 357251 acquired M2 while holding M1 (edge M1 -> M2)
+- Thread 357247 acquired M3 while holding M2 (edge M2 -> M3)
+- Thread 357248 acquired M0 while holding M3 (edge M3 -> M0)
+
+This is the C++ program that generated the output above:
+
+```c++
+#include <chrono>
+#include <iostream>
+#include <mutex>
+#include <thread>
+
+std::mutex global_mutex1;
+std::mutex global_mutex2;
+
+int main(void) {
+  static std::mutex static_mutex3;
+  std::mutex local_mutex4;
+
+  std::cout << "sleeping for a bit to allow trace to attach..." << std::endl;
+  std::this_thread::sleep_for(std::chrono::seconds(10));
+  std::cout << "starting program..." << std::endl;
+
+  auto t1 = std::thread([] {
+    std::lock_guard<std::mutex> g1(global_mutex1);
+    std::lock_guard<std::mutex> g2(global_mutex2);
+  });
+  t1.join();
+
+  auto t2 = std::thread([] {
+    std::lock_guard<std::mutex> g2(global_mutex2);
+    std::lock_guard<std::mutex> g3(static_mutex3);
+  });
+  t2.join();
+
+  auto t3 = std::thread([&local_mutex4] {
+    std::lock_guard<std::mutex> g3(static_mutex3);
+    std::lock_guard<std::mutex> g4(local_mutex4);
+  });
+  t3.join();
+
+  auto t4 = std::thread([&local_mutex4] {
+    std::lock_guard<std::mutex> g4(local_mutex4);
+    std::lock_guard<std::mutex> g1(global_mutex1);
+  });
+  t4.join();
+
+  std::cout << "sleeping to allow trace to collect data..." << std::endl;
+  std::this_thread::sleep_for(std::chrono::seconds(5));
+  std::cout << "done!" << std::endl;
+}
+```
+
+Note that an actual deadlock did not occur, although this mutex lock ordering
+creates the possibility of a deadlock, and this is a hint to the programmer to
+reconsider the lock ordering. If the mutexes are global or static and debug
+symbols are enabled, the output will contain the mutex symbol name. The output
+uses a similar format as ThreadSanitizer
+(https://github.com/google/sanitizers/wiki/ThreadSanitizerDeadlockDetector).
+
+
+# ./deadlock_detector.py 181 --binary /usr/local/bin/lockinversion
+
+Tracing... Hit Ctrl-C to end.
+^C
+
+If the traced process is instantiated from a statically-linked executable,
+this argument is optional, and the program will determine the path of the
+executable from the pid. However, on older kernels without this patch
+("uprobe: Find last occurrence of ':' when parsing uprobe PATH:OFFSET",
+https://lkml.org/lkml/2017/1/13/585), binaries that contain `:` in the path
+cannot be attached with uprobes. As a workaround, we can create a symlink
+to the binary, and provide the symlink name instead to the `--binary` option.
+
+
+# ./deadlock_detector.py 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+
+Tracing... Hit Ctrl-C to end.
+^C
+
+If the traced process is instantiated from a dynamically-linked executable,
+this argument is required and needs to be the path to the pthread shared
+library used by the executable.
+
+
+# ./deadlock_detector.py 181 --dump-graph graph.json --verbose
+
+Tracing... Hit Ctrl-C to end.
+Mutexes: 0, Edges: 0
+Mutexes: 532, Edges: 411
+Mutexes: 735, Edges: 675
+Mutexes: 1118, Edges: 1278
+Mutexes: 1666, Edges: 2185
+Mutexes: 2056, Edges: 2694
+Mutexes: 2245, Edges: 2906
+Mutexes: 2656, Edges: 3479
+Mutexes: 2813, Edges: 3785
+^C
+
+If the program does not find a deadlock, it will keep running until you hit
+Ctrl-C. If you pass the `--verbose` flag, the program will also dump statistics
+about the number of mutexes and edges in the mutex wait graph. If you want to
+serialize the graph to analyze it later, you can pass the `--dump-graph FILE`
+flag, and the program will serialize the graph in json.
+
+
+# ./deadlock_detector.py 181 --lock-symbols custom_mutex1_lock,custom_mutex2_lock --unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock --verbose
+
+Tracing... Hit Ctrl-C to end.
+Mutexes: 0, Edges: 0
+Mutexes: 532, Edges: 411
+Mutexes: 735, Edges: 675
+Mutexes: 1118, Edges: 1278
+Mutexes: 1666, Edges: 2185
+Mutexes: 2056, Edges: 2694
+Mutexes: 2245, Edges: 2906
+Mutexes: 2656, Edges: 3479
+Mutexes: 2813, Edges: 3785
+^C
+
+If your program is using custom mutexes and not pthread mutexes, you can use
+the `--lock-symbols` and `--unlock-symbols` flags to specify different mutex
+symbols to trace. The flags take a comma-separated string of symbol names.
+Note that if the symbols are inlined in the binary, then this program can result
+in false positives.
+
+
+USAGE message:
+
+# ./deadlock_detector.py -h
+
+usage: deadlock_detector.py [-h] [--binary BINARY] [--dump-graph DUMP_GRAPH]
+                            [--verbose] [--lock-symbols LOCK_SYMBOLS]
+                            [--unlock-symbols UNLOCK_SYMBOLS]
+                            pid
+
+Detect potential deadlocks (lock inversions) in a running binary.
+Must be run as root.
+
+positional arguments:
+  pid                   Pid to trace
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --binary BINARY       If set, trace the mutexes from the binary at this
+                        path. For statically-linked binaries, this argument is
+                        not required. For dynamically-linked binaries, this
+                        argument is required and should be the path of the
+                        pthread library the binary is using. Example:
+                        /lib/x86_64-linux-gnu/libpthread.so.0
+  --dump-graph DUMP_GRAPH
+                        If set, this will dump the mutex graph to the
+                        specified file.
+  --verbose             Print statistics about the mutex wait graph.
+  --lock-symbols LOCK_SYMBOLS
+                        Comma-separated list of lock symbols to trace. Default
+                        is pthread_mutex_lock. These symbols cannot be inlined
+                        in the binary.
+  --unlock-symbols UNLOCK_SYMBOLS
+                        Comma-separated list of unlock symbols to trace.
+                        Default is pthread_mutex_unlock. These symbols cannot
+                        be inlined in the binary.
+
+Examples:
+    deadlock_detector 181        # Analyze PID 181
+
+    deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
+                                 # Analyze PID 181 and locks from this binary.
+                                 # If tracing a process that is running from
+                                 # a dynamically-linked binary, this argument
+                                 # is required and should be the path to the
+                                 # pthread library.
+
+    deadlock_detector 181 --verbose
+                                 # Analyze PID 181 and print statistics about
+                                 # the mutex wait graph.
+
+    deadlock_detector 181 --lock-symbols my_mutex_lock1,my_mutex_lock2 \
+        --unlock-symbols my_mutex_unlock1,my_mutex_unlock2
+                                 # Analyze PID 181 and trace custom mutex
+                                 # symbols instead of pthread mutexes.
+
+    deadlock_detector 181 --dump-graph graph.json
+                                 # Analyze PID 181 and dump the mutex wait
+                                 # graph to graph.json.
--- a/tools/execsnoop.py
+++ b/tools/execsnoop.py
@@ -197,9 +197,9 @@ def print_event(cpu, data, size):
            if args.timestamp:
                print("%-8.3f" % (time.time() - start_ts), end="")
            ppid = get_ppid(event.pid)
-            print("%-16s %-6s %-6s %3s %s" % (event.comm, event.pid,
+            print("%-16s %-6s %-6s %3s %s" % (event.comm.decode(), event.pid,
                    ppid if ppid > 0 else "?", event.retval,
-                    ' '.join(argv[event.pid])))
+                    b' '.join(argv[event.pid]).decode()))

        del(argv[event.pid])


--- a/tools/funccount.py
+++ b/tools/funccount.py
@@ -43,6 +43,7 @@ class Probe(object):

            func            -- probe a kernel function
            lib:func        -- probe a user-space function in the library 'lib'
+            /path:func      -- probe a user-space function in binary '/path'
            p::func         -- same thing as 'func'
            p:lib:func      -- same thing as 'lib:func'
            t:cat:event     -- probe a kernel tracepoint
@@ -219,8 +220,11 @@ class Tool(object):
    ./funccount -Ti 5 'vfs_*'       # output every 5 seconds, with timestamps
    ./funccount -p 185 'vfs_*'      # count vfs calls for PID 181 only
    ./funccount t:sched:sched_fork  # count calls to the sched_fork tracepoint
-    ./funccount -p 185 u:node:gc*   # count all GC USDT probes in node
+    ./funccount -p 185 u:node:gc*   # count all GC USDT probes in node, PID 185
    ./funccount c:malloc            # count all malloc() calls in libc
+    ./funccount go:os.*             # count all "os.*" calls in libgo
+    ./funccount -p 185 go:os.*      # count all "os.*" calls in libgo, PID 185
+    ./funccount ./test:read*        # count "read*" calls in the ./test binary
    """
        parser = argparse.ArgumentParser(
            description="Count functions, tracepoints, and USDT probes",

--- a/tools/funccount_example.txt
+++ b/tools/funccount_example.txt
@@ -169,7 +169,7 @@ Ctrl-C has been hit.
 User functions can be traced in executables or libraries, and per-process
 filtering is allowed:

-# ./funccount -p 1442 contentions:*
+# ./funccount -p 1442 /home/ubuntu/contentions:*
 Tracing 15 functions for "/home/ubuntu/contentions:*"... Hit Ctrl-C to end.
 ^C
 FUNC                                           COUNT
@@ -180,6 +180,10 @@ insert_result                                  87186
 is_prime                                     1252772
 Detaching...

+If /home/ubuntu is in the $PATH, then the following command will also work:
+
+# ./funccount -p 1442 contentions:*
+

 Counting libc write and read calls using regular expression syntax (-r):

@@ -314,6 +318,8 @@ examples:
    ./funccount -Ti 5 'vfs_*'       # output every 5 seconds, with timestamps
    ./funccount -p 185 'vfs_*'      # count vfs calls for PID 181 only
    ./funccount t:sched:sched_fork  # count calls to the sched_fork tracepoint
-    ./funccount -p 185 u:node:gc*   # count all GC USDT probes in node
+    ./funccount -p 185 u:node:gc*   # count all GC USDT probes in node, PID 185
    ./funccount c:malloc            # count all malloc() calls in libc
- 
+    ./funccount go:os.*             # count all "os.*" calls in libgo
+    ./funccount -p 185 go:os.*      # count all "os.*" calls in libgo, PID 185
+    ./funccount ./test:read*        # count "read*" calls in the ./test binary
--- a/tools/funclatency.py
+++ b/tools/funclatency.py
@@ -201,9 +201,10 @@ if not library:
    b.attach_kretprobe(event_re=pattern, fn_name="trace_func_return")
    matched = b.num_open_kprobes()
 else:
-    b.attach_uprobe(name=library, sym_re=pattern, fn_name="trace_func_entry")
+    b.attach_uprobe(name=library, sym_re=pattern, fn_name="trace_func_entry",
+                    pid=args.pid or -1)
    b.attach_uretprobe(name=library, sym_re=pattern,
-                       fn_name="trace_func_return")
+                       fn_name="trace_func_return", pid=args.pid or -1)
    matched = b.num_open_uprobes()

 if matched == 0:

--- a/tools/gethostlatency.py
+++ b/tools/gethostlatency.py
@@ -18,8 +18,21 @@
 from __future__ import print_function
 from bcc import BPF
 from time import strftime
+import argparse
 import ctypes as ct

+examples = """examples:
+    ./gethostlatency           # trace all TCP accept()s
+    ./gethostlatency -p 181    # only trace PID 181
+"""
+parser = argparse.ArgumentParser(
+    description="Show latency for getaddrinfo/gethostbyname[2] calls",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-p", "--pid", help="trace this PID only", type=int,
+    default=-1)
+args = parser.parse_args()
+
 # load BPF program
 bpf_text = """
 #include <uapi/linux/ptrace.h>
@@ -34,7 +47,6 @@ struct val_t {

 struct data_t {
    u32 pid;
-    u64 ts;
    u64 delta;
    char comm[TASK_COMM_LEN];
    char host[80];
@@ -77,56 +89,42 @@ int do_return(struct pt_regs *ctx) {
    bpf_probe_read(&data.host, sizeof(data.host), (void *)valp->host);
    data.pid = valp->pid;
    data.delta = tsp - valp->ts;
-    data.ts = tsp / 1000;
    events.perf_submit(ctx, &data, sizeof(data));
    start.delete(&pid);
    return 0;
 }
 """
 b = BPF(text=bpf_text)
-b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry")
-b.attach_uprobe(name="c", sym="gethostbyname", fn_name="do_entry")
-b.attach_uprobe(name="c", sym="gethostbyname2", fn_name="do_entry")
-b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return")
-b.attach_uretprobe(name="c", sym="gethostbyname", fn_name="do_return")
-b.attach_uretprobe(name="c", sym="gethostbyname2", fn_name="do_return")
+b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry", pid=args.pid)
+b.attach_uprobe(name="c", sym="gethostbyname", fn_name="do_entry",
+                pid=args.pid)
+b.attach_uprobe(name="c", sym="gethostbyname2", fn_name="do_entry",
+                pid=args.pid)
+b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return",
+                   pid=args.pid)
+b.attach_uretprobe(name="c", sym="gethostbyname", fn_name="do_return",
+                   pid=args.pid)
+b.attach_uretprobe(name="c", sym="gethostbyname2", fn_name="do_return",
+                   pid=args.pid)

 TASK_COMM_LEN = 16    # linux/sched.h

 class Data(ct.Structure):
    _fields_ = [
        ("pid", ct.c_ulonglong),
-        ("ts", ct.c_ulonglong),
        ("delta", ct.c_ulonglong),
        ("comm", ct.c_char * TASK_COMM_LEN),
        ("host", ct.c_char * 80)
    ]

-start_ts = 0
-prev_ts = 0
-delta = 0
-
 # header
 print("%-9s %-6s %-16s %10s %s" % ("TIME", "PID", "COMM", "LATms", "HOST"))

 def print_event(cpu, data, size):
    event = ct.cast(data, ct.POINTER(Data)).contents
-    global start_ts
-    global prev_ts
-    global delta
-
-    if start_ts == 0:
-        prev_ts = start_ts
-
-    if start_ts == 1:
-        delta = float(delta) + (event.ts - prev_ts)
-
    print("%-9s %-6d %-16s %10.2f %s" % (strftime("%H:%M:%S"), event.pid,
        event.comm, (event.delta / 1000000), event.host))

-    prev_ts = event.ts
-    start_ts = 1
-
 # loop with callback to print_event
 b["events"].open_perf_buffer(print_event)
 while 1:

--- a/tools/gethostlatency_example.txt
+++ b/tools/gethostlatency_example.txt
@@ -19,3 +19,19 @@ TIME      PID    COMM          LATms HOST
 In this example, the first call to lookup "www.iovisor.org" took 90 ms, and
 the second took 0 ms (cached). The slowest call in this example was to "foo",
 which was an unsuccessful lookup.
+
+
+USAGE message:
+
+# ./gethostlatency -h
+usage: gethostlatency [-h] [-p PID]
+
+Show latency for getaddrinfo/gethostbyname[2] calls
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -p PID, --pid PID  trace this PID only
+
+examples:
+    ./gethostlatency           # trace all TCP accept()s
+    ./gethostlatency -p 181    # only trace PID 181
--- a/tools/memleak.py
+++ b/tools/memleak.py
@@ -135,6 +135,8 @@ parser.add_argument("-z", "--min-size", type=int,
        help="capture only allocations larger than this size")
 parser.add_argument("-Z", "--max-size", type=int,
        help="capture only allocations smaller than this size")
+parser.add_argument("-O", "--obj", type=str, default="c",
+        help="attach to malloc & free in the specified object")

 args = parser.parse_args()

@@ -149,6 +151,7 @@ num_prints = args.count
 top_stacks = args.top
 min_size = args.min_size
 max_size = args.max_size
+obj = args.obj

 if min_size is not None and max_size is not None and min_size > max_size:
        print("min_size (-z) can't be greater than max_size (-Z)")
@@ -251,11 +254,11 @@ bpf_program = BPF(text=bpf_source)

 if not kernel_trace:
        print("Attaching to malloc and free in pid %d, Ctrl+C to quit." % pid)
-        bpf_program.attach_uprobe(name="c", sym="malloc",
+        bpf_program.attach_uprobe(name=obj, sym="malloc",
                                  fn_name="alloc_enter", pid=pid)
-        bpf_program.attach_uretprobe(name="c", sym="malloc",
+        bpf_program.attach_uretprobe(name=obj, sym="malloc",
                                     fn_name="alloc_exit", pid=pid)
-        bpf_program.attach_uprobe(name="c", sym="free",
+        bpf_program.attach_uprobe(name=obj, sym="free",
                                  fn_name="free_enter", pid=pid)
 else:
        print("Attaching to kmalloc and kfree, Ctrl+C to quit.")

--- a/tools/memleak_example.txt
+++ b/tools/memleak_example.txt
@@ -150,14 +150,16 @@ of the sampling rate applied.
 USAGE message:

 # ./memleak -h
-usage: memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
-                  [-s SAMPLE_RATE] [-d STACK_DEPTH] [-T TOP]
+usage: memleak.py [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
+                  [-s SAMPLE_RATE] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE]
+                  [-O OBJ]
                  [interval] [count]

 Trace outstanding memory allocations that weren't freed.
 Supports both user-mode allocations made with malloc/free and kernel-mode
 allocations made with kmalloc/kfree.

+positional arguments:
  interval              interval in seconds to print outstanding allocations
  count                 number of times to print the report before exiting

@@ -175,13 +177,12 @@ optional arguments:
                        execute and trace the specified command
  -s SAMPLE_RATE, --sample-rate SAMPLE_RATE
                        sample every N-th allocation to decrease the overhead
-  -d STACK_DEPTH, --stack_depth STACK_DEPTH
-                        maximum stack depth to capture
  -T TOP, --top TOP     display only this many top allocating stacks (by size)
  -z MIN_SIZE, --min-size MIN_SIZE
                        capture only allocations larger than this size
  -Z MAX_SIZE, --max-size MAX_SIZE
                        capture only allocations smaller than this size
+  -O OBJ, --obj OBJ     attach to malloc & free in the specified object

 EXAMPLES:


--- a/tools/profile.py
+++ b/tools/profile.py
@@ -90,7 +90,7 @@ parser.add_argument("-a", "--annotations", action="store_true",
    help="add _[k] annotations to kernel frames")
 parser.add_argument("-f", "--folded", action="store_true",
    help="output folded format, one line per stack (for flame graphs)")
-parser.add_argument("--stack-storage-size", default=2048,
+parser.add_argument("--stack-storage-size", default=10240,
    type=positive_nonzero_int,
    help="the number of unique stack traces that can be stored and "
        "displayed (default 2048)")

--- a/tools/sslsniff.py
+++ b/tools/sslsniff.py
@@ -130,18 +130,20 @@ b = BPF(text=prog)
 # on its exit (Mark Drayton)
 #
 if args.openssl:
-    b.attach_uprobe(name="ssl", sym="SSL_write", fn_name="probe_SSL_write")
-    b.attach_uprobe(name="ssl", sym="SSL_read", fn_name="probe_SSL_read_enter")
+    b.attach_uprobe(name="ssl", sym="SSL_write", fn_name="probe_SSL_write",
+                    pid=args.pid or -1)
+    b.attach_uprobe(name="ssl", sym="SSL_read", fn_name="probe_SSL_read_enter",
+                    pid=args.pid or -1)
    b.attach_uretprobe(name="ssl", sym="SSL_read",
-                       fn_name="probe_SSL_read_exit")
+                       fn_name="probe_SSL_read_exit", pid=args.pid or -1)

 if args.gnutls:
    b.attach_uprobe(name="gnutls", sym="gnutls_record_send",
-                    fn_name="probe_SSL_write")
+                    fn_name="probe_SSL_write", pid=args.pid or -1)
    b.attach_uprobe(name="gnutls", sym="gnutls_record_recv",
-                    fn_name="probe_SSL_read_enter")
+                    fn_name="probe_SSL_read_enter", pid=args.pid or -1)
    b.attach_uretprobe(name="gnutls", sym="gnutls_record_recv",
-                       fn_name="probe_SSL_read_exit")
+                       fn_name="probe_SSL_read_exit", pid=args.pid or -1)

 # define output data structure in Python
 TASK_COMM_LEN = 16  # linux/sched.h

--- a/tools/statsnoop.py
+++ b/tools/statsnoop.py
@@ -44,16 +44,12 @@ bpf_text = """
 #include <linux/sched.h>

 struct val_t {
-    u32 pid;
-    u64 ts;
-    char comm[TASK_COMM_LEN];
    const char *fname;
 };

 struct data_t {
    u32 pid;
-    u64 ts;
-    u64 delta;
+    u64 ts_ns;
    int ret;
    char comm[TASK_COMM_LEN];
    char fname[NAME_MAX];
@@ -69,12 +65,8 @@ int trace_entry(struct pt_regs *ctx, const char __user *filename)
    u32 pid = bpf_get_current_pid_tgid();

    FILTER
-    if (bpf_get_current_comm(&val.comm, sizeof(val.comm)) == 0) {
-        val.pid = bpf_get_current_pid_tgid();
-        val.ts = bpf_ktime_get_ns();
-        val.fname = filename;
-        infotmp.update(&pid, &val);
-    }
+    val.fname = filename;
+    infotmp.update(&pid, &val);

    return 0;
 };
@@ -83,20 +75,17 @@ int trace_return(struct pt_regs *ctx)
 {
    u32 pid = bpf_get_current_pid_tgid();
    struct val_t *valp;
-    struct data_t data = {};
-
-    u64 tsp = bpf_ktime_get_ns();

    valp = infotmp.lookup(&pid);
    if (valp == 0) {
        // missed entry
        return 0;
    }
-    bpf_probe_read(&data.comm, sizeof(data.comm), valp->comm);
+
+    struct data_t data = {.pid = pid};
    bpf_probe_read(&data.fname, sizeof(data.fname), (void *)valp->fname);
-    data.pid = valp->pid;
-    data.delta = tsp - valp->ts;
-    data.ts = tsp / 1000;
+    bpf_get_current_comm(&data.comm, sizeof(data.comm));
+    data.ts_ns = bpf_ktime_get_ns();
    data.ret = PT_REGS_RC(ctx);

    events.perf_submit(ctx, &data, sizeof(data));
@@ -129,8 +118,7 @@ NAME_MAX = 255        # linux/limits.h
 class Data(ct.Structure):
    _fields_ = [
        ("pid", ct.c_ulonglong),
-        ("ts", ct.c_ulonglong),
-        ("delta", ct.c_ulonglong),
+        ("ts_ns", ct.c_ulonglong),
        ("ret", ct.c_int),
        ("comm", ct.c_char * TASK_COMM_LEN),
        ("fname", ct.c_char * NAME_MAX)
@@ -162,25 +150,14 @@ def print_event(cpu, data, size):
        err = - event.ret

    if start_ts == 0:
-        prev_ts = start_ts
-
-    if start_ts == 1:
-        delta = float(delta) + (event.ts - prev_ts)
-
-    if (args.failed and (event.ret >= 0)):
-        start_ts = 1
-        prev_ts = event.ts
-        return
+        start_ts = event.ts_ns

    if args.timestamp:
-        print("%-14.9f" % (delta / 1000000), end="")
+        print("%-14.9f" % (float(event.ts_ns - start_ts) / 1000000000), end="")

    print("%-6d %-16s %4d %3d %s" % (event.pid, event.comm,
        fd_s, err, event.fname))

-    prev_ts = event.ts
-    start_ts = 1
-
 # loop with callback to print_event
 b["events"].open_perf_buffer(print_event)
 while 1:

--- a/tools/tplist.py
+++ b/tools/tplist.py
@@ -26,7 +26,7 @@ parser.add_argument("-p", "--pid", type=int, default=None,
        help="List USDT probes in the specified process")
 parser.add_argument("-l", "--lib", default="",
        help="List USDT probes in the specified library or executable")
-parser.add_argument("-v", dest="verbosity", action="count",
+parser.add_argument("-v", dest="verbosity", action="count", default=0,
        help="Increase verbosity level (print variables, arguments, etc.)")
 parser.add_argument(dest="filter", nargs="?",
        help="A filter that specifies which probes/tracepoints to print")
@@ -42,8 +42,6 @@ def print_tpoint_format(category, event):
                parts = match.group(1).split()
                field_name = parts[-1:][0]
                field_type = " ".join(parts[:-1])
-                if "__data_loc" in field_type:
-                        continue
                if field_name.startswith("common_"):
                        continue
                print("    %s %s;" % (field_type, field_name))
@@ -68,7 +66,7 @@ def print_tracepoints():
 def print_usdt_argument_details(location):
        for idx in xrange(0, location.num_arguments):
                arg = location.get_argument(idx)
-                print("    argument #%d %s" % (idx, arg))
+                print("    argument #%d %s" % (idx+1, arg))

 def print_usdt_details(probe):
        if args.verbosity > 0:
@@ -76,7 +74,7 @@ def print_usdt_details(probe):
                if args.verbosity > 1:
                        for idx in xrange(0, probe.num_locations):
                                loc = probe.get_location(idx)
-                                print("  location #%d %s" % (idx, loc))
+                                print("  location #%d %s" % (idx+1, loc))
                                print_usdt_argument_details(loc)
                else:
                        print("  %d location(s)" % probe.num_locations)

--- a/tools/trace.py
+++ b/tools/trace.py
@@ -76,6 +76,7 @@ class Probe(object):
                self.probe_num = Probe.probe_count
                self.probe_name = "probe_%s_%d" % \
                                (self._display_function(), self.probe_num)
+                self.probe_name = re.sub(r'[^A-Za-z0-9_]', '_', self.probe_name)

        def __str__(self):
                return "%s:%s:%s FLT=%s ACT=%s/%s" % (self.probe_type,
@@ -92,15 +93,24 @@ class Probe(object):
        def _parse_probe(self):
                text = self.raw_probe

-                # Everything until the first space is the probe specifier
-                first_space = text.find(' ')
-                spec = text[:first_space] if first_space >= 0 else text
+                # There might be a function signature preceding the actual
+                # filter/print part, or not. Find the probe specifier first --
+                # it ends with either a space or an open paren ( for the
+                # function signature part.
+                #                                          opt. signature
+                #                               probespec       |      rest
+                #                               ---------  ----------   --
+                (spec, sig, rest) = re.match(r'([^ \t\(]+)(\([^\(]*\))?(.*)',
+                                             text).groups()
+
                self._parse_spec(spec)
-                if first_space >= 0:
-                        text = text[first_space:].lstrip()
-                else:
-                        text = ""
+                self.signature = sig[1:-1] if sig else None # remove the parens
+                if self.signature and self.probe_type in ['u', 't']:
+                        self._bail("USDT and tracepoint probes can't have " +
+                                   "a function signature; use arg1, arg2, " +
+                                   "... instead")

+                text = rest.lstrip()
                # If we now have a (, wait for the balanced closing ) and that
                # will be the predicate
                self.filter = None
@@ -216,11 +226,11 @@ class Probe(object):
                fname = "streq_%d" % Probe.streq_index
                Probe.streq_index += 1
                self.streq_functions += """
-static inline bool %s(char const *ignored, unsigned long str) {
+static inline bool %s(char const *ignored, uintptr_t str) {
        char needle[] = %s;
        char haystack[sizeof(needle)];
        bpf_probe_read(&haystack, sizeof(haystack), (void *)str);
-        for (int i = 0; i < sizeof(needle); ++i) {
+        for (int i = 0; i < sizeof(needle) - 1; ++i) {
                if (needle[i] != haystack[i]) {
                        return false;
                }
@@ -353,33 +363,35 @@ BPF_PERF_OUTPUT(%s);

        def _generate_usdt_filter_read(self):
            text = ""
-            if self.probe_type == "u":
-                    for arg, _ in Probe.aliases.items():
-                        if not (arg.startswith("arg") and
-                                (arg in self.filter)):
-                                continue
-                        arg_index = int(arg.replace("arg", ""))
-                        arg_ctype = self.usdt.get_probe_arg_ctype(
-                                self.usdt_name, arg_index)
-                        if not arg_ctype:
-                                self._bail("Unable to determine type of {} "
-                                           "in the filter".format(arg))
-                        text += """
+            if self.probe_type != "u":
+                    return text
+            for arg, _ in Probe.aliases.items():
+                    if not (arg.startswith("arg") and
+                            (arg in self.filter)):
+                            continue
+                    arg_index = int(arg.replace("arg", ""))
+                    arg_ctype = self.usdt.get_probe_arg_ctype(
+                            self.usdt_name, arg_index - 1)
+                    if not arg_ctype:
+                            self._bail("Unable to determine type of {} "
+                                       "in the filter".format(arg))
+                    text += """
        {} {}_filter;
        bpf_usdt_readarg({}, ctx, &{}_filter);
-                        """.format(arg_ctype, arg, arg_index, arg)
-                        self.filter = self.filter.replace(
-                                arg, "{}_filter".format(arg))
+                    """.format(arg_ctype, arg, arg_index, arg)
+                    self.filter = self.filter.replace(
+                            arg, "{}_filter".format(arg))
            return text

        def generate_program(self, include_self):
                data_decl = self._generate_data_decl()
-                # kprobes don't have built-in pid filters, so we have to add
-                # it to the function body:
-                if len(self.library) == 0 and Probe.pid != -1:
+                if Probe.pid != -1:
                        pid_filter = """
        if (__pid != %d) { return 0; }
                """ % Probe.pid
+                # uprobes can have a built-in tgid filter passed to
+                # attach_uprobe, hence the check here -- for kprobes, we
+                # need to do the tgid test by hand:
                elif len(self.library) == 0 and Probe.tgid != -1:
                        pid_filter = """
        if (__tgid != %d) { return 0; }
@@ -393,6 +405,8 @@ BPF_PERF_OUTPUT(%s);

                prefix = ""
                signature = "struct pt_regs *ctx"
+                if self.signature:
+                        signature += ", " + self.signature

                data_fields = ""
                for i, expr in enumerate(self.values):
@@ -469,10 +483,10 @@ BPF_PERF_OUTPUT(%s);

        def _format_message(self, bpf, tgid, values):
                # Replace each %K with kernel sym and %U with user sym in tgid
-                kernel_placeholders = [i for i in xrange(0, len(self.types))
-                                       if self.types[i] == 'K']
-                user_placeholders = [i for i in xrange(0, len(self.types))
-                                     if self.types[i] == 'U']
+                kernel_placeholders = [i for i, t in enumerate(self.types)
+                                       if t == 'K']
+                user_placeholders = [i for i, t in enumerate(self.types)
+                                     if t == 'U']
                for kp in kernel_placeholders:
                        values[kp] = bpf.ksymaddr(values[kp])
                for up in user_placeholders:
@@ -541,12 +555,12 @@ BPF_PERF_OUTPUT(%s);
                        bpf.attach_uretprobe(name=libpath,
                                             sym=self.function,
                                             fn_name=self.probe_name,
-                                             pid=Probe.pid)
+                                             pid=Probe.tgid)
                else:
                        bpf.attach_uprobe(name=libpath,
                                          sym=self.function,
                                          fn_name=self.probe_name,
-                                          pid=Probe.pid)
+                                          pid=Probe.tgid)

 class Tool(object):
        examples = """
@@ -558,7 +572,7 @@ trace 'do_sys_open "%s", arg2'
        Trace the open syscall and print the filename being opened
 trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
        Trace the read syscall and print a message for reads >20000 bytes
-trace 'r::do_sys_return "%llx", retval'
+trace 'r::do_sys_open "%llx", retval'
        Trace the return from the open syscall and print the return value
 trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
        Trace the open() call from libc only if the flags (arg2) argument is 42
@@ -574,6 +588,8 @@ trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
 trace 'u:pthread:pthread_create (arg4 != 0)'
        Trace the USDT probe pthread_create when its 4th argument is non-zero
+trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
+        Trace the nanosleep syscall and print the sleep duration in ns
 """

        def __init__(self):
@@ -608,7 +624,8 @@ trace 'u:pthread:pthread_create (arg4 != 0)'
                  help="probe specifier (see examples)")
                parser.add_argument("-I", "--include", action="append",
                  metavar="header",
-                  help="additional header files to include in the BPF program")
+                  help="additional header files to include in the BPF program "
+                       "as either full path, or relative to '/usr/include'")
                self.args = parser.parse_args()
                if self.args.tgid and self.args.pid:
                        parser.error("only one of -p and -t may be specified")
@@ -628,7 +645,11 @@ trace 'u:pthread:pthread_create (arg4 != 0)'

 """
                for include in (self.args.include or []):
-                        self.program += "#include <%s>\n" % include
+                        if include.startswith((".", "/")):
+                                include = os.path.abspath(include)
+                                self.program += "#include \"%s\"\n" % include
+                        else:
+                                self.program += "#include <%s>\n" % include
                self.program += BPF.generate_auto_includes(
                        map(lambda p: p.raw_probe, self.probes))
                for probe in self.probes:

--- a/tools/trace_example.txt
+++ b/tools/trace_example.txt
@@ -2,8 +2,8 @@ Demonstrations of trace.


 trace probes functions you specify and displays trace messages if a particular
-condition is met. You can control the message format to display function 
-arguments and return values. 
+condition is met. You can control the message format to display function
+arguments and return values.

 For example, suppose you want to trace all commands being exec'd across the
 system:
@@ -135,6 +135,16 @@ TIME     PID    COMM         FUNC             -
 In the previous invocation, arg1 and arg2 are the class name and method name
 for the Ruby method being invoked.

+You can also trace exported functions from shared libraries, or an imported
+function on the actual executable:
+
+# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
+# tput -V
+
+PID    TID    COMM         FUNC             -
+21720  21720  tput         curses_version   Version=ncurses 6.0.20160709
+^C
+

 Occasionally, it can be useful to filter specific strings. For example, you
 might be interested in open() calls that open a specific file:
@@ -146,7 +156,30 @@ TIME     PID    COMM         FUNC             -
 ^C


-As a final example, let's trace open syscalls for a specific process. By 
+In the preceding example, as well as in many others, readability may be
+improved by providing the function's signature, which names the arguments and
+lets you access structure sub-fields, which is hard with the "arg1", "arg2"
+convention. For example:
+
+# trace 'p:c:open(char *filename) "opening %s", filename'
+PID    TID    COMM         FUNC             -
+17507  17507  cat          open             opening FAQ.txt
+^C
+
+# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
+PID    TID    COMM         FUNC             -
+777    785    automount    SyS_nanosleep    sleep for 500000000 ns
+777    785    automount    SyS_nanosleep    sleep for 500000000 ns
+777    785    automount    SyS_nanosleep    sleep for 500000000 ns
+777    785    automount    SyS_nanosleep    sleep for 500000000 ns
+^C
+
+Remember to use the -I argument include the appropriate header file. We didn't
+need to do that here because `struct timespec` is used internally by the tool,
+so it always includes this header file.
+
+
+As a final example, let's trace open syscalls for a specific process. By
 default, tracing is system-wide, but the -p switch overrides this:

 # trace -p 2740 'do_sys_open "%s", arg2' -T
@@ -196,6 +229,7 @@ optional arguments:
  -U, --user-stack      output user stack trace
  -I header, --include header
                        additional header files to include in the BPF program
+                        as either full path, or relative to '/usr/include'

 EXAMPLES:

@@ -205,7 +239,7 @@ trace 'do_sys_open "%s", arg2'
        Trace the open syscall and print the filename being opened
 trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
        Trace the read syscall and print a message for reads >20000 bytes
-trace 'r::do_sys_return "%llx", retval'
+trace 'r::do_sys_open "%llx", retval'
        Trace the return from the open syscall and print the return value
 trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
        Trace the open() call from libc only if the flags (arg2) argument is 42
@@ -221,3 +255,5 @@ trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
 trace 'u:pthread:pthread_create (arg4 != 0)'
        Trace the USDT probe pthread_create when its 4th argument is non-zero
+trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
+        Trace the nanosleep syscall and print the sleep duration in ns
--- a/tools/ucalls.py
+++ b/tools/ucalls.py
@@ -4,7 +4,7 @@
 # ucalls  Summarize method calls in high-level languages and/or system calls.
 #         For Linux, uses BCC, eBPF.
 #
-# USAGE: ucalls [-l {java,python,ruby}] [-h] [-T TOP] [-L] [-S] [-v] [-m]
+# USAGE: ucalls [-l {java,python,ruby,php}] [-h] [-T TOP] [-L] [-S] [-v] [-m]
 #        pid [interval]
 #
 # Copyright 2016 Sasha Goldshtein
@@ -24,7 +24,7 @@ examples = """examples:
    ./ucalls 6712 -S            # trace only syscall counts
    ./ucalls -l ruby 1344 -T 10 # trace top 10 Ruby method calls
    ./ucalls -l ruby 1344 -L    # trace Ruby calls including latency
-    ./ucalls -l ruby 1344 -LS   # trace Ruby calls and syscalls with latency
+    ./ucalls -l php 443 -LS     # trace PHP calls and syscalls with latency
    ./ucalls -l python 2020 -mL # trace Python calls including latency in ms
 """
 parser = argparse.ArgumentParser(
@@ -34,7 +34,8 @@ parser = argparse.ArgumentParser(
 parser.add_argument("pid", type=int, help="process id to attach to")
 parser.add_argument("interval", type=int, nargs='?',
    help="print every specified number of seconds")
-parser.add_argument("-l", "--language", choices=["java", "python", "ruby"],
+parser.add_argument("-l", "--language",
+    choices=["java", "python", "ruby", "php"],
    help="language to trace (if none, trace syscalls only)")
 parser.add_argument("-T", "--top", type=int,
    help="number of most frequent/slow calls to print")
@@ -49,8 +50,8 @@ parser.add_argument("-m", "--milliseconds", action="store_true",
 args = parser.parse_args()

 # We assume that the entry and return probes have the same arguments. This is
-# the case for Java, Python, and Ruby. If there's a language where it's not the
-# case, we will need to build a custom correlator from entry to exit.
+# the case for Java, Python, Ruby, and PHP. If there's a language where it's
+# not the case, we will need to build a custom correlator from entry to exit.
 if args.language == "java":
    # TODO for JVM entries, we actually have the real length of the class
    #      and method strings in arg3 and arg5 respectively, so we can insert
@@ -70,6 +71,11 @@ elif args.language == "ruby":
    return_probe = "method__return"
    read_class = "bpf_usdt_readarg(1, ctx, &clazz);"
    read_method = "bpf_usdt_readarg(2, ctx, &method);"
+elif args.language == "php":
+    entry_probe = "function__entry"
+    return_probe = "function__return"
+    read_class = "bpf_usdt_readarg(4, ctx, &clazz);"
+    read_method = "bpf_usdt_readarg(1, ctx, &method);"
 elif not args.language:
    if not args.syscalls:
        print("Nothing to do; use -S to trace syscalls.")
@@ -213,9 +219,9 @@ int syscall_return(struct pt_regs *ctx) {

 if args.language:
    usdt = USDT(pid=args.pid)
-    usdt.enable_probe(entry_probe, "trace_entry")
+    usdt.enable_probe_or_bail(entry_probe, "trace_entry")
    if args.latency:
-        usdt.enable_probe(return_probe, "trace_return")
+        usdt.enable_probe_or_bail(return_probe, "trace_return")
 else:
    usdt = None

@@ -236,25 +242,26 @@ if args.syscalls:
 def get_data():
    # Will be empty when no language was specified for tracing
    if args.latency:
-        data = map(lambda (k, v): (k.clazz + "." + k.method,
-                                   (v.num_calls, v.total_ns)),
-                   bpf["times"].items())
+        data = list(map(lambda kv: (kv[0].clazz + "." + kv[0].method,
+                                   (kv[1].num_calls, kv[1].total_ns)),
+                   bpf["times"].items()))
    else:
-        data = map(lambda (k, v): (k.clazz + "." + k.method, (v.value, 0)),
-                   bpf["counts"].items())
+        data = list(map(lambda kv: (kv[0].clazz + "." + kv[0].method,
+                                   (kv[1].value, 0)),
+                   bpf["counts"].items()))

    if args.syscalls:
        if args.latency:
-            syscalls = map(lambda (k, v): (bpf.ksym(k.value),
-                                           (v.num_calls, v.total_ns)),
+            syscalls = map(lambda kv: (bpf.ksym(kv[0].value),
+                                           (kv[1].num_calls, kv[1].total_ns)),
                           bpf["systimes"].items())
            data.extend(syscalls)
        else:
-            syscalls = map(lambda (k, v): (bpf.ksym(k.value), (v.value, 0)),
+            syscalls = map(lambda kv: (bpf.ksym(kv[0].value), (kv[1].value, 0)),
                           bpf["syscounts"].items())
            data.extend(syscalls)

-    return sorted(data, key=lambda (k, v): v[1 if args.latency else 0])
+    return sorted(data, key=lambda kv: kv[1][1 if args.latency else 0])

 def clear_data():
    if args.latency:

--- a/tools/ucalls_example.txt
+++ b/tools/ucalls_example.txt
@@ -2,7 +2,7 @@ Demonstrations of ucalls.


 ucalls summarizes method calls in various high-level languages, including Java,
-Python, Ruby, and Linux system calls. It displays statistics on the most 
+Python, Ruby, PHP, and Linux system calls. It displays statistics on the most 
 frequently called methods, as well as the latency (duration) of these methods.

 Through the syscalls support, ucalls can provide basic information on a 
@@ -60,7 +60,7 @@ METHOD                                              # CALLS
 USAGE message:

 # ./ucalls.py -h
-usage: ucalls.py [-h] [-l {java,python,ruby}] [-T TOP] [-L] [-S] [-v] [-m]
+usage: ucalls.py [-h] [-l {java,python,ruby,php}] [-T TOP] [-L] [-S] [-v] [-m]
                 pid [interval]

 Summarize method calls in high-level languages.
@@ -71,7 +71,7 @@ positional arguments:

 optional arguments:
  -h, --help            show this help message and exit
-  -l {java,python,ruby}, --language {java,python,ruby}
+  -l {java,python,ruby,php}, --language {java,python,ruby,php}
                        language to trace (if none, trace syscalls only)
  -T TOP, --top TOP     number of most frequent/slow calls to print
  -L, --latency         record method latency from enter to exit (except
@@ -88,5 +88,5 @@ examples:
    ./ucalls 6712 -S            # trace only syscall counts
    ./ucalls -l ruby 1344 -T 10 # trace top 10 Ruby method calls
    ./ucalls -l ruby 1344 -L    # trace Ruby calls including latency
-    ./ucalls -l ruby 1344 -LS   # trace Ruby calls and syscalls with latency
+    ./ucalls -l php 443 -LS     # trace PHP calls and syscalls with latency
    ./ucalls -l python 2020 -mL # trace Python calls including latency in ms
--- a/tools/uflow.py
+++ b/tools/uflow.py
@@ -4,7 +4,7 @@
 # uflow  Trace method execution flow in high-level languages.
 #        For Linux, uses BCC, eBPF.
 #
-# USAGE: uflow [-C CLASS] [-M METHOD] [-v] {java,python,ruby} pid
+# USAGE: uflow [-C CLASS] [-M METHOD] [-v] {java,python,ruby,php} pid
 #
 # Copyright 2016 Sasha Goldshtein
 # Licensed under the Apache License, Version 2.0 (the "License")
@@ -27,7 +27,7 @@ parser = argparse.ArgumentParser(
    description="Trace method execution flow in high-level languages.",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
-parser.add_argument("language", choices=["java", "python", "ruby"],
+parser.add_argument("language", choices=["java", "python", "ruby", "php"],
    help="language to trace")
 parser.add_argument("pid", type=int, help="process id to attach to")
 parser.add_argument("-M", "--method",
@@ -109,7 +109,7 @@ def enable_probe(probe_name, func_name, read_class, read_method, is_return):
                             .replace("FILTER_METHOD", filter_method)   \
                             .replace("DEPTH", depth)                   \
                             .replace("UPDATE", update)
-    usdt.enable_probe(probe_name, func_name)
+    usdt.enable_probe_or_bail(probe_name, func_name)

 usdt = USDT(pid=args.pid)

@@ -140,6 +140,13 @@ elif args.language == "ruby":
    enable_probe("cmethod__return", "ruby_creturn",
                 "bpf_usdt_readarg(1, ctx, &clazz);",
                 "bpf_usdt_readarg(2, ctx, &method);", is_return=True)
+elif args.language == "php":
+    enable_probe("function__entry", "php_entry",
+                 "bpf_usdt_readarg(4, ctx, &clazz);",
+                 "bpf_usdt_readarg(1, ctx, &method);", is_return=False)
+    enable_probe("function__return", "php_return",
+                 "bpf_usdt_readarg(4, ctx, &clazz);",
+                 "bpf_usdt_readarg(1, ctx, &method);", is_return=True)

 if args.verbose:
    print(usdt.get_text())

--- a/tools/uflow_example.txt
+++ b/tools/uflow_example.txt
@@ -4,8 +4,8 @@ Demonstrations of uflow.
 uflow traces method entry and exit events and prints a visual flow graph that
 shows how methods are entered and exited, similar to a tracing debugger with
 breakpoints. This can be useful for understanding program flow in high-level
-languages such as Java, Python, and Ruby, which provide USDT probes for method
-invocations.
+languages such as Java, Python, Ruby, and PHP, which provide USDT probes for
+method invocations.


 For example, trace all Ruby method calls in a specific process:
@@ -88,12 +88,13 @@ thread running on the same CPU.
 USAGE message:

 # ./uflow -h
-usage: uflow.py [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby} pid
+usage: uflow.py [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby,php} pid

 Trace method execution flow in high-level languages.

 positional arguments:
-  {java,python,ruby}    language to trace
+  {java,python,ruby,php}
+			language to trace
  pid                   process id to attach to

 optional arguments:

--- a/tools/ugc.py
+++ b/tools/ugc.py
@@ -4,7 +4,7 @@
 # ugc  Summarize garbage collection events in high-level languages.
 #      For Linux, uses BCC, eBPF.
 #
-# USAGE: ugc [-v] [-m] {java,python,ruby,node} pid
+# USAGE: ugc [-v] [-m] [-M MSEC] [-F FILTER] {java,python,ruby,node} pid
 #
 # Copyright 2016 Sasha Goldshtein
 # Licensed under the Apache License, Version 2.0 (the "License")
@@ -20,6 +20,7 @@ import time
 examples = """examples:
    ./ugc java 185           # trace Java GCs in process 185
    ./ugc ruby 1344 -m       # trace Ruby GCs reporting in ms
+    ./ugc -M 10 java 185     # trace only Java GCs longer than 10ms
 """
 parser = argparse.ArgumentParser(
    description="Summarize garbage collection events in high-level languages.",
@@ -32,6 +33,10 @@ parser.add_argument("-v", "--verbose", action="store_true",
    help="verbose mode: print the BPF program (for debugging purposes)")
 parser.add_argument("-m", "--milliseconds", action="store_true",
    help="report times in milliseconds (default is microseconds)")
+parser.add_argument("-M", "--minimum", type=int, default=0,
+    help="display only GCs longer than this many milliseconds")
+parser.add_argument("-F", "--filter", type=str,
+    help="display only GCs whose description contains this text")
 args = parser.parse_args()

 usdt = USDT(pid=args.pid)
@@ -85,17 +90,21 @@ int trace_%s(struct pt_regs *ctx) {
        return 0;   // missed the entry event on this thread
    }
    elapsed = bpf_ktime_get_ns() - e->start_ns;
+    if (elapsed < %d) {
+        return 0;
+    }
    event.elapsed_ns = elapsed;
    %s
    gcs.perf_submit(ctx, &event, sizeof(event));
    return 0;
 }
-        """ % (self.begin, self.begin_save, self.end, self.end_save)
+        """ % (self.begin, self.begin_save, self.end,
+               args.minimum * 1000000, self.end_save)
        return text

    def attach(self):
-        usdt.enable_probe(self.begin, "trace_%s" % self.begin)
-        usdt.enable_probe(self.end, "trace_%s" % self.end)
+        usdt.enable_probe_or_bail(self.begin, "trace_%s" % self.begin)
+        usdt.enable_probe_or_bail(self.end, "trace_%s" % self.end)

    def format(self, data):
        return self.formatter(data)
@@ -187,7 +196,7 @@ bpf = BPF(text=program, usdt_contexts=[usdt])
 print("Tracing garbage collections in %s process %d... Ctrl-C to quit." %
      (args.language, args.pid))
 time_col = "TIME (ms)" if args.milliseconds else "TIME (us)"
-print("%-8s %-40s %-8s" % ("START", "DESCRIPTION", time_col))
+print("%-8s %-8s %-40s" % ("START", time_col, "DESCRIPTION"))

 class GCEvent(ct.Structure):
    _fields_ = [
@@ -207,9 +216,10 @@ def print_event(cpu, data, size):
    event = ct.cast(data, ct.POINTER(GCEvent)).contents
    elapsed = event.elapsed_ns/1000000 if args.milliseconds else \
              event.elapsed_ns/1000
-    print("%-8.3f %-40s %-8.2f" % (time.time() - start_ts,
-                                   probes[event.probe_index].format(event),
-                                   elapsed))
+    description = probes[event.probe_index].format(event)
+    if args.filter and not args.filter in description:
+        return
+    print("%-8.3f %-8.2f %s" % (time.time() - start_ts, elapsed, description))

 bpf["gcs"].open_perf_buffer(print_event)
 while 1:

--- a/tools/ugc_example.txt
+++ b/tools/ugc_example.txt
@@ -8,45 +8,68 @@ the GC event is also provided.

 For example, to trace all garbage collection events in a specific Node process:

-# ./ugc node $(pidof node)
-Tracing garbage collections in node process 3018... Ctrl-C to quit.
-START    DESCRIPTION                              TIME (us)
-3.864    GC mark-sweep-compact                    3189.00 
-4.937    GC scavenge                              1254.00 
-4.940    GC scavenge                              1657.00 
-4.943    GC scavenge                              1171.00 
-4.949    GC scavenge                              2216.00 
-4.954    GC scavenge                              2515.00 
-4.960    GC scavenge                              2243.00 
-4.966    GC scavenge                              2410.00 
-4.976    GC scavenge                              3003.00 
-4.986    GC scavenge                              4174.00 
-4.994    GC scavenge                              1508.00 
-5.003    GC scavenge                              1966.00 
-5.010    GC scavenge                              1636.00 
-5.022    GC scavenge                              3564.00 
-5.035    GC scavenge                              3275.00 
-5.045    GC incremental mark                      157.00  
-5.049    GC mark-sweep-compact                    3248.00 
-5.060    GC scavenge                              4785.00 
-5.081    GC scavenge                              6616.00 
-5.094    GC scavenge                              8570.00 
-5.144    GC scavenge                              456.00  
-7.188    GC scavenge                              2345.00 
-7.227    GC scavenge                              12054.00
-7.253    GC scavenge                              15626.00
-7.304    GC scavenge                              15329.00
-7.384    GC scavenge                              7168.00 
-7.411    GC scavenge                              3794.00 
-7.414    GC incremental mark                      123.00  
-7.430    GC mark-sweep-compact                    7110.00 
+# ugc node $(pidof node)
+Tracing garbage collections in node process 30012... Ctrl-C to quit.
+START    TIME (us) DESCRIPTION                             
+1.500    1181.00  GC scavenge
+1.505    1704.00  GC scavenge
+1.509    1534.00  GC scavenge
+1.515    1953.00  GC scavenge
+1.519    2155.00  GC scavenge
+1.525    2055.00  GC scavenge
+1.530    2164.00  GC scavenge
+1.536    2170.00  GC scavenge
+1.541    2237.00  GC scavenge
+1.547    1982.00  GC scavenge
+1.551    2333.00  GC scavenge
+1.557    2043.00  GC scavenge
+1.561    2028.00  GC scavenge
+1.573    3650.00  GC scavenge
+1.580    4443.00  GC scavenge
+1.604    6236.00  GC scavenge
+1.615    8324.00  GC scavenge
+1.659    11249.00 GC scavenge
+1.678    16084.00 GC scavenge
+1.747    15250.00 GC scavenge
+1.937    191.00   GC incremental mark
+2.001    63120.00 GC mark-sweep-compact
+3.185    153.00   GC incremental mark
+3.207    20847.00 GC mark-sweep-compact
+^C
+
+The above output shows some fairly long GCs, notably around 2 seconds in there
+is a collection that takes over 60ms (mark-sweep-compact).
+
+Occasionally, it might be useful to filter out collections that are very short,
+or display only collections that have a specific description. The -M and -F
+switches can be useful for this:
+
+# ugc -F Tenured java $(pidof java)
+Tracing garbage collections in java process 29907... Ctrl-C to quit.
+START    TIME (us) DESCRIPTION                             
+0.360    4309.00  MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
+2.459    4232.00  MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
+4.648    4139.00  MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
+^C
+
+# ugc -M 1 java $(pidof java)
+Tracing garbage collections in java process 29907... Ctrl-C to quit.
+START    TIME (us) DESCRIPTION                             
+0.160    3715.00  MarkSweepCompact Code Cache used=287528->3209472 max=173408256->251658240
+0.160    3975.00  MarkSweepCompact Metaspace used=287528->3092104 max=173408256->18446744073709551615
+0.160    4058.00  MarkSweepCompact Compressed Class Space used=287528->266840 max=173408256->1073741824
+0.160    4110.00  MarkSweepCompact Eden Space used=287528->0 max=173408256->69337088
+0.160    4159.00  MarkSweepCompact Survivor Space used=287528->0 max=173408256->8650752
+0.160    4207.00  MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
+0.160    4289.00    used=0->0 max=0->0
 ^C


 USAGE message:

-# ./ugc -h
-usage: ugc.py [-h] [-v] [-m] {java,python,ruby,node} pid
+# ugc -h
+usage: ugc.py [-h] [-v] [-m] [-M MINIMUM] [-F FILTER]
+              {java,python,ruby,node} pid

 Summarize garbage collection events in high-level languages.

@@ -60,7 +83,12 @@ optional arguments:
  -v, --verbose         verbose mode: print the BPF program (for debugging
                        purposes)
  -m, --milliseconds    report times in milliseconds (default is microseconds)
+  -M MINIMUM, --minimum MINIMUM
+                        display only GCs longer than this many milliseconds
+  -F FILTER, --filter FILTER
+                        display only GCs whose description contains this text

 examples:
    ./ugc java 185           # trace Java GCs in process 185
    ./ugc ruby 1344 -m       # trace Ruby GCs reporting in ms
+    ./ugc -M 10 java 185     # trace only Java GCs longer than 10ms
--- a/tools/uobjnew.py
+++ b/tools/uobjnew.py
@@ -78,7 +78,7 @@ int alloc_entry(struct pt_regs *ctx) {
    return 0;
 }
    """
-    usdt.enable_probe("object__alloc", "alloc_entry")
+    usdt.enable_probe_or_bail("object__alloc", "alloc_entry")
 #
 # Ruby
 #
@@ -107,10 +107,10 @@ int object_alloc_entry(struct pt_regs *ctx) {
    return 0;
 }
    """
-    usdt.enable_probe("object__create", "object_alloc_entry")
+    usdt.enable_probe_or_bail("object__create", "object_alloc_entry")
    for thing in ["string", "hash", "array"]:
        program += create_template.replace("THETHING", thing)
-        usdt.enable_probe("%s__create" % thing, "%s_alloc_entry" % thing)
+        usdt.enable_probe_or_bail("%s__create" % thing, "%s_alloc_entry" % thing)
 #
 # C
 #
@@ -147,13 +147,13 @@ while True:
    print()
    data = bpf["allocs"]
    if args.top_count:
-        data = sorted(data.items(), key=lambda (k, v): v.num_allocs)
+        data = sorted(data.items(), key=lambda kv: kv[1].num_allocs)
        data = data[-args.top_count:]
    elif args.top_size:
-        data = sorted(data.items(), key=lambda (k, v): v.total_size)
+        data = sorted(data.items(), key=lambda kv: kv[1].total_size)
        data = data[-args.top_size:]
    else:
-        data = sorted(data.items(), key=lambda (k, v): v.total_size)
+        data = sorted(data.items(), key=lambda kv: kv[1].total_size)
    print("%-30s %8s %12s" % ("TYPE", "# ALLOCS", "# BYTES"))
    for key, value in data:
        if args.language == "c":

--- a/tools/ustat.py
+++ b/tools/ustat.py
@@ -5,7 +5,7 @@
 #        method calls, class loads, garbage collections, and more.
 #        For Linux, uses BCC, eBPF.
 #
-# USAGE: ustat [-l {java,python,ruby,node}] [-C]
+# USAGE: ustat [-l {java,python,ruby,node,php}] [-C]
 #        [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d]
 #        [interval [count]]
 #
@@ -132,7 +132,7 @@ class Tool(object):
            formatter_class=argparse.RawDescriptionHelpFormatter,
            epilog=examples)
        parser.add_argument("-l", "--language",
-            choices=["java", "python", "ruby", "node"],
+            choices=["java", "python", "ruby", "node", "php"],
            help="language to trace (default: all languages)")
        parser.add_argument("-C", "--noclear", action="store_true",
            help="don't clear the screen")
@@ -158,6 +158,11 @@ class Tool(object):
                    "function__entry": Category.METHOD,
                    "gc__start": Category.GC
                    }),
+                "php": Probe("php", ["php"], {
+                    "function__entry": Category.METHOD,
+                    "compile__file__entry": Category.CLOAD,
+                    "exception__thrown": Category.EXCP
+                    }),
                "ruby": Probe("ruby", ["ruby", "irb"], {
                    "method__entry": Category.METHOD,
                    "cmethod__entry": Category.METHOD,
@@ -239,10 +244,10 @@ class Tool(object):
            counts.update(probe.get_counts(self.bpf))
            targets.update(probe.targets)
        if self.args.sort:
-            counts = sorted(counts.items(), key=lambda (_, v):
-                            -v.get(self.args.sort.upper(), 0))
+            counts = sorted(counts.items(), key=lambda kv:
+                            -kv[1].get(self.args.sort.upper(), 0))
        else:
-            counts = sorted(counts.items(), key=lambda (k, _): k)
+            counts = sorted(counts.items(), key=lambda kv: kv[0])
        for pid, stats in counts:
            print("%-6d %-20s %-10d %-6d %-10d %-8d %-6d %-6d" % (
                  pid, targets[pid][:20],

--- a/tools/ustat_example.txt
+++ b/tools/ustat_example.txt
@@ -4,7 +4,7 @@ Demonstrations of ustat.
 ustat is a "top"-like tool for monitoring events in high-level languages. It 
 prints statistics about garbage collections, method calls, object allocations,
 and various other events for every process that it recognizes with a Java,
-Python, Ruby, or Node runtime.
+Python, Ruby, Node, or PHP runtime.

 For example:

@@ -48,7 +48,7 @@ PID    CMDLINE              METHOD/s   GC/s   OBJNEW/s   CLOAD/s  EXC/s  THR/s
 USAGE message:

 # ./ustat.py -h
-usage: ustat.py [-h] [-l {java,python,ruby,node}] [-C]
+usage: ustat.py [-h] [-l {java,python,ruby,node,php}] [-C]
                [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d]
                [interval] [count]

@@ -60,7 +60,7 @@ positional arguments:

 optional arguments:
  -h, --help            show this help message and exit
-  -l {java,python,ruby,node}, --language {java,python,ruby,node}
+  -l {java,python,ruby,node,php}, --language {java,python,ruby,node,php}
                        language to trace (default: all languages)
  -C, --noclear         don't clear the screen
  -S {cload,excp,gc,method,objnew,thread}, --sort {cload,excp,gc,method,objnew,thread}

--- a/tools/uthreads.py
+++ b/tools/uthreads.py
@@ -57,7 +57,7 @@ int trace_pthread(struct pt_regs *ctx) {
    return 0;
 }
 """
-usdt.enable_probe("pthread_start", "trace_pthread")
+usdt.enable_probe_or_bail("pthread_start", "trace_pthread")

 if args.language == "java":
    template = """
@@ -78,8 +78,8 @@ int %s(struct pt_regs *ctx) {
    """
    program += template % ("trace_start", "start")
    program += template % ("trace_stop", "stop")
-    usdt.enable_probe("thread__start", "trace_start")
-    usdt.enable_probe("thread__stop", "trace_stop")
+    usdt.enable_probe_or_bail("thread__start", "trace_start")
+    usdt.enable_probe_or_bail("thread__stop", "trace_stop")

 if args.verbose:
    print(usdt.get_text())