Commit c063fc4f authored by Brenden Blanco's avatar Brenden Blanco Committed by GitHub

Merge branch 'master' into patch

parents 4bb64cfb afb19da0
......@@ -2,6 +2,10 @@
*.swp
*.swo
*.pyc
.idea
# Build artefacts
# Build artifacts
/build/
cmake-build-debug
debian/**/*.log
obj-x86_64-linux-gnu
......@@ -7,6 +7,7 @@
- [Arch](#arch---aur)
- [Gentoo](#gentoo---portage)
* [Source](#source)
- [Debian](#debian---source)
- [Ubuntu](#ubuntu---source)
- [Fedora](#fedora---source)
* [Older Instructions](#older-instructions)
......@@ -163,6 +164,90 @@ The appropriate dependencies (e.g., ```clang```, ```llvm``` with BPF backend) wi
# Source
## Debian - Source
### Jessie
#### Repositories
The automated tests that run as part of the build process require `netperf`. Since netperf's license is not "certified"
as an open-source license, it is in Debian's `non-free` repository.
`/etc/apt/sources.list` should include the `non-free` repository and look something like this:
```
deb http://httpredir.debian.org/debian/ jessie main non-free
deb-src http://httpredir.debian.org/debian/ jessie main non-free
deb http://security.debian.org/ jessie/updates main non-free
deb-src http://security.debian.org/ jessie/updates main non-free
# wheezy-updates, previously known as 'volatile'
deb http://ftp.us.debian.org/debian/ jessie-updates main non-free
deb-src http://ftp.us.debian.org/debian/ jessie-updates main non-free
```
BCC also requires kernel version 4.1 or above. Those kernels are available in the `jessie-backports` repository. To
add the `jessie-backports` repository to your system create the file `/etc/apt/sources.list.d/jessie-backports.list`
with the following contents:
```
deb http://httpredir.debian.org/debian jessie-backports main
deb-src http://httpredir.debian.org/debian jessie-backports main
```
#### Install Build Dependencies
Note, check for the latest `linux-image-4.x` version in `jessie-backports` before proceeding. Also, have a look at the
`Build-Depends:` section in `debian/control` file.
```
# Before you begin
apt-get update
# Update kernel and linux-base package
apt-get -t jessie-backports install linux-base linux-image-4.8.0-0.bpo.2-amd64
# BCC build dependencies:
apt-get install debhelper cmake libllvm3.8 llvm-3.8-dev libclang-3.8-dev \
libelf-dev bison flex libedit-dev clang-format-3.8 python python-netaddr \
python-pyroute2 luajit libluajit-5.1-dev arping iperf netperf ethtool \
devscripts
```
#### Sudo
Adding eBPF probes to the kernel and removing probes from it requires root privileges. For the build to complete
successfully, you must build from an account with `sudo` access. (You may also build as root, but it is bad style.)
`/etc/sudoers` or `/etc/sudoers.d/build-user` should contain
```
build-user ALL = (ALL) NOPASSWD: ALL
```
or
```
build-user ALL = (ALL) ALL
```
If using the latter sudoers configuration, please keep an eye out for sudo's password prompt while the build is running.
#### Build
```
cd <preferred development directory>
git clone https://github.com/iovisor/bcc.git
cd bcc
debuild -b -uc -us
```
#### Install
```
cd ..
sudo dpkg -i *bcc*.deb
```
## Ubuntu - Source
To build the toolchain from source, one needs:
......@@ -219,9 +304,12 @@ sudo pip install pyroute2
wget http://llvm.org/releases/3.7.1/clang+llvm-3.7.1-x86_64-fedora22.tar.xz
sudo tar xf clang+llvm-3.7.1-x86_64-fedora22.tar.xz -C /usr/local --strip 1
# FC23 and FC24
# FC23
wget http://llvm.org/releases/3.9.0/clang+llvm-3.9.0-x86_64-fedora23.tar.xz
sudo tar xf clang+llvm-3.9.0-x86_64-fedora23.tar.xz -C /usr/local --strip 1
# FC24 and FC25
sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
```
### Install and compile BCC
......
......@@ -73,7 +73,7 @@ Examples:
- examples/tracing/[vfsreadlat.py](examples/tracing/vfsreadlat.py) examples/tracing/[vfsreadlat.c](examples/tracing/vfsreadlat.c): VFS read latency distribution. [Examples](examples/tracing/vfsreadlat_example.txt).
#### Tools:
<center><a href="images/bcc_tracing_tools_2016.png"><img src="images/bcc_tracing_tools_2016.png" border=0 width=700></a></center>
<center><a href="images/bcc_tracing_tools_2017.png"><img src="images/bcc_tracing_tools_2017.png" border=0 width=700></a></center>
- tools/[argdist](tools/argdist.py): Display function parameter values as a histogram or frequency count. [Examples](tools/argdist_example.txt).
- tools/[bashreadline](tools/bashreadline.py): Print entered bash commands system wide. [Examples](tools/bashreadline_example.txt).
- tools/[biolatency](tools/biolatency.py): Summarize block device I/O latency as a histogram. [Examples](tools/biolatency_example.txt).
......@@ -89,6 +89,7 @@ Examples:
- tools/[cpuunclaimed](tools/cpuunclaimed.py): Sample CPU run queues and calculate unclaimed idle CPU. [Examples](tools/cpuunclaimed_example.txt)
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[deadlock_detector](tools/deadlock_detector.py): Detect potential deadlocks on a running process. [Examples](tools/deadlock_detector_example.txt)
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency distribution as a histogram. [Examples](tools/ext4dist_example.txt).
- tools/[ext4slower](tools/ext4slower.py): Trace slow ext4 operations. [Examples](tools/ext4slower_example.txt).
......
%bcond_with local_clang_static
#lua jit not available for some architectures
%ifarch ppc64 aarch64 ppc64le
%{!?with_lua: %global with_lua 0}
%else
%{!?with_lua: %global with_lua 1}
%endif
%define debug_package %{nil}
Name: bcc
......@@ -11,10 +17,12 @@ License: ASL 2.0
URL: https://github.com/iovisor/bcc
Source0: bcc.tar.gz
ExclusiveArch: x86_64
ExclusiveArch: x86_64 ppc64 aarch64 ppc64le
BuildRequires: bison cmake >= 2.8.7 flex make
BuildRequires: gcc gcc-c++ python2-devel elfutils-libelf-devel-static
%if %{with_lua}
BuildRequires: luajit luajit-devel
%endif
%if %{without local_clang_static}
BuildRequires: llvm-devel llvm-static
BuildRequires: clang-devel
......@@ -25,6 +33,11 @@ BuildRequires: pkgconfig ncurses-devel
Python bindings for BPF Compiler Collection (BCC). Control a BPF program from
userspace.
%if %{with_lua}
%global lua_include `pkg-config --variable=includedir luajit`
%global lua_libs `pkg-config --variable=libdir luajit`/lib`pkg-config --variable=libname luajit`.so
%global lua_config -DLUAJIT_INCLUDE_DIR=%{lua_include} -DLUAJIT_LIBRARIES=%{lua_libs}
%endif
%prep
%setup -q -n bcc
......@@ -35,8 +48,7 @@ mkdir build
pushd build
cmake .. -DREVISION_LAST=%{version} -DREVISION=%{version} \
-DCMAKE_INSTALL_PREFIX=/usr \
-DLUAJIT_INCLUDE_DIR=`pkg-config --variable=includedir luajit` \
-DLUAJIT_LIBRARIES=`pkg-config --variable=libdir luajit`/lib`pkg-config --variable=libname luajit`.so
%{?lua_config}
make %{?_smp_mflags}
popd
......@@ -56,16 +68,20 @@ Requires: libbcc = %{version}-%{release}
%description -n python-bcc
Python bindings for BPF Compiler Collection (BCC)
%if %{with_lua}
%package -n bcc-lua
Summary: Standalone tool to run BCC tracers written in Lua
Requires: libbcc = %{version}-%{release}
%description -n bcc-lua
Standalone tool to run BCC tracers written in Lua
%endif
%package -n libbcc-examples
Summary: Examples for BPF Compiler Collection (BCC)
Requires: python-bcc = %{version}-%{release}
%if %{with_lua}
Requires: bcc-lua = %{version}-%{release}
%endif
%description -n libbcc-examples
Examples for BPF Compiler Collection (BCC)
......@@ -82,8 +98,10 @@ Command line tools for BPF Compiler Collection (BCC)
%files -n python-bcc
%{python_sitelib}/bcc*
%if %{with_lua}
%files -n bcc-lua
/usr/bin/bcc-lua
%endif
%files -n libbcc-examples
/usr/share/bcc/examples/*
......
......@@ -3,7 +3,12 @@ Maintainer: Brenden Blanco <bblanco@plumgrid.com>
Section: misc
Priority: optional
Standards-Version: 3.9.5
Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8, llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev, libelf-dev, bison, flex, libedit-dev, clang-format | clang-format-3.7, python-netaddr, python-pyroute2, luajit, libluajit-5.1-dev
Build-Depends: debhelper (>= 9), cmake, libllvm3.7 | libllvm3.8,
llvm-3.7-dev | llvm-3.8-dev, libclang-3.7-dev | libclang-3.8-dev,
libelf-dev, bison, flex, libedit-dev,
clang-format | clang-format-3.7 | clang-format-3.8, python (>= 2.7),
python-netaddr, python-pyroute2, luajit, libluajit-5.1-dev, arping,
inetutils-ping | iputils-ping, iperf, netperf, ethtool, devscripts
Homepage: https://github.com/iovisor/bcc
Package: libbcc
......
......@@ -15,6 +15,7 @@ ARM64 | 3.18 | [e54bcde3d69d](https://git.kernel.org/cgit/linux/kernel/git/torva
s390 | 4.1 | [054623105728](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=054623105728b06852f077299e2bf1bf3d5f2b0b)
Constant blinding for JIT machines | 4.7 | [4f3446bb809f](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4f3446bb809f20ad56cadf712e6006815ae7a8f9)
PowerPC64 | 4.8 | [156d0e290e96](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=156d0e290e969caba25f1851c52417c14d141b24)
Constant blinding - PowerPC64 | 4.9 | [b7b7013cac55](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b7b7013cac55d794940bd9cb7b7c55c9dececac4)
## Main features
......@@ -24,7 +25,7 @@ Feature | Kernel version | Commit
Kernel helpers | 3.15 | [bd4cf0ed331a](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8)
`bpf()` syscall | 3.18 | [99c55f7d47c0](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=99c55f7d47c0dc6fc64729f37bf435abf43f4c60)
Tables (_a.k.a._ Maps; details below) | 3.18 | [99c55f7d47c0](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=99c55f7d47c0dc6fc64729f37bf435abf43f4c60)
BPF attached to sockets | 3.19 | [89aa075832b0](89aa075832b0da4402acebd698d0411dcc82d03e)
BPF attached to sockets | 3.19 | [89aa075832b0](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=89aa075832b0da4402acebd698d0411dcc82d03e)
BPF attached to `kprobes` | 4.1 | [2541517c32be](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2541517c32be2531e0da59dfd7efc1ce844644f5)
`cls_bpf` / `act_bpf` for `tc` | 4.1 | [e2e9b6541dd4](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e2e9b6541dd4b31848079da80fe2253daaafb549)
Tail calls | 4.2 | [04fd61ab36ec](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=04fd61ab36ec065e194ab5e74ae34a5240d992bb)
......@@ -53,9 +54,11 @@ Perf events | 4.3 | [ea317b267e9d](https://git.kernel.org/cgit/linux/kernel/git/
Per-CPU hash | 4.6 | [824bd0ce6c7c](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=824bd0ce6c7c43a9e1e210abf124958e54d88342)
Per-CPU array | 4.6 | [a10423b87a7e](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a10423b87a7eae75da79ce80a8d9475047a674ee)
Stack trace | 4.6 | [d5a3b1f69186](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d5a3b1f691865be576c2bffa708549b8cdccda19)
Pre-alloc maps memory | 4.6 | [6c9059817432](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6c90598174322b8888029e40dd84a4eb01f56afe)
cgroup array | 4.8 | [4ed8ec521ed5](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4ed8ec521ed57c4e207ad464ca0388776de74d4b)
LRU hash | [4.10](https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3) | [](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3)
LRU per-CPU hash | [4.10](https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0) | [](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0)
LRU hash | 4.10 | [29ba732acbee](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3)
LRU per-CPU hash | 4.10 | [8f8449384ec3](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0)
LPM trie | 4.11 | [b95a5c4db09b](https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=b95a5c4db09bc7c253636cb84dc9b12c577fd5a0)
Text string | _To be done?_ |
Variable-length maps | _To be done?_ |
......
.TH deadlock_detector 8 "2017-02-01" "USER COMMANDS"
.SH NAME
deadlock_detector \- Find potential deadlocks (lock order inversions)
in a running program.
.SH SYNOPSIS
.B deadlock_detector [\-h] [\--binary BINARY] [\--dump-graph DUMP_GRAPH]
.B [\--verbose] [\--lock-symbols LOCK_SYMBOLS]
.B [\--unlock-symbols UNLOCK_SYMBOLS]
.B pid
.SH DESCRIPTION
deadlock_detector finds potential deadlocks in a running process. The program
attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` by default
to build a mutex wait directed graph, and then looks for a cycle in this graph.
This graph has the following properties:
- Nodes in the graph represent mutexes.
- Edge (A, B) exists if there exists some thread T where lock(A) was called
and lock(B) was called before unlock(A) was called.
If there is a cycle in this graph, this indicates that there is a lock order
inversion (potential deadlock). If the program finds a lock order inversion, the
program will dump the cycle of mutexes, dump the stack traces where each mutex
was acquired, and then exit.
This program can only find potential deadlocks that occur while the program is
tracing the process. It cannot find deadlocks that may have occurred before the
program was attached to the process.
This tool does not work for shared mutexes or recursive mutexes.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc
.SH OPTIONS
.TP
\-h, --help
show this help message and exit
.TP
\--binary BINARY
If set, trace the mutexes from the binary at this path. For
statically-linked binaries, this argument is not required.
For dynamically-linked binaries, this argument is required and should be the
path of the pthread library the binary is using.
Example: /lib/x86_64-linux-gnu/libpthread.so.0
.TP
\--dump-graph DUMP_GRAPH
If set, this will dump the mutex graph to the specified file.
.TP
\--verbose
Print statistics about the mutex wait graph.
.TP
\--lock-symbols LOCK_SYMBOLS
Comma-separated list of lock symbols to trace. Default is pthread_mutex_lock.
These symbols cannot be inlined in the binary.
.TP
\--unlock-symbols UNLOCK_SYMBOLS
Comma-separated list of unlock symbols to trace. Default is
pthread_mutex_unlock. These symbols cannot be inlined in the binary.
.TP
pid
Pid to trace
.SH EXAMPLES
.TP
Find potential deadlocks in PID 181. The --binary argument is not needed for \
statically-linked binaries.
#
.B deadlock_detector 181
.TP
Find potential deadlocks in PID 181. If the process was created from a \
dynamically-linked executable, the --binary argument is required and must be \
the path of the pthread library:
#
.B deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
.TP
Find potential deadlocks in PID 181. If the process was created from a \
statically-linked executable, optionally pass the location of the binary. \
On older kernels without https://lkml.org/lkml/2017/1/13/585, binaries that \
contain `:` in the path cannot be attached with uprobes. As a workaround, we \
can create a symlink to the binary, and provide the symlink name instead with \
the `--binary` option:
#
.B deadlock_detector 181 --binary /usr/local/bin/lockinversion
.TP
Find potential deadlocks in PID 181 and dump the mutex wait graph to a file:
#
.B deadlock_detector 181 --dump-graph graph.json
.TP
Find potential deadlocks in PID 181 and print mutex wait graph statistics:
#
.B deadlock_detector 181 --verbose
.TP
Find potential deadlocks in PID 181 with custom mutexes:
#
.B deadlock_detector 181
.B --lock-symbols custom_mutex1_lock,custom_mutex2_lock
.B --unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock
.SH OUTPUT
This program does not output any fields. Rather, it will keep running until
it finds a potential deadlock, or the user hits Ctrl-C. If the program finds
a potential deadlock, it will output the stack traces and lock order inversion
in the following format and exit:
.TP
Potential Deadlock Detected!
.TP
Cycle in lock order graph: Mutex M0 => Mutex M1 => Mutex M0
.TP
Mutex M1 acquired here while holding Mutex M0 in Thread T:
.B [stack trace]
.TP
Mutex M0 previously acquired by the same Thread T here:
.B [stack trace]
.TP
Mutex M0 acquired here while holding Mutex M1 in Thread S:
.B [stack trace]
.TP
Mutex M1 previously acquired by the same Thread S here:
.B [stack trace]
.TP
Thread T created by Thread R here:
.B [stack trace]
.TP
Thread S created by Thread Q here:
.B [stack trace]
.SH OVERHEAD
This traces all mutex lock and unlock events and all thread creation events
on the traced process. The overhead of this can be high if the process has many
threads and mutexes. You should only run this on a process where the slowdown
is acceptable.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Kenny Yu
......@@ -21,6 +21,10 @@ and may need modifications to match your software and processor architecture.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-p PID
Trace this process ID only.
.SH EXAMPLES
.TP
Trace host lookups (getaddrinfo/gethostbyname[2]) system wide:
......
......@@ -3,7 +3,7 @@
memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJ] [INTERVAL] [COUNT]
.SH DESCRIPTION
memleak traces and matches memory allocation and deallocation requests, and
collects call stacks for each allocation. memleak can then print a summary
......@@ -53,6 +53,9 @@ Capture only allocations that are larger than or equal to MIN_SIZE bytes.
\-Z MAX_SIZE
Capture only allocations that are smaller than or equal to MAX_SIZE bytes.
.TP
\-O OBJ
Attach to malloc and free in specified object instead of resolving libc. Ignored when kernel allocations are profiled.
.TP
INTERVAL
Print a summary of oustanding allocations and their call stacks every INTERVAL seconds.
The default interval is 5 seconds.
......
......@@ -49,7 +49,7 @@ Show stacks from kernel space only (no user space stacks).
.TP
\-\-stack-storage-size COUNT
The maximum number of unique stack traces that the kernel will count (default
2048). If the sampled count exceeds this, a warning will be printed.
10240). If the sampled count exceeds this, a warning will be printed.
.TP
duration
Duration to trace, in seconds.
......
......@@ -25,7 +25,8 @@ CONFIG_BPF and bcc.
Print usage message.
.TP
\-t
Include a timestamp column.
Include a timestamp column: in seconds since the first event, with decimal
places.
.TP
\-x
Only print failed stats.
......
......@@ -62,7 +62,7 @@ information. See PROBE SYNTAX below.
.SH PROBE SYNTAX
The general probe syntax is as follows:
.B [{p,r}]:[library]:function [(predicate)] ["format string"[, arguments]]
.B [{p,r}]:[library]:function[(signature)] [(predicate)] ["format string"[, arguments]]
.B {t:category:event,u:library:probe} [(predicate)] ["format string"[, arguments]]
.TP
......@@ -84,6 +84,12 @@ The tracepoint category. For example, "sched" or "irq".
.B function
The function to probe.
.TP
.B signature
The optional signature of the function to probe. This can make it easier to
access the function's arguments, instead of using the "arg1", "arg2" etc.
argument specifiers. For example, "(struct timespec *ts)" in the signature
position lets you use "ts" in the filter or print expressions.
.TP
.B event
The tracepoint event. For example, "block_rq_complete".
.TP
......@@ -159,6 +165,10 @@ Trace the block:block_rq_complete tracepoint and print the number of sectors com
Trace the pthread_create USDT probe from the pthread library and print the address of the thread's start function:
#
.B trace 'u:pthread:pthread_create """start addr = %llx"", arg3'
.TP
Trace the nanosleep system call and print the sleep duration in nanoseconds:
#
.B trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
.SH SOURCE
This is from bcc.
.IP
......
......@@ -2,28 +2,29 @@
.SH NAME
ucalls \- Summarize method calls from high-level languages and Linux syscalls.
.SH SYNOPSIS
.B ucalls [-l {java,python,ruby}] [-h] [-T TOP] [-L] [-S] [-v] [-m] pid [interval]
.B ucalls [-l {java,python,ruby,php}] [-h] [-T TOP] [-L] [-S] [-v] [-m] pid [interval]
.SH DESCRIPTION
This tool summarizes method calls from high-level languages such as Python,
Java, and Ruby. It can also trace Linux system calls. Whenever a method is
Java, Ruby, and PHP. It can also trace Linux system calls. Whenever a method is
invoked, ucalls records the call count and optionally the method's execution
time (latency) and displays a summary.
This uses in-kernel eBPF maps to store per process summaries for efficiency.
This tool relies on USDT probes embedded in many high-level languages, such as
Node, Java, Python, and Ruby. It requires a runtime instrumented with these
Java, Python, Ruby, and PHP. It requires a runtime instrumented with these
probes, which in some cases requires building from source with a USDT-specific
flag, such as "--enable-dtrace" or "--with-dtrace". For Java, method probes are
not enabled by default, and can be turned on by running the Java process with
the "-XX:+ExtendedDTraceProbes" flag.
the "-XX:+ExtendedDTraceProbes" flag. For PHP processes, the environment
variable USE_ZEND_DTRACE must be set to 1.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-l {java,python,ruby,node}
\-l {java,python,ruby,php}
The language to trace. If not provided, only syscalls are traced (when the \-S
option is used).
.TP
......
......@@ -2,16 +2,17 @@
.SH NAME
uflow \- Print a flow graph of method calls in high-level languages.
.SH SYNOPSIS
.B uflow [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby} pid
.B uflow [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby,php} pid
.SH DESCRIPTION
uflow traces method calls and prints them in a flow graph that can facilitate
debugging and diagnostics by following the program's execution (method flow).
This tool relies on USDT probes embedded in many high-level languages, such as
Node, Java, Python, and Ruby. It requires a runtime instrumented with these
Java, Python, Ruby, and PHP. It requires a runtime instrumented with these
probes, which in some cases requires building from source with a USDT-specific
flag, such as "--enable-dtrace" or "--with-dtrace". For Java processes, the
startup flag "-XX:+ExtendedDTraceProbes" is required.
startup flag "-XX:+ExtendedDTraceProbes" is required. For PHP processes, the
environment variable USE_ZEND_DTRACE must be set to 1.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
......@@ -29,7 +30,7 @@ name interpretation strongly depends on the language. For example, in Java use
\-v
Print the resulting BPF program, for debugging purposes.
.TP
{java,python,ruby}
{java,python,ruby,php}
The language to trace.
.TP
pid
......
......@@ -2,7 +2,7 @@
.SH NAME
ugc \- Trace garbage collection events in high-level languages.
.SH SYNOPSIS
.B ugc [-h] [-v] [-m] {java,python,ruby,node} pid
.B ugc [-h] [-v] [-m] [-M MINIMUM] [-F FILTER] {java,python,ruby,node} pid
.SH DESCRIPTION
This traces garbage collection events as they occur, including their duration
and any additional information (such as generation collected or type of GC)
......@@ -24,6 +24,18 @@ Print the resulting BPF program, for debugging purposes.
\-m
Print times in milliseconds. The default is microseconds.
.TP
\-M MINIMUM
Display only collections that are longer than this threshold. The value is
given in milliseconds. The default is to display all collections.
.TP
\-F FILTER
Display only collections whose textual description matches (contains) this
string. The default is to display all collections. Note that the filtering here
is performed in user-space, and not as part of the BPF program. This means that
if you have thousands of collection events, specifying this filter will not
reduce the amount of data that has to be transferred from the BPF program to
the user-space script.
.TP
{java,python,ruby,node}
The language to trace.
.TP
......@@ -39,16 +51,22 @@ Trace garbage collections in a specific Java process, and print GC times in
milliseconds:
#
.B ugc -m java 6004
.TP
Trace garbage collections in a specific Java process, and display them only if
they are longer than 10ms and have the string "Tenured" in their detailed
description:
#
.B ugc -M 10 -F Tenured java 6004
.SH FIELDS
.TP
START
The start time of the GC, in seconds from the beginning of the trace.
.TP
DESCRIPTION
The runtime-provided description of this garbage collection event.
.TP
TIME
The duration of the garbage collection event.
.TP
DESCRIPTION
The runtime-provided description of this garbage collection event.
.SH OVERHEAD
Garbage collection events, even if frequent, should not produce a considerable
overhead when traced because they are still not very common. Even hundreds of
......
......@@ -2,7 +2,7 @@
.SH NAME
ustat \- Activity stats from high-level languages.
.SH SYNOPSIS
.B ustat [-l {java,python,ruby,node}] [-C] [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d] [interval [count]]
.B ustat [-l {java,python,ruby,node,php}] [-C] [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d] [interval [count]]
.SH DESCRIPTION
This is "top" for high-level language events, such as garbage collections,
exceptions, thread creations, object allocations, method calls, and more. The
......@@ -12,11 +12,12 @@ can be sorted by various fields.
This uses in-kernel eBPF maps to store per process summaries for efficiency.
This tool relies on USDT probes embedded in many high-level languages, such as
Node, Java, Python, and Ruby. It requires a runtime instrumented with these
Node, Java, Python, Ruby, and PHP. It requires a runtime instrumented with these
probes, which in some cases requires building from source with a USDT-specific
flag, such as "--enable-dtrace" or "--with-dtrace". For Java, some probes are
not enabled by default, and can be turned on by running the Java process with
the "-XX:+ExtendedDTraceProbes" flag.
the "-XX:+ExtendedDTraceProbes" flag. For PHP processes, the environment
variable USE_ZEND_DTRACE must be set to 1.
Newly-created processes will only be traced at the next interval. If you run
this tool with a short interval (say, 1-5 seconds), this should be virtually
......@@ -28,7 +29,7 @@ Since this uses BPF, only the root user can use this tool.
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-l {java,python,ruby,node}
\-l {java,python,ruby,node,php}
The language to trace. By default, all languages are traced.
.TP
\-C
......
......@@ -24,14 +24,20 @@ pushd $TMP
tar xf bcc_$revision.orig.tar.gz
cd bcc
debuild=debuild
if [[ "$buildtype" = "test" ]]; then
# when testing, use faster compression options
debuild+=" --preserve-envvar PATH"
echo -e '#!/bin/bash\nexec /usr/bin/dpkg-deb -z1 "$@"' \
| sudo tee /usr/local/bin/dpkg-deb
sudo chmod +x /usr/local/bin/dpkg-deb
dch -b -v $revision-$release "$git_subject"
fi
if [[ "$buildtype" = "nightly" ]]; then
dch -v $revision-$release "$git_subject"
fi
DEB_BUILD_OPTIONS="nocheck parallel=${PARALLEL}" debuild -us -uc
DEB_BUILD_OPTIONS="nocheck parallel=${PARALLEL}" $debuild -us -uc
popd
cp $TMP/*.deb .
......@@ -16,7 +16,7 @@
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#
name: bcc
version: 0.2.0-20161215-1402-7151673
version: 0.2.0-20170208-1555-3e77af5
summary: BPF Compiler Collection (BCC)
description: A toolkit for creating efficient kernel tracing and manipulation programs
confinement: strict
......@@ -46,12 +46,18 @@ apps:
command: wrapper cachestat
cachetop:
command: wrapper cachetop
capable:
command: wrapper capable
cpudist:
command: wrapper cpudist
cpuunclaimed:
command: wrapper cpuunclaimed
dcsnoop:
command: wrapper dcsnoop
dcstat:
command: wrapper dcstat
deadlock-detector:
command: wrapper deadlock_detector
execsnoop:
command: wrapper execsnoop
ext4dist:
......@@ -74,10 +80,14 @@ apps:
command: wrapper hardirqs
killsnoop:
command: wrapper killsnoop
llcstat:
command: wrapper llcstat
mdflush:
command: wrapper mdflush
memleak:
command: wrapper memleak
mountsnoop:
command: wrapper mountsnoop
offcputime:
command: wrapper offcputime
offwaketime:
......@@ -88,12 +98,18 @@ apps:
command: wrapper opensnoop
pidpersec:
command: wrapper pidpersec
profile:
command: wrapper profile
runqlat:
command: wrapper runqlat
runqlen:
command: wrapper runqlen
slabratetop:
command: wrapper slabratetop
softirqs:
command: wrapper softirqs
solisten:
command: wrapper solisten
sslsniff:
command: wrapper sslsniff
stackcount:
......@@ -116,10 +132,24 @@ apps:
command: wrapper tcpretrans
tcptop:
command: wrapper tcptop
ttysnoop:
command: wrapper ttysnop
tplist:
command: wrapper tplist
trace:
command: wrapper trace
ttysnoop:
command: wrapper ttysnoop
ucalls:
command: wrapper ucalls
uflow:
command: wrapper uflow
ugc:
command: wrapper ugc
uobjnew:
command: wrapper uobjnew
ustat:
command: wrapper ustat
uthreads:
command: wrapper uthreads
vfscount:
command: wrapper vfscount
vfsstat:
......
......@@ -30,6 +30,7 @@
#include "bpf_module.h"
#include "libbpf.h"
#include "perf_reader.h"
#include "common.h"
#include "usdt.h"
#include "BPF.h"
......@@ -146,7 +147,7 @@ StatusTuple BPF::detach_all() {
StatusTuple BPF::attach_kprobe(const std::string& kernel_func,
const std::string& probe_func,
bpf_attach_type attach_type,
bpf_probe_attach_type attach_type,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void* cb_cookie) {
std::string probe_event = get_kprobe_event(kernel_func, attach_type);
......@@ -156,11 +157,8 @@ StatusTuple BPF::attach_kprobe(const std::string& kernel_func,
int probe_fd;
TRY2(load_func(probe_func, BPF_PROG_TYPE_KPROBE, probe_fd));
std::string probe_event_desc = attach_type_prefix(attach_type);
probe_event_desc += ":kprobes/" + probe_event + " " + kernel_func;
void* res =
bpf_attach_kprobe(probe_fd, probe_event.c_str(), probe_event_desc.c_str(),
bpf_attach_kprobe(probe_fd, attach_type, probe_event.c_str(), kernel_func.c_str(),
pid, cpu, group_fd, cb, cb_cookie);
if (!res) {
......@@ -181,7 +179,7 @@ StatusTuple BPF::attach_uprobe(const std::string& binary_path,
const std::string& symbol,
const std::string& probe_func,
uint64_t symbol_addr,
bpf_attach_type attach_type,
bpf_probe_attach_type attach_type,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void* cb_cookie) {
bcc_symbol sym = bcc_symbol();
......@@ -195,13 +193,9 @@ StatusTuple BPF::attach_uprobe(const std::string& binary_path,
int probe_fd;
TRY2(load_func(probe_func, BPF_PROG_TYPE_KPROBE, probe_fd));
std::string probe_event_desc = attach_type_prefix(attach_type);
probe_event_desc += ":uprobes/" + probe_event + " ";
probe_event_desc += binary_path + ":0x" + uint_to_hex(sym.offset);
void* res =
bpf_attach_uprobe(probe_fd, probe_event.c_str(), probe_event_desc.c_str(),
pid, cpu, group_fd, cb, cb_cookie);
bpf_attach_uprobe(probe_fd, attach_type, probe_event.c_str(), binary_path.c_str(),
sym.offset, pid, cpu, group_fd, cb, cb_cookie);
if (!res) {
TRY2(unload_func(probe_func));
......@@ -297,11 +291,12 @@ StatusTuple BPF::attach_perf_event(uint32_t ev_type, uint32_t ev_config,
TRY2(load_func(probe_func, BPF_PROG_TYPE_PERF_EVENT, probe_fd));
auto fds = new std::map<int, int>();
int cpu_st = 0;
int cpu_en = sysconf(_SC_NPROCESSORS_ONLN) - 1;
std::vector<int> cpus;
if (cpu >= 0)
cpu_st = cpu_en = cpu;
for (int i = cpu_st; i <= cpu_en; i++) {
cpus.push_back(cpu);
else
cpus = get_online_cpus();
for (int i: cpus) {
int fd = bpf_attach_perf_event(probe_fd, ev_type, ev_config, sample_period,
sample_freq, pid, i, group_fd);
if (fd < 0) {
......@@ -323,7 +318,7 @@ StatusTuple BPF::attach_perf_event(uint32_t ev_type, uint32_t ev_config,
}
StatusTuple BPF::detach_kprobe(const std::string& kernel_func,
bpf_attach_type attach_type) {
bpf_probe_attach_type attach_type) {
std::string event = get_kprobe_event(kernel_func, attach_type);
auto it = kprobes_.find(event);
......@@ -339,7 +334,7 @@ StatusTuple BPF::detach_kprobe(const std::string& kernel_func,
StatusTuple BPF::detach_uprobe(const std::string& binary_path,
const std::string& symbol, uint64_t symbol_addr,
bpf_attach_type attach_type) {
bpf_probe_attach_type attach_type) {
bcc_symbol sym = bcc_symbol();
TRY2(check_binary_symbol(binary_path, symbol, symbol_addr, &sym));
......@@ -421,7 +416,7 @@ void BPF::poll_perf_buffer(const std::string& name, int timeout) {
}
StatusTuple BPF::load_func(const std::string& func_name,
enum bpf_prog_type type, int& fd) {
bpf_prog_type type, int& fd) {
if (funcs_.find(func_name) != funcs_.end()) {
fd = funcs_[func_name];
return StatusTuple(0);
......@@ -462,7 +457,7 @@ StatusTuple BPF::check_binary_symbol(const std::string& binary_path,
const std::string& symbol,
uint64_t symbol_addr, bcc_symbol* output) {
int res = bcc_resolve_symname(binary_path.c_str(), symbol.c_str(),
symbol_addr, output);
symbol_addr, 0, output);
if (res < 0)
return StatusTuple(
-1, "Unable to find offset for binary %s symbol %s address %lx",
......@@ -471,14 +466,14 @@ StatusTuple BPF::check_binary_symbol(const std::string& binary_path,
}
std::string BPF::get_kprobe_event(const std::string& kernel_func,
bpf_attach_type type) {
bpf_probe_attach_type type) {
std::string res = attach_type_prefix(type) + "_";
res += sanitize_str(kernel_func, &BPF::kprobe_event_validator);
return res;
}
std::string BPF::get_uprobe_event(const std::string& binary_path,
uint64_t offset, bpf_attach_type type) {
uint64_t offset, bpf_probe_attach_type type) {
std::string res = attach_type_prefix(type) + "_";
res += sanitize_str(binary_path, &BPF::uprobe_path_validator);
res += "_0x" + uint_to_hex(offset);
......@@ -492,8 +487,7 @@ StatusTuple BPF::detach_kprobe_event(const std::string& event,
attr.reader_ptr = nullptr;
}
TRY2(unload_func(attr.func));
std::string detach_event = "-:kprobes/" + event;
if (bpf_detach_kprobe(detach_event.c_str()) < 0)
if (bpf_detach_kprobe(event.c_str()) < 0)
return StatusTuple(-1, "Unable to detach kprobe %s", event.c_str());
return StatusTuple(0);
}
......@@ -505,8 +499,7 @@ StatusTuple BPF::detach_uprobe_event(const std::string& event,
attr.reader_ptr = nullptr;
}
TRY2(unload_func(attr.func));
std::string detach_event = "-:uprobes/" + event;
if (bpf_detach_uprobe(detach_event.c_str()) < 0)
if (bpf_detach_uprobe(event.c_str()) < 0)
return StatusTuple(-1, "Unable to detach uprobe %s", event.c_str());
return StatusTuple(0);
}
......
......@@ -29,11 +29,6 @@
namespace ebpf {
enum class bpf_attach_type {
probe_entry,
probe_return
};
struct open_probe_t {
void* reader_ptr;
std::string func;
......@@ -56,23 +51,23 @@ public:
StatusTuple attach_kprobe(
const std::string& kernel_func, const std::string& probe_func,
bpf_attach_type attach_type = bpf_attach_type::probe_entry,
bpf_probe_attach_type = BPF_PROBE_ENTRY,
pid_t pid = -1, int cpu = 0, int group_fd = -1,
perf_reader_cb cb = nullptr, void* cb_cookie = nullptr);
StatusTuple detach_kprobe(
const std::string& kernel_func,
bpf_attach_type attach_type = bpf_attach_type::probe_entry);
bpf_probe_attach_type attach_type = BPF_PROBE_ENTRY);
StatusTuple attach_uprobe(
const std::string& binary_path, const std::string& symbol,
const std::string& probe_func, uint64_t symbol_addr = 0,
bpf_attach_type attach_type = bpf_attach_type::probe_entry,
bpf_probe_attach_type attach_type = BPF_PROBE_ENTRY,
pid_t pid = -1, int cpu = 0, int group_fd = -1,
perf_reader_cb cb = nullptr, void* cb_cookie = nullptr);
StatusTuple detach_uprobe(
const std::string& binary_path, const std::string& symbol,
uint64_t symbol_addr = 0,
bpf_attach_type attach_type = bpf_attach_type::probe_entry);
bpf_probe_attach_type attach_type = BPF_PROBE_ENTRY);
StatusTuple attach_usdt(const USDT& usdt, pid_t pid = -1, int cpu = 0,
int group_fd = -1);
StatusTuple detach_usdt(const USDT& usdt);
......@@ -111,9 +106,9 @@ private:
StatusTuple unload_func(const std::string& func_name);
std::string get_kprobe_event(const std::string& kernel_func,
bpf_attach_type type);
bpf_probe_attach_type type);
std::string get_uprobe_event(const std::string& binary_path, uint64_t offset,
bpf_attach_type type);
bpf_probe_attach_type type);
StatusTuple detach_kprobe_event(const std::string& event, open_probe_t& attr);
StatusTuple detach_uprobe_event(const std::string& event, open_probe_t& attr);
......@@ -121,21 +116,21 @@ private:
open_probe_t& attr);
StatusTuple detach_perf_event_all_cpu(open_probe_t& attr);
std::string attach_type_debug(bpf_attach_type type) {
std::string attach_type_debug(bpf_probe_attach_type type) {
switch (type) {
case bpf_attach_type::probe_entry:
case BPF_PROBE_ENTRY:
return "";
case bpf_attach_type::probe_return:
case BPF_PROBE_RETURN:
return "return ";
}
return "ERROR";
}
std::string attach_type_prefix(bpf_attach_type type) {
std::string attach_type_prefix(bpf_probe_attach_type type) {
switch (type) {
case bpf_attach_type::probe_entry:
case BPF_PROBE_ENTRY:
return "p";
case bpf_attach_type::probe_return:
case BPF_PROBE_RETURN:
return "r";
}
return "ERROR";
......
......@@ -25,6 +25,7 @@
#include "bcc_syms.h"
#include "libbpf.h"
#include "perf_reader.h"
#include "common.h"
namespace ebpf {
......@@ -89,7 +90,7 @@ StatusTuple BPFPerfBuffer::open_all_cpu(perf_reader_raw_cb cb,
if (cpu_readers_.size() != 0 || readers_.size() != 0)
return StatusTuple(-1, "Previously opened perf buffer not cleaned");
for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN); i++) {
for (int i: get_online_cpus()) {
auto res = open_on_cpu(cb, i, cb_cookie);
if (res.code() != 0) {
TRY2(close_all_cpu());
......@@ -113,7 +114,7 @@ StatusTuple BPFPerfBuffer::close_on_cpu(int cpu) {
StatusTuple BPFPerfBuffer::close_all_cpu() {
std::string errors;
bool has_error = false;
for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN); i++) {
for (int i: get_online_cpus()) {
auto res = close_on_cpu(i);
if (res.code() != 0) {
errors += "Failed to close CPU" + std::to_string(i) + " perf buffer: ";
......
......@@ -35,12 +35,12 @@ if (CMAKE_COMPILER_IS_GNUCC AND LIBCLANG_ISSTATIC)
endif()
endif()
add_library(bcc-shared SHARED bpf_common.cc bpf_module.cc libbpf.c perf_reader.c shared_table.cc exported_files.cc bcc_elf.c bcc_perf_map.c bcc_proc.c bcc_syms.cc usdt_args.cc usdt.cc BPF.cc BPFTable.cc)
add_library(bcc-shared SHARED bpf_common.cc bpf_module.cc libbpf.c perf_reader.c shared_table.cc exported_files.cc bcc_elf.c bcc_perf_map.c bcc_proc.c bcc_syms.cc usdt_args.cc usdt.cc common.cc BPF.cc BPFTable.cc)
set_target_properties(bcc-shared PROPERTIES VERSION ${REVISION_LAST} SOVERSION 0)
set_target_properties(bcc-shared PROPERTIES OUTPUT_NAME bcc)
add_library(bcc-loader-static libbpf.c perf_reader.c bcc_elf.c bcc_perf_map.c bcc_proc.c)
add_library(bcc-static STATIC bpf_common.cc bpf_module.cc shared_table.cc exported_files.cc bcc_syms.cc usdt_args.cc usdt.cc BPF.cc BPFTable.cc)
add_library(bcc-static STATIC bpf_common.cc bpf_module.cc shared_table.cc exported_files.cc bcc_syms.cc usdt_args.cc usdt.cc common.cc BPF.cc BPFTable.cc)
set_target_properties(bcc-static PROPERTIES OUTPUT_NAME bcc)
set(llvm_raw_libs bitwriter bpfcodegen irreader linker
......
......@@ -24,6 +24,7 @@
#include <stdint.h>
#include <ctype.h>
#include <stdio.h>
#include <math.h>
#include "bcc_perf_map.h"
#include "bcc_proc.h"
......@@ -307,13 +308,57 @@ static bool match_so_flags(int flags) {
return true;
}
const char *bcc_procutils_which_so(const char *libname) {
static bool which_so_in_process(const char* libname, int pid, char* libpath) {
int ret, found = false;
char endline[4096], *mapname = NULL, *newline;
char mappings_file[128];
const size_t search_len = strlen(libname) + strlen("/lib.");
char search1[search_len + 1];
char search2[search_len + 1];
sprintf(mappings_file, "/proc/%ld/maps", (long)pid);
FILE *fp = fopen(mappings_file, "r");
if (!fp)
return NULL;
snprintf(search1, search_len + 1, "/lib%s.", libname);
snprintf(search2, search_len + 1, "/lib%s-", libname);
do {
ret = fscanf(fp, "%*x-%*x %*s %*x %*s %*d");
if (!fgets(endline, sizeof(endline), fp))
break;
mapname = endline;
newline = strchr(endline, '\n');
if (newline)
newline[0] = '\0';
while (isspace(mapname[0])) mapname++;
if (strstr(mapname, ".so") && (strstr(mapname, search1) ||
strstr(mapname, search2))) {
found = true;
memcpy(libpath, mapname, strlen(mapname) + 1);
break;
}
} while (ret != EOF);
fclose(fp);
return found;
}
char *bcc_procutils_which_so(const char *libname, int pid) {
const size_t soname_len = strlen(libname) + strlen("lib.so");
char soname[soname_len + 1];
char libpath[4096];
int i;
if (strchr(libname, '/'))
return libname;
return strdup(libname);
if (pid && which_so_in_process(libname, pid, libpath))
return strdup(libpath);
if (lib_cache_count < 0)
return NULL;
......@@ -327,8 +372,13 @@ const char *bcc_procutils_which_so(const char *libname) {
for (i = 0; i < lib_cache_count; ++i) {
if (!strncmp(lib_cache[i].libname, soname, soname_len) &&
match_so_flags(lib_cache[i].flags))
return lib_cache[i].path;
match_so_flags(lib_cache[i].flags)) {
return strdup(lib_cache[i].path);
}
}
return NULL;
}
void bcc_procutils_free(const char *ptr) {
free((void *)ptr);
}
......@@ -27,11 +27,12 @@ extern "C" {
typedef int (*bcc_procutils_modulecb)(const char *, uint64_t, uint64_t, void *);
typedef void (*bcc_procutils_ksymcb)(const char *, uint64_t, void *);
const char *bcc_procutils_which_so(const char *libname);
char *bcc_procutils_which_so(const char *libname, int pid);
char *bcc_procutils_which(const char *binpath);
int bcc_procutils_each_module(int pid, bcc_procutils_modulecb callback,
void *payload);
int bcc_procutils_each_ksym(bcc_procutils_ksymcb callback, void *payload);
void bcc_procutils_free(const char *ptr);
#ifdef __cplusplus
}
......
......@@ -304,7 +304,7 @@ int bcc_foreach_symbol(const char *module, SYM_CB cb) {
}
int bcc_resolve_symname(const char *module, const char *symname,
const uint64_t addr, struct bcc_symbol *sym) {
const uint64_t addr, int pid, struct bcc_symbol *sym) {
uint64_t load_addr;
sym->module = NULL;
......@@ -315,9 +315,9 @@ int bcc_resolve_symname(const char *module, const char *symname,
return -1;
if (strchr(module, '/')) {
sym->module = module;
sym->module = strdup(module);
} else {
sym->module = bcc_procutils_which_so(module);
sym->module = bcc_procutils_which_so(module, pid);
}
if (sym->module == NULL)
......
......@@ -43,7 +43,7 @@ int bcc_resolve_global_addr(int pid, const char *module, const uint64_t address,
int bcc_foreach_symbol(const char *module, SYM_CB cb);
int bcc_find_symbol_addr(struct bcc_symbol *sym);
int bcc_resolve_symname(const char *module, const char *symname,
const uint64_t addr, struct bcc_symbol *sym);
const uint64_t addr, int pid, struct bcc_symbol *sym);
#ifdef __cplusplus
}
#endif
......
......@@ -345,7 +345,8 @@ int BPFModule::load_includes(const string &text) {
int BPFModule::annotate() {
for (auto fn = mod_->getFunctionList().begin(); fn != mod_->getFunctionList().end(); ++fn)
fn->addFnAttr(Attribute::AlwaysInline);
if (!fn->hasFnAttribute(Attribute::NoInline))
fn->addFnAttr(Attribute::AlwaysInline);
// separate module to hold the reader functions
auto m = make_unique<Module>("sscanf", *ctx_);
......
/*
* Copyright (c) 2016 Catalysts GmbH
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <fstream>
#include <sstream>
#include "common.h"
namespace ebpf {
std::vector<int> read_cpu_range(std::string path) {
std::ifstream cpus_range_stream { path };
std::vector<int> cpus;
std::string cpu_range;
while (std::getline(cpus_range_stream, cpu_range, ',')) {
std::size_t rangeop = cpu_range.find('-');
if (rangeop == std::string::npos) {
cpus.push_back(std::stoi(cpu_range));
}
else {
int start = std::stoi(cpu_range.substr(0, rangeop));
int end = std::stoi(cpu_range.substr(rangeop + 1));
for (int i = start; i <= end; i++)
cpus.push_back(i);
}
}
return cpus;
}
std::vector<int> get_online_cpus() {
return read_cpu_range("/sys/devices/system/cpu/online");
}
std::vector<int> get_possible_cpus() {
return read_cpu_range("/sys/devices/system/cpu/possible");
}
} // namespace ebpf
......@@ -19,6 +19,7 @@
#include <memory>
#include <string>
#include <tuple>
#include <vector>
namespace ebpf {
......@@ -28,4 +29,8 @@ make_unique(Args &&... args) {
return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
}
std::vector<int> get_online_cpus();
std::vector<int> get_possible_cpus();
} // namespace ebpf
......@@ -63,6 +63,12 @@ struct bpf_insn {
__s32 imm; /* signed immediate constant */
};
/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
struct bpf_lpm_trie_key {
__u32 prefixlen; /* up to 32 for AF_INET, 128 for AF_INET6 */
__u8 data[0]; /* Arbitrary size */
};
/* BPF syscall commands, see bpf(2) man-page for details. */
enum bpf_cmd {
BPF_MAP_CREATE,
......@@ -73,6 +79,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
BPF_PROG_ATTACH,
BPF_PROG_DETACH,
};
enum bpf_map_type {
......@@ -87,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
BPF_MAP_TYPE_LPM_TRIE,
};
enum bpf_prog_type {
......@@ -98,8 +107,22 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
BPF_PROG_TYPE_CGROUP_SOCK,
BPF_PROG_TYPE_LWT_IN,
BPF_PROG_TYPE_LWT_OUT,
BPF_PROG_TYPE_LWT_XMIT,
};
enum bpf_attach_type {
BPF_CGROUP_INET_INGRESS,
BPF_CGROUP_INET_EGRESS,
BPF_CGROUP_INET_SOCK_CREATE,
__MAX_BPF_ATTACH_TYPE
};
#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
#define BPF_PSEUDO_MAP_FD 1
/* flags for BPF_MAP_UPDATE_ELEM command */
......@@ -150,243 +173,327 @@ union bpf_attr {
__aligned_u64 pathname;
__u32 bpf_fd;
};
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
__u32 target_fd; /* container object to attach to */
__u32 attach_bpf_fd; /* eBPF program to attach */
__u32 attach_type;
};
} __attribute__((aligned(8)));
/* BPF helper function descriptions:
*
* void *bpf_map_lookup_elem(&map, &key)
* Return: Map value or NULL
*
* int bpf_map_update_elem(&map, &key, &value, flags)
* Return: 0 on success or negative error
*
* int bpf_map_delete_elem(&map, &key)
* Return: 0 on success or negative error
*
* int bpf_probe_read(void *dst, int size, void *src)
* Return: 0 on success or negative error
*
* u64 bpf_ktime_get_ns(void)
* Return: current ktime
*
* int bpf_trace_printk(const char *fmt, int fmt_size, ...)
* Return: length of buffer written or negative error
*
* u32 bpf_prandom_u32(void)
* Return: random value
*
* u32 bpf_raw_smp_processor_id(void)
* Return: SMP processor ID
*
* int bpf_skb_store_bytes(skb, offset, from, len, flags)
* store bytes into packet
* @skb: pointer to skb
* @offset: offset within packet from skb->mac_header
* @from: pointer where to copy bytes from
* @len: number of bytes to store into packet
* @flags: bit 0 - if true, recompute skb->csum
* other bits - reserved
* Return: 0 on success or negative error
*
* int bpf_l3_csum_replace(skb, offset, from, to, flags)
* recompute IP checksum
* @skb: pointer to skb
* @offset: offset within packet where IP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* other bits - reserved
* Return: 0 on success or negative error
*
* int bpf_l4_csum_replace(skb, offset, from, to, flags)
* recompute TCP/UDP checksum
* @skb: pointer to skb
* @offset: offset within packet where TCP/UDP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* bit 4 - is pseudo header
* other bits - reserved
* Return: 0 on success or negative error
*
* int bpf_tail_call(ctx, prog_array_map, index)
* jump into another BPF program
* @ctx: context pointer passed to next program
* @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
* @index: index inside array that selects specific program to run
* Return: 0 on success or negative error
*
* int bpf_clone_redirect(skb, ifindex, flags)
* redirect to another netdev
* @skb: pointer to skb
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: 0 on success or negative error
*
* u64 bpf_get_current_pid_tgid(void)
* Return: current->tgid << 32 | current->pid
*
* u64 bpf_get_current_uid_gid(void)
* Return: current_gid << 32 | current_uid
*
* int bpf_get_current_comm(char *buf, int size_of_buf)
* stores current->comm into buf
* Return: 0 on success or negative error
*
* u32 bpf_get_cgroup_classid(skb)
* retrieve a proc's classid
* @skb: pointer to skb
* Return: classid if != 0
*
* int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
* Return: 0 on success or negative error
*
* int bpf_skb_vlan_pop(skb)
* Return: 0 on success or negative error
*
* int bpf_skb_get_tunnel_key(skb, key, size, flags)
* int bpf_skb_set_tunnel_key(skb, key, size, flags)
* retrieve or populate tunnel metadata
* @skb: pointer to skb
* @key: pointer to 'struct bpf_tunnel_key'
* @size: size of 'struct bpf_tunnel_key'
* @flags: room for future extensions
* Return: 0 on success or negative error
*
* u64 bpf_perf_event_read(&map, index)
* Return: Number events read or error code
*
* int bpf_redirect(ifindex, flags)
* redirect to another netdev
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: TC_ACT_REDIRECT
*
* u32 bpf_get_route_realm(skb)
* retrieve a dst's tclassid
* @skb: pointer to skb
* Return: realm if != 0
*
* int bpf_perf_event_output(ctx, map, index, data, size)
* output perf raw sample
* @ctx: struct pt_regs*
* @map: pointer to perf_event_array map
* @index: index of event in the map
* @data: data on stack to be output as raw data
* @size: size of data
* Return: 0 on success or negative error
*
* int bpf_get_stackid(ctx, map, flags)
* walk user or kernel stack and return id
* @ctx: struct pt_regs*
* @map: pointer to stack_trace map
* @flags: bits 0-7 - numer of stack frames to skip
* bit 8 - collect user stack instead of kernel
* bit 9 - compare stacks by hash only
* bit 10 - if two different stacks hash into the same stackid
* discard old
* other bits - reserved
* Return: >= 0 stackid on success or negative error
*
* s64 bpf_csum_diff(from, from_size, to, to_size, seed)
* calculate csum diff
* @from: raw from buffer
* @from_size: length of from buffer
* @to: raw to buffer
* @to_size: length of to buffer
* @seed: optional seed
* Return: csum result or negative error code
*
* int bpf_skb_get_tunnel_opt(skb, opt, size)
* retrieve tunnel options metadata
* @skb: pointer to skb
* @opt: pointer to raw tunnel option data
* @size: size of @opt
* Return: option size
*
* int bpf_skb_set_tunnel_opt(skb, opt, size)
* populate tunnel options metadata
* @skb: pointer to skb
* @opt: pointer to raw tunnel option data
* @size: size of @opt
* Return: 0 on success or negative error
*
* int bpf_skb_change_proto(skb, proto, flags)
* Change protocol of the skb. Currently supported is v4 -> v6,
* v6 -> v4 transitions. The helper will also resize the skb. eBPF
* program is expected to fill the new headers via skb_store_bytes
* and lX_csum_replace.
* @skb: pointer to skb
* @proto: new skb->protocol type
* @flags: reserved
* Return: 0 on success or negative error
*
* int bpf_skb_change_type(skb, type)
* Change packet type of skb.
* @skb: pointer to skb
* @type: new skb->pkt_type type
* Return: 0 on success or negative error
*
* int bpf_skb_under_cgroup(skb, map, index)
* Check cgroup2 membership of skb
* @skb: pointer to skb
* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
* @index: index of the cgroup in the bpf_map
* Return:
* == 0 skb failed the cgroup2 descendant test
* == 1 skb succeeded the cgroup2 descendant test
* < 0 error
*
* u32 bpf_get_hash_recalc(skb)
* Retrieve and possibly recalculate skb->hash.
* @skb: pointer to skb
* Return: hash
*
* u64 bpf_get_current_task(void)
* Returns current task_struct
* Return: current
*
* int bpf_probe_write_user(void *dst, void *src, int len)
* safely attempt to write to a location
* @dst: destination address in userspace
* @src: source address on stack
* @len: number of bytes to copy
* Return: 0 on success or negative error
*
* int bpf_current_task_under_cgroup(map, index)
* Check cgroup2 membership of current task
* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
* @index: index of the cgroup in the bpf_map
* Return:
* == 0 current failed the cgroup2 descendant test
* == 1 current succeeded the cgroup2 descendant test
* < 0 error
*
* int bpf_skb_change_tail(skb, len, flags)
* The helper will resize the skb to the given new size, to be used f.e.
* with control messages.
* @skb: pointer to skb
* @len: new skb length
* @flags: reserved
* Return: 0 on success or negative error
*
* int bpf_skb_pull_data(skb, len)
* The helper will pull in non-linear data in case the skb is non-linear
* and not all of len are part of the linear section. Only needed for
* read/write with direct packet access.
* @skb: pointer to skb
* @len: len to make read/writeable
* Return: 0 on success or negative error
*
* s64 bpf_csum_update(skb, csum)
* Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
* @skb: pointer to skb
* @csum: csum to add
* Return: csum on success or negative error
*
* void bpf_set_hash_invalid(skb)
* Invalidate current skb->hash.
* @skb: pointer to skb
*
* int bpf_get_numa_node_id()
* Return: Id of current NUMA node.
*
* int bpf_skb_change_head()
* Grows headroom of skb and adjusts MAC header offset accordingly.
* Will extends/reallocae as required automatically.
* May change skb data pointer and will thus invalidate any check
* performed for direct packet access.
* @skb: pointer to skb
* @len: length of header to be pushed in front
* @flags: Flags (unused for now)
* Return: 0 on success or negative error
*
* int bpf_xdp_adjust_head(xdp_md, delta)
* Adjust the xdp_md.data by delta
* @xdp_md: pointer to xdp_md
* @delta: An positive/negative integer to be added to xdp_md.data
* Return: 0 on success or negative on error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
FN(map_lookup_elem), \
FN(map_update_elem), \
FN(map_delete_elem), \
FN(probe_read), \
FN(ktime_get_ns), \
FN(trace_printk), \
FN(get_prandom_u32), \
FN(get_smp_processor_id), \
FN(skb_store_bytes), \
FN(l3_csum_replace), \
FN(l4_csum_replace), \
FN(tail_call), \
FN(clone_redirect), \
FN(get_current_pid_tgid), \
FN(get_current_uid_gid), \
FN(get_current_comm), \
FN(get_cgroup_classid), \
FN(skb_vlan_push), \
FN(skb_vlan_pop), \
FN(skb_get_tunnel_key), \
FN(skb_set_tunnel_key), \
FN(perf_event_read), \
FN(redirect), \
FN(get_route_realm), \
FN(perf_event_output), \
FN(skb_load_bytes), \
FN(get_stackid), \
FN(csum_diff), \
FN(skb_get_tunnel_opt), \
FN(skb_set_tunnel_opt), \
FN(skb_change_proto), \
FN(skb_change_type), \
FN(skb_under_cgroup), \
FN(get_hash_recalc), \
FN(get_current_task), \
FN(probe_write_user), \
FN(current_task_under_cgroup), \
FN(skb_change_tail), \
FN(skb_pull_data), \
FN(csum_update), \
FN(set_hash_invalid), \
FN(get_numa_node_id), \
FN(skb_change_head), \
FN(xdp_adjust_head),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
*/
#define __BPF_ENUM_FN(x) BPF_FUNC_ ## x
enum bpf_func_id {
BPF_FUNC_unspec,
BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
BPF_FUNC_probe_read, /* int bpf_probe_read(void *dst, int size, void *src) */
BPF_FUNC_ktime_get_ns, /* u64 bpf_ktime_get_ns(void) */
BPF_FUNC_trace_printk, /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
BPF_FUNC_get_prandom_u32, /* u32 prandom_u32(void) */
BPF_FUNC_get_smp_processor_id, /* u32 raw_smp_processor_id(void) */
/**
* skb_store_bytes(skb, offset, from, len, flags) - store bytes into packet
* @skb: pointer to skb
* @offset: offset within packet from skb->mac_header
* @from: pointer where to copy bytes from
* @len: number of bytes to store into packet
* @flags: bit 0 - if true, recompute skb->csum
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_skb_store_bytes,
/**
* l3_csum_replace(skb, offset, from, to, flags) - recompute IP checksum
* @skb: pointer to skb
* @offset: offset within packet where IP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_l3_csum_replace,
/**
* l4_csum_replace(skb, offset, from, to, flags) - recompute TCP/UDP checksum
* @skb: pointer to skb
* @offset: offset within packet where TCP/UDP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* bit 4 - is pseudo header
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_l4_csum_replace,
/**
* bpf_tail_call(ctx, prog_array_map, index) - jump into another BPF program
* @ctx: context pointer passed to next program
* @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
* @index: index inside array that selects specific program to run
* Return: 0 on success
*/
BPF_FUNC_tail_call,
/**
* bpf_clone_redirect(skb, ifindex, flags) - redirect to another netdev
* @skb: pointer to skb
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_clone_redirect,
/**
* u64 bpf_get_current_pid_tgid(void)
* Return: current->tgid << 32 | current->pid
*/
BPF_FUNC_get_current_pid_tgid,
/**
* u64 bpf_get_current_uid_gid(void)
* Return: current_gid << 32 | current_uid
*/
BPF_FUNC_get_current_uid_gid,
/**
* bpf_get_current_comm(char *buf, int size_of_buf)
* stores current->comm into buf
* Return: 0 on success
*/
BPF_FUNC_get_current_comm,
/**
* bpf_get_cgroup_classid(skb) - retrieve a proc's classid
* @skb: pointer to skb
* Return: classid if != 0
*/
BPF_FUNC_get_cgroup_classid,
BPF_FUNC_skb_vlan_push, /* bpf_skb_vlan_push(skb, vlan_proto, vlan_tci) */
BPF_FUNC_skb_vlan_pop, /* bpf_skb_vlan_pop(skb) */
/**
* bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
* retrieve or populate tunnel metadata
* @skb: pointer to skb
* @key: pointer to 'struct bpf_tunnel_key'
* @size: size of 'struct bpf_tunnel_key'
* @flags: room for future extensions
* Retrun: 0 on success
*/
BPF_FUNC_skb_get_tunnel_key,
BPF_FUNC_skb_set_tunnel_key,
BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */
/**
* bpf_redirect(ifindex, flags) - redirect to another netdev
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: TC_ACT_REDIRECT
*/
BPF_FUNC_redirect,
/**
* bpf_get_route_realm(skb) - retrieve a dst's tclassid
* @skb: pointer to skb
* Return: realm if != 0
*/
BPF_FUNC_get_route_realm,
/**
* bpf_perf_event_output(ctx, map, index, data, size) - output perf raw sample
* @ctx: struct pt_regs*
* @map: pointer to perf_event_array map
* @index: index of event in the map
* @data: data on stack to be output as raw data
* @size: size of data
* Return: 0 on success
*/
BPF_FUNC_perf_event_output,
BPF_FUNC_skb_load_bytes,
/**
* bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
* @ctx: struct pt_regs*
* @map: pointer to stack_trace map
* @flags: bits 0-7 - numer of stack frames to skip
* bit 8 - collect user stack instead of kernel
* bit 9 - compare stacks by hash only
* bit 10 - if two different stacks hash into the same stackid
* discard old
* other bits - reserved
* Return: >= 0 stackid on success or negative error
*/
BPF_FUNC_get_stackid,
/**
* bpf_csum_diff(from, from_size, to, to_size, seed) - calculate csum diff
* @from: raw from buffer
* @from_size: length of from buffer
* @to: raw to buffer
* @to_size: length of to buffer
* @seed: optional seed
* Return: csum result
*/
BPF_FUNC_csum_diff,
/**
* bpf_skb_[gs]et_tunnel_opt(skb, opt, size)
* retrieve or populate tunnel options metadata
* @skb: pointer to skb
* @opt: pointer to raw tunnel option data
* @size: size of @opt
* Return: 0 on success for set, option size for get
*/
BPF_FUNC_skb_get_tunnel_opt,
BPF_FUNC_skb_set_tunnel_opt,
/**
* bpf_skb_change_proto(skb, proto, flags)
* Change protocol of the skb. Currently supported is
* v4 -> v6, v6 -> v4 transitions. The helper will also
* resize the skb. eBPF program is expected to fill the
* new headers via skb_store_bytes and lX_csum_replace.
* @skb: pointer to skb
* @proto: new skb->protocol type
* @flags: reserved
* Return: 0 on success or negative error
*/
BPF_FUNC_skb_change_proto,
/**
* bpf_skb_change_type(skb, type)
* Change packet type of skb.
* @skb: pointer to skb
* @type: new skb->pkt_type type
* Return: 0 on success or negative error
*/
BPF_FUNC_skb_change_type,
/**
* bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb
* @skb: pointer to skb
* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
* @index: index of the cgroup in the bpf_map
* Return:
* == 0 skb failed the cgroup2 descendant test
* == 1 skb succeeded the cgroup2 descendant test
* < 0 error
*/
BPF_FUNC_skb_in_cgroup,
/**
* bpf_get_hash_recalc(skb)
* Retrieve and possibly recalculate skb->hash.
* @skb: pointer to skb
* Return: hash
*/
BPF_FUNC_get_hash_recalc,
/**
* u64 bpf_get_current_task(void)
* Returns current task_struct
* Return: current
*/
BPF_FUNC_get_current_task,
/**
* bpf_probe_write_user(void *dst, void *src, int len)
* safely attempt to write to a location
* @dst: destination address in userspace
* @src: source address on stack
* @len: number of bytes to copy
* Return: 0 on success or negative error
*/
BPF_FUNC_probe_write_user,
__BPF_FUNC_MAPPER(__BPF_ENUM_FN)
__BPF_FUNC_MAX_ID,
};
#undef __BPF_ENUM_FN
/* All flags used by eBPF helper functions, placed here. */
......@@ -460,6 +567,31 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
};
/* Generic BPF return codes which all BPF program types may support.
* The values are binary compatible with their TC_ACT_* counter-part to
* provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
* programs.
*
* XDP is handled seprately, see XDP_*.
*/
enum bpf_ret_code {
BPF_OK = 0,
/* 1 reserved */
BPF_DROP = 2,
/* 3-6 reserved */
BPF_REDIRECT = 7,
/* >127 are reserved for prog type specific return codes */
};
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
#define XDP_PACKET_HEADROOM 256
/* User return codes for XDP prog type.
* A valid XDP program must return one of these defined values. All other
* return codes are reserved for future use. Unknown return codes will result
......
......@@ -64,6 +64,12 @@ struct bpf_insn {
__s32 imm; /* signed immediate constant */
};
/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
struct bpf_lpm_trie_key {
__u32 prefixlen; /* up to 32 for AF_INET, 128 for AF_INET6 */
__u8 data[0]; /* Arbitrary size */
};
/* BPF syscall commands, see bpf(2) man-page for details. */
enum bpf_cmd {
BPF_MAP_CREATE,
......@@ -74,6 +80,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
BPF_PROG_ATTACH,
BPF_PROG_DETACH,
};
enum bpf_map_type {
......@@ -88,6 +96,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
BPF_MAP_TYPE_LPM_TRIE,
};
enum bpf_prog_type {
......@@ -99,8 +108,22 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
BPF_PROG_TYPE_CGROUP_SOCK,
BPF_PROG_TYPE_LWT_IN,
BPF_PROG_TYPE_LWT_OUT,
BPF_PROG_TYPE_LWT_XMIT,
};
enum bpf_attach_type {
BPF_CGROUP_INET_INGRESS,
BPF_CGROUP_INET_EGRESS,
BPF_CGROUP_INET_SOCK_CREATE,
__MAX_BPF_ATTACH_TYPE
};
#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
#define BPF_PSEUDO_MAP_FD 1
/* flags for BPF_MAP_UPDATE_ELEM command */
......@@ -151,243 +174,327 @@ union bpf_attr {
__aligned_u64 pathname;
__u32 bpf_fd;
};
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
__u32 target_fd; /* container object to attach to */
__u32 attach_bpf_fd; /* eBPF program to attach */
__u32 attach_type;
};
} __attribute__((aligned(8)));
/* BPF helper function descriptions:
*
* void *bpf_map_lookup_elem(&map, &key)
* Return: Map value or NULL
*
* int bpf_map_update_elem(&map, &key, &value, flags)
* Return: 0 on success or negative error
*
* int bpf_map_delete_elem(&map, &key)
* Return: 0 on success or negative error
*
* int bpf_probe_read(void *dst, int size, void *src)
* Return: 0 on success or negative error
*
* u64 bpf_ktime_get_ns(void)
* Return: current ktime
*
* int bpf_trace_printk(const char *fmt, int fmt_size, ...)
* Return: length of buffer written or negative error
*
* u32 bpf_prandom_u32(void)
* Return: random value
*
* u32 bpf_raw_smp_processor_id(void)
* Return: SMP processor ID
*
* int bpf_skb_store_bytes(skb, offset, from, len, flags)
* store bytes into packet
* @skb: pointer to skb
* @offset: offset within packet from skb->mac_header
* @from: pointer where to copy bytes from
* @len: number of bytes to store into packet
* @flags: bit 0 - if true, recompute skb->csum
* other bits - reserved
* Return: 0 on success or negative error
*
* int bpf_l3_csum_replace(skb, offset, from, to, flags)
* recompute IP checksum
* @skb: pointer to skb
* @offset: offset within packet where IP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* other bits - reserved
* Return: 0 on success or negative error
*
* int bpf_l4_csum_replace(skb, offset, from, to, flags)
* recompute TCP/UDP checksum
* @skb: pointer to skb
* @offset: offset within packet where TCP/UDP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* bit 4 - is pseudo header
* other bits - reserved
* Return: 0 on success or negative error
*
* int bpf_tail_call(ctx, prog_array_map, index)
* jump into another BPF program
* @ctx: context pointer passed to next program
* @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
* @index: index inside array that selects specific program to run
* Return: 0 on success or negative error
*
* int bpf_clone_redirect(skb, ifindex, flags)
* redirect to another netdev
* @skb: pointer to skb
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: 0 on success or negative error
*
* u64 bpf_get_current_pid_tgid(void)
* Return: current->tgid << 32 | current->pid
*
* u64 bpf_get_current_uid_gid(void)
* Return: current_gid << 32 | current_uid
*
* int bpf_get_current_comm(char *buf, int size_of_buf)
* stores current->comm into buf
* Return: 0 on success or negative error
*
* u32 bpf_get_cgroup_classid(skb)
* retrieve a proc's classid
* @skb: pointer to skb
* Return: classid if != 0
*
* int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
* Return: 0 on success or negative error
*
* int bpf_skb_vlan_pop(skb)
* Return: 0 on success or negative error
*
* int bpf_skb_get_tunnel_key(skb, key, size, flags)
* int bpf_skb_set_tunnel_key(skb, key, size, flags)
* retrieve or populate tunnel metadata
* @skb: pointer to skb
* @key: pointer to 'struct bpf_tunnel_key'
* @size: size of 'struct bpf_tunnel_key'
* @flags: room for future extensions
* Return: 0 on success or negative error
*
* u64 bpf_perf_event_read(&map, index)
* Return: Number events read or error code
*
* int bpf_redirect(ifindex, flags)
* redirect to another netdev
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: TC_ACT_REDIRECT
*
* u32 bpf_get_route_realm(skb)
* retrieve a dst's tclassid
* @skb: pointer to skb
* Return: realm if != 0
*
* int bpf_perf_event_output(ctx, map, index, data, size)
* output perf raw sample
* @ctx: struct pt_regs*
* @map: pointer to perf_event_array map
* @index: index of event in the map
* @data: data on stack to be output as raw data
* @size: size of data
* Return: 0 on success or negative error
*
* int bpf_get_stackid(ctx, map, flags)
* walk user or kernel stack and return id
* @ctx: struct pt_regs*
* @map: pointer to stack_trace map
* @flags: bits 0-7 - numer of stack frames to skip
* bit 8 - collect user stack instead of kernel
* bit 9 - compare stacks by hash only
* bit 10 - if two different stacks hash into the same stackid
* discard old
* other bits - reserved
* Return: >= 0 stackid on success or negative error
*
* s64 bpf_csum_diff(from, from_size, to, to_size, seed)
* calculate csum diff
* @from: raw from buffer
* @from_size: length of from buffer
* @to: raw to buffer
* @to_size: length of to buffer
* @seed: optional seed
* Return: csum result or negative error code
*
* int bpf_skb_get_tunnel_opt(skb, opt, size)
* retrieve tunnel options metadata
* @skb: pointer to skb
* @opt: pointer to raw tunnel option data
* @size: size of @opt
* Return: option size
*
* int bpf_skb_set_tunnel_opt(skb, opt, size)
* populate tunnel options metadata
* @skb: pointer to skb
* @opt: pointer to raw tunnel option data
* @size: size of @opt
* Return: 0 on success or negative error
*
* int bpf_skb_change_proto(skb, proto, flags)
* Change protocol of the skb. Currently supported is v4 -> v6,
* v6 -> v4 transitions. The helper will also resize the skb. eBPF
* program is expected to fill the new headers via skb_store_bytes
* and lX_csum_replace.
* @skb: pointer to skb
* @proto: new skb->protocol type
* @flags: reserved
* Return: 0 on success or negative error
*
* int bpf_skb_change_type(skb, type)
* Change packet type of skb.
* @skb: pointer to skb
* @type: new skb->pkt_type type
* Return: 0 on success or negative error
*
* int bpf_skb_under_cgroup(skb, map, index)
* Check cgroup2 membership of skb
* @skb: pointer to skb
* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
* @index: index of the cgroup in the bpf_map
* Return:
* == 0 skb failed the cgroup2 descendant test
* == 1 skb succeeded the cgroup2 descendant test
* < 0 error
*
* u32 bpf_get_hash_recalc(skb)
* Retrieve and possibly recalculate skb->hash.
* @skb: pointer to skb
* Return: hash
*
* u64 bpf_get_current_task(void)
* Returns current task_struct
* Return: current
*
* int bpf_probe_write_user(void *dst, void *src, int len)
* safely attempt to write to a location
* @dst: destination address in userspace
* @src: source address on stack
* @len: number of bytes to copy
* Return: 0 on success or negative error
*
* int bpf_current_task_under_cgroup(map, index)
* Check cgroup2 membership of current task
* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
* @index: index of the cgroup in the bpf_map
* Return:
* == 0 current failed the cgroup2 descendant test
* == 1 current succeeded the cgroup2 descendant test
* < 0 error
*
* int bpf_skb_change_tail(skb, len, flags)
* The helper will resize the skb to the given new size, to be used f.e.
* with control messages.
* @skb: pointer to skb
* @len: new skb length
* @flags: reserved
* Return: 0 on success or negative error
*
* int bpf_skb_pull_data(skb, len)
* The helper will pull in non-linear data in case the skb is non-linear
* and not all of len are part of the linear section. Only needed for
* read/write with direct packet access.
* @skb: pointer to skb
* @len: len to make read/writeable
* Return: 0 on success or negative error
*
* s64 bpf_csum_update(skb, csum)
* Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
* @skb: pointer to skb
* @csum: csum to add
* Return: csum on success or negative error
*
* void bpf_set_hash_invalid(skb)
* Invalidate current skb->hash.
* @skb: pointer to skb
*
* int bpf_get_numa_node_id()
* Return: Id of current NUMA node.
*
* int bpf_skb_change_head()
* Grows headroom of skb and adjusts MAC header offset accordingly.
* Will extends/reallocae as required automatically.
* May change skb data pointer and will thus invalidate any check
* performed for direct packet access.
* @skb: pointer to skb
* @len: length of header to be pushed in front
* @flags: Flags (unused for now)
* Return: 0 on success or negative error
*
* int bpf_xdp_adjust_head(xdp_md, delta)
* Adjust the xdp_md.data by delta
* @xdp_md: pointer to xdp_md
* @delta: An positive/negative integer to be added to xdp_md.data
* Return: 0 on success or negative on error
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
FN(map_lookup_elem), \
FN(map_update_elem), \
FN(map_delete_elem), \
FN(probe_read), \
FN(ktime_get_ns), \
FN(trace_printk), \
FN(get_prandom_u32), \
FN(get_smp_processor_id), \
FN(skb_store_bytes), \
FN(l3_csum_replace), \
FN(l4_csum_replace), \
FN(tail_call), \
FN(clone_redirect), \
FN(get_current_pid_tgid), \
FN(get_current_uid_gid), \
FN(get_current_comm), \
FN(get_cgroup_classid), \
FN(skb_vlan_push), \
FN(skb_vlan_pop), \
FN(skb_get_tunnel_key), \
FN(skb_set_tunnel_key), \
FN(perf_event_read), \
FN(redirect), \
FN(get_route_realm), \
FN(perf_event_output), \
FN(skb_load_bytes), \
FN(get_stackid), \
FN(csum_diff), \
FN(skb_get_tunnel_opt), \
FN(skb_set_tunnel_opt), \
FN(skb_change_proto), \
FN(skb_change_type), \
FN(skb_under_cgroup), \
FN(get_hash_recalc), \
FN(get_current_task), \
FN(probe_write_user), \
FN(current_task_under_cgroup), \
FN(skb_change_tail), \
FN(skb_pull_data), \
FN(csum_update), \
FN(set_hash_invalid), \
FN(get_numa_node_id), \
FN(skb_change_head), \
FN(xdp_adjust_head),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
*/
#define __BPF_ENUM_FN(x) BPF_FUNC_ ## x
enum bpf_func_id {
BPF_FUNC_unspec,
BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
BPF_FUNC_probe_read, /* int bpf_probe_read(void *dst, int size, void *src) */
BPF_FUNC_ktime_get_ns, /* u64 bpf_ktime_get_ns(void) */
BPF_FUNC_trace_printk, /* int bpf_trace_printk(const char *fmt, int fmt_size, ...) */
BPF_FUNC_get_prandom_u32, /* u32 prandom_u32(void) */
BPF_FUNC_get_smp_processor_id, /* u32 raw_smp_processor_id(void) */
/**
* skb_store_bytes(skb, offset, from, len, flags) - store bytes into packet
* @skb: pointer to skb
* @offset: offset within packet from skb->mac_header
* @from: pointer where to copy bytes from
* @len: number of bytes to store into packet
* @flags: bit 0 - if true, recompute skb->csum
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_skb_store_bytes,
/**
* l3_csum_replace(skb, offset, from, to, flags) - recompute IP checksum
* @skb: pointer to skb
* @offset: offset within packet where IP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_l3_csum_replace,
/**
* l4_csum_replace(skb, offset, from, to, flags) - recompute TCP/UDP checksum
* @skb: pointer to skb
* @offset: offset within packet where TCP/UDP checksum is located
* @from: old value of header field
* @to: new value of header field
* @flags: bits 0-3 - size of header field
* bit 4 - is pseudo header
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_l4_csum_replace,
/**
* bpf_tail_call(ctx, prog_array_map, index) - jump into another BPF program
* @ctx: context pointer passed to next program
* @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
* @index: index inside array that selects specific program to run
* Return: 0 on success
*/
BPF_FUNC_tail_call,
/**
* bpf_clone_redirect(skb, ifindex, flags) - redirect to another netdev
* @skb: pointer to skb
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: 0 on success
*/
BPF_FUNC_clone_redirect,
/**
* u64 bpf_get_current_pid_tgid(void)
* Return: current->tgid << 32 | current->pid
*/
BPF_FUNC_get_current_pid_tgid,
/**
* u64 bpf_get_current_uid_gid(void)
* Return: current_gid << 32 | current_uid
*/
BPF_FUNC_get_current_uid_gid,
/**
* bpf_get_current_comm(char *buf, int size_of_buf)
* stores current->comm into buf
* Return: 0 on success
*/
BPF_FUNC_get_current_comm,
/**
* bpf_get_cgroup_classid(skb) - retrieve a proc's classid
* @skb: pointer to skb
* Return: classid if != 0
*/
BPF_FUNC_get_cgroup_classid,
BPF_FUNC_skb_vlan_push, /* bpf_skb_vlan_push(skb, vlan_proto, vlan_tci) */
BPF_FUNC_skb_vlan_pop, /* bpf_skb_vlan_pop(skb) */
/**
* bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
* retrieve or populate tunnel metadata
* @skb: pointer to skb
* @key: pointer to 'struct bpf_tunnel_key'
* @size: size of 'struct bpf_tunnel_key'
* @flags: room for future extensions
* Retrun: 0 on success
*/
BPF_FUNC_skb_get_tunnel_key,
BPF_FUNC_skb_set_tunnel_key,
BPF_FUNC_perf_event_read, /* u64 bpf_perf_event_read(&map, index) */
/**
* bpf_redirect(ifindex, flags) - redirect to another netdev
* @ifindex: ifindex of the net device
* @flags: bit 0 - if set, redirect to ingress instead of egress
* other bits - reserved
* Return: TC_ACT_REDIRECT
*/
BPF_FUNC_redirect,
/**
* bpf_get_route_realm(skb) - retrieve a dst's tclassid
* @skb: pointer to skb
* Return: realm if != 0
*/
BPF_FUNC_get_route_realm,
/**
* bpf_perf_event_output(ctx, map, index, data, size) - output perf raw sample
* @ctx: struct pt_regs*
* @map: pointer to perf_event_array map
* @index: index of event in the map
* @data: data on stack to be output as raw data
* @size: size of data
* Return: 0 on success
*/
BPF_FUNC_perf_event_output,
BPF_FUNC_skb_load_bytes,
/**
* bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
* @ctx: struct pt_regs*
* @map: pointer to stack_trace map
* @flags: bits 0-7 - numer of stack frames to skip
* bit 8 - collect user stack instead of kernel
* bit 9 - compare stacks by hash only
* bit 10 - if two different stacks hash into the same stackid
* discard old
* other bits - reserved
* Return: >= 0 stackid on success or negative error
*/
BPF_FUNC_get_stackid,
/**
* bpf_csum_diff(from, from_size, to, to_size, seed) - calculate csum diff
* @from: raw from buffer
* @from_size: length of from buffer
* @to: raw to buffer
* @to_size: length of to buffer
* @seed: optional seed
* Return: csum result
*/
BPF_FUNC_csum_diff,
/**
* bpf_skb_[gs]et_tunnel_opt(skb, opt, size)
* retrieve or populate tunnel options metadata
* @skb: pointer to skb
* @opt: pointer to raw tunnel option data
* @size: size of @opt
* Return: 0 on success for set, option size for get
*/
BPF_FUNC_skb_get_tunnel_opt,
BPF_FUNC_skb_set_tunnel_opt,
/**
* bpf_skb_change_proto(skb, proto, flags)
* Change protocol of the skb. Currently supported is
* v4 -> v6, v6 -> v4 transitions. The helper will also
* resize the skb. eBPF program is expected to fill the
* new headers via skb_store_bytes and lX_csum_replace.
* @skb: pointer to skb
* @proto: new skb->protocol type
* @flags: reserved
* Return: 0 on success or negative error
*/
BPF_FUNC_skb_change_proto,
/**
* bpf_skb_change_type(skb, type)
* Change packet type of skb.
* @skb: pointer to skb
* @type: new skb->pkt_type type
* Return: 0 on success or negative error
*/
BPF_FUNC_skb_change_type,
/**
* bpf_skb_in_cgroup(skb, map, index) - Check cgroup2 membership of skb
* @skb: pointer to skb
* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
* @index: index of the cgroup in the bpf_map
* Return:
* == 0 skb failed the cgroup2 descendant test
* == 1 skb succeeded the cgroup2 descendant test
* < 0 error
*/
BPF_FUNC_skb_in_cgroup,
/**
* bpf_get_hash_recalc(skb)
* Retrieve and possibly recalculate skb->hash.
* @skb: pointer to skb
* Return: hash
*/
BPF_FUNC_get_hash_recalc,
/**
* u64 bpf_get_current_task(void)
* Returns current task_struct
* Return: current
*/
BPF_FUNC_get_current_task,
/**
* bpf_probe_write_user(void *dst, void *src, int len)
* safely attempt to write to a location
* @dst: destination address in userspace
* @src: source address on stack
* @len: number of bytes to copy
* Return: 0 on success or negative error
*/
BPF_FUNC_probe_write_user,
__BPF_FUNC_MAPPER(__BPF_ENUM_FN)
__BPF_FUNC_MAX_ID,
};
#undef __BPF_ENUM_FN
/* All flags used by eBPF helper functions, placed here. */
......@@ -461,6 +568,31 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
};
/* Generic BPF return codes which all BPF program types may support.
* The values are binary compatible with their TC_ACT_* counter-part to
* provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
* programs.
*
* XDP is handled seprately, see XDP_*.
*/
enum bpf_ret_code {
BPF_OK = 0,
/* 1 reserved */
BPF_DROP = 2,
/* 3-6 reserved */
BPF_REDIRECT = 7,
/* >127 are reserved for prog type specific return codes */
};
struct bpf_sock {
__u32 bound_dev_if;
__u32 family;
__u32 type;
__u32 protocol;
};
#define XDP_PACKET_HEADROOM 256
/* User return codes for XDP prog type.
* A valid XDP program must return one of these defined values. All other
* return codes are reserved for future use. Unknown return codes will result
......
......@@ -147,7 +147,7 @@ static u32 (*bpf_get_prandom_u32)(void) =
static int (*bpf_trace_printk_)(const char *fmt, u64 fmt_size, ...) =
(void *) BPF_FUNC_trace_printk;
int bpf_trace_printk(const char *fmt, ...) asm("llvm.bpf.extra");
static void bpf_tail_call_(u64 map_fd, void *ctx, int index) {
static inline void bpf_tail_call_(u64 map_fd, void *ctx, int index) {
((void (*)(void *, u64, int))BPF_FUNC_tail_call)(ctx, map_fd, index);
}
static int (*bpf_clone_redirect)(void *ctx, int ifindex, u32 flags) =
......@@ -180,8 +180,6 @@ static int (*bpf_perf_event_output)(void *ctx, void *map, u64 index, void *data,
(void *) BPF_FUNC_perf_event_output;
static int (*bpf_skb_load_bytes)(void *ctx, int offset, void *to, u32 len) =
(void *) BPF_FUNC_skb_load_bytes;
static u64 (*bpf_get_current_task)(void) =
(void *) BPF_FUNC_get_current_task;
/* bpf_get_stackid will return a negative value in the case of an error
*
......@@ -203,6 +201,34 @@ int bpf_get_stackid(uintptr_t map, void *ctx, u64 flags) {
static int (*bpf_csum_diff)(void *from, u64 from_size, void *to, u64 to_size, u64 seed) =
(void *) BPF_FUNC_csum_diff;
static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, u32 size) =
(void *) BPF_FUNC_skb_get_tunnel_opt;
static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, u32 size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
static int (*bpf_skb_change_proto)(void *ctx, u16 proto, u64 flags) =
(void *) BPF_FUNC_skb_change_proto;
static int (*bpf_skb_change_type)(void *ctx, u32 type) =
(void *) BPF_FUNC_skb_change_type;
static u32 (*bpf_get_hash_recalc)(void *ctx) =
(void *) BPF_FUNC_get_hash_recalc;
static u64 (*bpf_get_current_task)(void) =
(void *) BPF_FUNC_get_current_task;
static int (*bpf_probe_write_user)(void *dst, void *src, u32 size) =
(void *) BPF_FUNC_probe_write_user;
static int (*bpf_skb_change_tail)(void *ctx, u32 new_len, u64 flags) =
(void *) BPF_FUNC_skb_change_tail;
static int (*bpf_skb_pull_data)(void *ctx, u32 len) =
(void *) BPF_FUNC_skb_pull_data;
static int (*bpf_csum_update)(void *ctx, u16 csum) =
(void *) BPF_FUNC_csum_update;
static int (*bpf_set_hash_invalid)(void *ctx) =
(void *) BPF_FUNC_set_hash_invalid;
static int (*bpf_get_numa_node_id)(void) =
(void *) BPF_FUNC_get_numa_node_id;
static int (*bpf_skb_change_head)(void *ctx, u32 len, u64 flags) =
(void *) BPF_FUNC_skb_change_head;
static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
(void *) BPF_FUNC_xdp_adjust_head;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
......@@ -284,7 +310,7 @@ static inline void bpf_store_dword(void *skb, u64 off, u64 val) {
#define MASK(_n) ((_n) < 64 ? (1ull << (_n)) - 1 : ((u64)-1LL))
#define MASK128(_n) ((_n) < 128 ? ((unsigned __int128)1 << (_n)) - 1 : ((unsigned __int128)-1))
static unsigned int bpf_log2(unsigned int v)
static inline unsigned int bpf_log2(unsigned int v)
{
unsigned int r;
unsigned int shift;
......@@ -297,7 +323,7 @@ static unsigned int bpf_log2(unsigned int v)
return r;
}
static unsigned int bpf_log2l(unsigned long v)
static inline unsigned int bpf_log2l(unsigned long v)
{
unsigned int hi = v >> 32;
if (hi)
......@@ -473,5 +499,18 @@ int bpf_usdt_readarg_p(int argc, struct pt_regs *ctx, void *buf, u64 len) asm("l
#define TRACEPOINT_PROBE(category, event) \
int tracepoint__##category##__##event(struct tracepoint__##category##__##event *args)
#define TP_DATA_LOC_READ_CONST(dst, field, length) \
do { \
short __offset = args->data_loc_##field & 0xFFFF; \
bpf_probe_read((void *)dst, length, (char *)args + __offset); \
} while (0);
#define TP_DATA_LOC_READ(dst, field) \
do { \
short __offset = args->data_loc_##field & 0xFFFF; \
short __length = args->data_loc_##field >> 16; \
bpf_probe_read((void *)dst, __length, (char *)args + __offset); \
} while (0);
#endif
)********"
......@@ -15,6 +15,9 @@ R"********(
* limitations under the License.
*/
#ifndef __BCC_PROTO_H
#define __BCC_PROTO_H
#include <uapi/linux/if_ether.h>
#define BPF_PACKET_HEADER __attribute__((packed)) __attribute__((deprecated("packet")))
......@@ -142,4 +145,6 @@ struct vxlan_gbp_t {
unsigned int key:24;
unsigned int rsv6:8;
} BPF_PACKET_HEADER;
#endif
)********"
......@@ -1163,6 +1163,8 @@ StatusTuple CodegenLLVM::visit_func_decl_stmt_node(FuncDeclStmtNode *n) {
Function *fn = mod_->getFunction(n->id_->name_);
if (fn) return mkstatus_(n, "Function %s already defined", n->id_->c_str());
fn = Function::Create(fn_type, GlobalValue::ExternalLinkage, n->id_->name_, mod_);
fn->setCallingConv(CallingConv::C);
fn->addFnAttr(Attribute::NoUnwind);
fn->setSection(BPF_FN_PREFIX + n->id_->name_);
BasicBlock *label_entry = BasicBlock::Create(ctx(), "entry", fn);
......
......@@ -2,4 +2,4 @@
# Licensed under the Apache License, Version 2.0 (the "License")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DKERNEL_MODULES_DIR='\"${BCC_KERNEL_MODULES_DIR}\"'")
add_library(clang_frontend loader.cc b_frontend_action.cc tp_frontend_action.cc kbuild_helper.cc)
add_library(clang_frontend loader.cc b_frontend_action.cc tp_frontend_action.cc kbuild_helper.cc ../../common.cc)
......@@ -27,6 +27,7 @@
#include "b_frontend_action.h"
#include "shared_table.h"
#include "common.h"
#include "libbpf.h"
......@@ -644,6 +645,8 @@ bool BTypeVisitor::VisitVarDecl(VarDecl *Decl) {
map_type = BPF_MAP_TYPE_LRU_HASH;
} else if (A->getName() == "maps/lru_percpu_hash") {
map_type = BPF_MAP_TYPE_LRU_PERCPU_HASH;
} else if (A->getName() == "maps/lpm_trie") {
map_type = BPF_MAP_TYPE_LPM_TRIE;
} else if (A->getName() == "maps/histogram") {
if (table.key_desc == "\"int\"")
map_type = BPF_MAP_TYPE_ARRAY;
......@@ -656,7 +659,7 @@ bool BTypeVisitor::VisitVarDecl(VarDecl *Decl) {
map_type = BPF_MAP_TYPE_PROG_ARRAY;
} else if (A->getName() == "maps/perf_output") {
map_type = BPF_MAP_TYPE_PERF_EVENT_ARRAY;
int numcpu = sysconf(_SC_NPROCESSORS_ONLN);
int numcpu = get_possible_cpus().size();
if (numcpu <= 0)
numcpu = 1;
table.max_entries = numcpu;
......
......@@ -89,6 +89,7 @@ int KBuildHelper::get_flags(const char *uname_machine, vector<string> *cflags) {
cflags->push_back("-D__HAVE_BUILTIN_BSWAP64__");
cflags->push_back("-Wno-unused-value");
cflags->push_back("-Wno-pointer-sign");
cflags->push_back("-fno-stack-protector");
return 0;
}
......
......@@ -137,6 +137,8 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
"-Wno-deprecated-declarations",
"-Wno-gnu-variable-sized-type-not-at-end",
"-fno-color-diagnostics",
"-fno-unwind-tables",
"-fno-asynchronous-unwind-tables",
"-x", "c", "-c", abs_file.c_str()});
KBuildHelper kbuild_helper(kdir, kernel_path_info.first);
......@@ -165,7 +167,11 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
// set up the command line argument wrapper
#if defined(__powerpc64__)
driver::Driver drv("", "ppc64le-unknown-linux-gnu", diags);
#if defined(_CALL_ELF) && _CALL_ELF == 2
driver::Driver drv("", "powerpc64le-unknown-linux-gnu", diags);
#else
driver::Driver drv("", "powerpc64-unknown-linux-gnu", diags);
#endif
#elif defined(__aarch64__)
driver::Driver drv("", "aarch64-unknown-linux-gnu", diags);
#else
......@@ -205,24 +211,25 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
}
// pre-compilation pass for generating tracepoint structures
auto invocation0 = make_unique<CompilerInvocation>();
if (!CompilerInvocation::CreateFromArgs(*invocation0, const_cast<const char **>(ccargs.data()),
const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
CompilerInstance compiler0;
CompilerInvocation &invocation0 = compiler0.getInvocation();
if (!CompilerInvocation::CreateFromArgs(
invocation0, const_cast<const char **>(ccargs.data()),
const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
return -1;
invocation0->getPreprocessorOpts().RetainRemappedFileBuffers = true;
invocation0.getPreprocessorOpts().RetainRemappedFileBuffers = true;
for (const auto &f : remapped_files_)
invocation0->getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
invocation0.getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
if (in_memory) {
invocation0->getPreprocessorOpts().addRemappedFile(main_path, &*main_buf);
invocation0->getFrontendOpts().Inputs.clear();
invocation0->getFrontendOpts().Inputs.push_back(FrontendInputFile(main_path, IK_C));
invocation0.getPreprocessorOpts().addRemappedFile(main_path, &*main_buf);
invocation0.getFrontendOpts().Inputs.clear();
invocation0.getFrontendOpts().Inputs.push_back(
FrontendInputFile(main_path, IK_C));
}
invocation0->getFrontendOpts().DisableFree = false;
invocation0.getFrontendOpts().DisableFree = false;
CompilerInstance compiler0;
compiler0.setInvocation(invocation0.release());
compiler0.createDiagnostics(new IgnoringDiagConsumer());
// capture the rewritten c file
......@@ -233,24 +240,25 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
unique_ptr<llvm::MemoryBuffer> out_buf = llvm::MemoryBuffer::getMemBuffer(out_str);
// first pass
auto invocation1 = make_unique<CompilerInvocation>();
if (!CompilerInvocation::CreateFromArgs(*invocation1, const_cast<const char **>(ccargs.data()),
const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
CompilerInstance compiler1;
CompilerInvocation &invocation1 = compiler1.getInvocation();
if (!CompilerInvocation::CreateFromArgs(
invocation1, const_cast<const char **>(ccargs.data()),
const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
return -1;
// This option instructs clang whether or not to free the file buffers that we
// give to it. Since the embedded header files should be copied fewer times
// and reused if possible, set this flag to true.
invocation1->getPreprocessorOpts().RetainRemappedFileBuffers = true;
invocation1.getPreprocessorOpts().RetainRemappedFileBuffers = true;
for (const auto &f : remapped_files_)
invocation1->getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
invocation1->getPreprocessorOpts().addRemappedFile(main_path, &*out_buf);
invocation1->getFrontendOpts().Inputs.clear();
invocation1->getFrontendOpts().Inputs.push_back(FrontendInputFile(main_path, IK_C));
invocation1->getFrontendOpts().DisableFree = false;
invocation1.getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
invocation1.getPreprocessorOpts().addRemappedFile(main_path, &*out_buf);
invocation1.getFrontendOpts().Inputs.clear();
invocation1.getFrontendOpts().Inputs.push_back(
FrontendInputFile(main_path, IK_C));
invocation1.getFrontendOpts().DisableFree = false;
CompilerInstance compiler1;
compiler1.setInvocation(invocation1.release());
compiler1.createDiagnostics();
// capture the rewritten c file
......@@ -264,21 +272,22 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
*tables = bact.take_tables();
// second pass, clear input and take rewrite buffer
auto invocation2 = make_unique<CompilerInvocation>();
if (!CompilerInvocation::CreateFromArgs(*invocation2, const_cast<const char **>(ccargs.data()),
const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
return -1;
CompilerInstance compiler2;
invocation2->getPreprocessorOpts().RetainRemappedFileBuffers = true;
CompilerInvocation &invocation2 = compiler2.getInvocation();
if (!CompilerInvocation::CreateFromArgs(
invocation2, const_cast<const char **>(ccargs.data()),
const_cast<const char **>(ccargs.data()) + ccargs.size(), diags))
return -1;
invocation2.getPreprocessorOpts().RetainRemappedFileBuffers = true;
for (const auto &f : remapped_files_)
invocation2->getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
invocation2->getPreprocessorOpts().addRemappedFile(main_path, &*out_buf1);
invocation2->getFrontendOpts().Inputs.clear();
invocation2->getFrontendOpts().Inputs.push_back(FrontendInputFile(main_path, IK_C));
invocation2->getFrontendOpts().DisableFree = false;
invocation2.getPreprocessorOpts().addRemappedFile(f.first, &*f.second);
invocation2.getPreprocessorOpts().addRemappedFile(main_path, &*out_buf1);
invocation2.getFrontendOpts().Inputs.clear();
invocation2.getFrontendOpts().Inputs.push_back(
FrontendInputFile(main_path, IK_C));
invocation2.getFrontendOpts().DisableFree = false;
// suppress warnings in the 2nd pass, but bail out on errors (our fault)
invocation2->getDiagnosticOpts().IgnoreWarnings = true;
compiler2.setInvocation(invocation2.release());
invocation2.getDiagnosticOpts().IgnoreWarnings = true;
compiler2.createDiagnostics();
EmitLLVMOnlyAction ir_act(&*ctx_);
......
......@@ -48,35 +48,42 @@ TracepointTypeVisitor::TracepointTypeVisitor(ASTContext &C, Rewriter &rewriter)
: C(C), diag_(C.getDiagnostics()), rewriter_(rewriter), out_(llvm::errs()) {
}
static inline bool _is_valid_field(string const& line,
string& field_type,
string& field_name) {
enum class field_kind_t {
common,
data_loc,
regular,
invalid
};
static inline field_kind_t _get_field_kind(string const& line,
string& field_type,
string& field_name) {
auto field_pos = line.find("field:");
if (field_pos == string::npos)
return false;
return field_kind_t::invalid;
auto semi_pos = line.find(';', field_pos);
if (semi_pos == string::npos)
return false;
return field_kind_t::invalid;
auto size_pos = line.find("size:", semi_pos);
if (size_pos == string::npos)
return false;
return field_kind_t::invalid;
auto field = line.substr(field_pos + 6/*"field:"*/,
semi_pos - field_pos - 6);
auto pos = field.find_last_of("\t ");
if (pos == string::npos)
return false;
return field_kind_t::invalid;
field_type = field.substr(0, pos);
field_name = field.substr(pos + 1);
if (field_type.find("__data_loc") != string::npos)
return false;
return field_kind_t::data_loc;
if (field_name.find("common_") == 0)
return false;
return field_kind_t::common;
return true;
return field_kind_t::regular;
}
string TracepointTypeVisitor::GenerateTracepointStruct(
......@@ -91,9 +98,17 @@ string TracepointTypeVisitor::GenerateTracepointStruct(
tp_struct += "\tu64 __do_not_use__;\n";
for (string line; getline(input, line); ) {
string field_type, field_name;
if (!_is_valid_field(line, field_type, field_name))
continue;
tp_struct += "\t" + field_type + " " + field_name + ";\n";
switch (_get_field_kind(line, field_type, field_name)) {
case field_kind_t::invalid:
case field_kind_t::common:
continue;
case field_kind_t::data_loc:
tp_struct += "\tint data_loc_" + field_name + ";\n";
break;
case field_kind_t::regular:
tp_struct += "\t" + field_type + " " + field_name + ";\n";
break;
}
}
tp_struct += "};\n";
......
......@@ -35,6 +35,8 @@
#include <sys/resource.h>
#include <unistd.h>
#include <stdbool.h>
#include <sys/stat.h>
#include <sys/types.h>
#include "libbpf.h"
#include "perf_reader.h"
......@@ -137,6 +139,37 @@ int bpf_get_next_key(int fd, void *key, void *next_key)
return syscall(__NR_bpf, BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
}
void bpf_print_hints(char *log)
{
if (log == NULL)
return;
// The following error strings will need maintenance to match LLVM.
// stack busting
if (strstr(log, "invalid stack off=-") != NULL) {
fprintf(stderr, "HINT: Looks like you exceeded the BPF stack limit. "
"This can happen if you allocate too much local variable storage. "
"For example, if you allocated a 1 Kbyte struct (maybe for "
"BPF_PERF_OUTPUT), busting a max stack of 512 bytes.\n\n");
}
// didn't check NULL on map lookup
if (strstr(log, "invalid mem access 'map_value_or_null'") != NULL) {
fprintf(stderr, "HINT: The 'map_value_or_null' error can happen if "
"you dereference a pointer value from a map lookup without first "
"checking if that pointer is NULL.\n\n");
}
// lacking a bpf_probe_read
if (strstr(log, "invalid mem access 'inv'") != NULL) {
fprintf(stderr, "HINT: The invalid mem access 'inv' error can happen "
"if you try to dereference memory without first using "
"bpf_probe_read() to copy it to the BPF stack. Sometimes the "
"bpf_probe_read is automatic by the bcc rewriter, other times "
"you'll need to be explicit.\n\n");
}
}
#define ROUND_UP(x, n) (((x) + (n) - 1u) & ~((n) - 1u))
int bpf_prog_load(enum bpf_prog_type prog_type,
......@@ -221,6 +254,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
}
fprintf(stderr, "bpf: %s\n%s\n", strerror(errno), bpf_log_buffer);
bpf_print_hints(bpf_log_buffer);
free(bpf_log_buffer);
}
......@@ -304,14 +338,19 @@ static int bpf_attach_tracing_event(int progfd, const char *event_path,
return 0;
}
static void * bpf_attach_probe(int progfd, const char *event,
const char *event_desc, const char *event_type,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie) {
void * bpf_attach_kprobe(int progfd, enum bpf_probe_attach_type attach_type, const char *ev_name,
const char *fn_name,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie)
{
int kfd;
char buf[256];
char new_name[128];
struct perf_reader *reader = NULL;
static char *event_type = "kprobe";
int n;
snprintf(new_name, sizeof(new_name), "%s_bcc_%d", ev_name, getpid());
reader = perf_reader_new(cb, NULL, cb_cookie);
if (!reader)
goto error;
......@@ -323,8 +362,9 @@ static void * bpf_attach_probe(int progfd, const char *event,
goto error;
}
if (write(kfd, event_desc, strlen(event_desc)) < 0) {
fprintf(stderr, "write(%s, \"%s\") failed: %s\n", buf, event_desc, strerror(errno));
snprintf(buf, sizeof(buf), "%c:%ss/%s %s", attach_type==BPF_PROBE_ENTRY ? 'p' : 'r',
event_type, new_name, fn_name);
if (write(kfd, buf, strlen(buf)) < 0) {
if (errno == EINVAL)
fprintf(stderr, "check dmesg output for possible cause\n");
close(kfd);
......@@ -332,34 +372,84 @@ static void * bpf_attach_probe(int progfd, const char *event,
}
close(kfd);
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s", event_type, event);
if (access("/sys/kernel/debug/tracing/instances", F_OK) != -1) {
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d", getpid());
if (access(buf, F_OK) == -1) {
if (mkdir(buf, 0755) == -1)
goto retry;
}
n = snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d/events/%ss/%s",
getpid(), event_type, new_name);
if (n < sizeof(buf) && bpf_attach_tracing_event(progfd, buf, reader, pid, cpu, group_fd) == 0)
goto out;
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d", getpid());
rmdir(buf);
}
retry:
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s", event_type, new_name);
if (bpf_attach_tracing_event(progfd, buf, reader, pid, cpu, group_fd) < 0)
goto error;
out:
return reader;
error:
perf_reader_free(reader);
return NULL;
}
void * bpf_attach_kprobe(int progfd, const char *event,
const char *event_desc,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie) {
return bpf_attach_probe(progfd, event, event_desc, "kprobe", pid, cpu, group_fd, cb, cb_cookie);
}
void * bpf_attach_uprobe(int progfd, const char *event,
const char *event_desc,
void * bpf_attach_uprobe(int progfd, enum bpf_probe_attach_type attach_type, const char *ev_name,
const char *binary_path, uint64_t offset,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie) {
return bpf_attach_probe(progfd, event, event_desc, "uprobe", pid, cpu, group_fd, cb, cb_cookie);
perf_reader_cb cb, void *cb_cookie)
{
int kfd;
char buf[PATH_MAX];
char new_name[128];
struct perf_reader *reader = NULL;
static char *event_type = "uprobe";
int n;
snprintf(new_name, sizeof(new_name), "%s_bcc_%d", ev_name, getpid());
reader = perf_reader_new(cb, NULL, cb_cookie);
if (!reader)
goto error;
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", event_type);
kfd = open(buf, O_WRONLY | O_APPEND, 0);
if (kfd < 0) {
fprintf(stderr, "open(%s): %s\n", buf, strerror(errno));
goto error;
}
n = snprintf(buf, sizeof(buf), "%c:%ss/%s %s:0x%lx", attach_type==BPF_PROBE_ENTRY ? 'p' : 'r',
event_type, new_name, binary_path, offset);
if (n >= sizeof(buf)) {
close(kfd);
goto error;
}
if (write(kfd, buf, strlen(buf)) < 0) {
if (errno == EINVAL)
fprintf(stderr, "check dmesg output for possible cause\n");
close(kfd);
goto error;
}
close(kfd);
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s", event_type, new_name);
if (bpf_attach_tracing_event(progfd, buf, reader, pid, cpu, group_fd) < 0)
goto error;
return reader;
error:
perf_reader_free(reader);
return NULL;
}
static int bpf_detach_probe(const char *event_desc, const char *event_type) {
static int bpf_detach_probe(const char *ev_name, const char *event_type)
{
int kfd;
char buf[256];
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/%s_events", event_type);
kfd = open(buf, O_WRONLY | O_APPEND, 0);
......@@ -368,7 +458,8 @@ static int bpf_detach_probe(const char *event_desc, const char *event_type) {
return -1;
}
if (write(kfd, event_desc, strlen(event_desc)) < 0) {
snprintf(buf, sizeof(buf), "-:%ss/%s_bcc_%d", event_type, ev_name, getpid());
if (write(kfd, buf, strlen(buf)) < 0) {
fprintf(stderr, "write(%s): %s\n", buf, strerror(errno));
close(kfd);
return -1;
......@@ -378,14 +469,24 @@ static int bpf_detach_probe(const char *event_desc, const char *event_type) {
return 0;
}
int bpf_detach_kprobe(const char *event_desc) {
return bpf_detach_probe(event_desc, "kprobe");
int bpf_detach_kprobe(const char *ev_name)
{
char buf[256];
int ret = bpf_detach_probe(ev_name, "kprobe");
snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/instances/bcc_%d", getpid());
if (access(buf, F_OK) != -1) {
rmdir(buf);
}
return ret;
}
int bpf_detach_uprobe(const char *event_desc) {
return bpf_detach_probe(event_desc, "uprobe");
int bpf_detach_uprobe(const char *ev_name)
{
return bpf_detach_probe(ev_name, "uprobe");
}
void * bpf_attach_tracepoint(int progfd, const char *tp_category,
const char *tp_name, int pid, int cpu,
int group_fd, perf_reader_cb cb, void *cb_cookie) {
......
......@@ -24,6 +24,11 @@
extern "C" {
#endif
enum bpf_probe_attach_type {
BPF_PROBE_ENTRY,
BPF_PROBE_RETURN
};
int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
int max_entries, int map_flags);
int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags);
......@@ -44,15 +49,19 @@ typedef void (*perf_reader_cb)(void *cb_cookie, int pid, uint64_t callchain_num,
void *callchain);
typedef void (*perf_reader_raw_cb)(void *cb_cookie, void *raw, int raw_size);
void * bpf_attach_kprobe(int progfd, const char *event, const char *event_desc,
int pid, int cpu, int group_fd, perf_reader_cb cb,
void *cb_cookie);
int bpf_detach_kprobe(const char *event_desc);
void * bpf_attach_kprobe(int progfd, enum bpf_probe_attach_type attach_type,
const char *ev_name, const char *fn_name,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie);
int bpf_detach_kprobe(const char *ev_name);
void * bpf_attach_uprobe(int progfd, enum bpf_probe_attach_type attach_type,
const char *ev_name, const char *binary_path, uint64_t offset,
pid_t pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie);
void * bpf_attach_uprobe(int progfd, const char *event, const char *event_desc,
int pid, int cpu, int group_fd, perf_reader_cb cb,
void *cb_cookie);
int bpf_detach_uprobe(const char *event_desc);
int bpf_detach_uprobe(const char *ev_name);
void * bpf_attach_tracepoint(int progfd, const char *tp_category,
const char *tp_name, int pid, int cpu,
......
......@@ -223,8 +223,9 @@ std::string Context::resolve_bin_path(const std::string &bin_path) {
if (char *which = bcc_procutils_which(bin_path.c_str())) {
result = which;
::free(which);
} else if (const char *which_so = bcc_procutils_which_so(bin_path.c_str())) {
} else if (char *which_so = bcc_procutils_which_so(bin_path.c_str(), 0)) {
result = which_so;
::free(which_so);
}
return result;
......
......@@ -43,13 +43,10 @@ function Bpf.static.cleanup()
libbcc.perf_reader_free(probe)
-- skip bcc-specific kprobes
if not key:starts("bcc:") then
local desc = string.format("-:%s/%s", probe_type, key)
log.info("detaching %s", desc)
if probe_type == "kprobes" then
libbcc.bpf_detach_kprobe(desc)
libbcc.bpf_detach_kprobe(key)
elseif probe_type == "uprobes" then
libbcc.bpf_detach_uprobe(desc)
libbcc.bpf_detach_uprobe(key)
end
end
all_probes[key] = nil
......@@ -183,15 +180,13 @@ end
function Bpf:attach_uprobe(args)
Bpf.check_probe_quota(1)
local path, addr = Sym.check_path_symbol(args.name, args.sym, args.addr)
local path, addr = Sym.check_path_symbol(args.name, args.sym, args.addr, args.pid)
local fn = self:load_func(args.fn_name, 'BPF_PROG_TYPE_KPROBE')
local ptype = args.retprobe and "r" or "p"
local ev_name = string.format("%s_%s_0x%p", ptype, path:gsub("[^%a%d]", "_"), addr)
local desc = string.format("%s:uprobes/%s %s:0x%p", ptype, ev_name, path, addr)
log.info(desc)
local retprobe = args.retprobe and 1 or 0
local res = libbcc.bpf_attach_uprobe(fn.fd, ev_name, desc,
local res = libbcc.bpf_attach_uprobe(fn.fd, retprobe, ev_name, path, addr,
args.pid or -1,
args.cpu or 0,
args.group_fd or -1, nil, nil) -- TODO; reader callback
......@@ -209,11 +204,9 @@ function Bpf:attach_kprobe(args)
local event = args.event or ""
local ptype = args.retprobe and "r" or "p"
local ev_name = string.format("%s_%s", ptype, event:gsub("[%+%.]", "_"))
local desc = string.format("%s:kprobes/%s %s", ptype, ev_name, event)
log.info(desc)
local retprobe = args.retprobe and 1 or 0
local res = libbcc.bpf_attach_kprobe(fn.fd, ev_name, desc,
local res = libbcc.bpf_attach_kprobe(fn.fd, retprobe, ev_name, event,
args.pid or -1,
args.cpu or 0,
args.group_fd or -1, nil, nil) -- TODO; reader callback
......
......@@ -40,13 +40,19 @@ int bpf_open_raw_sock(const char *name);
typedef void (*perf_reader_cb)(void *cb_cookie, int pid, uint64_t callchain_num, void *callchain);
typedef void (*perf_reader_raw_cb)(void *cb_cookie, void *raw, int raw_size);
void * bpf_attach_kprobe(int progfd, const char *event, const char *event_desc,
int pid, int cpu, int group_fd, perf_reader_cb cb, void *cb_cookie);
int bpf_detach_kprobe(const char *event_desc);
void * bpf_attach_kprobe(int progfd, int attach_type, const char *ev_name,
const char *fn_name,
int pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie);
void * bpf_attach_uprobe(int progfd, const char *event, const char *event_desc,
int pid, int cpu, int group_fd, perf_reader_cb cb, void *cb_cookie);
int bpf_detach_uprobe(const char *event_desc);
int bpf_detach_kprobe(const char *ev_name);
void * bpf_attach_uprobe(int progfd, int attach_type, const char *ev_name,
const char *binary_path, uint64_t offset,
int pid, int cpu, int group_fd,
perf_reader_cb cb, void *cb_cookie);
int bpf_detach_uprobe(const char *ev_name);
void * bpf_open_perf_buffer(perf_reader_raw_cb raw_cb, void *cb_cookie, int pid, int cpu);
]]
......@@ -109,7 +115,8 @@ struct bcc_symbol {
};
int bcc_resolve_symname(const char *module, const char *symname, const uint64_t addr,
struct bcc_symbol *sym);
int pid, struct bcc_symbol *sym);
void bcc_procutils_free(const char *ptr);
void *bcc_symcache_new(int pid);
int bcc_symcache_resolve(void *symcache, uint64_t addr, struct bcc_symbol *sym);
void bcc_symcache_refresh(void *resolver);
......
......@@ -30,17 +30,22 @@ local function create_cache(pid)
}
end
local function check_path_symbol(module, symname, addr)
local function check_path_symbol(module, symname, addr, pid)
local sym = SYM()
if libbcc.bcc_resolve_symname(module, symname, addr or 0x0, sym) < 0 then
local module_path
if libbcc.bcc_resolve_symname(module, symname, addr or 0x0, pid or 0, sym) < 0 then
if sym[0].module == nil then
error("could not find library '%s' in the library path" % module)
else
module_path = ffi.string(sym[0].module)
libbcc.bcc_procutils_free(sym[0].module)
error("failed to resolve symbol '%s' in '%s'" % {
symname, ffi.string(sym[0].module)})
symname, module_path})
end
end
return ffi.string(sym[0].module), sym[0].offset
module_path = ffi.string(sym[0].module)
libbcc.bcc_procutils_free(sym[0].module)
return module_path, sym[0].offset
end
return { create_cache=create_cache, check_path_symbol=check_path_symbol }
......@@ -29,6 +29,7 @@ BaseTable.static.BPF_MAP_TYPE_STACK_TRACE = 7
BaseTable.static.BPF_MAP_TYPE_CGROUP_ARRAY = 8
BaseTable.static.BPF_MAP_TYPE_LRU_HASH = 9
BaseTable.static.BPF_MAP_TYPE_LRU_PERCPU_HASH = 10
BaseTable.static.BPF_MAP_TYPE_LPM_TRIE = 11
function BaseTable:initialize(t_type, bpf, map_id, map_fd, key_type, leaf_type)
assert(t_type == libbcc.bpf_table_type_id(bpf.module, map_id))
......
......@@ -140,6 +140,7 @@ else
S.c.BPF_MAP.CGROUP_ARRAY = 8
S.c.BPF_MAP.LRU_HASH = 9
S.c.BPF_MAP.LRU_PERCPU_HASH = 10
S.c.BPF_MAP.LPM_TRIE = 11
end
if not S.c.BPF_PROG.TRACEPOINT then
S.c.BPF_PROG.TRACEPOINT = 5
......
......@@ -17,7 +17,6 @@ import atexit
import ctypes as ct
import fcntl
import json
import multiprocessing
import os
import re
import struct
......@@ -29,6 +28,7 @@ from .libbcc import lib, _CB_TYPE, bcc_symbol, _SYM_CB_TYPE
from .table import Table
from .perf import Perf
from .usyms import ProcessSymbols
from .utils import get_online_cpus
_kprobe_limit = 1000
_num_open_probes = 0
......@@ -459,9 +459,8 @@ class BPF(object):
self._check_probe_quota(1)
fn = self.load_func(fn_name, BPF.KPROBE)
ev_name = "p_" + event.replace("+", "_").replace(".", "_")
desc = "p:kprobes/%s %s" % (ev_name, event)
res = lib.bpf_attach_kprobe(fn.fd, ev_name.encode("ascii"),
desc.encode("ascii"), pid, cpu, group_fd,
res = lib.bpf_attach_kprobe(fn.fd, 0, ev_name.encode("ascii"),
event.encode("ascii"), pid, cpu, group_fd,
self._reader_cb_impl, ct.cast(id(self), ct.py_object))
res = ct.cast(res, ct.c_void_p)
if not res:
......@@ -475,8 +474,7 @@ class BPF(object):
if ev_name not in self.open_kprobes:
raise Exception("Kprobe %s is not attached" % event)
lib.perf_reader_free(self.open_kprobes[ev_name])
desc = "-:kprobes/%s" % ev_name
res = lib.bpf_detach_kprobe(desc.encode("ascii"))
res = lib.bpf_detach_kprobe(ev_name.encode("ascii"))
if res < 0:
raise Exception("Failed to detach BPF from kprobe")
self._del_kprobe(ev_name)
......@@ -498,9 +496,8 @@ class BPF(object):
self._check_probe_quota(1)
fn = self.load_func(fn_name, BPF.KPROBE)
ev_name = "r_" + event.replace("+", "_").replace(".", "_")
desc = "r:kprobes/%s %s" % (ev_name, event)
res = lib.bpf_attach_kprobe(fn.fd, ev_name.encode("ascii"),
desc.encode("ascii"), pid, cpu, group_fd,
res = lib.bpf_attach_kprobe(fn.fd, 1, ev_name.encode("ascii"),
event.encode("ascii"), pid, cpu, group_fd,
self._reader_cb_impl, ct.cast(id(self), ct.py_object))
res = ct.cast(res, ct.c_void_p)
if not res:
......@@ -514,8 +511,7 @@ class BPF(object):
if ev_name not in self.open_kprobes:
raise Exception("Kretprobe %s is not attached" % event)
lib.perf_reader_free(self.open_kprobes[ev_name])
desc = "-:kprobes/%s" % ev_name
res = lib.bpf_detach_kprobe(desc.encode("ascii"))
res = lib.bpf_detach_kprobe(ev_name.encode("ascii"))
if res < 0:
raise Exception("Failed to detach BPF from kprobe")
self._del_kprobe(ev_name)
......@@ -554,20 +550,28 @@ class BPF(object):
@classmethod
def _check_path_symbol(cls, module, symname, addr):
def _check_path_symbol(cls, module, symname, addr, pid):
sym = bcc_symbol()
psym = ct.pointer(sym)
c_pid = 0 if pid == -1 else pid
if lib.bcc_resolve_symname(module.encode("ascii"),
symname.encode("ascii"), addr or 0x0, psym) < 0:
symname.encode("ascii"), addr or 0x0, c_pid, psym) < 0:
if not sym.module:
raise Exception("could not find library %s" % module)
lib.bcc_procutils_free(sym.module)
raise Exception("could not determine address of symbol %s" % symname)
return sym.module.decode(), sym.offset
module_path = ct.cast(sym.module, ct.c_char_p).value.decode()
lib.bcc_procutils_free(sym.module)
return module_path, sym.offset
@staticmethod
def find_library(libname):
res = lib.bcc_procutils_which_so(libname.encode("ascii"))
return res if res is None else res.decode()
res = lib.bcc_procutils_which_so(libname.encode("ascii"), 0)
if not res:
return None
libpath = ct.cast(res, ct.c_char_p).value.decode()
lib.bcc_procutils_free(res)
return libpath
@staticmethod
def get_tracepoints(tp_re):
......@@ -660,7 +664,7 @@ class BPF(object):
res[cpu] = self._attach_perf_event(fn.fd, ev_type, ev_config,
sample_period, sample_freq, pid, cpu, group_fd)
else:
for i in range(0, multiprocessing.cpu_count()):
for i in get_online_cpus():
res[i] = self._attach_perf_event(fn.fd, ev_type, ev_config,
sample_period, sample_freq, pid, i, group_fd)
self.open_perf_events[(ev_type, ev_config)] = res
......@@ -736,7 +740,8 @@ class BPF(object):
Libraries can be given in the name argument without the lib prefix, or
with the full path (/usr/lib/...). Binaries can be given only with the
full path (/bin/sh).
full path (/bin/sh). If a PID is given, the uprobe will attach to the
version of the library used by the process.
Example: BPF(text).attach_uprobe("c", "malloc")
BPF(text).attach_uprobe("/usr/bin/python", "main")
......@@ -753,14 +758,13 @@ class BPF(object):
group_fd=group_fd)
return
(path, addr) = BPF._check_path_symbol(name, sym, addr)
(path, addr) = BPF._check_path_symbol(name, sym, addr, pid)
self._check_probe_quota(1)
fn = self.load_func(fn_name, BPF.KPROBE)
ev_name = "p_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
desc = "p:uprobes/%s %s:0x%x" % (ev_name, path, addr)
res = lib.bpf_attach_uprobe(fn.fd, ev_name.encode("ascii"),
desc.encode("ascii"), pid, cpu, group_fd,
res = lib.bpf_attach_uprobe(fn.fd, 0, ev_name.encode("ascii"),
path.encode("ascii"), addr, pid, cpu, group_fd,
self._reader_cb_impl, ct.cast(id(self), ct.py_object))
res = ct.cast(res, ct.c_void_p)
if not res:
......@@ -768,21 +772,20 @@ class BPF(object):
self._add_uprobe(ev_name, res)
return self
def detach_uprobe(self, name="", sym="", addr=None):
"""detach_uprobe(name="", sym="", addr=None)
def detach_uprobe(self, name="", sym="", addr=None, pid=-1):
"""detach_uprobe(name="", sym="", addr=None, pid=-1)
Stop running a bpf function that is attached to symbol 'sym' in library
or binary 'name'.
"""
name = str(name)
(path, addr) = BPF._check_path_symbol(name, sym, addr)
(path, addr) = BPF._check_path_symbol(name, sym, addr, pid)
ev_name = "p_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
if ev_name not in self.open_uprobes:
raise Exception("Uprobe %s is not attached" % event)
raise Exception("Uprobe %s is not attached" % ev_name)
lib.perf_reader_free(self.open_uprobes[ev_name])
desc = "-:uprobes/%s" % ev_name
res = lib.bpf_detach_uprobe(desc.encode("ascii"))
res = lib.bpf_detach_uprobe(ev_name.encode("ascii"))
if res < 0:
raise Exception("Failed to detach BPF from uprobe")
self._del_uprobe(ev_name)
......@@ -805,14 +808,13 @@ class BPF(object):
return
name = str(name)
(path, addr) = BPF._check_path_symbol(name, sym, addr)
(path, addr) = BPF._check_path_symbol(name, sym, addr, pid)
self._check_probe_quota(1)
fn = self.load_func(fn_name, BPF.KPROBE)
ev_name = "r_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
desc = "r:uprobes/%s %s:0x%x" % (ev_name, path, addr)
res = lib.bpf_attach_uprobe(fn.fd, ev_name.encode("ascii"),
desc.encode("ascii"), pid, cpu, group_fd,
res = lib.bpf_attach_uprobe(fn.fd, 1, ev_name.encode("ascii"),
path.encode("ascii"), addr, pid, cpu, group_fd,
self._reader_cb_impl, ct.cast(id(self), ct.py_object))
res = ct.cast(res, ct.c_void_p)
if not res:
......@@ -820,21 +822,20 @@ class BPF(object):
self._add_uprobe(ev_name, res)
return self
def detach_uretprobe(self, name="", sym="", addr=None):
"""detach_uretprobe(name="", sym="", addr=None)
def detach_uretprobe(self, name="", sym="", addr=None, pid=-1):
"""detach_uretprobe(name="", sym="", addr=None, pid=-1)
Stop running a bpf function that is attached to symbol 'sym' in library
or binary 'name'.
"""
name = str(name)
(path, addr) = BPF._check_path_symbol(name, sym, addr)
(path, addr) = BPF._check_path_symbol(name, sym, addr, pid)
ev_name = "r_%s_0x%x" % (self._probe_repl.sub("_", path), addr)
if ev_name not in self.open_uprobes:
raise Exception("Kretprobe %s is not attached" % event)
raise Exception("Uretprobe %s is not attached" % ev_name)
lib.perf_reader_free(self.open_uprobes[ev_name])
desc = "-:uprobes/%s" % ev_name
res = lib.bpf_detach_uprobe(desc.encode("ascii"))
res = lib.bpf_detach_uprobe(ev_name.encode("ascii"))
if res < 0:
raise Exception("Failed to detach BPF from uprobe")
self._del_uprobe(ev_name)
......@@ -1036,18 +1037,17 @@ class BPF(object):
lib.perf_reader_free(v)
# non-string keys here include the perf_events reader
if isinstance(k, str):
desc = "-:kprobes/%s" % k
lib.bpf_detach_kprobe(desc.encode("ascii"))
lib.bpf_detach_kprobe(str(k).encode("ascii"))
self._del_kprobe(k)
for k, v in list(self.open_uprobes.items()):
lib.perf_reader_free(v)
desc = "-:uprobes/%s" % k
lib.bpf_detach_uprobe(desc.encode("ascii"))
lib.bpf_detach_uprobe(str(k).encode("ascii"))
self._del_uprobe(k)
for k, v in self.open_tracepoints.items():
lib.perf_reader_free(v)
(tp_category, tp_name) = k.split(':')
lib.bpf_detach_tracepoint(tp_category, tp_name)
lib.bpf_detach_tracepoint(tp_category.encode("ascii"),
tp_name.encode("ascii"))
self.open_tracepoints.clear()
for (ev_type, ev_config) in list(self.open_perf_events.keys()):
self.detach_perf_event(ev_type, ev_config)
......@@ -1065,4 +1065,4 @@ class BPF(object):
self.cleanup()
from .usdt import USDT
from .usdt import USDT, USDTException
......@@ -87,13 +87,13 @@ lib.bpf_attach_kprobe.restype = ct.c_void_p
_CB_TYPE = ct.CFUNCTYPE(None, ct.py_object, ct.c_int,
ct.c_ulonglong, ct.POINTER(ct.c_ulonglong))
_RAW_CB_TYPE = ct.CFUNCTYPE(None, ct.py_object, ct.c_void_p, ct.c_int)
lib.bpf_attach_kprobe.argtypes = [ct.c_int, ct.c_char_p, ct.c_char_p, ct.c_int,
lib.bpf_attach_kprobe.argtypes = [ct.c_int, ct.c_int, ct.c_char_p, ct.c_char_p, ct.c_int,
ct.c_int, ct.c_int, _CB_TYPE, ct.py_object]
lib.bpf_detach_kprobe.restype = ct.c_int
lib.bpf_detach_kprobe.argtypes = [ct.c_char_p]
lib.bpf_attach_uprobe.restype = ct.c_void_p
lib.bpf_attach_uprobe.argtypes = [ct.c_int, ct.c_char_p, ct.c_char_p, ct.c_int,
ct.c_int, ct.c_int, _CB_TYPE, ct.py_object]
lib.bpf_attach_uprobe.argtypes = [ct.c_int, ct.c_int, ct.c_char_p, ct.c_char_p,
ct.c_ulonglong, ct.c_int, ct.c_int, ct.c_int, _CB_TYPE, ct.py_object]
lib.bpf_detach_uprobe.restype = ct.c_int
lib.bpf_detach_uprobe.argtypes = [ct.c_char_p]
lib.bpf_attach_tracepoint.restype = ct.c_void_p
......@@ -126,16 +126,18 @@ class bcc_symbol(ct.Structure):
_fields_ = [
('name', ct.c_char_p),
('demangle_name', ct.c_char_p),
('module', ct.c_char_p),
('module', ct.POINTER(ct.c_char)),
('offset', ct.c_ulonglong),
]
lib.bcc_procutils_which_so.restype = ct.c_char_p
lib.bcc_procutils_which_so.argtypes = [ct.c_char_p]
lib.bcc_procutils_which_so.restype = ct.POINTER(ct.c_char)
lib.bcc_procutils_which_so.argtypes = [ct.c_char_p, ct.c_int]
lib.bcc_procutils_free.restype = None
lib.bcc_procutils_free.argtypes = [ct.c_void_p]
lib.bcc_resolve_symname.restype = ct.c_int
lib.bcc_resolve_symname.argtypes = [
ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.POINTER(bcc_symbol)]
ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.c_int, ct.POINTER(bcc_symbol)]
_SYM_CB_TYPE = ct.CFUNCTYPE(ct.c_int, ct.c_char_p, ct.c_ulonglong)
lib.bcc_foreach_symbol.restype = ct.c_int
......
......@@ -13,8 +13,8 @@
# limitations under the License.
import ctypes as ct
import multiprocessing
import os
from .utils import get_online_cpus
class Perf(object):
class perf_event_attr(ct.Structure):
......@@ -105,5 +105,5 @@ class Perf(object):
attr.sample_period = 1
attr.wakeup_events = 9999999 # don't wake up
for cpu in range(0, multiprocessing.cpu_count()):
for cpu in get_online_cpus():
Perf._open_for_cpu(cpu, attr)
......@@ -14,11 +14,14 @@
from collections import MutableMapping
import ctypes as ct
from functools import reduce
import multiprocessing
import os
from .libbcc import lib, _RAW_CB_TYPE
from .perf import Perf
from .utils import get_online_cpus
from .utils import get_possible_cpus
from subprocess import check_output
BPF_MAP_TYPE_HASH = 1
......@@ -31,6 +34,7 @@ BPF_MAP_TYPE_STACK_TRACE = 7
BPF_MAP_TYPE_CGROUP_ARRAY = 8
BPF_MAP_TYPE_LRU_HASH = 9
BPF_MAP_TYPE_LRU_PERCPU_HASH = 10
BPF_MAP_TYPE_LPM_TRIE = 11
stars_max = 40
log2_index_max = 65
......@@ -121,6 +125,8 @@ def Table(bpf, map_id, map_fd, keytype, leaftype, **kwargs):
t = PerCpuHash(bpf, map_id, map_fd, keytype, leaftype, **kwargs)
elif ttype == BPF_MAP_TYPE_PERCPU_ARRAY:
t = PerCpuArray(bpf, map_id, map_fd, keytype, leaftype, **kwargs)
elif ttype == BPF_MAP_TYPE_LPM_TRIE:
t = LpmTrie(bpf, map_id, map_fd, keytype, leaftype)
elif ttype == BPF_MAP_TYPE_STACK_TRACE:
t = StackTrace(bpf, map_id, map_fd, keytype, leaftype)
elif ttype == BPF_MAP_TYPE_LRU_HASH:
......@@ -509,7 +515,7 @@ class PerfEventArray(ArrayBase):
event submitted from the kernel, up to millions per second.
"""
for i in range(0, multiprocessing.cpu_count()):
for i in get_online_cpus():
self._open_perf_buffer(i, callback)
def _open_perf_buffer(self, cpu, callback):
......@@ -550,7 +556,7 @@ class PerfEventArray(ArrayBase):
if not isinstance(ev, self.Event):
raise Exception("argument must be an Event, got %s", type(ev))
for i in range(0, multiprocessing.cpu_count()):
for i in get_online_cpus():
self._open_perf_event(i, ev.typ, ev.config)
......@@ -559,7 +565,7 @@ class PerCpuHash(HashTable):
self.reducer = kwargs.pop("reducer", None)
super(PerCpuHash, self).__init__(*args, **kwargs)
self.sLeaf = self.Leaf
self.total_cpu = multiprocessing.cpu_count()
self.total_cpu = len(get_possible_cpus())
# This needs to be 8 as hard coded into the linux kernel.
self.alignment = ct.sizeof(self.sLeaf) % 8
if self.alignment is 0:
......@@ -595,7 +601,7 @@ class PerCpuHash(HashTable):
def sum(self, key):
if isinstance(self.Leaf(), ct.Structure):
raise IndexError("Leaf must be an integer type for default sum functions")
return self.sLeaf(reduce(lambda x,y: x+y, self.getvalue(key)))
return self.sLeaf(sum(self.getvalue(key)))
def max(self, key):
if isinstance(self.Leaf(), ct.Structure):
......@@ -604,8 +610,7 @@ class PerCpuHash(HashTable):
def average(self, key):
result = self.sum(key)
result.value/=self.total_cpu
return result
return result.value / self.total_cpu
class LruPerCpuHash(PerCpuHash):
def __init__(self, *args, **kwargs):
......@@ -616,7 +621,7 @@ class PerCpuArray(ArrayBase):
self.reducer = kwargs.pop("reducer", None)
super(PerCpuArray, self).__init__(*args, **kwargs)
self.sLeaf = self.Leaf
self.total_cpu = multiprocessing.cpu_count()
self.total_cpu = len(get_possible_cpus())
# This needs to be 8 as hard coded into the linux kernel.
self.alignment = ct.sizeof(self.sLeaf) % 8
if self.alignment is 0:
......@@ -652,7 +657,7 @@ class PerCpuArray(ArrayBase):
def sum(self, key):
if isinstance(self.Leaf(), ct.Structure):
raise IndexError("Leaf must be an integer type for default sum functions")
return self.sLeaf(reduce(lambda x,y: x+y, self.getvalue(key)))
return self.sLeaf(sum(self.getvalue(key)))
def max(self, key):
if isinstance(self.Leaf(), ct.Structure):
......@@ -661,8 +666,19 @@ class PerCpuArray(ArrayBase):
def average(self, key):
result = self.sum(key)
result.value/=self.total_cpu
return result
return result.value / self.total_cpu
class LpmTrie(TableBase):
def __init__(self, *args, **kwargs):
super(LpmTrie, self).__init__(*args, **kwargs)
def __len__(self):
raise NotImplementedError
def __delitem__(self, key):
# Not implemented for lpm trie as of kernel commit
# b95a5c4db09bc7c253636cb84dc9b12c577fd5a0
raise NotImplementedError
class StackTrace(TableBase):
MAX_DEPTH = 127
......
......@@ -13,10 +13,14 @@
# limitations under the License.
import ctypes as ct
import sys
from .libbcc import lib, _USDT_CB, _USDT_PROBE_CB, \
bcc_usdt_location, bcc_usdt_argument, \
BCC_USDT_ARGUMENT_FLAGS
class USDTException(Exception):
pass
class USDTProbeArgument(object):
def __init__(self, argument):
self.signed = argument.size < 0
......@@ -77,8 +81,9 @@ class USDTProbeLocation(object):
res = lib.bcc_usdt_get_argument(self.probe.context, self.probe.name,
self.index, index, ct.pointer(arg))
if res != 0:
raise Exception("error retrieving probe argument %d location %d" %
(index, self.index))
raise USDTException(
"error retrieving probe argument %d location %d" %
(index, self.index))
return USDTProbeArgument(arg)
class USDTProbe(object):
......@@ -103,7 +108,7 @@ class USDTProbe(object):
res = lib.bcc_usdt_get_location(self.context, self.name,
index, ct.pointer(loc))
if res != 0:
raise Exception("error retrieving probe location %d" % index)
raise USDTException("error retrieving probe location %d" % index)
return USDTProbeLocation(self, index, loc)
class USDT(object):
......@@ -112,23 +117,36 @@ class USDT(object):
self.pid = pid
self.context = lib.bcc_usdt_new_frompid(pid)
if self.context == None:
raise Exception("USDT failed to instrument PID %d" % pid)
raise USDTException("USDT failed to instrument PID %d" % pid)
elif path:
self.path = path
self.context = lib.bcc_usdt_new_frompath(path)
if self.context == None:
raise Exception("USDT failed to instrument path %s" % path)
raise USDTException("USDT failed to instrument path %s" % path)
else:
raise Exception("either a pid or a binary path must be specified")
raise USDTException(
"either a pid or a binary path must be specified")
def enable_probe(self, probe, fn_name):
if lib.bcc_usdt_enable_probe(self.context, probe, fn_name) != 0:
raise Exception(("failed to enable probe '%s'; a possible cause " +
"can be that the probe requires a pid to enable") %
probe)
raise USDTException(
("failed to enable probe '%s'; a possible cause " +
"can be that the probe requires a pid to enable") %
probe
)
def enable_probe_or_bail(self, probe, fn_name):
if lib.bcc_usdt_enable_probe(self.context, probe, fn_name) != 0:
print(
"""Error attaching USDT probes: the specified pid might not contain the
given language's runtime, or the runtime was not built with the required
USDT probes. Look for a configure flag similar to --with-dtrace or
--enable-dtrace. To check which probes are present in the process, use the
tplist tool.""")
sys.exit(1)
def get_text(self):
return lib.bcc_usdt_genargs(self.context)
return lib.bcc_usdt_genargs(self.context).decode()
def get_probe_arg_ctype(self, probe_name, arg_index):
return lib.bcc_usdt_get_probe_argctype(
......
# Copyright 2016 Catalysts GmbH
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
def _read_cpu_range(path):
cpus = []
with open(path, 'r') as f:
cpus_range_str = f.read()
for cpu_range in cpus_range_str.split(','):
rangeop = cpu_range.find('-')
if rangeop == -1:
cpus.append(int(cpu_range))
else:
start = int(cpu_range[:rangeop])
end = int(cpu_range[rangeop+1:])
cpus.extend(range(start, end+1))
return cpus
def get_online_cpus():
return _read_cpu_range('/sys/devices/system/cpu/online')
def get_possible_cpus():
return _read_cpu_range('/sys/devices/system/cpu/possible')
......@@ -23,6 +23,7 @@
#include "bcc_perf_map.h"
#include "bcc_proc.h"
#include "bcc_syms.h"
#include "common.h"
#include "vendor/tinyformat.hpp"
#include "catch.hpp"
......@@ -30,10 +31,19 @@
using namespace std;
TEST_CASE("shared object resolution", "[c_api]") {
const char *libm = bcc_procutils_which_so("m");
char *libm = bcc_procutils_which_so("m", 0);
REQUIRE(libm);
REQUIRE(libm[0] == '/');
REQUIRE(string(libm).find("libm.so") != string::npos);
free(libm);
}
TEST_CASE("shared object resolution using loaded libraries", "[c_api]") {
char *libelf = bcc_procutils_which_so("elf", getpid());
REQUIRE(libelf);
REQUIRE(libelf[0] == '/');
REQUIRE(string(libelf).find("libelf") != string::npos);
free(libelf);
}
TEST_CASE("binary resolution with `which`", "[c_api]") {
......@@ -57,10 +67,21 @@ TEST_CASE("list all kernel symbols", "[c_api]") {
TEST_CASE("resolve symbol name in external library", "[c_api]") {
struct bcc_symbol sym;
REQUIRE(bcc_resolve_symname("c", "malloc", 0x0, &sym) == 0);
REQUIRE(bcc_resolve_symname("c", "malloc", 0x0, 0, &sym) == 0);
REQUIRE(string(sym.module).find("libc.so") != string::npos);
REQUIRE(sym.module[0] == '/');
REQUIRE(sym.offset != 0);
bcc_procutils_free(sym.module);
}
TEST_CASE("resolve symbol name in external library using loaded libraries", "[c_api]") {
struct bcc_symbol sym;
REQUIRE(bcc_resolve_symname("bcc", "bcc_procutils_which", 0x0, getpid(), &sym) == 0);
REQUIRE(string(sym.module).find("libbcc.so") != string::npos);
REQUIRE(sym.module[0] == '/');
REQUIRE(sym.offset != 0);
bcc_procutils_free(sym.module);
}
extern "C" int _a_test_function(const char *a_string) {
......@@ -196,3 +217,10 @@ TEST_CASE("resolve symbols using /tmp/perf-pid.map", "[c_api]") {
munmap(map_addr, map_sz);
}
TEST_CASE("get online CPUs", "[c_api]") {
std::vector<int> cpus = ebpf::get_online_cpus();
int num_cpus = sysconf(_SC_NPROCESSORS_ONLN);
REQUIRE(cpus.size() == num_cpus);
}
......@@ -19,23 +19,9 @@ if ldd bcc-lua | grep -q luajit; then
fail "bcc-lua depends on libluajit"
fi
rm -f libbcc.so probe.lua
rm -f probe.lua
echo "return function(BPF) print(\"Hello world\") end" > probe.lua
if ./bcc-lua "probe.lua"; then
fail "bcc-lua runs without libbcc.so"
fi
if ! env LIBBCC_SO_PATH=../cc/libbcc.so ./bcc-lua "probe.lua"; then
fail "bcc-lua cannot load libbcc.so through the environment"
fi
ln -s ../cc/libbcc.so
if ! ./bcc-lua "probe.lua"; then
fail "bcc-lua cannot find local libbcc.so"
fi
PROBE="../../../examples/lua/offcputime.lua"
if ! sudo ./bcc-lua "$PROBE" -d 1 >/dev/null 2>/dev/null; then
......
......@@ -27,8 +27,8 @@ int count(struct pt_regs *ctx) {
local text = text:gsub("PID", tostring(pid))
local b = BPF:new{text=text}
b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count"}
b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count", retprobe=true}
b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count", pid=pid}
b:attach_uprobe{name="c", sym="malloc_stats", fn_name="count", pid=pid, retprobe=true}
assert_equals(BPF.num_open_uprobes(), 2)
......
......@@ -17,7 +17,7 @@ endif()
add_test(NAME py_test_stat1_b WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_stat1_b namespace ${CMAKE_CURRENT_SOURCE_DIR}/test_stat1.py test_stat1.b proto.b)
add_test(NAME py_test_bpf_log WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_bpf_prog namespace ${CMAKE_CURRENT_SOURCE_DIR}/test_bpf_log.py)
COMMAND ${TEST_WRAPPER} py_bpf_prog sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_bpf_log.py)
add_test(NAME py_test_stat1_c WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_stat1_c namespace ${CMAKE_CURRENT_SOURCE_DIR}/test_stat1.py test_stat1.c)
#add_test(NAME py_test_xlate1_b WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
......@@ -56,6 +56,10 @@ add_test(NAME py_test_tracepoint WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_test_tracepoint sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_tracepoint.py)
add_test(NAME py_test_perf_event WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_test_perf_event sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_perf_event.py)
add_test(NAME py_test_utils WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_test_utils sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_utils.py)
add_test(NAME py_test_percpu WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_test_percpu sudo ${CMAKE_CURRENT_SOURCE_DIR}/test_percpu.py)
add_test(NAME py_test_dump_func WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMAND ${TEST_WRAPPER} py_dump_func simple ${CMAKE_CURRENT_SOURCE_DIR}/test_dump_func.py)
......@@ -6,6 +6,8 @@ from bcc import BPF
import ctypes as ct
import random
import time
import subprocess
from bcc.utils import get_online_cpus
from unittest import main, TestCase
class TestArray(TestCase):
......@@ -62,6 +64,37 @@ int kprobe__sys_nanosleep(void *ctx) {
time.sleep(0.1)
b.kprobe_poll()
self.assertGreater(self.counter, 0)
b.cleanup()
def test_perf_buffer_for_each_cpu(self):
self.events = []
class Data(ct.Structure):
_fields_ = [("cpu", ct.c_ulonglong)]
def cb(cpu, data, size):
self.assertGreater(size, ct.sizeof(Data))
event = ct.cast(data, ct.POINTER(Data)).contents
self.events.append(event)
text = """
BPF_PERF_OUTPUT(events);
int kprobe__sys_nanosleep(void *ctx) {
struct {
u64 cpu;
} data = {bpf_get_smp_processor_id()};
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
"""
b = BPF(text=text)
b["events"].open_perf_buffer(cb)
online_cpus = get_online_cpus()
for cpu in online_cpus:
subprocess.call(['taskset', '-c', str(cpu), 'sleep', '0.1'])
b.kprobe_poll()
b.cleanup()
self.assertGreaterEqual(len(self.events), len(online_cpus), 'Received only {}/{} events'.format(len(self.events), len(online_cpus)))
if __name__ == "__main__":
main()
......@@ -51,7 +51,7 @@ class TestBPFProgLoad(TestCase):
except Exception:
self.fp.flush()
self.fp.seek(0)
self.assertEqual(error_msg in self.fp.read(), True)
self.assertEqual(error_msg in self.fp.read().decode(), True)
def test_log_no_debug(self):
......@@ -61,7 +61,7 @@ class TestBPFProgLoad(TestCase):
except Exception:
self.fp.flush()
self.fp.seek(0)
self.assertEqual(error_msg in self.fp.read(), True)
self.assertEqual(error_msg in self.fp.read().decode(), True)
if __name__ == "__main__":
......
......@@ -68,6 +68,16 @@ ipr = IPRoute()
ipdb = IPDB(nl=ipr)
sim = Simulation(ipdb)
allocated_interfaces = set(ipdb.interfaces.keys())
def get_next_iface(prefix):
i = 0
while True:
iface = "{0}{1}".format(prefix, i)
if iface not in allocated_interfaces:
allocated_interfaces.add(iface)
return iface
i += 1
class TestBPFSocket(TestCase):
def setup_br(self, br, veth_rt_2_br, veth_pem_2_br, veth_br_2_pem):
......@@ -84,15 +94,15 @@ class TestBPFSocket(TestCase):
br1.add_port(ipdb.interfaces[veth_rt_2_br])
br1.up()
subprocess.call(["sysctl", "-q", "-w", "net.ipv6.conf." + br + ".disable_ipv6=1"])
def set_default_const(self):
self.ns1 = "ns1"
self.ns2 = "ns2"
self.ns_router = "ns_router"
self.br1 = "br1"
self.br1 = get_next_iface("br")
self.veth_pem_2_br1 = "v20"
self.veth_br1_2_pem = "v21"
self.br2 = "br2"
self.br2 = get_next_iface("br")
self.veth_pem_2_br2 = "v22"
self.veth_br2_2_pem = "v23"
......
#!/usr/bin/env python
# Copyright (c) 2017 Facebook, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
import ctypes as ct
import unittest
from bcc import BPF
from netaddr import IPAddress
class KeyV4(ct.Structure):
_fields_ = [("prefixlen", ct.c_uint),
("data", ct.c_ubyte * 4)]
class KeyV6(ct.Structure):
_fields_ = [("prefixlen", ct.c_uint),
("data", ct.c_ushort * 8)]
class TestLpmTrie(unittest.TestCase):
def test_lpm_trie_v4(self):
test_prog1 = """
BPF_F_TABLE("lpm_trie", u64, int, trie, 16, BPF_F_NO_PREALLOC);
"""
b = BPF(text=test_prog1)
t = b["trie"]
k1 = KeyV4(24, (192, 168, 0, 0))
v1 = ct.c_int(24)
t[k1] = v1
k2 = KeyV4(28, (192, 168, 0, 0))
v2 = ct.c_int(28)
t[k2] = v2
k = KeyV4(32, (192, 168, 0, 15))
self.assertEqual(t[k].value, 28)
k = KeyV4(32, (192, 168, 0, 127))
self.assertEqual(t[k].value, 24)
with self.assertRaises(KeyError):
k = KeyV4(32, (172, 16, 1, 127))
v = t[k]
def test_lpm_trie_v6(self):
test_prog1 = """
struct key_v6 {
u32 prefixlen;
u32 data[4];
};
BPF_F_TABLE("lpm_trie", struct key_v6, int, trie, 16, BPF_F_NO_PREALLOC);
"""
b = BPF(text=test_prog1)
t = b["trie"]
k1 = KeyV6(64, IPAddress('2a00:1450:4001:814:200e::').words)
v1 = ct.c_int(64)
t[k1] = v1
k2 = KeyV6(96, IPAddress('2a00:1450:4001:814::200e').words)
v2 = ct.c_int(96)
t[k2] = v2
k = KeyV6(128, IPAddress('2a00:1450:4001:814::1024').words)
self.assertEqual(t[k].value, 96)
k = KeyV6(128, IPAddress('2a00:1450:4001:814:2046::').words)
self.assertEqual(t[k].value, 64)
with self.assertRaises(KeyError):
k = KeyV6(128, IPAddress('2a00:ffff::').words)
v = t[k]
if __name__ == "__main__":
unittest.main()
......@@ -9,6 +9,12 @@ import multiprocessing
class TestPercpu(unittest.TestCase):
def setUp(self):
try:
b = BPF(text='BPF_TABLE("percpu_array", u32, u32, stub, 1);')
except:
raise unittest.SkipTest("PerCpu unsupported on this kernel")
def test_u64(self):
test_prog1 = """
BPF_TABLE("percpu_hash", u32, u64, stats, 1);
......@@ -34,8 +40,8 @@ class TestPercpu(unittest.TestCase):
sum = stats_map.sum(stats_map.Key(0))
avg = stats_map.average(stats_map.Key(0))
max = stats_map.max(stats_map.Key(0))
self.assertGreater(sum.value, 0L)
self.assertGreater(max.value, 0L)
self.assertGreater(sum.value, int(0))
self.assertGreater(max.value, int(0))
bpf_code.detach_kprobe("sys_clone")
def test_u32(self):
......@@ -63,8 +69,8 @@ class TestPercpu(unittest.TestCase):
sum = stats_map.sum(stats_map.Key(0))
avg = stats_map.average(stats_map.Key(0))
max = stats_map.max(stats_map.Key(0))
self.assertGreater(sum.value, 0L)
self.assertGreater(max.value, 0L)
self.assertGreater(sum.value, int(0))
self.assertGreater(max.value, int(0))
bpf_code.detach_kprobe("sys_clone")
def test_struct_custom_func(self):
......@@ -95,7 +101,7 @@ class TestPercpu(unittest.TestCase):
f.close()
self.assertEqual(len(stats_map),1)
k = stats_map[ stats_map.Key(0) ]
self.assertGreater(k.c1, 0L)
self.assertGreater(k.c1, int(0))
bpf_code.detach_kprobe("sys_clone")
......
......@@ -7,6 +7,7 @@ import unittest
from time import sleep
import distutils.version
import os
import subprocess
def kernel_version_ge(major, minor):
# True if running kernel is >= X.Y
......@@ -39,5 +40,29 @@ class TestTracepoint(unittest.TestCase):
total_switches += v.value
self.assertNotEqual(0, total_switches)
@unittest.skipUnless(kernel_version_ge(4,7), "requires kernel >= 4.7")
class TestTracepointDataLoc(unittest.TestCase):
def test_tracepoint_data_loc(self):
text = """
struct value_t {
char filename[64];
};
BPF_HASH(execs, u32, struct value_t);
TRACEPOINT_PROBE(sched, sched_process_exec) {
struct value_t val = {0};
char fn[64];
u32 pid = args->pid;
struct value_t *existing = execs.lookup_or_init(&pid, &val);
TP_DATA_LOC_READ_CONST(fn, filename, 64);
__builtin_memcpy(existing->filename, fn, 64);
return 0;
}
"""
b = bcc.BPF(text=text)
subprocess.check_output(["/bin/ls"])
sleep(1)
self.assertTrue("/bin/ls" in [v.filename.decode()
for v in b["execs"].values()])
if __name__ == "__main__":
unittest.main()
......@@ -18,22 +18,24 @@ static void incr(int idx) {
++(*ptr);
}
int count(struct pt_regs *ctx) {
bpf_trace_printk("count() uprobe fired");
u32 pid = bpf_get_current_pid_tgid();
if (pid == PID)
incr(0);
return 0;
}"""
text = text.replace("PID", "%d" % os.getpid())
test_pid = os.getpid()
text = text.replace("PID", "%d" % test_pid)
b = bcc.BPF(text=text)
b.attach_uprobe(name="c", sym="malloc_stats", fn_name="count")
b.attach_uretprobe(name="c", sym="malloc_stats", fn_name="count")
b.attach_uprobe(name="c", sym="malloc_stats", fn_name="count", pid=test_pid)
b.attach_uretprobe(name="c", sym="malloc_stats", fn_name="count", pid=test_pid)
libc = ctypes.CDLL("libc.so.6")
libc.malloc_stats.restype = None
libc.malloc_stats.argtypes = []
libc.malloc_stats()
self.assertEqual(b["stats"][ctypes.c_int(0)].value, 2)
b.detach_uretprobe(name="c", sym="malloc_stats")
b.detach_uprobe(name="c", sym="malloc_stats")
b.detach_uretprobe(name="c", sym="malloc_stats", pid=test_pid)
b.detach_uprobe(name="c", sym="malloc_stats", pid=test_pid)
def test_simple_binary(self):
text = """
......
#!/usr/bin/python
# Copyright (c) Catalysts GmbH
# Licensed under the Apache License, Version 2.0 (the "License")
from bcc.utils import get_online_cpus
import multiprocessing
import unittest
class TestUtils(unittest.TestCase):
def test_get_online_cpus(self):
online_cpus = get_online_cpus()
num_cores = multiprocessing.cpu_count()
self.assertEqual(len(online_cpus), num_cores)
if __name__ == "__main__":
unittest.main()
......@@ -159,7 +159,7 @@ u64 __time = bpf_ktime_get_ns();
if parts[0] not in ["r", "p", "t", "u"]:
self._bail("probe type must be 'p', 'r', 't', or 'u'" +
" but got '%s'" % parts[0])
if re.match(r"\w+\(.*\)", parts[2]) is None:
if re.match(r"\S+\(.*\)", parts[2]) is None:
self._bail(("function signature '%s' has an invalid " +
"format") % parts[2])
......@@ -173,6 +173,9 @@ u64 __time = bpf_ktime_get_ns();
self._bail("no exprs specified")
self.exprs = exprs.split(',')
def _make_valid_identifier(self, ident):
return re.sub(r'[^A-Za-z0-9_]', '_', ident)
def __init__(self, tool, type, specifier):
self.usdt_ctx = None
self.streq_functions = ""
......@@ -196,8 +199,9 @@ u64 __time = bpf_ktime_get_ns();
self.tp_event = self.function
elif self.probe_type == "u":
self.library = parts[1]
self.probe_func_name = "%s_probe%d" % \
(self.function, Probe.next_probe_index)
self.probe_func_name = self._make_valid_identifier(
"%s_probe%d" % \
(self.function, Probe.next_probe_index))
self._enable_usdt_probe()
else:
self.library = parts[1]
......@@ -233,10 +237,12 @@ u64 __time = bpf_ktime_get_ns();
self.entry_probe_required = self.probe_type == "r" and \
(any(map(check, self.exprs)) or check(self.filter))
self.probe_func_name = "%s_probe%d" % \
(self.function, Probe.next_probe_index)
self.probe_hash_name = "%s_hash%d" % \
(self.function, Probe.next_probe_index)
self.probe_func_name = self._make_valid_identifier(
"%s_probe%d" % \
(self.function, Probe.next_probe_index))
self.probe_hash_name = self._make_valid_identifier(
"%s_hash%d" % \
(self.function, Probe.next_probe_index))
Probe.next_probe_index += 1
def _enable_usdt_probe(self):
......@@ -252,7 +258,7 @@ static inline bool %s(char const *ignored, char const *str) {
char needle[] = %s;
char haystack[sizeof(needle)];
bpf_probe_read(&haystack, sizeof(haystack), (void *)str);
for (int i = 0; i < sizeof(needle); ++i) {
for (int i = 0; i < sizeof(needle) - 1; ++i) {
if (needle[i] != haystack[i]) {
return false;
}
......@@ -613,7 +619,8 @@ argdist -p 2780 -z 120 \\
"(see examples below)")
parser.add_argument("-I", "--include", action="append",
metavar="header",
help="additional header files to include in the BPF program")
help="additional header files to include in the BPF program "
"as either full path, or relative to '/usr/include'")
self.args = parser.parse_args()
self.usdt_ctx = None
......@@ -634,7 +641,12 @@ struct __string_t { char s[%d]; };
#include <uapi/linux/ptrace.h>
""" % self.args.string_size
for include in (self.args.include or []):
bpf_source += "#include <%s>\n" % include
if include.startswith((".", "/")):
include = os.path.abspath(include)
bpf_source += "#include \"%s\"\n" % include
else:
bpf_source += "#include <%s>\n" % include
bpf_source += BPF.generate_auto_includes(
map(lambda p: p.raw_spec, self.probes))
for probe in self.probes:
......
......@@ -363,6 +363,7 @@ optional arguments:
below)
-I header, --include header
additional header files to include in the BPF program
as either full path, or relative to '/usr/include'
Probe specifier syntax:
{p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
......
......@@ -106,14 +106,12 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
* test, and maintenance burden.
*/
#ifdef REQ_WRITE
if (req->cmd_flags & REQ_WRITE) {
data.rwflag = !!(req->cmd_flags & REQ_WRITE);
#elif defined(REQ_OP_SHIFT)
data.rwflag = !!((req->cmd_flags >> REQ_OP_SHIFT) == REQ_OP_WRITE);
#else
if ((req->cmd_flags >> REQ_OP_SHIFT) == REQ_OP_WRITE) {
data.rwflag = !!((req->cmd_flags & REQ_OP_MASK) == REQ_OP_WRITE);
#endif
data.rwflag = 1;
} else {
data.rwflag = 0;
}
events.perf_submit(ctx, &data, sizeof(data));
start.delete(&req);
......
......@@ -137,8 +137,10 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
*/
#ifdef REQ_WRITE
info.rwflag = !!(req->cmd_flags & REQ_WRITE);
#else
#elif defined(REQ_OP_SHIFT)
info.rwflag = !!((req->cmd_flags >> REQ_OP_SHIFT) == REQ_OP_WRITE);
#else
info.rwflag = !!((req->cmd_flags & REQ_OP_MASK) == REQ_OP_WRITE);
#endif
whop = whobyreq.lookup(&req);
......
/*
* deadlock_detector.c Detects potential deadlocks in a running process.
* For Linux, uses BCC, eBPF. See .py file.
*
* Copyright 2017 Facebook, Inc.
* Licensed under the Apache License, Version 2.0 (the "License")
*
* 1-Feb-2016 Kenny Yu Created this.
*/
#include <linux/sched.h>
#include <uapi/linux/ptrace.h>
// Maximum number of mutexes a single thread can hold at once.
// If the number is too big, the unrolled loops wil cause the stack
// to be too big, and the bpf verifier will fail.
#define MAX_HELD_MUTEXES 16
// Info about held mutexes. `mutex` will be 0 if not held.
struct held_mutex_t {
u64 mutex;
u64 stack_id;
};
// List of mutexes that a thread is holding. Whenever we loop over this array,
// we need to force the compiler to unroll the loop, otherwise the bcc verifier
// will fail because the loop will create a backwards edge.
struct thread_to_held_mutex_leaf_t {
struct held_mutex_t held_mutexes[MAX_HELD_MUTEXES];
};
// Map of thread ID -> array of (mutex addresses, stack id)
BPF_TABLE("hash", u32, struct thread_to_held_mutex_leaf_t,
thread_to_held_mutexes, 2097152);
// Key type for edges. Represents an edge from mutex1 => mutex2.
struct edges_key_t {
u64 mutex1;
u64 mutex2;
};
// Leaf type for edges. Holds information about where each mutex was acquired.
struct edges_leaf_t {
u64 mutex1_stack_id;
u64 mutex2_stack_id;
u32 thread_pid;
char comm[TASK_COMM_LEN];
};
// Represents all edges currently in the mutex wait graph.
BPF_TABLE("hash", struct edges_key_t, struct edges_leaf_t, edges, 2097152);
// Info about parent thread when a child thread is created.
struct thread_created_leaf_t {
u64 stack_id;
u32 parent_pid;
char comm[TASK_COMM_LEN];
};
// Map of child thread pid -> info about parent thread.
BPF_TABLE("hash", u32, struct thread_created_leaf_t, thread_to_parent, 10240);
// Stack traces when threads are created and when mutexes are locked/unlocked.
BPF_STACK_TRACE(stack_traces, 655360);
// The first argument to the user space function we are tracing
// is a pointer to the mutex M held by thread T.
//
// For all mutexes N held by mutexes_held[T]
// add edge N => M (held by T)
// mutexes_held[T].add(M)
int trace_mutex_acquire(struct pt_regs *ctx, void *mutex_addr) {
// Higher 32 bits is process ID, Lower 32 bits is thread ID
u32 pid = bpf_get_current_pid_tgid();
u64 mutex = (u64)mutex_addr;
struct thread_to_held_mutex_leaf_t empty_leaf = {};
struct thread_to_held_mutex_leaf_t *leaf =
thread_to_held_mutexes.lookup_or_init(&pid, &empty_leaf);
if (!leaf) {
bpf_trace_printk(
"could not add thread_to_held_mutex key, thread: %d, mutex: %p\n", pid,
mutex);
return 1; // Could not insert, no more memory
}
// Recursive mutexes lock the same mutex multiple times. We cannot tell if
// the mutex is recursive after the mutex is already created. To avoid noisy
// reports, disallow self edges. Do one pass to check if we are already
// holding the mutex, and if we are, do nothing.
#pragma unroll
for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
if (leaf->held_mutexes[i].mutex == mutex) {
return 1; // Disallow self edges
}
}
u64 stack_id =
stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
int added_mutex = 0;
#pragma unroll
for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
// If this is a free slot, see if we can insert.
if (!leaf->held_mutexes[i].mutex) {
if (!added_mutex) {
leaf->held_mutexes[i].mutex = mutex;
leaf->held_mutexes[i].stack_id = stack_id;
added_mutex = 1;
}
continue; // Nothing to do for a free slot
}
// Add edges from held mutex => current mutex
struct edges_key_t edge_key = {};
edge_key.mutex1 = leaf->held_mutexes[i].mutex;
edge_key.mutex2 = mutex;
struct edges_leaf_t edge_leaf = {};
edge_leaf.mutex1_stack_id = leaf->held_mutexes[i].stack_id;
edge_leaf.mutex2_stack_id = stack_id;
edge_leaf.thread_pid = pid;
bpf_get_current_comm(&edge_leaf.comm, sizeof(edge_leaf.comm));
// Returns non-zero on error
int result = edges.update(&edge_key, &edge_leaf);
if (result) {
bpf_trace_printk("could not add edge key %p, %p, error: %d\n",
edge_key.mutex1, edge_key.mutex2, result);
continue; // Could not insert, no more memory
}
}
// There were no free slots for this mutex.
if (!added_mutex) {
bpf_trace_printk("could not add mutex %p, added_mutex: %d\n", mutex,
added_mutex);
return 1;
}
return 0;
}
// The first argument to the user space function we are tracing
// is a pointer to the mutex M held by thread T.
//
// mutexes_held[T].remove(M)
int trace_mutex_release(struct pt_regs *ctx, void *mutex_addr) {
// Higher 32 bits is process ID, Lower 32 bits is thread ID
u32 pid = bpf_get_current_pid_tgid();
u64 mutex = (u64)mutex_addr;
struct thread_to_held_mutex_leaf_t *leaf =
thread_to_held_mutexes.lookup(&pid);
if (!leaf) {
// If the leaf does not exist for the pid, then it means we either missed
// the acquire event, or we had no more memory and could not add it.
bpf_trace_printk(
"could not find thread_to_held_mutex, thread: %d, mutex: %p\n", pid,
mutex);
return 1;
}
// For older kernels without "Bpf: allow access into map value arrays"
// (https://lkml.org/lkml/2016/8/30/287) the bpf verifier will fail with an
// invalid memory access on `leaf->held_mutexes[i]` below. On newer kernels,
// we can avoid making this extra copy in `value` and use `leaf` directly.
struct thread_to_held_mutex_leaf_t value = {};
bpf_probe_read(&value, sizeof(struct thread_to_held_mutex_leaf_t), leaf);
#pragma unroll
for (int i = 0; i < MAX_HELD_MUTEXES; ++i) {
// Find the current mutex (if it exists), and clear it.
// Note: Can't use `leaf->` in this if condition, see comment above.
if (value.held_mutexes[i].mutex == mutex) {
leaf->held_mutexes[i].mutex = 0;
leaf->held_mutexes[i].stack_id = 0;
}
}
return 0;
}
// Trace return from clone() syscall in the child thread (return value > 0).
int trace_clone(struct pt_regs *ctx, unsigned long flags, void *child_stack,
void *ptid, void *ctid, struct pt_regs *regs) {
u32 child_pid = PT_REGS_RC(ctx);
if (child_pid <= 0) {
return 1;
}
struct thread_created_leaf_t thread_created_leaf = {};
thread_created_leaf.parent_pid = bpf_get_current_pid_tgid();
thread_created_leaf.stack_id =
stack_traces.get_stackid(ctx, BPF_F_USER_STACK | BPF_F_REUSE_STACKID);
bpf_get_current_comm(&thread_created_leaf.comm,
sizeof(thread_created_leaf.comm));
struct thread_created_leaf_t *insert_result =
thread_to_parent.lookup_or_init(&child_pid, &thread_created_leaf);
if (!insert_result) {
bpf_trace_printk(
"could not add thread_created_key, child: %d, parent: %d\n", child_pid,
thread_created_leaf.parent_pid);
return 1; // Could not insert, no more memory
}
return 0;
}
#!/usr/bin/env python
#
# deadlock_detector Detects potential deadlocks (lock order inversions)
# on a running process. For Linux, uses BCC, eBPF.
#
# USAGE: deadlock_detector.py [-h] [--binary BINARY] [--dump-graph DUMP_GRAPH]
# [--verbose] [--lock-symbols LOCK_SYMBOLS]
# [--unlock-symbols UNLOCK_SYMBOLS]
# pid
#
# This traces pthread mutex lock and unlock calls to build a directed graph
# representing the mutex wait graph:
#
# - Nodes in the graph represent mutexes.
# - Edge (A, B) exists if there exists some thread T where lock(A) was called
# and lock(B) was called before unlock(A) was called.
#
# If the program finds a potential lock order inversion, the program will dump
# the cycle of mutexes and the stack traces where each mutex was acquired, and
# then exit.
#
# This program can only find potential deadlocks that occur while the program
# is tracing the process. It cannot find deadlocks that may have occurred
# before the program was attached to the process.
#
# Since this traces all mutex lock and unlock events and all thread creation
# events on the traced process, the overhead of this bpf program can be very
# high if the process has many threads and mutexes. You should only run this on
# a process where the slowdown is acceptable.
#
# Note: This tool does not work for shared mutexes or recursive mutexes.
#
# For shared (read-write) mutexes, a deadlock requires a cycle in the wait
# graph where at least one of the mutexes in the cycle is acquiring exclusive
# (write) ownership.
#
# For recursive mutexes, lock() is called multiple times on the same mutex.
# However, there is no way to determine if a mutex is a recursive mutex
# after the mutex has been created. As a result, this tool will not find
# potential deadlocks that involve only one mutex.
#
# Copyright 2017 Facebook, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 01-Feb-2017 Kenny Yu Created this.
from __future__ import (
absolute_import, division, unicode_literals, print_function
)
from bcc import BPF
from collections import defaultdict
import argparse
import json
import os
import subprocess
import sys
import time
class DiGraph(object):
'''
Adapted from networkx: http://networkx.github.io/
Represents a directed graph. Edges can store (key, value) attributes.
'''
def __init__(self):
# Map of node -> set of nodes
self.adjacency_map = {}
# Map of (node1, node2) -> map string -> arbitrary attribute
# This will not be copied in subgraph()
self.attributes_map = {}
def neighbors(self, node):
return self.adjacency_map.get(node, set())
def edges(self):
edges = []
for node, neighbors in self.adjacency_map.items():
for neighbor in neighbors:
edges.append((node, neighbor))
return edges
def nodes(self):
return self.adjacency_map.keys()
def attributes(self, node1, node2):
return self.attributes_map[(node1, node2)]
def add_edge(self, node1, node2, **kwargs):
if node1 not in self.adjacency_map:
self.adjacency_map[node1] = set()
if node2 not in self.adjacency_map:
self.adjacency_map[node2] = set()
self.adjacency_map[node1].add(node2)
self.attributes_map[(node1, node2)] = kwargs
def remove_node(self, node):
self.adjacency_map.pop(node, None)
for _, neighbors in self.adjacency_map.items():
neighbors.discard(node)
def subgraph(self, nodes):
graph = DiGraph()
for node in nodes:
for neighbor in self.neighbors(node):
if neighbor in nodes:
graph.add_edge(node, neighbor)
return graph
def node_link_data(self):
'''
Returns the graph as a dictionary in a format that can be
serialized.
'''
data = {
'directed': True,
'multigraph': False,
'graph': {},
'links': [],
'nodes': [],
}
# Do one pass to build a map of node -> position in nodes
node_to_number = {}
for node in self.adjacency_map.keys():
node_to_number[node] = len(data['nodes'])
data['nodes'].append({'id': node})
# Do another pass to build the link information
for node, neighbors in self.adjacency_map.items():
for neighbor in neighbors:
link = self.attributes_map[(node, neighbor)].copy()
link['source'] = node_to_number[node]
link['target'] = node_to_number[neighbor]
data['links'].append(link)
return data
def strongly_connected_components(G):
'''
Adapted from networkx: http://networkx.github.io/
Parameters
----------
G : DiGraph
Returns
-------
comp : generator of sets
A generator of sets of nodes, one for each strongly connected
component of G.
'''
preorder = {}
lowlink = {}
scc_found = {}
scc_queue = []
i = 0 # Preorder counter
for source in G.nodes():
if source not in scc_found:
queue = [source]
while queue:
v = queue[-1]
if v not in preorder:
i = i + 1
preorder[v] = i
done = 1
v_nbrs = G.neighbors(v)
for w in v_nbrs:
if w not in preorder:
queue.append(w)
done = 0
break
if done == 1:
lowlink[v] = preorder[v]
for w in v_nbrs:
if w not in scc_found:
if preorder[w] > preorder[v]:
lowlink[v] = min([lowlink[v], lowlink[w]])
else:
lowlink[v] = min([lowlink[v], preorder[w]])
queue.pop()
if lowlink[v] == preorder[v]:
scc_found[v] = True
scc = {v}
while (
scc_queue and preorder[scc_queue[-1]] > preorder[v]
):
k = scc_queue.pop()
scc_found[k] = True
scc.add(k)
yield scc
else:
scc_queue.append(v)
def simple_cycles(G):
'''
Adapted from networkx: http://networkx.github.io/
Parameters
----------
G : DiGraph
Returns
-------
cycle_generator: generator
A generator that produces elementary cycles of the graph.
Each cycle is represented by a list of nodes along the cycle.
'''
def _unblock(thisnode, blocked, B):
stack = set([thisnode])
while stack:
node = stack.pop()
if node in blocked:
blocked.remove(node)
stack.update(B[node])
B[node].clear()
# Johnson's algorithm requires some ordering of the nodes.
# We assign the arbitrary ordering given by the strongly connected comps
# There is no need to track the ordering as each node removed as processed.
# save the actual graph so we can mutate it here
# We only take the edges because we do not want to
# copy edge and node attributes here.
subG = G.subgraph(G.nodes())
sccs = list(strongly_connected_components(subG))
while sccs:
scc = sccs.pop()
# order of scc determines ordering of nodes
startnode = scc.pop()
# Processing node runs 'circuit' routine from recursive version
path = [startnode]
blocked = set() # vertex: blocked from search?
closed = set() # nodes involved in a cycle
blocked.add(startnode)
B = defaultdict(set) # graph portions that yield no elementary circuit
stack = [(startnode, list(subG.neighbors(startnode)))]
while stack:
thisnode, nbrs = stack[-1]
if nbrs:
nextnode = nbrs.pop()
if nextnode == startnode:
yield path[:]
closed.update(path)
elif nextnode not in blocked:
path.append(nextnode)
stack.append((nextnode, list(subG.neighbors(nextnode))))
closed.discard(nextnode)
blocked.add(nextnode)
continue
# done with nextnode... look for more neighbors
if not nbrs: # no more nbrs
if thisnode in closed:
_unblock(thisnode, blocked, B)
else:
for nbr in subG.neighbors(thisnode):
if thisnode not in B[nbr]:
B[nbr].add(thisnode)
stack.pop()
path.pop()
# done processing this node
subG.remove_node(startnode)
H = subG.subgraph(scc) # make smaller to avoid work in SCC routine
sccs.extend(list(strongly_connected_components(H)))
def find_cycle(graph):
'''
Looks for a cycle in the graph. If found, returns the first cycle.
If nodes a1, a2, ..., an are in a cycle, then this returns:
[(a1,a2), (a2,a3), ... (an-1,an), (an, a1)]
Otherwise returns an empty list.
'''
cycles = list(simple_cycles(graph))
if cycles:
nodes = cycles[0]
nodes.append(nodes[0])
edges = []
prev = nodes[0]
for node in nodes[1:]:
edges.append((prev, node))
prev = node
return edges
else:
return []
def print_cycle(binary, graph, edges, thread_info, print_stack_trace_fn):
'''
Prints the cycle in the mutex graph in the following format:
Potential Deadlock Detected!
Cycle in lock order graph: M0 => M1 => M2 => M0
for (m, n) in cycle:
Mutex n acquired here while holding Mutex m in thread T:
[ stack trace ]
Mutex m previously acquired by thread T here:
[ stack trace ]
for T in all threads:
Thread T was created here:
[ stack trace ]
'''
# List of mutexes in the cycle, first and last repeated
nodes_in_order = []
# Map mutex address -> readable alias
node_addr_to_name = {}
for counter, (m, n) in enumerate(edges):
nodes_in_order.append(m)
# For global or static variables, try to symbolize the mutex address.
symbol = symbolize_with_objdump(binary, m)
if symbol:
symbol += ' '
node_addr_to_name[m] = 'Mutex M%d (%s0x%016x)' % (counter, symbol, m)
nodes_in_order.append(nodes_in_order[0])
print('----------------\nPotential Deadlock Detected!\n')
print(
'Cycle in lock order graph: %s\n' %
(' => '.join([node_addr_to_name[n] for n in nodes_in_order]))
)
# Set of threads involved in the lock inversion
thread_pids = set()
# For each edge in the cycle, print where the two mutexes were held
for (m, n) in edges:
thread_pid = graph.attributes(m, n)['thread_pid']
thread_comm = graph.attributes(m, n)['thread_comm']
first_mutex_stack_id = graph.attributes(m, n)['first_mutex_stack_id']
second_mutex_stack_id = graph.attributes(m, n)['second_mutex_stack_id']
thread_pids.add(thread_pid)
print(
'%s acquired here while holding %s in Thread %d (%s):' % (
node_addr_to_name[n], node_addr_to_name[m], thread_pid,
thread_comm
)
)
print_stack_trace_fn(second_mutex_stack_id)
print('')
print(
'%s previously acquired by the same Thread %d (%s) here:' %
(node_addr_to_name[m], thread_pid, thread_comm)
)
print_stack_trace_fn(first_mutex_stack_id)
print('')
# Print where the threads were created, if available
for thread_pid in thread_pids:
parent_pid, stack_id, parent_comm = thread_info.get(
thread_pid, (None, None, None)
)
if parent_pid:
print(
'Thread %d created by Thread %d (%s) here: ' %
(thread_pid, parent_pid, parent_comm)
)
print_stack_trace_fn(stack_id)
else:
print(
'Could not find stack trace where Thread %d was created' %
thread_pid
)
print('')
def symbolize_with_objdump(binary, addr):
'''
Searches the binary for the address using objdump. Returns the symbol if
it is found, otherwise returns empty string.
'''
try:
command = (
'objdump -tT %s | grep %x | awk {\'print $NF\'} | c++filt' %
(binary, addr)
)
output = subprocess.check_output(command, shell=True)
return output.decode('utf-8').strip()
except subprocess.CalledProcessError:
return ''
def strlist(s):
'''Given a comma-separated string, returns a list of substrings'''
return s.strip().split(',')
def main():
examples = '''Examples:
deadlock_detector 181 # Analyze PID 181
deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
# Analyze PID 181 and locks from this binary.
# If tracing a process that is running from
# a dynamically-linked binary, this argument
# is required and should be the path to the
# pthread library.
deadlock_detector 181 --verbose
# Analyze PID 181 and print statistics about
# the mutex wait graph.
deadlock_detector 181 --lock-symbols my_mutex_lock1,my_mutex_lock2 \\
--unlock-symbols my_mutex_unlock1,my_mutex_unlock2
# Analyze PID 181 and trace custom mutex
# symbols instead of pthread mutexes.
deadlock_detector 181 --dump-graph graph.json
# Analyze PID 181 and dump the mutex wait
# graph to graph.json.
'''
parser = argparse.ArgumentParser(
description=(
'Detect potential deadlocks (lock inversions) in a running binary.'
'\nMust be run as root.'
),
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples,
)
parser.add_argument('pid', type=int, help='Pid to trace')
# Binaries with `:` in the path will fail to attach uprobes on kernels
# running without this patch: https://lkml.org/lkml/2017/1/13/585.
# Symlinks to the binary without `:` in the path can get around this issue.
parser.add_argument(
'--binary',
type=str,
default='',
help='If set, trace the mutexes from the binary at this path. '
'For statically-linked binaries, this argument is not required. '
'For dynamically-linked binaries, this argument is required and '
'should be the path of the pthread library the binary is using. '
'Example: /lib/x86_64-linux-gnu/libpthread.so.0',
)
parser.add_argument(
'--dump-graph',
type=str,
default='',
help='If set, this will dump the mutex graph to the specified file.',
)
parser.add_argument(
'--verbose',
action='store_true',
help='Print statistics about the mutex wait graph.',
)
parser.add_argument(
'--lock-symbols',
type=strlist,
default=['pthread_mutex_lock'],
help='Comma-separated list of lock symbols to trace. Default is '
'pthread_mutex_lock. These symbols cannot be inlined in the binary.',
)
parser.add_argument(
'--unlock-symbols',
type=strlist,
default=['pthread_mutex_unlock'],
help='Comma-separated list of unlock symbols to trace. Default is '
'pthread_mutex_unlock. These symbols cannot be inlined in the binary.',
)
args = parser.parse_args()
if not args.binary:
try:
args.binary = os.readlink('/proc/%d/exe' % args.pid)
except OSError as e:
print('%s. Is the process (pid=%d) running?' % (str(e), args.pid))
sys.exit(1)
bpf = BPF(src_file='deadlock_detector.c')
# Trace where threads are created
bpf.attach_kretprobe(
event='sys_clone', fn_name='trace_clone', pid=args.pid
)
# We must trace unlock first, otherwise in the time we attached the probe
# on lock() and have not yet attached the probe on unlock(), a thread can
# acquire mutexes and release them, but the release events will not be
# traced, resulting in noisy reports.
for symbol in args.unlock_symbols:
try:
bpf.attach_uprobe(
name=args.binary,
sym=symbol,
fn_name='trace_mutex_release',
pid=args.pid,
)
except Exception as e:
print('%s. Failed to attach to symbol: %s' % (str(e), symbol))
sys.exit(1)
for symbol in args.lock_symbols:
try:
bpf.attach_uprobe(
name=args.binary,
sym=symbol,
fn_name='trace_mutex_acquire',
pid=args.pid,
)
except Exception as e:
print('%s. Failed to attach to symbol: %s' % (str(e), symbol))
sys.exit(1)
def print_stack_trace(stack_id):
'''Closure that prints the symbolized stack trace.'''
for addr in bpf.get_table('stack_traces').walk(stack_id):
line = bpf.sym(addr, args.pid)
# Try to symbolize with objdump if we cannot with bpf.
if line == '[unknown]':
symbol = symbolize_with_objdump(args.binary, addr)
if symbol:
line = symbol
print('@ %016x %s' % (addr, line))
print('Tracing... Hit Ctrl-C to end.')
while True:
try:
# Map of child thread pid -> parent info
thread_info = {
child.value: (parent.parent_pid, parent.stack_id, parent.comm)
for child, parent in bpf.get_table('thread_to_parent').items()
}
# Mutex wait directed graph. Nodes are mutexes. Edge (A,B) exists
# if there exists some thread T where lock(A) was called and
# lock(B) was called before unlock(A) was called.
graph = DiGraph()
for key, leaf in bpf.get_table('edges').items():
graph.add_edge(
key.mutex1,
key.mutex2,
thread_pid=leaf.thread_pid,
thread_comm=leaf.comm.decode('utf-8'),
first_mutex_stack_id=leaf.mutex1_stack_id,
second_mutex_stack_id=leaf.mutex2_stack_id,
)
if args.verbose:
print(
'Mutexes: %d, Edges: %d' %
(len(graph.nodes()), len(graph.edges()))
)
if args.dump_graph:
with open(args.dump_graph, 'w') as f:
data = graph.node_link_data()
f.write(json.dumps(data, indent=2))
cycle = find_cycle(graph)
if cycle:
print_cycle(
args.binary, graph, cycle, thread_info, print_stack_trace
)
sys.exit(1)
time.sleep(1)
except KeyboardInterrupt:
break
if __name__ == '__main__':
main()
Demonstrations of deadlock_detector.
This program detects potential deadlocks on a running process. The program
attaches uprobes on `pthread_mutex_lock` and `pthread_mutex_unlock` to build
a mutex wait directed graph, and then looks for a cycle in this graph. This
graph has the following properties:
- Nodes in the graph represent mutexes.
- Edge (A, B) exists if there exists some thread T where lock(A) was called
and lock(B) was called before unlock(A) was called.
If there is a cycle in this graph, this indicates that there is a lock order
inversion (potential deadlock). If the program finds a lock order inversion, the
program will dump the cycle of mutexes, dump the stack traces where each mutex
was acquired, and then exit.
This program can only find potential deadlocks that occur while the program
is tracing the process. It cannot find deadlocks that may have occurred
before the program was attached to the process.
Since this traces all mutex lock and unlock events and all thread creation
events on the traced process, the overhead of this bpf program can be very
high if the process has many threads and mutexes. You should only run this on
a process where the slowdown is acceptable.
Note: This tool does not work for shared mutexes or recursive mutexes.
For shared (read-write) mutexes, a deadlock requires a cycle in the wait
graph where at least one of the mutexes in the cycle is acquiring exclusive
(write) ownership.
For recursive mutexes, lock() is called multiple times on the same mutex.
However, there is no way to determine if a mutex is a recursive mutex
after the mutex has been created. As a result, this tool will not find
potential deadlocks that involve only one mutex.
# ./deadlock_detector.py 181
Tracing... Hit Ctrl-C to end.
----------------
Potential Deadlock Detected!
Cycle in lock order graph: Mutex M0 (main::static_mutex3 0x0000000000473c60) => Mutex M1 (0x00007fff6d738400) => Mutex M2 (global_mutex1 0x0000000000473be0) => Mutex M3 (global_mutex2 0x0000000000473c20) => Mutex M0 (main::static_mutex3 0x0000000000473c60)
Mutex M1 (0x00007fff6d738400) acquired here while holding Mutex M0 (main::static_mutex3 0x0000000000473c60) in Thread 357250 (lockinversion):
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402e38 main::{lambda()#3}::operator()() const
@ 0000000000406ba8 void std::_Bind_simple<main::{lambda()#3} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 0000000000406951 std::_Bind_simple<main::{lambda()#3} ()>::operator()()
@ 000000000040673a std::thread::_Impl<std::_Bind_simple<main::{lambda()#3} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M0 (main::static_mutex3 0x0000000000473c60) previously acquired by the same Thread 357250 (lockinversion) here:
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402e22 main::{lambda()#3}::operator()() const
@ 0000000000406ba8 void std::_Bind_simple<main::{lambda()#3} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 0000000000406951 std::_Bind_simple<main::{lambda()#3} ()>::operator()()
@ 000000000040673a std::thread::_Impl<std::_Bind_simple<main::{lambda()#3} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M2 (global_mutex1 0x0000000000473be0) acquired here while holding Mutex M1 (0x00007fff6d738400) in Thread 357251 (lockinversion):
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402ea8 main::{lambda()#4}::operator()() const
@ 0000000000406b46 void std::_Bind_simple<main::{lambda()#4} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 000000000040692d std::_Bind_simple<main::{lambda()#4} ()>::operator()()
@ 000000000040671c std::thread::_Impl<std::_Bind_simple<main::{lambda()#4} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M1 (0x00007fff6d738400) previously acquired by the same Thread 357251 (lockinversion) here:
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402e97 main::{lambda()#4}::operator()() const
@ 0000000000406b46 void std::_Bind_simple<main::{lambda()#4} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 000000000040692d std::_Bind_simple<main::{lambda()#4} ()>::operator()()
@ 000000000040671c std::thread::_Impl<std::_Bind_simple<main::{lambda()#4} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M3 (global_mutex2 0x0000000000473c20) acquired here while holding Mutex M2 (global_mutex1 0x0000000000473be0) in Thread 357247 (lockinversion):
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402d5f main::{lambda()#1}::operator()() const
@ 0000000000406c6c void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 0000000000406999 std::_Bind_simple<main::{lambda()#1} ()>::operator()()
@ 0000000000406776 std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M2 (global_mutex1 0x0000000000473be0) previously acquired by the same Thread 357247 (lockinversion) here:
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402d4e main::{lambda()#1}::operator()() const
@ 0000000000406c6c void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 0000000000406999 std::_Bind_simple<main::{lambda()#1} ()>::operator()()
@ 0000000000406776 std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M0 (main::static_mutex3 0x0000000000473c60) acquired here while holding Mutex M3 (global_mutex2 0x0000000000473c20) in Thread 357248 (lockinversion):
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402dc9 main::{lambda()#2}::operator()() const
@ 0000000000406c0a void std::_Bind_simple<main::{lambda()#2} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 0000000000406975 std::_Bind_simple<main::{lambda()#2} ()>::operator()()
@ 0000000000406758 std::thread::_Impl<std::_Bind_simple<main::{lambda()#2} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Mutex M3 (global_mutex2 0x0000000000473c20) previously acquired by the same Thread 357248 (lockinversion) here:
@ 00000000004024d0 pthread_mutex_lock
@ 0000000000406dd0 std::mutex::lock()
@ 00000000004070d2 std::lock_guard<std::mutex>::lock_guard(std::mutex&)
@ 0000000000402db8 main::{lambda()#2}::operator()() const
@ 0000000000406c0a void std::_Bind_simple<main::{lambda()#2} ()>::_M_invoke<>(std::_Index_tuple<>)
@ 0000000000406975 std::_Bind_simple<main::{lambda()#2} ()>::operator()()
@ 0000000000406758 std::thread::_Impl<std::_Bind_simple<main::{lambda()#2} ()> >::_M_run()
@ 00007fd4496564e1 execute_native_thread_routine
@ 00007fd449dd57f1 start_thread
@ 00007fd44909746d __clone
Thread 357248 created by Thread 350692 (lockinversion) here:
@ 00007fd449097431 __clone
@ 00007fd449dd5ef5 pthread_create
@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
@ 00000000004033ac std::thread::thread<main::{lambda()#2}>(main::{lambda()#2}&&)
@ 000000000040308f main
@ 00007fd448faa0f6 __libc_start_main
@ 0000000000402ad8 [unknown]
Thread 357250 created by Thread 350692 (lockinversion) here:
@ 00007fd449097431 __clone
@ 00007fd449dd5ef5 pthread_create
@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
@ 00000000004034b2 std::thread::thread<main::{lambda()#3}>(main::{lambda()#3}&&)
@ 00000000004030b9 main
@ 00007fd448faa0f6 __libc_start_main
@ 0000000000402ad8 [unknown]
Thread 357251 created by Thread 350692 (lockinversion) here:
@ 00007fd449097431 __clone
@ 00007fd449dd5ef5 pthread_create
@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
@ 00000000004035b8 std::thread::thread<main::{lambda()#4}>(main::{lambda()#4}&&)
@ 00000000004030e6 main
@ 00007fd448faa0f6 __libc_start_main
@ 0000000000402ad8 [unknown]
Thread 357247 created by Thread 350692 (lockinversion) here:
@ 00007fd449097431 __clone
@ 00007fd449dd5ef5 pthread_create
@ 00007fd449658440 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
@ 00000000004032a6 std::thread::thread<main::{lambda()#1}>(main::{lambda()#1}&&)
@ 0000000000403070 main
@ 00007fd448faa0f6 __libc_start_main
@ 0000000000402ad8 [unknown]
This is output from a process that has a potential deadlock involving 4 mutexes
and 4 threads:
- Thread 357250 acquired M1 while holding M0 (edge M0 -> M1)
- Thread 357251 acquired M2 while holding M1 (edge M1 -> M2)
- Thread 357247 acquired M3 while holding M2 (edge M2 -> M3)
- Thread 357248 acquired M0 while holding M3 (edge M3 -> M0)
This is the C++ program that generated the output above:
```c++
#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
std::mutex global_mutex1;
std::mutex global_mutex2;
int main(void) {
static std::mutex static_mutex3;
std::mutex local_mutex4;
std::cout << "sleeping for a bit to allow trace to attach..." << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(10));
std::cout << "starting program..." << std::endl;
auto t1 = std::thread([] {
std::lock_guard<std::mutex> g1(global_mutex1);
std::lock_guard<std::mutex> g2(global_mutex2);
});
t1.join();
auto t2 = std::thread([] {
std::lock_guard<std::mutex> g2(global_mutex2);
std::lock_guard<std::mutex> g3(static_mutex3);
});
t2.join();
auto t3 = std::thread([&local_mutex4] {
std::lock_guard<std::mutex> g3(static_mutex3);
std::lock_guard<std::mutex> g4(local_mutex4);
});
t3.join();
auto t4 = std::thread([&local_mutex4] {
std::lock_guard<std::mutex> g4(local_mutex4);
std::lock_guard<std::mutex> g1(global_mutex1);
});
t4.join();
std::cout << "sleeping to allow trace to collect data..." << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(5));
std::cout << "done!" << std::endl;
}
```
Note that an actual deadlock did not occur, although this mutex lock ordering
creates the possibility of a deadlock, and this is a hint to the programmer to
reconsider the lock ordering. If the mutexes are global or static and debug
symbols are enabled, the output will contain the mutex symbol name. The output
uses a similar format as ThreadSanitizer
(https://github.com/google/sanitizers/wiki/ThreadSanitizerDeadlockDetector).
# ./deadlock_detector.py 181 --binary /usr/local/bin/lockinversion
Tracing... Hit Ctrl-C to end.
^C
If the traced process is instantiated from a statically-linked executable,
this argument is optional, and the program will determine the path of the
executable from the pid. However, on older kernels without this patch
("uprobe: Find last occurrence of ':' when parsing uprobe PATH:OFFSET",
https://lkml.org/lkml/2017/1/13/585), binaries that contain `:` in the path
cannot be attached with uprobes. As a workaround, we can create a symlink
to the binary, and provide the symlink name instead to the `--binary` option.
# ./deadlock_detector.py 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
Tracing... Hit Ctrl-C to end.
^C
If the traced process is instantiated from a dynamically-linked executable,
this argument is required and needs to be the path to the pthread shared
library used by the executable.
# ./deadlock_detector.py 181 --dump-graph graph.json --verbose
Tracing... Hit Ctrl-C to end.
Mutexes: 0, Edges: 0
Mutexes: 532, Edges: 411
Mutexes: 735, Edges: 675
Mutexes: 1118, Edges: 1278
Mutexes: 1666, Edges: 2185
Mutexes: 2056, Edges: 2694
Mutexes: 2245, Edges: 2906
Mutexes: 2656, Edges: 3479
Mutexes: 2813, Edges: 3785
^C
If the program does not find a deadlock, it will keep running until you hit
Ctrl-C. If you pass the `--verbose` flag, the program will also dump statistics
about the number of mutexes and edges in the mutex wait graph. If you want to
serialize the graph to analyze it later, you can pass the `--dump-graph FILE`
flag, and the program will serialize the graph in json.
# ./deadlock_detector.py 181 --lock-symbols custom_mutex1_lock,custom_mutex2_lock --unlock_symbols custom_mutex1_unlock,custom_mutex2_unlock --verbose
Tracing... Hit Ctrl-C to end.
Mutexes: 0, Edges: 0
Mutexes: 532, Edges: 411
Mutexes: 735, Edges: 675
Mutexes: 1118, Edges: 1278
Mutexes: 1666, Edges: 2185
Mutexes: 2056, Edges: 2694
Mutexes: 2245, Edges: 2906
Mutexes: 2656, Edges: 3479
Mutexes: 2813, Edges: 3785
^C
If your program is using custom mutexes and not pthread mutexes, you can use
the `--lock-symbols` and `--unlock-symbols` flags to specify different mutex
symbols to trace. The flags take a comma-separated string of symbol names.
Note that if the symbols are inlined in the binary, then this program can result
in false positives.
USAGE message:
# ./deadlock_detector.py -h
usage: deadlock_detector.py [-h] [--binary BINARY] [--dump-graph DUMP_GRAPH]
[--verbose] [--lock-symbols LOCK_SYMBOLS]
[--unlock-symbols UNLOCK_SYMBOLS]
pid
Detect potential deadlocks (lock inversions) in a running binary.
Must be run as root.
positional arguments:
pid Pid to trace
optional arguments:
-h, --help show this help message and exit
--binary BINARY If set, trace the mutexes from the binary at this
path. For statically-linked binaries, this argument is
not required. For dynamically-linked binaries, this
argument is required and should be the path of the
pthread library the binary is using. Example:
/lib/x86_64-linux-gnu/libpthread.so.0
--dump-graph DUMP_GRAPH
If set, this will dump the mutex graph to the
specified file.
--verbose Print statistics about the mutex wait graph.
--lock-symbols LOCK_SYMBOLS
Comma-separated list of lock symbols to trace. Default
is pthread_mutex_lock. These symbols cannot be inlined
in the binary.
--unlock-symbols UNLOCK_SYMBOLS
Comma-separated list of unlock symbols to trace.
Default is pthread_mutex_unlock. These symbols cannot
be inlined in the binary.
Examples:
deadlock_detector 181 # Analyze PID 181
deadlock_detector 181 --binary /lib/x86_64-linux-gnu/libpthread.so.0
# Analyze PID 181 and locks from this binary.
# If tracing a process that is running from
# a dynamically-linked binary, this argument
# is required and should be the path to the
# pthread library.
deadlock_detector 181 --verbose
# Analyze PID 181 and print statistics about
# the mutex wait graph.
deadlock_detector 181 --lock-symbols my_mutex_lock1,my_mutex_lock2 \
--unlock-symbols my_mutex_unlock1,my_mutex_unlock2
# Analyze PID 181 and trace custom mutex
# symbols instead of pthread mutexes.
deadlock_detector 181 --dump-graph graph.json
# Analyze PID 181 and dump the mutex wait
# graph to graph.json.
......@@ -197,9 +197,9 @@ def print_event(cpu, data, size):
if args.timestamp:
print("%-8.3f" % (time.time() - start_ts), end="")
ppid = get_ppid(event.pid)
print("%-16s %-6s %-6s %3s %s" % (event.comm, event.pid,
print("%-16s %-6s %-6s %3s %s" % (event.comm.decode(), event.pid,
ppid if ppid > 0 else "?", event.retval,
' '.join(argv[event.pid])))
b' '.join(argv[event.pid]).decode()))
del(argv[event.pid])
......
......@@ -43,6 +43,7 @@ class Probe(object):
func -- probe a kernel function
lib:func -- probe a user-space function in the library 'lib'
/path:func -- probe a user-space function in binary '/path'
p::func -- same thing as 'func'
p:lib:func -- same thing as 'lib:func'
t:cat:event -- probe a kernel tracepoint
......@@ -219,8 +220,11 @@ class Tool(object):
./funccount -Ti 5 'vfs_*' # output every 5 seconds, with timestamps
./funccount -p 185 'vfs_*' # count vfs calls for PID 181 only
./funccount t:sched:sched_fork # count calls to the sched_fork tracepoint
./funccount -p 185 u:node:gc* # count all GC USDT probes in node
./funccount -p 185 u:node:gc* # count all GC USDT probes in node, PID 185
./funccount c:malloc # count all malloc() calls in libc
./funccount go:os.* # count all "os.*" calls in libgo
./funccount -p 185 go:os.* # count all "os.*" calls in libgo, PID 185
./funccount ./test:read* # count "read*" calls in the ./test binary
"""
parser = argparse.ArgumentParser(
description="Count functions, tracepoints, and USDT probes",
......
......@@ -169,7 +169,7 @@ Ctrl-C has been hit.
User functions can be traced in executables or libraries, and per-process
filtering is allowed:
# ./funccount -p 1442 contentions:*
# ./funccount -p 1442 /home/ubuntu/contentions:*
Tracing 15 functions for "/home/ubuntu/contentions:*"... Hit Ctrl-C to end.
^C
FUNC COUNT
......@@ -180,6 +180,10 @@ insert_result 87186
is_prime 1252772
Detaching...
If /home/ubuntu is in the $PATH, then the following command will also work:
# ./funccount -p 1442 contentions:*
Counting libc write and read calls using regular expression syntax (-r):
......@@ -314,6 +318,8 @@ examples:
./funccount -Ti 5 'vfs_*' # output every 5 seconds, with timestamps
./funccount -p 185 'vfs_*' # count vfs calls for PID 181 only
./funccount t:sched:sched_fork # count calls to the sched_fork tracepoint
./funccount -p 185 u:node:gc* # count all GC USDT probes in node
./funccount -p 185 u:node:gc* # count all GC USDT probes in node, PID 185
./funccount c:malloc # count all malloc() calls in libc
./funccount go:os.* # count all "os.*" calls in libgo
./funccount -p 185 go:os.* # count all "os.*" calls in libgo, PID 185
./funccount ./test:read* # count "read*" calls in the ./test binary
......@@ -201,9 +201,10 @@ if not library:
b.attach_kretprobe(event_re=pattern, fn_name="trace_func_return")
matched = b.num_open_kprobes()
else:
b.attach_uprobe(name=library, sym_re=pattern, fn_name="trace_func_entry")
b.attach_uprobe(name=library, sym_re=pattern, fn_name="trace_func_entry",
pid=args.pid or -1)
b.attach_uretprobe(name=library, sym_re=pattern,
fn_name="trace_func_return")
fn_name="trace_func_return", pid=args.pid or -1)
matched = b.num_open_uprobes()
if matched == 0:
......
......@@ -18,8 +18,21 @@
from __future__ import print_function
from bcc import BPF
from time import strftime
import argparse
import ctypes as ct
examples = """examples:
./gethostlatency # trace all TCP accept()s
./gethostlatency -p 181 # only trace PID 181
"""
parser = argparse.ArgumentParser(
description="Show latency for getaddrinfo/gethostbyname[2] calls",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid", help="trace this PID only", type=int,
default=-1)
args = parser.parse_args()
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
......@@ -34,7 +47,6 @@ struct val_t {
struct data_t {
u32 pid;
u64 ts;
u64 delta;
char comm[TASK_COMM_LEN];
char host[80];
......@@ -77,56 +89,42 @@ int do_return(struct pt_regs *ctx) {
bpf_probe_read(&data.host, sizeof(data.host), (void *)valp->host);
data.pid = valp->pid;
data.delta = tsp - valp->ts;
data.ts = tsp / 1000;
events.perf_submit(ctx, &data, sizeof(data));
start.delete(&pid);
return 0;
}
"""
b = BPF(text=bpf_text)
b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry")
b.attach_uprobe(name="c", sym="gethostbyname", fn_name="do_entry")
b.attach_uprobe(name="c", sym="gethostbyname2", fn_name="do_entry")
b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return")
b.attach_uretprobe(name="c", sym="gethostbyname", fn_name="do_return")
b.attach_uretprobe(name="c", sym="gethostbyname2", fn_name="do_return")
b.attach_uprobe(name="c", sym="getaddrinfo", fn_name="do_entry", pid=args.pid)
b.attach_uprobe(name="c", sym="gethostbyname", fn_name="do_entry",
pid=args.pid)
b.attach_uprobe(name="c", sym="gethostbyname2", fn_name="do_entry",
pid=args.pid)
b.attach_uretprobe(name="c", sym="getaddrinfo", fn_name="do_return",
pid=args.pid)
b.attach_uretprobe(name="c", sym="gethostbyname", fn_name="do_return",
pid=args.pid)
b.attach_uretprobe(name="c", sym="gethostbyname2", fn_name="do_return",
pid=args.pid)
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("ts", ct.c_ulonglong),
("delta", ct.c_ulonglong),
("comm", ct.c_char * TASK_COMM_LEN),
("host", ct.c_char * 80)
]
start_ts = 0
prev_ts = 0
delta = 0
# header
print("%-9s %-6s %-16s %10s %s" % ("TIME", "PID", "COMM", "LATms", "HOST"))
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
global start_ts
global prev_ts
global delta
if start_ts == 0:
prev_ts = start_ts
if start_ts == 1:
delta = float(delta) + (event.ts - prev_ts)
print("%-9s %-6d %-16s %10.2f %s" % (strftime("%H:%M:%S"), event.pid,
event.comm, (event.delta / 1000000), event.host))
prev_ts = event.ts
start_ts = 1
# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
......
......@@ -19,3 +19,19 @@ TIME PID COMM LATms HOST
In this example, the first call to lookup "www.iovisor.org" took 90 ms, and
the second took 0 ms (cached). The slowest call in this example was to "foo",
which was an unsuccessful lookup.
USAGE message:
# ./gethostlatency -h
usage: gethostlatency [-h] [-p PID]
Show latency for getaddrinfo/gethostbyname[2] calls
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
examples:
./gethostlatency # trace all TCP accept()s
./gethostlatency -p 181 # only trace PID 181
......@@ -135,6 +135,8 @@ parser.add_argument("-z", "--min-size", type=int,
help="capture only allocations larger than this size")
parser.add_argument("-Z", "--max-size", type=int,
help="capture only allocations smaller than this size")
parser.add_argument("-O", "--obj", type=str, default="c",
help="attach to malloc & free in the specified object")
args = parser.parse_args()
......@@ -149,6 +151,7 @@ num_prints = args.count
top_stacks = args.top
min_size = args.min_size
max_size = args.max_size
obj = args.obj
if min_size is not None and max_size is not None and min_size > max_size:
print("min_size (-z) can't be greater than max_size (-Z)")
......@@ -251,11 +254,11 @@ bpf_program = BPF(text=bpf_source)
if not kernel_trace:
print("Attaching to malloc and free in pid %d, Ctrl+C to quit." % pid)
bpf_program.attach_uprobe(name="c", sym="malloc",
bpf_program.attach_uprobe(name=obj, sym="malloc",
fn_name="alloc_enter", pid=pid)
bpf_program.attach_uretprobe(name="c", sym="malloc",
bpf_program.attach_uretprobe(name=obj, sym="malloc",
fn_name="alloc_exit", pid=pid)
bpf_program.attach_uprobe(name="c", sym="free",
bpf_program.attach_uprobe(name=obj, sym="free",
fn_name="free_enter", pid=pid)
else:
print("Attaching to kmalloc and kfree, Ctrl+C to quit.")
......
......@@ -150,14 +150,16 @@ of the sampling rate applied.
USAGE message:
# ./memleak -h
usage: memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
[-s SAMPLE_RATE] [-d STACK_DEPTH] [-T TOP]
usage: memleak.py [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
[-s SAMPLE_RATE] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE]
[-O OBJ]
[interval] [count]
Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with malloc/free and kernel-mode
allocations made with kmalloc/kfree.
positional arguments:
interval interval in seconds to print outstanding allocations
count number of times to print the report before exiting
......@@ -175,13 +177,12 @@ optional arguments:
execute and trace the specified command
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE
sample every N-th allocation to decrease the overhead
-d STACK_DEPTH, --stack_depth STACK_DEPTH
maximum stack depth to capture
-T TOP, --top TOP display only this many top allocating stacks (by size)
-z MIN_SIZE, --min-size MIN_SIZE
capture only allocations larger than this size
-Z MAX_SIZE, --max-size MAX_SIZE
capture only allocations smaller than this size
-O OBJ, --obj OBJ attach to malloc & free in the specified object
EXAMPLES:
......
......@@ -90,7 +90,7 @@ parser.add_argument("-a", "--annotations", action="store_true",
help="add _[k] annotations to kernel frames")
parser.add_argument("-f", "--folded", action="store_true",
help="output folded format, one line per stack (for flame graphs)")
parser.add_argument("--stack-storage-size", default=2048,
parser.add_argument("--stack-storage-size", default=10240,
type=positive_nonzero_int,
help="the number of unique stack traces that can be stored and "
"displayed (default 2048)")
......
......@@ -130,18 +130,20 @@ b = BPF(text=prog)
# on its exit (Mark Drayton)
#
if args.openssl:
b.attach_uprobe(name="ssl", sym="SSL_write", fn_name="probe_SSL_write")
b.attach_uprobe(name="ssl", sym="SSL_read", fn_name="probe_SSL_read_enter")
b.attach_uprobe(name="ssl", sym="SSL_write", fn_name="probe_SSL_write",
pid=args.pid or -1)
b.attach_uprobe(name="ssl", sym="SSL_read", fn_name="probe_SSL_read_enter",
pid=args.pid or -1)
b.attach_uretprobe(name="ssl", sym="SSL_read",
fn_name="probe_SSL_read_exit")
fn_name="probe_SSL_read_exit", pid=args.pid or -1)
if args.gnutls:
b.attach_uprobe(name="gnutls", sym="gnutls_record_send",
fn_name="probe_SSL_write")
fn_name="probe_SSL_write", pid=args.pid or -1)
b.attach_uprobe(name="gnutls", sym="gnutls_record_recv",
fn_name="probe_SSL_read_enter")
fn_name="probe_SSL_read_enter", pid=args.pid or -1)
b.attach_uretprobe(name="gnutls", sym="gnutls_record_recv",
fn_name="probe_SSL_read_exit")
fn_name="probe_SSL_read_exit", pid=args.pid or -1)
# define output data structure in Python
TASK_COMM_LEN = 16 # linux/sched.h
......
......@@ -44,16 +44,12 @@ bpf_text = """
#include <linux/sched.h>
struct val_t {
u32 pid;
u64 ts;
char comm[TASK_COMM_LEN];
const char *fname;
};
struct data_t {
u32 pid;
u64 ts;
u64 delta;
u64 ts_ns;
int ret;
char comm[TASK_COMM_LEN];
char fname[NAME_MAX];
......@@ -69,12 +65,8 @@ int trace_entry(struct pt_regs *ctx, const char __user *filename)
u32 pid = bpf_get_current_pid_tgid();
FILTER
if (bpf_get_current_comm(&val.comm, sizeof(val.comm)) == 0) {
val.pid = bpf_get_current_pid_tgid();
val.ts = bpf_ktime_get_ns();
val.fname = filename;
infotmp.update(&pid, &val);
}
val.fname = filename;
infotmp.update(&pid, &val);
return 0;
};
......@@ -83,20 +75,17 @@ int trace_return(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid();
struct val_t *valp;
struct data_t data = {};
u64 tsp = bpf_ktime_get_ns();
valp = infotmp.lookup(&pid);
if (valp == 0) {
// missed entry
return 0;
}
bpf_probe_read(&data.comm, sizeof(data.comm), valp->comm);
struct data_t data = {.pid = pid};
bpf_probe_read(&data.fname, sizeof(data.fname), (void *)valp->fname);
data.pid = valp->pid;
data.delta = tsp - valp->ts;
data.ts = tsp / 1000;
bpf_get_current_comm(&data.comm, sizeof(data.comm));
data.ts_ns = bpf_ktime_get_ns();
data.ret = PT_REGS_RC(ctx);
events.perf_submit(ctx, &data, sizeof(data));
......@@ -129,8 +118,7 @@ NAME_MAX = 255 # linux/limits.h
class Data(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("ts", ct.c_ulonglong),
("delta", ct.c_ulonglong),
("ts_ns", ct.c_ulonglong),
("ret", ct.c_int),
("comm", ct.c_char * TASK_COMM_LEN),
("fname", ct.c_char * NAME_MAX)
......@@ -162,25 +150,14 @@ def print_event(cpu, data, size):
err = - event.ret
if start_ts == 0:
prev_ts = start_ts
if start_ts == 1:
delta = float(delta) + (event.ts - prev_ts)
if (args.failed and (event.ret >= 0)):
start_ts = 1
prev_ts = event.ts
return
start_ts = event.ts_ns
if args.timestamp:
print("%-14.9f" % (delta / 1000000), end="")
print("%-14.9f" % (float(event.ts_ns - start_ts) / 1000000000), end="")
print("%-6d %-16s %4d %3d %s" % (event.pid, event.comm,
fd_s, err, event.fname))
prev_ts = event.ts
start_ts = 1
# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
......
......@@ -26,7 +26,7 @@ parser.add_argument("-p", "--pid", type=int, default=None,
help="List USDT probes in the specified process")
parser.add_argument("-l", "--lib", default="",
help="List USDT probes in the specified library or executable")
parser.add_argument("-v", dest="verbosity", action="count",
parser.add_argument("-v", dest="verbosity", action="count", default=0,
help="Increase verbosity level (print variables, arguments, etc.)")
parser.add_argument(dest="filter", nargs="?",
help="A filter that specifies which probes/tracepoints to print")
......@@ -42,8 +42,6 @@ def print_tpoint_format(category, event):
parts = match.group(1).split()
field_name = parts[-1:][0]
field_type = " ".join(parts[:-1])
if "__data_loc" in field_type:
continue
if field_name.startswith("common_"):
continue
print(" %s %s;" % (field_type, field_name))
......@@ -68,7 +66,7 @@ def print_tracepoints():
def print_usdt_argument_details(location):
for idx in xrange(0, location.num_arguments):
arg = location.get_argument(idx)
print(" argument #%d %s" % (idx, arg))
print(" argument #%d %s" % (idx+1, arg))
def print_usdt_details(probe):
if args.verbosity > 0:
......@@ -76,7 +74,7 @@ def print_usdt_details(probe):
if args.verbosity > 1:
for idx in xrange(0, probe.num_locations):
loc = probe.get_location(idx)
print(" location #%d %s" % (idx, loc))
print(" location #%d %s" % (idx+1, loc))
print_usdt_argument_details(loc)
else:
print(" %d location(s)" % probe.num_locations)
......
......@@ -76,6 +76,7 @@ class Probe(object):
self.probe_num = Probe.probe_count
self.probe_name = "probe_%s_%d" % \
(self._display_function(), self.probe_num)
self.probe_name = re.sub(r'[^A-Za-z0-9_]', '_', self.probe_name)
def __str__(self):
return "%s:%s:%s FLT=%s ACT=%s/%s" % (self.probe_type,
......@@ -92,15 +93,24 @@ class Probe(object):
def _parse_probe(self):
text = self.raw_probe
# Everything until the first space is the probe specifier
first_space = text.find(' ')
spec = text[:first_space] if first_space >= 0 else text
# There might be a function signature preceding the actual
# filter/print part, or not. Find the probe specifier first --
# it ends with either a space or an open paren ( for the
# function signature part.
# opt. signature
# probespec | rest
# --------- ---------- --
(spec, sig, rest) = re.match(r'([^ \t\(]+)(\([^\(]*\))?(.*)',
text).groups()
self._parse_spec(spec)
if first_space >= 0:
text = text[first_space:].lstrip()
else:
text = ""
self.signature = sig[1:-1] if sig else None # remove the parens
if self.signature and self.probe_type in ['u', 't']:
self._bail("USDT and tracepoint probes can't have " +
"a function signature; use arg1, arg2, " +
"... instead")
text = rest.lstrip()
# If we now have a (, wait for the balanced closing ) and that
# will be the predicate
self.filter = None
......@@ -216,11 +226,11 @@ class Probe(object):
fname = "streq_%d" % Probe.streq_index
Probe.streq_index += 1
self.streq_functions += """
static inline bool %s(char const *ignored, unsigned long str) {
static inline bool %s(char const *ignored, uintptr_t str) {
char needle[] = %s;
char haystack[sizeof(needle)];
bpf_probe_read(&haystack, sizeof(haystack), (void *)str);
for (int i = 0; i < sizeof(needle); ++i) {
for (int i = 0; i < sizeof(needle) - 1; ++i) {
if (needle[i] != haystack[i]) {
return false;
}
......@@ -353,33 +363,35 @@ BPF_PERF_OUTPUT(%s);
def _generate_usdt_filter_read(self):
text = ""
if self.probe_type == "u":
for arg, _ in Probe.aliases.items():
if not (arg.startswith("arg") and
(arg in self.filter)):
continue
arg_index = int(arg.replace("arg", ""))
arg_ctype = self.usdt.get_probe_arg_ctype(
self.usdt_name, arg_index)
if not arg_ctype:
self._bail("Unable to determine type of {} "
"in the filter".format(arg))
text += """
if self.probe_type != "u":
return text
for arg, _ in Probe.aliases.items():
if not (arg.startswith("arg") and
(arg in self.filter)):
continue
arg_index = int(arg.replace("arg", ""))
arg_ctype = self.usdt.get_probe_arg_ctype(
self.usdt_name, arg_index - 1)
if not arg_ctype:
self._bail("Unable to determine type of {} "
"in the filter".format(arg))
text += """
{} {}_filter;
bpf_usdt_readarg({}, ctx, &{}_filter);
""".format(arg_ctype, arg, arg_index, arg)
self.filter = self.filter.replace(
arg, "{}_filter".format(arg))
""".format(arg_ctype, arg, arg_index, arg)
self.filter = self.filter.replace(
arg, "{}_filter".format(arg))
return text
def generate_program(self, include_self):
data_decl = self._generate_data_decl()
# kprobes don't have built-in pid filters, so we have to add
# it to the function body:
if len(self.library) == 0 and Probe.pid != -1:
if Probe.pid != -1:
pid_filter = """
if (__pid != %d) { return 0; }
""" % Probe.pid
# uprobes can have a built-in tgid filter passed to
# attach_uprobe, hence the check here -- for kprobes, we
# need to do the tgid test by hand:
elif len(self.library) == 0 and Probe.tgid != -1:
pid_filter = """
if (__tgid != %d) { return 0; }
......@@ -393,6 +405,8 @@ BPF_PERF_OUTPUT(%s);
prefix = ""
signature = "struct pt_regs *ctx"
if self.signature:
signature += ", " + self.signature
data_fields = ""
for i, expr in enumerate(self.values):
......@@ -469,10 +483,10 @@ BPF_PERF_OUTPUT(%s);
def _format_message(self, bpf, tgid, values):
# Replace each %K with kernel sym and %U with user sym in tgid
kernel_placeholders = [i for i in xrange(0, len(self.types))
if self.types[i] == 'K']
user_placeholders = [i for i in xrange(0, len(self.types))
if self.types[i] == 'U']
kernel_placeholders = [i for i, t in enumerate(self.types)
if t == 'K']
user_placeholders = [i for i, t in enumerate(self.types)
if t == 'U']
for kp in kernel_placeholders:
values[kp] = bpf.ksymaddr(values[kp])
for up in user_placeholders:
......@@ -541,12 +555,12 @@ BPF_PERF_OUTPUT(%s);
bpf.attach_uretprobe(name=libpath,
sym=self.function,
fn_name=self.probe_name,
pid=Probe.pid)
pid=Probe.tgid)
else:
bpf.attach_uprobe(name=libpath,
sym=self.function,
fn_name=self.probe_name,
pid=Probe.pid)
pid=Probe.tgid)
class Tool(object):
examples = """
......@@ -558,7 +572,7 @@ trace 'do_sys_open "%s", arg2'
Trace the open syscall and print the filename being opened
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
Trace the read syscall and print a message for reads >20000 bytes
trace 'r::do_sys_return "%llx", retval'
trace 'r::do_sys_open "%llx", retval'
Trace the return from the open syscall and print the return value
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
Trace the open() call from libc only if the flags (arg2) argument is 42
......@@ -574,6 +588,8 @@ trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
trace 'u:pthread:pthread_create (arg4 != 0)'
Trace the USDT probe pthread_create when its 4th argument is non-zero
trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
Trace the nanosleep syscall and print the sleep duration in ns
"""
def __init__(self):
......@@ -608,7 +624,8 @@ trace 'u:pthread:pthread_create (arg4 != 0)'
help="probe specifier (see examples)")
parser.add_argument("-I", "--include", action="append",
metavar="header",
help="additional header files to include in the BPF program")
help="additional header files to include in the BPF program "
"as either full path, or relative to '/usr/include'")
self.args = parser.parse_args()
if self.args.tgid and self.args.pid:
parser.error("only one of -p and -t may be specified")
......@@ -628,7 +645,11 @@ trace 'u:pthread:pthread_create (arg4 != 0)'
"""
for include in (self.args.include or []):
self.program += "#include <%s>\n" % include
if include.startswith((".", "/")):
include = os.path.abspath(include)
self.program += "#include \"%s\"\n" % include
else:
self.program += "#include <%s>\n" % include
self.program += BPF.generate_auto_includes(
map(lambda p: p.raw_probe, self.probes))
for probe in self.probes:
......
......@@ -2,8 +2,8 @@ Demonstrations of trace.
trace probes functions you specify and displays trace messages if a particular
condition is met. You can control the message format to display function
arguments and return values.
condition is met. You can control the message format to display function
arguments and return values.
For example, suppose you want to trace all commands being exec'd across the
system:
......@@ -135,6 +135,16 @@ TIME PID COMM FUNC -
In the previous invocation, arg1 and arg2 are the class name and method name
for the Ruby method being invoked.
You can also trace exported functions from shared libraries, or an imported
function on the actual executable:
# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
# tput -V
PID TID COMM FUNC -
21720 21720 tput curses_version Version=ncurses 6.0.20160709
^C
Occasionally, it can be useful to filter specific strings. For example, you
might be interested in open() calls that open a specific file:
......@@ -146,7 +156,30 @@ TIME PID COMM FUNC -
^C
As a final example, let's trace open syscalls for a specific process. By
In the preceding example, as well as in many others, readability may be
improved by providing the function's signature, which names the arguments and
lets you access structure sub-fields, which is hard with the "arg1", "arg2"
convention. For example:
# trace 'p:c:open(char *filename) "opening %s", filename'
PID TID COMM FUNC -
17507 17507 cat open opening FAQ.txt
^C
# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
PID TID COMM FUNC -
777 785 automount SyS_nanosleep sleep for 500000000 ns
777 785 automount SyS_nanosleep sleep for 500000000 ns
777 785 automount SyS_nanosleep sleep for 500000000 ns
777 785 automount SyS_nanosleep sleep for 500000000 ns
^C
Remember to use the -I argument include the appropriate header file. We didn't
need to do that here because `struct timespec` is used internally by the tool,
so it always includes this header file.
As a final example, let's trace open syscalls for a specific process. By
default, tracing is system-wide, but the -p switch overrides this:
# trace -p 2740 'do_sys_open "%s", arg2' -T
......@@ -196,6 +229,7 @@ optional arguments:
-U, --user-stack output user stack trace
-I header, --include header
additional header files to include in the BPF program
as either full path, or relative to '/usr/include'
EXAMPLES:
......@@ -205,7 +239,7 @@ trace 'do_sys_open "%s", arg2'
Trace the open syscall and print the filename being opened
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
Trace the read syscall and print a message for reads >20000 bytes
trace 'r::do_sys_return "%llx", retval'
trace 'r::do_sys_open "%llx", retval'
Trace the return from the open syscall and print the return value
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
Trace the open() call from libc only if the flags (arg2) argument is 42
......@@ -221,3 +255,5 @@ trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
trace 'u:pthread:pthread_create (arg4 != 0)'
Trace the USDT probe pthread_create when its 4th argument is non-zero
trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
Trace the nanosleep syscall and print the sleep duration in ns
......@@ -4,7 +4,7 @@
# ucalls Summarize method calls in high-level languages and/or system calls.
# For Linux, uses BCC, eBPF.
#
# USAGE: ucalls [-l {java,python,ruby}] [-h] [-T TOP] [-L] [-S] [-v] [-m]
# USAGE: ucalls [-l {java,python,ruby,php}] [-h] [-T TOP] [-L] [-S] [-v] [-m]
# pid [interval]
#
# Copyright 2016 Sasha Goldshtein
......@@ -24,7 +24,7 @@ examples = """examples:
./ucalls 6712 -S # trace only syscall counts
./ucalls -l ruby 1344 -T 10 # trace top 10 Ruby method calls
./ucalls -l ruby 1344 -L # trace Ruby calls including latency
./ucalls -l ruby 1344 -LS # trace Ruby calls and syscalls with latency
./ucalls -l php 443 -LS # trace PHP calls and syscalls with latency
./ucalls -l python 2020 -mL # trace Python calls including latency in ms
"""
parser = argparse.ArgumentParser(
......@@ -34,7 +34,8 @@ parser = argparse.ArgumentParser(
parser.add_argument("pid", type=int, help="process id to attach to")
parser.add_argument("interval", type=int, nargs='?',
help="print every specified number of seconds")
parser.add_argument("-l", "--language", choices=["java", "python", "ruby"],
parser.add_argument("-l", "--language",
choices=["java", "python", "ruby", "php"],
help="language to trace (if none, trace syscalls only)")
parser.add_argument("-T", "--top", type=int,
help="number of most frequent/slow calls to print")
......@@ -49,8 +50,8 @@ parser.add_argument("-m", "--milliseconds", action="store_true",
args = parser.parse_args()
# We assume that the entry and return probes have the same arguments. This is
# the case for Java, Python, and Ruby. If there's a language where it's not the
# case, we will need to build a custom correlator from entry to exit.
# the case for Java, Python, Ruby, and PHP. If there's a language where it's
# not the case, we will need to build a custom correlator from entry to exit.
if args.language == "java":
# TODO for JVM entries, we actually have the real length of the class
# and method strings in arg3 and arg5 respectively, so we can insert
......@@ -70,6 +71,11 @@ elif args.language == "ruby":
return_probe = "method__return"
read_class = "bpf_usdt_readarg(1, ctx, &clazz);"
read_method = "bpf_usdt_readarg(2, ctx, &method);"
elif args.language == "php":
entry_probe = "function__entry"
return_probe = "function__return"
read_class = "bpf_usdt_readarg(4, ctx, &clazz);"
read_method = "bpf_usdt_readarg(1, ctx, &method);"
elif not args.language:
if not args.syscalls:
print("Nothing to do; use -S to trace syscalls.")
......@@ -213,9 +219,9 @@ int syscall_return(struct pt_regs *ctx) {
if args.language:
usdt = USDT(pid=args.pid)
usdt.enable_probe(entry_probe, "trace_entry")
usdt.enable_probe_or_bail(entry_probe, "trace_entry")
if args.latency:
usdt.enable_probe(return_probe, "trace_return")
usdt.enable_probe_or_bail(return_probe, "trace_return")
else:
usdt = None
......@@ -236,25 +242,26 @@ if args.syscalls:
def get_data():
# Will be empty when no language was specified for tracing
if args.latency:
data = map(lambda (k, v): (k.clazz + "." + k.method,
(v.num_calls, v.total_ns)),
bpf["times"].items())
data = list(map(lambda kv: (kv[0].clazz + "." + kv[0].method,
(kv[1].num_calls, kv[1].total_ns)),
bpf["times"].items()))
else:
data = map(lambda (k, v): (k.clazz + "." + k.method, (v.value, 0)),
bpf["counts"].items())
data = list(map(lambda kv: (kv[0].clazz + "." + kv[0].method,
(kv[1].value, 0)),
bpf["counts"].items()))
if args.syscalls:
if args.latency:
syscalls = map(lambda (k, v): (bpf.ksym(k.value),
(v.num_calls, v.total_ns)),
syscalls = map(lambda kv: (bpf.ksym(kv[0].value),
(kv[1].num_calls, kv[1].total_ns)),
bpf["systimes"].items())
data.extend(syscalls)
else:
syscalls = map(lambda (k, v): (bpf.ksym(k.value), (v.value, 0)),
syscalls = map(lambda kv: (bpf.ksym(kv[0].value), (kv[1].value, 0)),
bpf["syscounts"].items())
data.extend(syscalls)
return sorted(data, key=lambda (k, v): v[1 if args.latency else 0])
return sorted(data, key=lambda kv: kv[1][1 if args.latency else 0])
def clear_data():
if args.latency:
......
......@@ -2,7 +2,7 @@ Demonstrations of ucalls.
ucalls summarizes method calls in various high-level languages, including Java,
Python, Ruby, and Linux system calls. It displays statistics on the most
Python, Ruby, PHP, and Linux system calls. It displays statistics on the most
frequently called methods, as well as the latency (duration) of these methods.
Through the syscalls support, ucalls can provide basic information on a
......@@ -60,7 +60,7 @@ METHOD # CALLS
USAGE message:
# ./ucalls.py -h
usage: ucalls.py [-h] [-l {java,python,ruby}] [-T TOP] [-L] [-S] [-v] [-m]
usage: ucalls.py [-h] [-l {java,python,ruby,php}] [-T TOP] [-L] [-S] [-v] [-m]
pid [interval]
Summarize method calls in high-level languages.
......@@ -71,7 +71,7 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
-l {java,python,ruby}, --language {java,python,ruby}
-l {java,python,ruby,php}, --language {java,python,ruby,php}
language to trace (if none, trace syscalls only)
-T TOP, --top TOP number of most frequent/slow calls to print
-L, --latency record method latency from enter to exit (except
......@@ -88,5 +88,5 @@ examples:
./ucalls 6712 -S # trace only syscall counts
./ucalls -l ruby 1344 -T 10 # trace top 10 Ruby method calls
./ucalls -l ruby 1344 -L # trace Ruby calls including latency
./ucalls -l ruby 1344 -LS # trace Ruby calls and syscalls with latency
./ucalls -l php 443 -LS # trace PHP calls and syscalls with latency
./ucalls -l python 2020 -mL # trace Python calls including latency in ms
......@@ -4,7 +4,7 @@
# uflow Trace method execution flow in high-level languages.
# For Linux, uses BCC, eBPF.
#
# USAGE: uflow [-C CLASS] [-M METHOD] [-v] {java,python,ruby} pid
# USAGE: uflow [-C CLASS] [-M METHOD] [-v] {java,python,ruby,php} pid
#
# Copyright 2016 Sasha Goldshtein
# Licensed under the Apache License, Version 2.0 (the "License")
......@@ -27,7 +27,7 @@ parser = argparse.ArgumentParser(
description="Trace method execution flow in high-level languages.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("language", choices=["java", "python", "ruby"],
parser.add_argument("language", choices=["java", "python", "ruby", "php"],
help="language to trace")
parser.add_argument("pid", type=int, help="process id to attach to")
parser.add_argument("-M", "--method",
......@@ -109,7 +109,7 @@ def enable_probe(probe_name, func_name, read_class, read_method, is_return):
.replace("FILTER_METHOD", filter_method) \
.replace("DEPTH", depth) \
.replace("UPDATE", update)
usdt.enable_probe(probe_name, func_name)
usdt.enable_probe_or_bail(probe_name, func_name)
usdt = USDT(pid=args.pid)
......@@ -140,6 +140,13 @@ elif args.language == "ruby":
enable_probe("cmethod__return", "ruby_creturn",
"bpf_usdt_readarg(1, ctx, &clazz);",
"bpf_usdt_readarg(2, ctx, &method);", is_return=True)
elif args.language == "php":
enable_probe("function__entry", "php_entry",
"bpf_usdt_readarg(4, ctx, &clazz);",
"bpf_usdt_readarg(1, ctx, &method);", is_return=False)
enable_probe("function__return", "php_return",
"bpf_usdt_readarg(4, ctx, &clazz);",
"bpf_usdt_readarg(1, ctx, &method);", is_return=True)
if args.verbose:
print(usdt.get_text())
......
......@@ -4,8 +4,8 @@ Demonstrations of uflow.
uflow traces method entry and exit events and prints a visual flow graph that
shows how methods are entered and exited, similar to a tracing debugger with
breakpoints. This can be useful for understanding program flow in high-level
languages such as Java, Python, and Ruby, which provide USDT probes for method
invocations.
languages such as Java, Python, Ruby, and PHP, which provide USDT probes for
method invocations.
For example, trace all Ruby method calls in a specific process:
......@@ -88,12 +88,13 @@ thread running on the same CPU.
USAGE message:
# ./uflow -h
usage: uflow.py [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby} pid
usage: uflow.py [-h] [-M METHOD] [-C CLAZZ] [-v] {java,python,ruby,php} pid
Trace method execution flow in high-level languages.
positional arguments:
{java,python,ruby} language to trace
{java,python,ruby,php}
language to trace
pid process id to attach to
optional arguments:
......
......@@ -4,7 +4,7 @@
# ugc Summarize garbage collection events in high-level languages.
# For Linux, uses BCC, eBPF.
#
# USAGE: ugc [-v] [-m] {java,python,ruby,node} pid
# USAGE: ugc [-v] [-m] [-M MSEC] [-F FILTER] {java,python,ruby,node} pid
#
# Copyright 2016 Sasha Goldshtein
# Licensed under the Apache License, Version 2.0 (the "License")
......@@ -20,6 +20,7 @@ import time
examples = """examples:
./ugc java 185 # trace Java GCs in process 185
./ugc ruby 1344 -m # trace Ruby GCs reporting in ms
./ugc -M 10 java 185 # trace only Java GCs longer than 10ms
"""
parser = argparse.ArgumentParser(
description="Summarize garbage collection events in high-level languages.",
......@@ -32,6 +33,10 @@ parser.add_argument("-v", "--verbose", action="store_true",
help="verbose mode: print the BPF program (for debugging purposes)")
parser.add_argument("-m", "--milliseconds", action="store_true",
help="report times in milliseconds (default is microseconds)")
parser.add_argument("-M", "--minimum", type=int, default=0,
help="display only GCs longer than this many milliseconds")
parser.add_argument("-F", "--filter", type=str,
help="display only GCs whose description contains this text")
args = parser.parse_args()
usdt = USDT(pid=args.pid)
......@@ -85,17 +90,21 @@ int trace_%s(struct pt_regs *ctx) {
return 0; // missed the entry event on this thread
}
elapsed = bpf_ktime_get_ns() - e->start_ns;
if (elapsed < %d) {
return 0;
}
event.elapsed_ns = elapsed;
%s
gcs.perf_submit(ctx, &event, sizeof(event));
return 0;
}
""" % (self.begin, self.begin_save, self.end, self.end_save)
""" % (self.begin, self.begin_save, self.end,
args.minimum * 1000000, self.end_save)
return text
def attach(self):
usdt.enable_probe(self.begin, "trace_%s" % self.begin)
usdt.enable_probe(self.end, "trace_%s" % self.end)
usdt.enable_probe_or_bail(self.begin, "trace_%s" % self.begin)
usdt.enable_probe_or_bail(self.end, "trace_%s" % self.end)
def format(self, data):
return self.formatter(data)
......@@ -187,7 +196,7 @@ bpf = BPF(text=program, usdt_contexts=[usdt])
print("Tracing garbage collections in %s process %d... Ctrl-C to quit." %
(args.language, args.pid))
time_col = "TIME (ms)" if args.milliseconds else "TIME (us)"
print("%-8s %-40s %-8s" % ("START", "DESCRIPTION", time_col))
print("%-8s %-8s %-40s" % ("START", time_col, "DESCRIPTION"))
class GCEvent(ct.Structure):
_fields_ = [
......@@ -207,9 +216,10 @@ def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(GCEvent)).contents
elapsed = event.elapsed_ns/1000000 if args.milliseconds else \
event.elapsed_ns/1000
print("%-8.3f %-40s %-8.2f" % (time.time() - start_ts,
probes[event.probe_index].format(event),
elapsed))
description = probes[event.probe_index].format(event)
if args.filter and not args.filter in description:
return
print("%-8.3f %-8.2f %s" % (time.time() - start_ts, elapsed, description))
bpf["gcs"].open_perf_buffer(print_event)
while 1:
......
......@@ -8,45 +8,68 @@ the GC event is also provided.
For example, to trace all garbage collection events in a specific Node process:
# ./ugc node $(pidof node)
Tracing garbage collections in node process 3018... Ctrl-C to quit.
START DESCRIPTION TIME (us)
3.864 GC mark-sweep-compact 3189.00
4.937 GC scavenge 1254.00
4.940 GC scavenge 1657.00
4.943 GC scavenge 1171.00
4.949 GC scavenge 2216.00
4.954 GC scavenge 2515.00
4.960 GC scavenge 2243.00
4.966 GC scavenge 2410.00
4.976 GC scavenge 3003.00
4.986 GC scavenge 4174.00
4.994 GC scavenge 1508.00
5.003 GC scavenge 1966.00
5.010 GC scavenge 1636.00
5.022 GC scavenge 3564.00
5.035 GC scavenge 3275.00
5.045 GC incremental mark 157.00
5.049 GC mark-sweep-compact 3248.00
5.060 GC scavenge 4785.00
5.081 GC scavenge 6616.00
5.094 GC scavenge 8570.00
5.144 GC scavenge 456.00
7.188 GC scavenge 2345.00
7.227 GC scavenge 12054.00
7.253 GC scavenge 15626.00
7.304 GC scavenge 15329.00
7.384 GC scavenge 7168.00
7.411 GC scavenge 3794.00
7.414 GC incremental mark 123.00
7.430 GC mark-sweep-compact 7110.00
# ugc node $(pidof node)
Tracing garbage collections in node process 30012... Ctrl-C to quit.
START TIME (us) DESCRIPTION
1.500 1181.00 GC scavenge
1.505 1704.00 GC scavenge
1.509 1534.00 GC scavenge
1.515 1953.00 GC scavenge
1.519 2155.00 GC scavenge
1.525 2055.00 GC scavenge
1.530 2164.00 GC scavenge
1.536 2170.00 GC scavenge
1.541 2237.00 GC scavenge
1.547 1982.00 GC scavenge
1.551 2333.00 GC scavenge
1.557 2043.00 GC scavenge
1.561 2028.00 GC scavenge
1.573 3650.00 GC scavenge
1.580 4443.00 GC scavenge
1.604 6236.00 GC scavenge
1.615 8324.00 GC scavenge
1.659 11249.00 GC scavenge
1.678 16084.00 GC scavenge
1.747 15250.00 GC scavenge
1.937 191.00 GC incremental mark
2.001 63120.00 GC mark-sweep-compact
3.185 153.00 GC incremental mark
3.207 20847.00 GC mark-sweep-compact
^C
The above output shows some fairly long GCs, notably around 2 seconds in there
is a collection that takes over 60ms (mark-sweep-compact).
Occasionally, it might be useful to filter out collections that are very short,
or display only collections that have a specific description. The -M and -F
switches can be useful for this:
# ugc -F Tenured java $(pidof java)
Tracing garbage collections in java process 29907... Ctrl-C to quit.
START TIME (us) DESCRIPTION
0.360 4309.00 MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
2.459 4232.00 MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
4.648 4139.00 MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
^C
# ugc -M 1 java $(pidof java)
Tracing garbage collections in java process 29907... Ctrl-C to quit.
START TIME (us) DESCRIPTION
0.160 3715.00 MarkSweepCompact Code Cache used=287528->3209472 max=173408256->251658240
0.160 3975.00 MarkSweepCompact Metaspace used=287528->3092104 max=173408256->18446744073709551615
0.160 4058.00 MarkSweepCompact Compressed Class Space used=287528->266840 max=173408256->1073741824
0.160 4110.00 MarkSweepCompact Eden Space used=287528->0 max=173408256->69337088
0.160 4159.00 MarkSweepCompact Survivor Space used=287528->0 max=173408256->8650752
0.160 4207.00 MarkSweepCompact Tenured Gen used=287528->287528 max=173408256->173408256
0.160 4289.00 used=0->0 max=0->0
^C
USAGE message:
# ./ugc -h
usage: ugc.py [-h] [-v] [-m] {java,python,ruby,node} pid
# ugc -h
usage: ugc.py [-h] [-v] [-m] [-M MINIMUM] [-F FILTER]
{java,python,ruby,node} pid
Summarize garbage collection events in high-level languages.
......@@ -60,7 +83,12 @@ optional arguments:
-v, --verbose verbose mode: print the BPF program (for debugging
purposes)
-m, --milliseconds report times in milliseconds (default is microseconds)
-M MINIMUM, --minimum MINIMUM
display only GCs longer than this many milliseconds
-F FILTER, --filter FILTER
display only GCs whose description contains this text
examples:
./ugc java 185 # trace Java GCs in process 185
./ugc ruby 1344 -m # trace Ruby GCs reporting in ms
./ugc -M 10 java 185 # trace only Java GCs longer than 10ms
......@@ -78,7 +78,7 @@ int alloc_entry(struct pt_regs *ctx) {
return 0;
}
"""
usdt.enable_probe("object__alloc", "alloc_entry")
usdt.enable_probe_or_bail("object__alloc", "alloc_entry")
#
# Ruby
#
......@@ -107,10 +107,10 @@ int object_alloc_entry(struct pt_regs *ctx) {
return 0;
}
"""
usdt.enable_probe("object__create", "object_alloc_entry")
usdt.enable_probe_or_bail("object__create", "object_alloc_entry")
for thing in ["string", "hash", "array"]:
program += create_template.replace("THETHING", thing)
usdt.enable_probe("%s__create" % thing, "%s_alloc_entry" % thing)
usdt.enable_probe_or_bail("%s__create" % thing, "%s_alloc_entry" % thing)
#
# C
#
......@@ -147,13 +147,13 @@ while True:
print()
data = bpf["allocs"]
if args.top_count:
data = sorted(data.items(), key=lambda (k, v): v.num_allocs)
data = sorted(data.items(), key=lambda kv: kv[1].num_allocs)
data = data[-args.top_count:]
elif args.top_size:
data = sorted(data.items(), key=lambda (k, v): v.total_size)
data = sorted(data.items(), key=lambda kv: kv[1].total_size)
data = data[-args.top_size:]
else:
data = sorted(data.items(), key=lambda (k, v): v.total_size)
data = sorted(data.items(), key=lambda kv: kv[1].total_size)
print("%-30s %8s %12s" % ("TYPE", "# ALLOCS", "# BYTES"))
for key, value in data:
if args.language == "c":
......
......@@ -5,7 +5,7 @@
# method calls, class loads, garbage collections, and more.
# For Linux, uses BCC, eBPF.
#
# USAGE: ustat [-l {java,python,ruby,node}] [-C]
# USAGE: ustat [-l {java,python,ruby,node,php}] [-C]
# [-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d]
# [interval [count]]
#
......@@ -132,7 +132,7 @@ class Tool(object):
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-l", "--language",
choices=["java", "python", "ruby", "node"],
choices=["java", "python", "ruby", "node", "php"],
help="language to trace (default: all languages)")
parser.add_argument("-C", "--noclear", action="store_true",
help="don't clear the screen")
......@@ -158,6 +158,11 @@ class Tool(object):
"function__entry": Category.METHOD,
"gc__start": Category.GC
}),
"php": Probe("php", ["php"], {
"function__entry": Category.METHOD,
"compile__file__entry": Category.CLOAD,
"exception__thrown": Category.EXCP
}),
"ruby": Probe("ruby", ["ruby", "irb"], {
"method__entry": Category.METHOD,
"cmethod__entry": Category.METHOD,
......@@ -239,10 +244,10 @@ class Tool(object):
counts.update(probe.get_counts(self.bpf))
targets.update(probe.targets)
if self.args.sort:
counts = sorted(counts.items(), key=lambda (_, v):
-v.get(self.args.sort.upper(), 0))
counts = sorted(counts.items(), key=lambda kv:
-kv[1].get(self.args.sort.upper(), 0))
else:
counts = sorted(counts.items(), key=lambda (k, _): k)
counts = sorted(counts.items(), key=lambda kv: kv[0])
for pid, stats in counts:
print("%-6d %-20s %-10d %-6d %-10d %-8d %-6d %-6d" % (
pid, targets[pid][:20],
......
......@@ -4,7 +4,7 @@ Demonstrations of ustat.
ustat is a "top"-like tool for monitoring events in high-level languages. It
prints statistics about garbage collections, method calls, object allocations,
and various other events for every process that it recognizes with a Java,
Python, Ruby, or Node runtime.
Python, Ruby, Node, or PHP runtime.
For example:
......@@ -48,7 +48,7 @@ PID CMDLINE METHOD/s GC/s OBJNEW/s CLOAD/s EXC/s THR/s
USAGE message:
# ./ustat.py -h
usage: ustat.py [-h] [-l {java,python,ruby,node}] [-C]
usage: ustat.py [-h] [-l {java,python,ruby,node,php}] [-C]
[-S {cload,excp,gc,method,objnew,thread}] [-r MAXROWS] [-d]
[interval] [count]
......@@ -60,7 +60,7 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
-l {java,python,ruby,node}, --language {java,python,ruby,node}
-l {java,python,ruby,node,php}, --language {java,python,ruby,node,php}
language to trace (default: all languages)
-C, --noclear don't clear the screen
-S {cload,excp,gc,method,objnew,thread}, --sort {cload,excp,gc,method,objnew,thread}
......
......@@ -57,7 +57,7 @@ int trace_pthread(struct pt_regs *ctx) {
return 0;
}
"""
usdt.enable_probe("pthread_start", "trace_pthread")
usdt.enable_probe_or_bail("pthread_start", "trace_pthread")
if args.language == "java":
template = """
......@@ -78,8 +78,8 @@ int %s(struct pt_regs *ctx) {
"""
program += template % ("trace_start", "start")
program += template % ("trace_stop", "stop")
usdt.enable_probe("thread__start", "trace_start")
usdt.enable_probe("thread__stop", "trace_stop")
usdt.enable_probe_or_bail("thread__start", "trace_start")
usdt.enable_probe_or_bail("thread__stop", "trace_stop")
if args.verbose:
print(usdt.get_text())
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment