Commit 239e8637 authored by Brendan Gregg's avatar Brendan Gregg

tutorials: end-user, and python developer

parent 170c40f9
......@@ -2,17 +2,19 @@
# BPF Compiler Collection (BCC)
BCC is a toolkit for creating efficient kernel tracing and manipulation
programs, and includes several useful tools and examples. It makes use of eBPF
(Extended Berkeley Packet Filters), a new feature that was first added to
Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.
programs, and includes several useful tools and examples. It makes use of
extended BPF (Berkeley Packet Filters), formally known as eBPF, a new feature
that was first added to Linux 3.15. Much of what BCC uses requires Linux 4.1
and above.
eBPF was [described by](https://lkml.org/lkml/2015/4/14/232) Ingo Molnár as:
> One of the more interesting features in this cycle is the ability to attach eBPF programs (user-defined, sandboxed bytecode executed by the kernel) to kprobes. This allows user-defined instrumentation on a live kernel image that can never crash, hang or interfere with the kernel negatively.
BCC makes eBPF programs easier to write, with kernel instrumentation in C
and a front-end in Python. It is suited for many tasks, including performance
analysis and network traffic control.
BCC makes BPF programs easier to write, with kernel instrumentation in C
(and includes a C wrapper around LLVM), and front-ends in Python and lua.
It is suited for many tasks, including performance analysis and network
traffic control.
## Screenshot
......@@ -170,46 +172,10 @@ The features of this toolkit include:
In the future, more bindings besides python will likely be supported. Feel free
to add support for the language of your choice and send a pull request!
## Tutorial
## Tutorials
The BCC toolchain is currently composed of two parts: a C wrapper around LLVM,
and a Python API to interact with the running program. Later, we will go into
more detail of how this all works.
### Hello, World
First, we should include the BPF class from the bpf module:
```python
from bcc import BPF
```
Since the C code is so short, we will embed it inside the python script.
The BPF program always takes at least one argument, which is a pointer to the
context for this type of program. Different program types have different calling
conventions, but for this one we don't care so `void *` is fine.
```python
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }').trace_print()
```
For this example, we will call the program every time `fork()` is called by a
userspace process. Underneath the hood, fork translates to the `clone` syscall.
BCC recognizes prefix `kprobe__`, and will auto attach our program to the kernel symbol `sys_clone`.
The python process will then print the trace printk circular buffer until ctrl-c
is pressed. The BPF program is removed from the kernel when the userspace
process that loaded it closes the fd (or exits).
Output:
```
bcc/examples$ sudo python hello_world.py
python-7282 [002] d... 3757.488508: : Hello, World!
```
For an explanation of the meaning of the printed fields, see the trace_pipe
section of the [kernel ftrace doc](https://www.kernel.org/doc/Documentation/trace/ftrace.txt).
[Source code listing](examples/hello_world.py)
- [docs/tutorial.md](docs/tutorial.md): Using bcc tools to solve performance, troubleshooting, and networking issues.
- [docs/tutorial_bcc_python_developer.md](docs/tutorial_bcc_python_developer.md): Developing new bcc programs using the Python interface.
### Networking
......@@ -222,75 +188,6 @@ multiple granularities. See the code [here](examples/networking/tunnel_monitor).
[![Screenshot](http://img.youtube.com/vi/yYy3Cwce02k/0.jpg)](https://youtu.be/yYy3Cwce02k)
### Tracing
Here is a slightly more complex tracing example than Hello World. This program
will be invoked for every task change in the kernel, and record in a BPF map
the new and old pids.
The C program below introduces two new concepts.
The first is the macro `BPF_TABLE`. This defines a table (type="hash"), with key
type `key_t` and leaf type `u64` (a single counter). The table name is `stats`,
containing 1024 entries maximum. One can `lookup`, `lookup_or_init`, `update`,
and `delete` entries from the table.
The second concept is the prev argument. This argument is treated specially by
the BCC frontend, such that accesses to this variable are read from the saved
context that is passed by the kprobe infrastructure. The prototype of the args
starting from position 1 should match the prototype of the kernel function being
kprobed. If done so, the program will have seamless access to the function
parameters.
```c
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
struct key_t {
u32 prev_pid;
u32 curr_pid;
};
// map_type, key_type, leaf_type, table_name, num_entry
BPF_TABLE("hash", struct key_t, u64, stats, 1024);
// attach to finish_task_switch in kernel/sched/core.c, which has the following
// prototype:
// struct rq *finish_task_switch(struct task_struct *prev)
int count_sched(struct pt_regs *ctx, struct task_struct *prev) {
struct key_t key = {};
u64 zero = 0, *val;
key.curr_pid = bpf_get_current_pid_tgid();
key.prev_pid = prev->pid;
val = stats.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
```
[Source code listing](examples/tracing/task_switch.c)
The userspace component loads the file shown above, and attaches it to the
`finish_task_switch` kernel function.
The [] operator of the BPF object gives access to each BPF_TABLE in the
program, allowing pass-through access to the values residing in the kernel. Use
the object as you would any other python dict object: read, update, and deletes
are all allowed.
```python
from bcc import BPF
from time import sleep
b = BPF(src_file="task_switch.c")
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched")
# generate many schedule events
for i in range(0, 100): sleep(0.01)
for k, v in b["stats"].items():
print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value))
```
[Source code listing](examples/tracing/task_switch.py)
## Getting started
See [INSTALL.md](INSTALL.md) for installation steps on your platform.
## Contributing
Already pumped up to commit some code? Here are some resources to join the
......
# bcc Tutorial
This tutorial covers how to use [bcc](https://github.com/iovisor/bcc) tools to quickly solve performance, troubleshooting, and networking issues. If you want to develop new bcc tools, see [tutorial_bcc_python_developer.md](tutorial_bcc_python_developer.md) for that tutorial.
It is assumed for this tutorial that bcc is already installed, and you can run tools like execsnoop successfully. See [INSTALL.md](../INSTALL.md). This uses enhancements added to the Linux 4.x series.
## Observability
Some quick wins.
### 0. Before bcc
Before using bcc, you should start with the Linux basics. One reference is the [Linux Performance Analysis in 60s](http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html) post, which covers these commands:
1. uptime
1. dmesg | tail
1. vmstat 1
1. mpstat -P ALL 1
1. pidstat 1
1. iostat -xz 1
1. free -m
1. sar -n DEV 1
1. sar -n TCP,ETCP 1
1. top
### 1. General Performance
Here is a generic checklist for performance investigations with bcc, first as a list, then in detail:
1. execsnoop
1. opensnoop
1. ext4slower (or btrfs\*, xfs\*, zfs\*)
1. biolatency
1. biosnoop
1. cachestat
1. tcpconnect
1. tcpaccept
1. tcpretrans
1. runqlat
1. profile
These tools may be installed on your system under /usr/share/bcc/tools, or you can run them from the bcc github repo under /tools where they have a .py extension. Browse the 50+ tools available for more analysis options.
#### 1. execsnoop
```
# ./execsnoop
PCOMM PID RET ARGS
supervise 9660 0 ./run
supervise 9661 0 ./run
mkdir 9662 0 /bin/mkdir -p ./main
run 9663 0 ./run
[...]
```
execsnoop prints one line of output for each new process. Check for short-lived processes. These can consume CPU resources, but not show up in most monitoring tools that periodically take snapshots of which processes are running.
It works by tracing exec(), not the fork(), so it will catch many types of new processes but not all (eg, it won't see an application launching working processes, that doesn't exec() anything else).
More [examples](../tools/execsnoop_example.txt).
#### 2. opensnoop
```
# ./opensnoop
PID COMM FD ERR PATH
1565 redis-server 5 0 /proc/1565/stat
1565 redis-server 5 0 /proc/1565/stat
1565 redis-server 5 0 /proc/1565/stat
1603 snmpd 9 0 /proc/net/dev
1603 snmpd 11 0 /proc/net/if_inet6
1603 snmpd -1 2 /sys/class/net/eth0/device/vendor
1603 snmpd 11 0 /proc/sys/net/ipv4/neigh/eth0/retrans_time_ms
1603 snmpd 11 0 /proc/sys/net/ipv6/neigh/eth0/retrans_time_ms
1603 snmpd 11 0 /proc/sys/net/ipv6/conf/eth0/forwarding
[...]
```
opensnoop prints one line of output for each open() syscall, including details.
Files that are opened can tell you a lot about how applications work: identifying their data files, config files, and log files. Sometimes applications can misbehave, and perform poorly, when they are constantly attempting to read files that do not exist. opensnoop gives you a quick look.
More [examples](../tools/opensnoop_example.txt).
#### 3. ext4slower (or btrfs\*, xfs\*, zfs\*)
```
# ./ext4slower
Tracing ext4 operations slower than 10 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:35:01 cron 16464 R 1249 0 16.05 common-auth
06:35:01 cron 16463 R 1249 0 16.04 common-auth
06:35:01 cron 16465 R 1249 0 16.03 common-auth
06:35:01 cron 16465 R 4096 0 10.62 login.defs
06:35:01 cron 16464 R 4096 0 10.61 login.defs
```
ext4slower traces the ext4 file system and times common operations, and then only prints those that exceed a threshold.
This is great for identifying or exonerating one type of performance issue: slow individually slow disk i/O via the file system. Disks process I/O asynchronously, and it can be difficult to associate latency at that layer with the latency applications experience. Tracing higher up in the kernel stack, at the VFS -> file system interface, will more closely match what an application suffers. Use this tool to identify if file system latency exceeds a given threshold.
Similar tools exist in bcc for other file systems: btrfsslower, xfsslower, and zfsslower. There is also fileslower, which works at the VFS layer and traces everything (although at some higher overhead).
More [examples](../tools/ext4slower_example.txt).
#### 4. biolatency
```
# ./biolatency
Tracing block device I/O... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 12 |******** |
256 -> 511 : 15 |********** |
512 -> 1023 : 43 |******************************* |
1024 -> 2047 : 52 |**************************************|
2048 -> 4095 : 47 |********************************** |
4096 -> 8191 : 52 |**************************************|
8192 -> 16383 : 36 |************************** |
16384 -> 32767 : 15 |********** |
32768 -> 65535 : 2 |* |
65536 -> 131071 : 2 |* |
```
biolatency traces disk I/O latency (time from device issue to completion), and when the tool ends (Ctrl-C, or a given interval), it prints a histogram summary of the latency.
This is great for understanding disk I/O latency beyond the average times given by tools like iostat. I/O latency outliers will be visible at the end of the distribution, as well as multi-mode distributions.
More [examples](../tools/biolatency_example.txt).
#### 5. biosnoop
```
# ./biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74
0.000178002 supervise 1950 xvda1 W 13092432 4096 0.61
0.001469001 supervise 1956 xvda1 W 13092440 4096 1.24
0.001588002 supervise 1956 xvda1 W 13115128 4096 1.09
1.022346001 supervise 1950 xvda1 W 13115272 4096 0.98
1.022568002 supervise 1950 xvda1 W 13188496 4096 0.93
[...]
```
biosnoop prints a line of output for each disk I/O, with details including latency (time from device issue to completion).
This allows you to examine disk I/O in more detail, and look for time-ordered patterns (eg, reads queueing behind writes). Note that the output will be verbose if your system performance a high rate of disk I/O.
More [examples](../tools/biosnoop_example.txt).
#### 6. cachestat
```
# ./cachestat
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
1074 44 13 94.9% 2.9% 1 223
2195 170 8 92.5% 6.8% 1 143
182 53 56 53.6% 1.3% 1 143
62480 40960 20480 40.6% 19.8% 1 223
7 2 5 22.2% 22.2% 1 223
348 0 0 100.0% 0.0% 1 223
[...]
```
cachestat prints a one line summary every second (or every custom interval) showing statistics from the file system cache.
Use this to identify a low cache hit ratio, and a high rate of misses: which gives one lead for performance tuning.
More [examples](../tools/cachestat_example.txt).
#### 7. tcpconnect
```
# ./tcpconnect
PID COMM IP SADDR DADDR DPORT
1479 telnet 4 127.0.0.1 127.0.0.1 23
1469 curl 4 10.201.219.236 54.245.105.25 80
1469 curl 4 10.201.219.236 54.67.101.145 80
1991 telnet 6 ::1 ::1 23
2015 ssh 6 fe80::2000:bff:fe82:3ac fe80::2000:bff:fe82:3ac 22
[...]
```
tcpconnect prints one line of output for every active TCP connection (eg, via connect()), with details including source and destination addresses.
Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder.
More [examples](../tools/tcpconnect_example.txt).
#### 8. tcpaccept
```
# ./tcpaccept
PID COMM IP RADDR LADDR LPORT
907 sshd 4 192.168.56.1 192.168.56.102 22
907 sshd 4 127.0.0.1 127.0.0.1 22
5389 perl 6 1234:ab12:2040:5020:2299:0:5:0 1234:ab12:2040:5020:2299:0:5:0 7001
[...]
```
tcpaccept prints one line of output for every passive TCP connection (eg, via accept()), with details including source and destination addresses.
Look for unexpected connections that may point to inefficiencies in application configuration, or an intruder.
More [examples](../tools/tcpaccept_example.txt).
#### 9. tcpretrans
```
# ./tcpretrans
TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
01:55:05 0 4 10.153.223.157:22 R> 69.53.245.40:34619 ESTABLISHED
01:55:17 0 4 10.153.223.157:22 R> 69.53.245.40:22957 ESTABLISHED
[...]
```
tcprerans prints one line of output for every TCP retransmit packet, with details including source and destination addresses, and kernel state of the TCP connection.
TCP retransmissions cause latency and throughput issues. For ESTABLESHID retransmits, look for patterns with networks. For SYN_SENT, this may point to target kernel CPU saturation and kernel packet drops.
More [examples](../tools/tcpretrans_example.txt).
#### 10. runqlat
```
# ./runqlat
Tracing run queue latency... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 233 |*********** |
2 -> 3 : 742 |************************************ |
4 -> 7 : 203 |********** |
8 -> 15 : 173 |******** |
16 -> 31 : 24 |* |
32 -> 63 : 0 | |
64 -> 127 : 30 |* |
128 -> 255 : 6 | |
256 -> 511 : 3 | |
512 -> 1023 : 5 | |
1024 -> 2047 : 27 |* |
2048 -> 4095 : 30 |* |
4096 -> 8191 : 20 | |
8192 -> 16383 : 29 |* |
16384 -> 32767 : 809 |****************************************|
32768 -> 65535 : 64 |*** |
```
runqlat times how long threads were waiting on the CPU run queues, and prints this as a histogram.
This can help quantify time lost waiting for a turn on CPU, during periods of CPU saturation.
More [examples](../tools/runqlat_example.txt).
#### 11. profile
```
# ./profile
Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
^C
00007f31d76c3251 [unknown]
47a2c1e752bf47f7 [unknown]
- sign-file (8877)
1
ffffffff813d0af8 __clear_user
ffffffff813d5277 iov_iter_zero
ffffffff814ec5f2 read_iter_zero
ffffffff8120be9d __vfs_read
ffffffff8120c385 vfs_read
ffffffff8120d786 sys_read
ffffffff817cc076 entry_SYSCALL_64_fastpath
00007fc5652ad9b0 read
- dd (25036)
4
0000000000400542 func_a
0000000000400598 main
00007f12a133e830 __libc_start_main
083e258d4c544155 [unknown]
- func_ab (13549)
5
[...]
ffffffff8105eb66 native_safe_halt
ffffffff8103659e default_idle
ffffffff81036d1f arch_cpu_idle
ffffffff810bba5a default_idle_call
ffffffff810bbd07 cpu_startup_entry
ffffffff8104df55 start_secondary
- swapper/1 (0)
75
```
profile is a CPU profiler, which takes samples of stack traces at timed intervals, and prints a summary of unique stack traces and a count of their occurrence.
Use this tool to understand the code paths that are consuming CPU resources.
More [examples](../tools/profile_example.txt).
## Networking
To do.
# bcc Python Developer Tutorial
This tutorial is about developing [bcc](https://github.com/iovisor/bcc) tools and programs using the Python interface. There are two parts: observability then networking. Snippits are taken from various programs in bcc: see their files for licences.
There is also a lua interface for bcc, and a tutorial for end-users of tools: [tutorial.md](tutorial.md).
## Observability
This observability tutorial contains 16 lessons, and 44 enumerated things to learn.
### Lesson 1. Hello World
Start by running [examples/hello_world.py](../examples/hello_world.py), while running some commands (eg, "ls") in another session. It should print "Hello, World!" for new processes. If not, start by fixing bcc: see [INSTALL.md](../INSTALL.md).
```
# ./examples/hello_world.py
bash-13364 [002] d... 24573433.052937: : Hello, World!
bash-13364 [003] d... 24573436.642808: : Hello, World!
[...]
```
Here's the code for hello_world.py:
```Python
from bcc import BPF
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); }').trace_print()
```
There are six things to learn from this:
1. ```text='...'```: This defines a BPF program inline. The program is written in C.
1. ```kprobe__sys_clone()```: This is a short-cut for kernel dynamic tracing via kprobes. If the C function begins with ``kprobe__``, the rest is treated as a kernel function name to instrument, in this case, ```sys_clone()```.
1. ```void *ctx```: ctx has arguments, but since we aren't using them here, we'll just cast it to ```void *```.
1. ```bpf_trace_printk()```: A simple kernel facility for printf() to the common trace_pipe (/sys/kernel/debug/tracing/trace_pipe). This is ok for some quick examples, but has limitations: 3 args max, 1 %s only, and trace_pipe is globally shared, so concurrent programs will have clashing output. A better interface is via BPF_PERF_OUTPUT(), covered later.
1. ```return 0;```: Necessary formality (if you want to know why, see #139).
1. ```.trace_print()```: A bcc routine that reads trace_pipe and prints the output.
### Lesson 2. sys_sync()
Write a program that traces the sys_sync() kernel function. Print "sys_sync() called" when it runs. Test by running ```sync``` in another session while tracing. The hello_world.py program has everything you need for this.
Improve it by printing "Tracing sys_sync()... Ctrl-C to end." when the program first starts. Hint: it's just Python.
### Lesson 3. hello_fields.py
This program is in [examples/tracing/hello_fields.py](../examples/tracing/trace_fields.py). Sample output (run commands in another session):
```
# ./examples/tracing/hello_fields.py
TIME(s) COMM PID MESSAGE
24585001.174885999 sshd 1432 Hello, World!
24585001.195710000 sshd 15780 Hello, World!
24585001.991976000 systemd-udevd 484 Hello, World!
24585002.276147000 bash 15787 Hello, World!
```
Code:
```Python
from bcc import BPF
# define BPF program
prog = """
int hello(void *ctx) {
bpf_trace_printk("Hello, World!\\n");
return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event="sys_clone", fn_name="hello")
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
```
This is simalar to hello_world.py, and traces new processes via sys_clone() again, but has a few more things to learn:
1. ```prog =```: This time we declare the C program as a variable, and later refer to it. This is useful if you want to add some string substitutions based on command line arguments.
1. ```hello()```: Now we're just declaring a C function, instead of the ```kprobe__``` shortcut. We'll refer to this later.
1. ```b.attach_kprobe(event="sys_clone", fn_name="hello")```: Creates a kprobe for the sys_clone() kernel function, which will execute our defined hello() function. You can call attach_kprobe() more than once, and attach your C function to multiple kernel functions.
1. ```b.trace_fields()```: Returns a fixed set of fields from trace_pipe. Simalar to trace_print(), this is handy for hacking, but for real tooling we should switch to BPF_PERF_OUTPUT().
### Lesson 4. sync_timing.py
Remember the days of sysadmins typing ```sync``` three times on a slow console before ```reboot```, to give the first asynchrosous sync time to complete? Then someone thought ```sync;sync;sync``` was clever, to run them all on one line, which became industry practice despite defeating the original purpose! And then sync became synchronous, so more reasons it was silly. Anyway.
The following example times how quickly the ```do_sync``` function is called, and prints output if it has been called more recently than one second ago. A ```sync;sync;sync``` will print output for the 2nd and 3rd sync's:
```
# ./examples/tracing/sync_timing.py
Tracing for quick sync's... Ctrl-C to end
At time 0.00 s: multiple syncs detected, last 95 ms ago
At time 0.10 s: multiple syncs detected, last 96 ms ago
```
This program is [examples/tracing/sync_timing.py](../examples/tracing/sync_timing.py):
```Python
from bcc import BPF
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
BPF_HASH(last);
void do_trace(struct pt_regs *ctx) {
u64 ts, *tsp, delta, key = 0;
// attempt to read stored timestamp
tsp = last.lookup(&key);
if (tsp != 0) {
delta = bpf_ktime_get_ns() - *tsp;
if (delta < 1000000000) {
// output if time is less than 1 second
bpf_trace_printk("%d\\n", delta / 1000000);
}
last.delete(&key);
}
// update stored timestamp
ts = bpf_ktime_get_ns();
last.update(&key, &ts);
}
""")
b.attach_kprobe(event="sys_sync", fn_name="do_trace")
print("Tracing for quick sync's... Ctrl-C to end")
# format output
start = 0
while 1:
(task, pid, cpu, flags, ts, ms) = b.trace_fields()
if start == 0:
start = ts
ts = ts - start
print("At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))
```
Things to learn:
1. ```bpf_ktime_get_ns()```: Returns the time as nanoseconds.
1. ```BPF_HASH(last)```: Creates a BPF map object that is a hash (associative array), called "last". We didn't specify any further arguments, so it defaults to key and value types of u64.
1. ```key = 0```: We'll only store one key/value pair in this hash, where the key is hardwired to zero.
1. ```last.lookup(&key)```: Lookup the key in the hash, and return a pointer to its value if it exists, else NULL. We pass the key in as an address to a pointer.
1. ```last.delete(&key)```: Delete the key from the hash.
1. ```last.update(&key)```: Set the key to equal the value in the 2nd argument. This records the timestamp.
### Lesson 5. sync_count.py
Modify the sync_timing.py program (prior lession) to store the count of all sys_sync() calls (both fast and slow), and print it with the output. This count can be recorded in the BPF program by adding a new key index to the existing hash.
### Lesson 6. disksnoop.py
Browse the [examples/tracing/disksnoop.py](../examples/tracing/disksnoop.py) program to see what is new. Here is some sample output:
```
# ./disksnoop.py
TIME(s) T BYTES LAT(ms)
16458043.436012 W 4096 3.13
16458043.437326 W 4096 4.44
16458044.126545 R 4096 42.82
16458044.129872 R 4096 3.24
[...]
```
And a code snippit:
```Python
[...]
REQ_WRITE = 1 # from include/linux/blk_types.h
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
BPF_HASH(start, struct request *);
void trace_start(struct pt_regs *ctx, struct request *req) {
// stash start timestamp by request ptr
u64 ts = bpf_ktime_get_ns();
start.update(&req, &ts);
}
void trace_completion(struct pt_regs *ctx, struct request *req) {
u64 *tsp, delta;
tsp = start.lookup(&req);
if (tsp != 0) {
delta = bpf_ktime_get_ns() - *tsp;
bpf_trace_printk("%d %x %d\\n", req->__data_len,
req->cmd_flags, delta / 1000);
start.delete(&req);
}
}
""")
b.attach_kprobe(event="blk_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_start")
b.attach_kprobe(event="blk_account_io_completion", fn_name="trace_completion")
[...]
```
Things to learn:
1. ```REQ_WRITE```: We're defining a kernel constant in the Python program because we'll use it there later. If we were using REQ_WRITE in the BPF program, it should just work (without needing to be defined) with the appropriate #includes.
1. ```trace_start(struct pt_regs *ctx, struct request *req)```: This function will later be attached to kprobes. The arguments to kprobe functions are ```struct pt_regs *ctx```, for registers and BPF context, and then the actual arguments to the function. We'll attach this to blk_start_request(), where the first argument is ```struct request *```.
1. ```start.update(&req, &ts)```: We're using the pointer to the request struct as a key in our hash. What? This is commonplace in tracing. Pointers to structs turn out to be great keys, as they are unique: two structs can't have the same pointer address. (Just be careful about when it gets free'd and reused.) So what we're really doing is tagging the request struct, which describes the disk I/O, with our own timestamp, so that we can time it. There's two common keys used for storing timestamps: pointers to structs, and, thread IDs (for timing function entry to return).
1. ```req->__data_len```: We're dereferencing members of ```struct request```. See its definition in the kernel source for what members are there. bcc actually rewrites these expressions to be a series of ```bpf_probe_read()``` calls. Sometimes bcc can't handle a complex dereference, and you need to call ```bpf_probe_read()``` directly.
This is a pretty interesting program, and if you can understand all the code, you'll understand many important basics. We're still using the bpf_trace_printk() hack, so let's fix that next.
### Lesson 7. hello_perf_output.py
Let's finally stop using bpf_trace_printk() and use the proper BPF_PERF_OUTPUT() interface. This will also mean we stop getting the free trace_field() members like PID and timestamp, and will need to fetch them directly. Sample output while commands are run in another session:
```
# ./hello_perf_output.py
TIME(s) COMM PID MESSAGE
0.000000000 bash 22986 Hello, perf_output!
0.021080275 systemd-udevd 484 Hello, perf_output!
0.021359520 systemd-udevd 484 Hello, perf_output!
0.021590610 systemd-udevd 484 Hello, perf_output!
[...]
```
Code is [examples/tracing/hello_perf_output.py](../examples/tracing/hello_perf_output.py):
```Python
from bcc import BPF
import ctypes as ct
# define BPF program
prog = """
#include <linux/sched.h>
// define output data structure in C
#define MSG_MAX 32
struct data_t {
u32 pid;
u64 ts;
char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
int hello(struct pt_regs *ctx) {
struct data_t data = {};
data.pid = bpf_get_current_pid_tgid();
data.ts = bpf_ktime_get_ns();
bpf_get_current_comm(&data.comm, sizeof(data.comm));
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event="sys_clone", fn_name="hello")
# define output data structure in Python
TASK_COMM_LEN = 16 # linux/sched.h
MSG_MAX = 32
class Data(ct.Structure):
_fields_ = [("pid", ct.c_ulonglong),
("ts", ct.c_ulonglong),
("comm", ct.c_char * TASK_COMM_LEN)]
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
# process event
start = 0
def print_event(cpu, data, size):
global start
event = ct.cast(data, ct.POINTER(Data)).contents
if start == 0:
start = event.ts
time_s = (float(event.ts - start)) / 1000000000
print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
"Hello, perf_output!"))
# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
```
Things to learn:
1. ```struct data_t```: This defines the C struct we'll use to pass data from kernel to user space.
1. ```BPF_PERF_OUTPUT(events)```: This names our output channel "events".
1. ```struct data_t data = {};```: Create an empty data_t struct that we'll then populate.
1. ```bpf_get_current_pid_tgid()```: Returns the process ID in the lower 32 bits (kernel's view of the PID, which in user space is usually presented as the thread ID), and the thread group ID in the upper 32 bits (what user space often thinks of as the PID). By directly setting this to a u32, we discard the upper 32 bits. Should you be presenting the PID or the TGID? For a multi-threaded app, the TGID will be the same, so you need the PID to differentiate them, if that's what you want. It's also a question of expectations for the end user.
1. ```bpf_get_current_comm()```: Populates the first argument address with the current process name.
1. ```events.perf_submit()```: Submit the event for user space to read via a perf ring buffer.
1. ```class Data(ct.Structure)```: Now define the Python version of the C data structure.
1. ```def print_event()```: Define a Python function that will handle reading events from the ```events``` stream.
1. ```b["events"].open_perf_buffer(print_event)```: Associate the Python ```print_event``` function with the ```events``` stream.
1. ```while 1: b.kprobe_poll()```: Block waiting for events.
This may be improved in future bcc versions. Eg, the Python data struct could be auto-generated from the C code.
### Lesson 8. sync_perf_output.py
Rewrite sync_timing.py, from a prior lesson, to use ```BPF_PERF_OUTPUT```.
### Lesson 9. bitesize.py
The following tool records a histogram of disk I/O sizes. Sample output:
```
# ./bitehist.py
Tracing... Hit Ctrl-C to end.
^C
kbytes : count distribution
0 -> 1 : 3 | |
2 -> 3 : 0 | |
4 -> 7 : 211 |********** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 800 |**************************************|
```
Code is [examples/tracing/bitesize.py](../examples/tracing/bitesize.py):
```Python
from bcc import BPF
from time import sleep
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
BPF_HISTOGRAM(dist);
int kprobe__blk_account_io_completion(struct pt_regs *ctx, struct request *req)
{
dist.increment(bpf_log2l(req->__data_len / 1024));
return 0;
}
""")
# header
print("Tracing... Hit Ctrl-C to end.")
# trace until Ctrl-C
try:
sleep(99999999)
except KeyboardInterrupt:
print
# output
b["dist"].print_log2_hist("kbytes")
```
A recap from earlier lessions:
- ```kprobe__```: This prefix means the rest will be treated as a kernel function name that will be instrumented using kprobe.
- ```struct pt_regs *ctx, struct request *req```: Arguments to kprobe. The ```ctx``` is registers and BPF context, the ```req``` is the first argument to the instrumented function: ```blk_account_io_completion()```.
- ```req->__data_len```: Dereferencing that member.
New things to learn:
1. ```BPF_HISTOGRAM(dist)```: Defines a BPF map object that is a histogram, and names it "dist".
1. ```dist.increment()```: Increments the histogram bucket index provided as an argument by one.
1. ```bpf_log2l()```: Returns the log-2 of the provided value. This becomes the index of our histogram, so that we're constructing a power-of-2 histogram.
1. ```b["dist"].print_log2_hist("kbytes")```: Prints the "dist" histogram as power-of-2, with a column header of "kbytes". The only data transferred from kernel to user space is the bucket counts, making this efficient.
### Lesson 10. disklatency.py
Write a program that times disk I/O, and prints a histogram of their latency. Disk I/O instrumentation and timing can be found in the disksnoop.py program from a prior lesson, and histogram code can be found in bitesize.py from a prior lesson.
### Lesson 11. vfsreadlat.py
This example is split into separate Python and C files. Example output:
```
# ./vfsreadlat.py 1
Tracing... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 2 |*********** |
4 -> 7 : 7 |****************************************|
8 -> 15 : 4 |********************** |
usecs : count distribution
0 -> 1 : 29 |****************************************|
2 -> 3 : 28 |************************************** |
4 -> 7 : 4 |***** |
8 -> 15 : 8 |*********** |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 2 |** |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 4 |***** |
8192 -> 16383 : 6 |******** |
16384 -> 32767 : 9 |************ |
32768 -> 65535 : 6 |******** |
65536 -> 131071 : 2 |** |
usecs : count distribution
0 -> 1 : 11 |****************************************|
2 -> 3 : 2 |******* |
4 -> 7 : 10 |************************************ |
8 -> 15 : 8 |***************************** |
16 -> 31 : 1 |*** |
32 -> 63 : 2 |******* |
[...]
```
Browse the code in [examples/tracing/vfsreadlat.py](../examples/tracing/vfsreadlat.py) and [examples/tracing/vfsreadlat.c](../examples/tracing/vfsreadlat.c). Things to learn:
1. ```b = BPF(src_file = "vfsreadlat.c")```: Read the BPF C program from a separate source file.
1. ```b.attach_kretprobe(event="vfs_read", fn_name="do_return")```: Attaches the BPF C function ```do_return()``` to the return of the kernel function ```vfs_read()```. This is a kretprobe: instrumenting the return from a function, rather than its entry.
1. ```b["dist"].clear()```: Clears the histogram.
### Lesson 12. urandomread.py
Tracing while a ```dd if=/dev/urandom of=/dev/null bs=8k count=5``` is run:
```
# ./urandomread.py
TIME(s) COMM PID GOTBITS
24652832.956994001 smtp 24690 384
24652837.726500999 dd 24692 65536
24652837.727111001 dd 24692 65536
24652837.727703001 dd 24692 65536
24652837.728294998 dd 24692 65536
24652837.728888001 dd 24692 65536
```
Hah! I caught smtp by accident. Code is [examples/tracing/urandomread.py](../examples/tracing/urandomread.py):
```Python
from bcc import BPF
# load BPF program
b = BPF(text="""
TRACEPOINT_PROBE(random, urandom_read) {
// args is from /sys/kernel/debug/tracing/events/random/urandom_read/format
bpf_trace_printk("%d\\n", args->got_bits);
return 0;
};
""")
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "GOTBITS"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
```
Things to learn:
1. ```TRACEPOINT_PROBE(random, urandom_read)```: Instrument the kernel tracepoint ```random:urandom_read```. These have a stable API, and thus are recommend to use instead of kprobes, wherever possible. You can run ```perf list``` for a list of tracepoints.
1. ```args->got_bits```: ```args``` is auto-populated to be a structure of the tracepoint arguments. The comment above says where you can see that structure. Eg:
```
# cat /sys/kernel/debug/tracing/events/random/urandom_read/format
name: urandom_read
ID: 972
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int got_bits; offset:8; size:4; signed:1;
field:int pool_left; offset:12; size:4; signed:1;
field:int input_left; offset:16; size:4; signed:1;
print fmt: "got_bits %d nonblocking_pool_entropy_left %d input_entropy_left %d", REC->got_bits, REC->pool_left, REC->input_left
```
In this case, we were printing the ```got_bits``` member.
### Lesson 13. disksnoop.py fixed
Convert disksnoop.py from a previous lesson to use the ```block:block_rq_issue``` and ```block:block_rq_complete``` tracepoints.
### Lesson 14. nodejs_http_server.py
This program instruments a user-defined static tracing (USDT) probe, which is the user-level version of a kernel tracepoint. Sample output:
```
# ./nodejs_http_server.py 24728
TIME(s) COMM PID ARGS
24653324.561322998 node 24728 path:/index.html
24653335.343401998 node 24728 path:/images/welcome.png
24653340.510164998 node 24728 path:/images/favicon.png
```
Relevent code from [examples/tracing/nodejs_http_server.py](../examples/tracing/nodejs_http_server.py):
```Python
if len(sys.argv) < 2:
print("USAGE: nodejs_http_server PID")
exit()
pid = sys.argv[1]
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
uint64_t addr;
char path[128];
bpf_usdt_readarg(6, ctx, &addr);
bpf_probe_read(&path, sizeof(path), (void *)addr);
bpf_trace_printk("path:%s\\n", path);
return 0;
};
"""
# enable USDT probe from given PID
u = USDT(pid=int(pid))
u.enable_probe(probe="http__server__request", fn_name="do_trace")
# initialize BPF
b = BPF(text=bpf_text, usdt=u)
```
Things to learn:
1. ```bpf_usdt_readarg(6, ctx, &addr)```: Read the address of argument 6 from the USDT probe into ```addr```.
1. ```bpf_probe_read(&path, sizeof(path), (void *)addr)```: Now the string ```addr``` points to into our ```path``` variable.
1. ```u = USDT(pid=int(pid))```: Initialize USDT tracing for the given PID.
1. ```u.enable_probe(probe="http__server__request", fn_name="do_trace")```: Attach our ```do_trace()``` BPF C function to the Node.js ```http__server__request``` USDT probe.
1. ```b = BPF(text=bpf_text, usdt=u)```: Need to pass in our USDT object, ```u```, to BPF object creation.
### Lesson 15. task_switch.c
This is an older tutorial included as a bonus lesson. Use this for recap and to reinforce what you've already learned.
This is a slightly more complex tracing example than Hello World. This program
will be invoked for every task change in the kernel, and record in a BPF map
the new and old pids.
The C program below introduces two new concepts.
The first is the macro `BPF_TABLE`. This defines a table (type="hash"), with key
type `key_t` and leaf type `u64` (a single counter). The table name is `stats`,
containing 1024 entries maximum. One can `lookup`, `lookup_or_init`, `update`,
and `delete` entries from the table.
The second concept is the prev argument. This argument is treated specially by
the BCC frontend, such that accesses to this variable are read from the saved
context that is passed by the kprobe infrastructure. The prototype of the args
starting from position 1 should match the prototype of the kernel function being
kprobed. If done so, the program will have seamless access to the function
parameters.
```c
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
struct key_t {
u32 prev_pid;
u32 curr_pid;
};
// map_type, key_type, leaf_type, table_name, num_entry
BPF_TABLE("hash", struct key_t, u64, stats, 1024);
// attach to finish_task_switch in kernel/sched/core.c, which has the following
// prototype:
// struct rq *finish_task_switch(struct task_struct *prev)
int count_sched(struct pt_regs *ctx, struct task_struct *prev) {
struct key_t key = {};
u64 zero = 0, *val;
key.curr_pid = bpf_get_current_pid_tgid();
key.prev_pid = prev->pid;
val = stats.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
```
The userspace component loads the file shown above, and attaches it to the
`finish_task_switch` kernel function.
The [] operator of the BPF object gives access to each BPF_TABLE in the
program, allowing pass-through access to the values residing in the kernel. Use
the object as you would any other python dict object: read, update, and deletes
are all allowed.
```python
from bcc import BPF
from time import sleep
b = BPF(src_file="task_switch.c")
b.attach_kprobe(event="finish_task_switch", fn_name="count_sched")
# generate many schedule events
for i in range(0, 100): sleep(0.01)
for k, v in b["stats"].items():
print("task_switch[%5d->%5d]=%u" % (k.prev_pid, k.curr_pid, v.value))
```
These programs have now been merged, and are both in [examples/tracing/task_switch.py](examples/tracing/task_switch.py).
### Lesson 16. Further Study
For further study, see Sasha Goldshtein's [linux-tracing-workshop](https://github.com/goldshtn/linux-tracing-workshop), which contains additional labs. There are also many tools in bcc /tools to study.
Please read [CONTRIBUTING-SCRIPTS.md](../CONTRIBUTING-SCRIPTS.md) if you wish to contrubite tools to bcc. At the bottom of the main [README.md](../README.md), you'll also find methods for contacting us. Good luck, and happy tracing!
## Networking
To do.
#!/usr/bin/env python
#
# This is a Hello World example that formats output as fields.
from bcc import BPF
# define BPF program
prog = """
int hello(void *ctx) {
bpf_trace_printk("Hello, World!\\n");
return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event="sys_clone", fn_name="hello")
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))
#!/usr/bin/env python
#
# This is a Hello World example that uses BPF_PERF_OUTPUT.
from bcc import BPF
import ctypes as ct
# define BPF program
prog = """
#include <linux/sched.h>
// define output data structure in C
#define MSG_MAX 32
struct data_t {
u32 pid;
u64 ts;
char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
int hello(struct pt_regs *ctx) {
struct data_t data = {};
data.pid = bpf_get_current_pid_tgid();
data.ts = bpf_ktime_get_ns();
bpf_get_current_comm(&data.comm, sizeof(data.comm));
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
"""
# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event="sys_clone", fn_name="hello")
# define output data structure in Python
TASK_COMM_LEN = 16 # linux/sched.h
MSG_MAX = 32
class Data(ct.Structure):
_fields_ = [("pid", ct.c_ulonglong),
("ts", ct.c_ulonglong),
("comm", ct.c_char * TASK_COMM_LEN)]
# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "MESSAGE"))
# process event
start = 0
def print_event(cpu, data, size):
global start
event = ct.cast(data, ct.POINTER(Data)).contents
if start == 0:
start = event.ts
time_s = (float(event.ts - start)) / 1000000000
print("%-18.9f %-16s %-6d %s" % (time_s, event.comm, event.pid,
"Hello, perf_output!"))
# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
#!/usr/bin/python
#
# sync_timing.py Trace time between syncs.
# For Linux, uses BCC, eBPF. Embedded C.
#
# Written as a basic example of tracing time between events.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
from __future__ import print_function
from bcc import BPF
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
BPF_HASH(last);
void do_trace(struct pt_regs *ctx) {
u64 ts, *tsp, delta, key = 0;
// attempt to read stored timestamp
tsp = last.lookup(&key);
if (tsp != 0) {
delta = bpf_ktime_get_ns() - *tsp;
if (delta < 1000000000) {
// output if time is less than 1 second
bpf_trace_printk("%d\\n", delta / 1000000);
}
last.delete(&key);
}
// update stored timestamp
ts = bpf_ktime_get_ns();
last.update(&key, &ts);
}
""")
b.attach_kprobe(event="sys_sync", fn_name="do_trace")
print("Tracing for quick sync's... Ctrl-C to end")
# format output
start = 0
while 1:
(task, pid, cpu, flags, ts, ms) = b.trace_fields()
if start == 0:
start = ts
ts = ts - start
print("At time %.2f s: multiple syncs detected, last %s ms ago" % (ts, ms))
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment