Commit fdfacb71 authored by Allan McAleavy's avatar Allan McAleavy

Merge pull request #2 from iovisor/master

update local repo
parents eb3c9605 dd221c4e
...@@ -24,11 +24,11 @@ As said earlier: keep it short, neat, and documented (code comments). ...@@ -24,11 +24,11 @@ As said earlier: keep it short, neat, and documented (code comments).
A checklist for bcc tool development: A checklist for bcc tool development:
1. **Research the topic landscape**. Learn the existing tools and metrics (incl. from /proc). Determine what real world problems exist and need solving. We have too many tools and metrics as it is, we don't need more "I guess that's useful" tools, we need more "ah-hah! I couldn't do this before!" tools. Consider asking other developers about your idea. Many of us can be found in IRC, in the #iovisor channel on irc.oftc.net. There's also the mailing list (see the README.md), and github for issues. 1. **Research the topic landscape**. Learn the existing tools and metrics (incl. from /proc). Determine what real world problems exist and need solving. We have too many tools and metrics as it is, we don't need more "I guess that's useful" tools, we need more "ah-hah! I couldn't do this before!" tools. Consider asking other developers about your idea. Many of us can be found in IRC, in the #iovisor channel on irc.oftc.net. There's also the mailing list (see the README.md), and github for issues.
1. **Create a known workload for testing**. This might involving writing a 10 line C program, using a microbenchmark, or just improvising at the shell. If you don't know how to create a workload, learn! Figuring this out will provide invaluable context and details that you may have otherwise overlooked. Sometimes it's easy, and I'm able to just use dd(1) from /dev/urandom or a disk device to /dev/null. It lets me set the I/O size, count, and provides throughput statistics for cross-checking checking my tool output. But other times I need a micro-benchhark, or some C. 1. **Create a known workload for testing**. This might involving writing a 10 line C program, using a micro-benchmark, or just improvising at the shell. If you don't know how to create a workload, learn! Figuring this out will provide invaluable context and details that you may have otherwise overlooked. Sometimes it's easy, and I'm able to just use dd(1) from /dev/urandom or a disk device to /dev/null. It lets me set the I/O size, count, and provides throughput statistics for cross-checking checking my tool output. But other times I need a micro-benchmark, or some C.
1. **Write the tool to solve the problem and no more**. Unix philosophy: do one thing and do it well. netstat doesn't have an option to dump packets, tcpdump-style. They are two different tools. 1. **Write the tool to solve the problem and no more**. Unix philosophy: do one thing and do it well. netstat doesn't have an option to dump packets, tcpdump-style. They are two different tools.
1. **Check your tool correctly measures your known workload**. If possible, run a prime number of events (eg, 23) and check that the numbers match. Try other workload variations. 1. **Check your tool correctly measures your known workload**. If possible, run a prime number of events (eg, 23) and check that the numbers match. Try other workload variations.
1. **Use other observability tools to perform a cross-check or sanity check**. Eg, imagine you write a PCI bus tool that shows current throughput is 28 Gbytes/sec. How could you sanity test that? Well, what PCI devices are there? Disks and network cards? Measure their throughput (iostat, nicstat, sar), and check if is in the ballpark of 28 Gbytes/sec (which would include PCI frame overheads). Ideally, your numbers match. 1. **Use other observability tools to perform a cross-check or sanity check**. Eg, imagine you write a PCI bus tool that shows current throughput is 28 Gbytes/sec. How could you sanity test that? Well, what PCI devices are there? Disks and network cards? Measure their throughput (iostat, nicstat, sar), and check if is in the ballpark of 28 Gbytes/sec (which would include PCI frame overheads). Ideally, your numbers match.
1. **Measure the overhead of the tool**. If you are running a microbenchmark, how much slower is it with the tool running. Is more CPU consumed? Try to determine the worst case: run the microbenchmark so that CPU headroom is exhausted, and then run the bcc tool. Can overhead be lowered? 1. **Measure the overhead of the tool**. If you are running a micro-benchmark, how much slower is it with the tool running. Is more CPU consumed? Try to determine the worst case: run the micro-benchmark so that CPU headroom is exhausted, and then run the bcc tool. Can overhead be lowered?
1. **Test again, and stress test**. You want to discover and fix all the bad things before others hit them. 1. **Test again, and stress test**. You want to discover and fix all the bad things before others hit them.
1. **Consider command line options**. Should it have -p for filtering on a PID? -T for timestamps? -i for interval? See other tools for examples, and copy the style: the usage message should list example usage at the end. Remember to keep the tool doing one thing and doing it well. Also, if there's one option that seems to be the common case, perhaps it should just be the first argument and not need a switch (no -X). A special case of this is *stat tools, like iostat/vmstat/etc, where the convention is [interval [count]]. 1. **Consider command line options**. Should it have -p for filtering on a PID? -T for timestamps? -i for interval? See other tools for examples, and copy the style: the usage message should list example usage at the end. Remember to keep the tool doing one thing and doing it well. Also, if there's one option that seems to be the common case, perhaps it should just be the first argument and not need a switch (no -X). A special case of this is *stat tools, like iostat/vmstat/etc, where the convention is [interval [count]].
1. **Use pep8 to check Python style**: pep8 --show-source --ignore=E123,E125,E126,E127,E128,E302 filename . Note that it misses some things, like consistent usage, so you'll still need to double check your script. 1. **Use pep8 to check Python style**: pep8 --show-source --ignore=E123,E125,E126,E127,E128,E302 filename . Note that it misses some things, like consistent usage, so you'll still need to double check your script.
...@@ -36,5 +36,6 @@ A checklist for bcc tool development: ...@@ -36,5 +36,6 @@ A checklist for bcc tool development:
1. **Read your example.txt file**. Does this sound too niche or convoluted? Are you spending too much time explaining caveats? These can be hints that perhaps you should fix your tool, or abandon it! Perhaps it better belongs as an /example, and not a tool. I've abandoned many tools at this stage. 1. **Read your example.txt file**. Does this sound too niche or convoluted? Are you spending too much time explaining caveats? These can be hints that perhaps you should fix your tool, or abandon it! Perhaps it better belongs as an /example, and not a tool. I've abandoned many tools at this stage.
1. **Write a man page**. Either ROFF (.8), markdown (.md), or plain text (.txt): so long as it documents the important sections, particularly columns (fields) and caveats. These go under man/man8. See the other examples. Include a section on overhead, and pull no punches. It's better for end users to know about high overhead beforehand, than to discover it the hard way. Also explain caveats. Don't assume those will be obvious to tool users. 1. **Write a man page**. Either ROFF (.8), markdown (.md), or plain text (.txt): so long as it documents the important sections, particularly columns (fields) and caveats. These go under man/man8. See the other examples. Include a section on overhead, and pull no punches. It's better for end users to know about high overhead beforehand, than to discover it the hard way. Also explain caveats. Don't assume those will be obvious to tool users.
1. **Read your man page**. For ROFF: nroff -man filename. Like before, this exercise is like saying something out loud. Does it sound too niche or convoluted? Again, hints that you might need to go back and fix things, or abandon it. 1. **Read your man page**. For ROFF: nroff -man filename. Like before, this exercise is like saying something out loud. Does it sound too niche or convoluted? Again, hints that you might need to go back and fix things, or abandon it.
1. **Spell check your documentation**. Use a spell checker like aspell to check your document quality before committing.
1. **Add an entry to README.md**. 1. **Add an entry to README.md**.
1. If you made it this far, pull request! 1. If you made it this far, pull request!
...@@ -67,21 +67,32 @@ Tools: ...@@ -67,21 +67,32 @@ Tools:
- tools/[bashreadline](tools/bashreadline.py): Print entered bash commands system wide. [Examples](tools/bashreadline_example.txt). - tools/[bashreadline](tools/bashreadline.py): Print entered bash commands system wide. [Examples](tools/bashreadline_example.txt).
- tools/[biolatency](tools/biolatency.py): Summarize block device I/O latency as a histogram. [Examples](tools/biolatency_example.txt). - tools/[biolatency](tools/biolatency.py): Summarize block device I/O latency as a histogram. [Examples](tools/biolatency_example.txt).
- tools/[biotop](tools/biotop.py): Top for disks: Summarize block device I/O by process. [Examples](tools/biotop_example.txt).
- tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt). - tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt). - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt). - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt).
- tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
- tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt).
- tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt). - tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt).
- tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt). - tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
- tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt). - tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
- tools/[hardirqs](tools/hardirqs.py): Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt). - tools/[hardirqs](tools/hardirqs.py): Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
- tools/[killsnoop](tools/killsnoop.py): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt). - tools/[killsnoop](tools/killsnoop.py): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
- tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_examples.txt).
- tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt). - tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
- tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt). - tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
- tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt). - tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt). - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt). - tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[stackcount](tools/stackcount.py): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt). - tools/[stackcount](tools/stackcount.py): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
- tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt). - tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
- tools/[statsnoop](tools/statsnoop.py): Trace stat() syscalls. [Examples](tools/statsnoop_example.txt).
- tools/[syncsnoop](tools/syncsnoop.py): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt). - tools/[syncsnoop](tools/syncsnoop.py): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
- tools/[tcpaccept](tools/tcpaccept.py): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt). - tools/[tcpaccept](tools/tcpaccept.py): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
- tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt). - tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
...@@ -177,7 +188,7 @@ section of the [kernel ftrace doc](https://www.kernel.org/doc/Documentation/trac ...@@ -177,7 +188,7 @@ section of the [kernel ftrace doc](https://www.kernel.org/doc/Documentation/trac
### Networking ### Networking
At RedHat Summit 2015, BCC was presented as part of a [session on BPF](http://www.devnation.org/#7784f1f7513e8542e4db519e79ff5eec). At Red Hat Summit 2015, BCC was presented as part of a [session on BPF](http://www.devnation.org/#7784f1f7513e8542e4db519e79ff5eec).
A multi-host vxlan environment is simulated and a BPF program used to monitor A multi-host vxlan environment is simulated and a BPF program used to monitor
one of the physical interfaces. The BPF program keeps statistics on the inner one of the physical interfaces. The BPF program keeps statistics on the inner
and outer IP addresses traversing the interface, and the userspace component and outer IP addresses traversing the interface, and the userspace component
......
#Simple HTTP Filter: Project purpose
Write an eBPF application that parses HTTP packets and extracts (and prints on screen) the URL contained in the GET/POST request.
[eBPF HTTP Filter - Short Presentation](https://github.com/iovisor/bpf-docs/blob/master/ebpf_http_filter.pdf)
#Usage Example
```Shell
$ sudo python http-parse-v2.py
GET /pipermail/iovisor-dev/ HTTP/1.1
HTTP/1.1 200 OK
GET /favicon.ico HTTP/1.1
HTTP/1.1 404 Not Found
GET /pipermail/iovisor-dev/2016-January/thread.html HTTP/1.1
HTTP/1.1 200 OK
GET /pipermail/iovisor-dev/2016-January/000046.html HTTP/1.1
HTTP/1.1 200 OK
```
#Implementation using BCC
eBPF socket filter.<br />
Filters IP and TCP packets, containing "HTTP", "GET", "POST" in payload and all subsequent packets belonging to the same session, having the same (ip_src,ip_dst,port_src,port_dst).<br />
Program is loaded as PROG_TYPE_SOCKET_FILTER and attached to a socket, bind to eth0. <br />
Matching packets are forwarded to user space, others dropped by the filter.<br />
<br />
Python script reads filtered raw packets from the socket, if necessary reassembles packets belonging to the same session, and prints on stdout the first line of the HTTP GET/POST request. <br />
#v1 vs v2
First version is the simple one: if the url is too long (splitted in more than one packet) is truncated. <br />
Second version is quite more complex: if necessary reassembles packets belonging to the same session and prints the complete url.
#To run:
```Shell
$ sudo python http-parse.py
$ sudo python http-parse-v2.py
```
\ No newline at end of file
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#define IP_TCP 6
#define ETH_HLEN 14
struct Key {
u32 src_ip; //source ip
u32 dst_ip; //destination ip
unsigned short src_port; //source port
unsigned short dst_port; //destination port
};
struct Leaf {
int timestamp; //timestamp in ns
};
//BPF_TABLE(map_type, key_type, leaf_type, table_name, num_entry)
//map <Key, Leaf>
//tracing sessions having same Key(dst_ip, src_ip, dst_port,src_port)
BPF_TABLE("hash", struct Key, struct Leaf, sessions, 1024);
/*eBPF program.
Filter IP and TCP packets, having payload not empty
and containing "HTTP", "GET", "POST" as first bytes of payload.
AND ALL the other packets having same (src_ip,dst_ip,src_port,dst_port)
this means belonging to the same "session"
this additional check avoids url truncation, if url is too long
userspace script, if necessary, reassembles urls splitted in 2 or more packets.
if the program is loaded as PROG_TYPE_SOCKET_FILTER
and attached to a socket
return 0 -> DROP the packet
return -1 -> KEEP the packet and return it to user space (userspace can read it from the socket_fd )
*/
int http_filter(struct __sk_buff *skb) {
u8 *cursor = 0;
struct ethernet_t *ethernet = cursor_advance(cursor, sizeof(*ethernet));
//filter IP packets (ethernet type = 0x0800)
if (!(ethernet->type == 0x0800)) {
goto DROP;
}
struct ip_t *ip = cursor_advance(cursor, sizeof(*ip));
//filter TCP packets (ip next protocol = 0x06)
if (ip->nextp != IP_TCP) {
goto DROP;
}
u32 tcp_header_length = 0;
u32 ip_header_length = 0;
u32 payload_offset = 0;
u32 payload_length = 0;
struct Key key;
struct Leaf leaf;
struct tcp_t *tcp = cursor_advance(cursor, sizeof(*tcp));
//retrieve ip src/dest and port src/dest of current packet
//and save it into struct Key
key.dst_ip = ip->dst;
key.src_ip = ip->src;
key.dst_port = tcp->dst_port;
key.src_port = tcp->src_port;
//calculate ip header length
//value to multiply * 4
//e.g. ip->hlen = 5 ; IP Header Length = 5 x 4 byte = 20 byte
ip_header_length = ip->hlen << 2; //SHL 2 -> *4 multiply
//calculate tcp header length
//value to multiply *4
//e.g. tcp->offset = 5 ; TCP Header Length = 5 x 4 byte = 20 byte
tcp_header_length = tcp->offset << 2; //SHL 2 -> *4 multiply
//calculate patload offset and length
payload_offset = ETH_HLEN + ip_header_length + tcp_header_length;
payload_length = ip->tlen - ip_header_length - tcp_header_length;
//http://stackoverflow.com/questions/25047905/http-request-minimum-size-in-bytes
//minimum length of http request is always geater than 7 bytes
//avoid invalid access memory
//include empty payload
if(payload_length < 7) {
goto DROP;
}
//load firt 7 byte of payload into p (payload_array)
//direct access to skb not allowed
unsigned long p[7];
int i = 0;
int j = 0;
for (i = payload_offset ; i < (payload_offset + 7) ; i++) {
p[j] = load_byte(skb , i);
j++;
}
//find a match with an HTTP message
//HTTP
if ((p[0] == 'H') && (p[1] == 'T') && (p[2] == 'T') && (p[3] == 'P')) {
goto HTTP_MATCH;
}
//GET
if ((p[0] == 'G') && (p[1] == 'E') && (p[2] == 'T')) {
goto HTTP_MATCH;
}
//POST
if ((p[0] == 'P') && (p[1] == 'O') && (p[2] == 'S') && (p[3] == 'T')) {
goto HTTP_MATCH;
}
//PUT
if ((p[0] == 'P') && (p[1] == 'U') && (p[2] == 'T')) {
goto HTTP_MATCH;
}
//DELETE
if ((p[0] == 'D') && (p[1] == 'E') && (p[2] == 'L') && (p[3] == 'E') && (p[4] == 'T') && (p[5] == 'E')) {
goto HTTP_MATCH;
}
//HEAD
if ((p[0] == 'H') && (p[1] == 'E') && (p[2] == 'A') && (p[3] == 'D')) {
goto HTTP_MATCH;
}
//no HTTP match
//check if packet belong to an HTTP session
struct Leaf * lookup_leaf = sessions.lookup(&key);
if(lookup_leaf) {
//send packet to userspace
goto KEEP;
}
goto DROP;
//keep the packet and send it to userspace retruning -1
HTTP_MATCH:
//if not already present, insert into map <Key, Leaf>
leaf.timestamp = 0;
sessions.lookup_or_init(&key, &leaf);
sessions.update(&key,&leaf);
//send packet to userspace returning -1
KEEP:
return -1;
//drop the packet returning 0
DROP:
return 0;
}
#!/usr/bin/python
#
#Bertrone Matteo - Polytechnic of Turin
#November 2015
#
#eBPF application that parses HTTP packets
#and extracts (and prints on screen) the URL contained in the GET/POST request.
#
#eBPF program http_filter is used as SOCKET_FILTER attached to eth0 interface.
#only packet of type ip and tcp containing HTTP GET/POST are returned to userspace, others dropped
#
#python script uses bcc BPF Compiler Collection by iovisor (https://github.com/iovisor/bcc)
#and prints on stdout the first line of the HTTP GET/POST request containing the url
from __future__ import print_function
from bcc import BPF
from ctypes import *
from struct import *
import sys
import socket
import os
import struct
import binascii
import time
CLEANUP_N_PACKETS = 50 #run cleanup every CLEANUP_N_PACKETS packets received
MAX_URL_STRING_LEN = 8192 #max url string len (usually 8K)
MAX_AGE_SECONDS = 30 #max age entry in bpf_sessions map
#-----FUNCTIONS-BEGIN----------------------#
#convert a bin string into a string of hex char
#helper function to print raw packet in hex
def toHex(s):
lst = []
for ch in s:
hv = hex(ord(ch)).replace('0x', '')
if len(hv) == 1:
hv = '0'+hv
lst.append(hv)
return reduce(lambda x,y:x+y, lst)
#print str until CR+LF
def printUntilCRLF(str):
for k in range (0,len(str)-1):
if (str[k] == '\n'):
if (str[k-1] == '\r'):
print ("")
return
print ("%c" % (str[k]), end = "")
print("")
return
#cleanup function
def cleanup():
#get current time in seconds
current_time = int(time.time())
#looking for leaf having:
#timestap == 0 --> update with current timestamp
#AGE > MAX_AGE_SECONDS --> delete item
for key,leaf in bpf_sessions.items():
try:
current_leaf = bpf_sessions[key]
#set timestamp if timestamp == 0
if (current_leaf.timestamp == 0):
bpf_sessions[key] = bpf_sessions.Leaf(current_time)
else:
#delete older entries
if (current_time - current_leaf.timestamp > MAX_AGE_SECONDS):
del bpf_sessions[key]
except:
print("cleanup exception.")
return
#-----FUNCTIONS-END-------------------------#
# initialize BPF - load source code from http-parse.c
bpf = BPF(src_file = "http-parse-v2.c",debug = 0)
#load eBPF program http_filter of type SOCKET_FILTER into the kernel eBPF vm
#more info about eBPF program types
#http://man7.org/linux/man-pages/man2/bpf.2.html
function_http_filter = bpf.load_func("http_filter", BPF.SOCKET_FILTER)
#create raw socket, bind it to eth0
#attach bpf program to socket created
BPF.attach_raw_socket(function_http_filter, "eth0")
#get file descriptor of the socket previously created inside BPF.attach_raw_socket
socket_fd = function_http_filter.sock
#create python socket object, from the file descriptor
sock = socket.fromfd(socket_fd,socket.PF_PACKET,socket.SOCK_RAW,socket.IPPROTO_IP)
#set it as blocking socket
sock.setblocking(True)
#get pointer to bpf map of type hash
bpf_sessions = bpf.get_table("sessions")
#packets counter
packet_count = 0
#dictionary containing association <key(ipsrc,ipdst,portsrc,portdst),payload_string>
#if url is not entirely contained in only one packet, save the firt part of it in this local dict
#when I find \r\n in a next pkt, append and print all the url
local_dictionary = {}
while 1:
#retrieve raw packet from socket
packet_str = os.read(socket_fd,4096) #set packet length to max packet length on the interface
packet_count += 1
#DEBUG - print raw packet in hex format
#packet_hex = toHex(packet_str)
#print ("%s" % packet_hex)
#convert packet into bytearray
packet_bytearray = bytearray(packet_str)
#ethernet header length
ETH_HLEN = 14
#IP HEADER
#https://tools.ietf.org/html/rfc791
# 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |Version| IHL |Type of Service| Total Length |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
#
#IHL : Internet Header Length is the length of the internet header
#value to multiply * 4 byte
#e.g. IHL = 5 ; IP Header Length = 5 * 4 byte = 20 byte
#
#Total length: This 16-bit field defines the entire packet size,
#including header and data, in bytes.
#calculate packet total length
total_length = packet_bytearray[ETH_HLEN + 2] #load MSB
total_length = total_length << 8 #shift MSB
total_length = total_length + packet_bytearray[ETH_HLEN+3] #add LSB
#calculate ip header length
ip_header_length = packet_bytearray[ETH_HLEN] #load Byte
ip_header_length = ip_header_length & 0x0F #mask bits 0..3
ip_header_length = ip_header_length << 2 #shift to obtain length
#retrieve ip source/dest
ip_src_str = packet_str[ETH_HLEN+12:ETH_HLEN+16] #ip source offset 12..15
ip_dst_str = packet_str[ETH_HLEN+16:ETH_HLEN+20] #ip dest offset 16..19
ip_src = int(toHex(ip_src_str),16)
ip_dst = int(toHex(ip_dst_str),16)
#TCP HEADER
#https://www.rfc-editor.org/rfc/rfc793.txt
# 12 13 14 15
# 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# | Data | |U|A|P|R|S|F| |
# | Offset| Reserved |R|C|S|S|Y|I| Window |
# | | |G|K|H|T|N|N| |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
#
#Data Offset: This indicates where the data begins.
#The TCP header is an integral number of 32 bits long.
#value to multiply * 4 byte
#e.g. DataOffset = 5 ; TCP Header Length = 5 * 4 byte = 20 byte
#calculate tcp header length
tcp_header_length = packet_bytearray[ETH_HLEN + ip_header_length + 12] #load Byte
tcp_header_length = tcp_header_length & 0xF0 #mask bit 4..7
tcp_header_length = tcp_header_length >> 2 #SHR 4 ; SHL 2 -> SHR 2
#retrieve port source/dest
port_src_str = packet_str[ETH_HLEN+ip_header_length:ETH_HLEN+ip_header_length+2]
port_dst_str = packet_str[ETH_HLEN+ip_header_length+2:ETH_HLEN+ip_header_length+4]
port_src = int(toHex(port_src_str),16)
port_dst = int(toHex(port_dst_str),16)
#calculate payload offset
payload_offset = ETH_HLEN + ip_header_length + tcp_header_length
#payload_string contains only packet payload
payload_string = packet_str[(payload_offset):(len(packet_bytearray))]
#CR + LF (substring to find)
crlf = "\r\n"
#current_Key contains ip source/dest and port source/map
#useful for direct bpf_sessions map access
current_Key = bpf_sessions.Key(ip_src,ip_dst,port_src,port_dst)
#looking for HTTP GET/POST request
if ((payload_string[:3] == "GET") or (payload_string[:4] == "POST") or (payload_string[:4] == "HTTP") \
or ( payload_string[:3] == "PUT") or (payload_string[:6] == "DELETE") or (payload_string[:4] == "HEAD") ):
#match: HTTP GET/POST packet found
if (crlf in payload_string):
#url entirely contained in first packet -> print it all
printUntilCRLF(payload_string)
#delete current_Key from bpf_sessions, url already printed. current session not useful anymore
try:
del bpf_sessions[current_Key]
except:
print ("error during delete from bpf map ")
else:
#url NOT entirely contained in first packet
#not found \r\n in payload.
#save current part of the payload_string in dictionary <key(ips,ipd,ports,portd),payload_string>
local_dictionary[binascii.hexlify(current_Key)] = payload_string
else:
#NO match: HTTP GET/POST NOT found
#check if the packet belong to a session saved in bpf_sessions
if (current_Key in bpf_sessions):
#check id the packet belong to a session saved in local_dictionary
#(local_dictionary mantains HTTP GET/POST url not printed yet because splitted in N packets)
if (binascii.hexlify(current_Key) in local_dictionary):
#first part of the HTTP GET/POST url is already present in local dictionary (prev_payload_string)
prev_payload_string = local_dictionary[binascii.hexlify(current_Key)]
#looking for CR+LF in current packet.
if (crlf in payload_string):
#last packet. containing last part of HTTP GET/POST url splitted in N packets.
#append current payload
prev_payload_string += payload_string
#print HTTP GET/POST url
printUntilCRLF(prev_payload_string)
#clean bpf_sessions & local_dictionary
try:
del bpf_sessions[current_Key]
del local_dictionary[binascii.hexlify(current_Key)]
except:
print ("error deleting from map or dictionary")
else:
#NOT last packet. containing part of HTTP GET/POST url splitted in N packets.
#append current payload
prev_payload_string += payload_string
#check if not size exceeding (usually HTTP GET/POST url < 8K )
if (len(prev_payload_string) > MAX_URL_STRING_LEN):
print("url too long")
try:
del bpf_sessions[current_Key]
del local_dictionary[binascii.hexlify(current_Key)]
except:
print ("error deleting from map or dict")
#update dictionary
local_dictionary[binascii.hexlify(current_Key)] = prev_payload_string
else:
#first part of the HTTP GET/POST url is NOT present in local dictionary
#bpf_sessions contains invalid entry -> delete it
try:
del bpf_sessions[current_Key]
except:
print ("error del bpf_session")
#check if dirty entry are present in bpf_sessions
if (((packet_count) % CLEANUP_N_PACKETS) == 0):
cleanup()
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#define IP_TCP 6
#define ETH_HLEN 14
/*eBPF program.
Filter IP and TCP packets, having payload not empty
and containing "HTTP", "GET", "POST" ... as first bytes of payload
if the program is loaded as PROG_TYPE_SOCKET_FILTER
and attached to a socket
return 0 -> DROP the packet
return -1 -> KEEP the packet and return it to user space (userspace can read it from the socket_fd )
*/
int http_filter(struct __sk_buff *skb) {
u8 *cursor = 0;
struct ethernet_t *ethernet = cursor_advance(cursor, sizeof(*ethernet));
//filter IP packets (ethernet type = 0x0800)
if (!(ethernet->type == 0x0800)) {
goto DROP;
}
struct ip_t *ip = cursor_advance(cursor, sizeof(*ip));
//filter TCP packets (ip next protocol = 0x06)
if (ip->nextp != IP_TCP) {
goto DROP;
}
u32 tcp_header_length = 0;
u32 ip_header_length = 0;
u32 payload_offset = 0;
u32 payload_length = 0;
struct tcp_t *tcp = cursor_advance(cursor, sizeof(*tcp));
//calculate ip header length
//value to multiply * 4
//e.g. ip->hlen = 5 ; IP Header Length = 5 x 4 byte = 20 byte
ip_header_length = ip->hlen << 2; //SHL 2 -> *4 multiply
//calculate tcp header length
//value to multiply *4
//e.g. tcp->offset = 5 ; TCP Header Length = 5 x 4 byte = 20 byte
tcp_header_length = tcp->offset << 2; //SHL 2 -> *4 multiply
//calculate patload offset and length
payload_offset = ETH_HLEN + ip_header_length + tcp_header_length;
payload_length = ip->tlen - ip_header_length - tcp_header_length;
//http://stackoverflow.com/questions/25047905/http-request-minimum-size-in-bytes
//minimum length of http request is always geater than 7 bytes
//avoid invalid access memory
//include empty payload
if(payload_length < 7) {
goto DROP;
}
//load firt 7 byte of payload into p (payload_array)
//direct access to skb not allowed
unsigned long p[7];
int i = 0;
int j = 0;
for (i = payload_offset ; i < (payload_offset + 7) ; i++) {
p[j] = load_byte(skb , i);
j++;
}
//find a match with an HTTP message
//HTTP
if ((p[0] == 'H') && (p[1] == 'T') && (p[2] == 'T') && (p[3] == 'P')) {
goto KEEP;
}
//GET
if ((p[0] == 'G') && (p[1] == 'E') && (p[2] == 'T')) {
goto KEEP;
}
//POST
if ((p[0] == 'P') && (p[1] == 'O') && (p[2] == 'S') && (p[3] == 'T')) {
goto KEEP;
}
//PUT
if ((p[0] == 'P') && (p[1] == 'U') && (p[2] == 'T')) {
goto KEEP;
}
//DELETE
if ((p[0] == 'D') && (p[1] == 'E') && (p[2] == 'L') && (p[3] == 'E') && (p[4] == 'T') && (p[5] == 'E')) {
goto KEEP;
}
//HEAD
if ((p[0] == 'H') && (p[1] == 'E') && (p[2] == 'A') && (p[3] == 'D')) {
goto KEEP;
}
//no HTTP match
goto DROP;
//keep the packet and send it to userspace retruning -1
KEEP:
return -1;
//drop the packet returning 0
DROP:
return 0;
}
\ No newline at end of file
#!/usr/bin/python
#
#Bertrone Matteo - Polytechnic of Turin
#November 2015
#
#eBPF application that parses HTTP packets
#and extracts (and prints on screen) the URL contained in the GET/POST request.
#
#eBPF program http_filter is used as SOCKET_FILTER attached to eth0 interface.
#only packet of type ip and tcp containing HTTP GET/POST are returned to userspace, others dropped
#
#python script uses bcc BPF Compiler Collection by iovisor (https://github.com/iovisor/bcc)
#and prints on stdout the first line of the HTTP GET/POST request containing the url
from __future__ import print_function
from bcc import BPF
import sys
import socket
import os
# initialize BPF - load source code from http-parse.c
bpf = BPF(src_file = "http-parse.c",debug = 0)
#load eBPF program http_filter of type SOCKET_FILTER into the kernel eBPF vm
#more info about eBPF program types
#http://man7.org/linux/man-pages/man2/bpf.2.html
function_http_filter = bpf.load_func("http_filter", BPF.SOCKET_FILTER)
#create raw socket, bind it to eth0
#attach bpf program to socket created
BPF.attach_raw_socket(function_http_filter, "eth0")
#get file descriptor of the socket previously created inside BPF.attach_raw_socket
socket_fd = function_http_filter.sock
#create python socket object, from the file descriptor
sock = socket.fromfd(socket_fd,socket.PF_PACKET,socket.SOCK_RAW,socket.IPPROTO_IP)
#set it as blocking socket
sock.setblocking(True)
while 1:
#retrieve raw packet from socket
packet_str = os.read(socket_fd,2048)
#DEBUG - print raw packet in hex format
#packet_hex = toHex(packet_str)
#print ("%s" % packet_hex)
#convert packet into bytearray
packet_bytearray = bytearray(packet_str)
#ethernet header length
ETH_HLEN = 14
#IP HEADER
#https://tools.ietf.org/html/rfc791
# 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# |Version| IHL |Type of Service| Total Length |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
#
#IHL : Internet Header Length is the length of the internet header
#value to multiply * 4 byte
#e.g. IHL = 5 ; IP Header Length = 5 * 4 byte = 20 byte
#
#Total length: This 16-bit field defines the entire packet size,
#including header and data, in bytes.
#calculate packet total length
total_length = packet_bytearray[ETH_HLEN + 2] #load MSB
total_length = total_length << 8 #shift MSB
total_length = total_length + packet_bytearray[ETH_HLEN+3] #add LSB
#calculate ip header length
ip_header_length = packet_bytearray[ETH_HLEN] #load Byte
ip_header_length = ip_header_length & 0x0F #mask bits 0..3
ip_header_length = ip_header_length << 2 #shift to obtain length
#TCP HEADER
#https://www.rfc-editor.org/rfc/rfc793.txt
# 12 13 14 15
# 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
# | Data | |U|A|P|R|S|F| |
# | Offset| Reserved |R|C|S|S|Y|I| Window |
# | | |G|K|H|T|N|N| |
# +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
#
#Data Offset: This indicates where the data begins.
#The TCP header is an integral number of 32 bits long.
#value to multiply * 4 byte
#e.g. DataOffset = 5 ; TCP Header Length = 5 * 4 byte = 20 byte
#calculate tcp header length
tcp_header_length = packet_bytearray[ETH_HLEN + ip_header_length + 12] #load Byte
tcp_header_length = tcp_header_length & 0xF0 #mask bit 4..7
tcp_header_length = tcp_header_length >> 2 #SHR 4 ; SHL 2 -> SHR 2
#calculate payload offset
payload_offset = ETH_HLEN + ip_header_length + tcp_header_length
#print first line of the HTTP GET/POST request
#line ends with 0xOD 0xOA (\r\n)
#(if we want to print all the header print until \r\n\r\n)
for i in range (payload_offset-1,len(packet_bytearray)-1):
if (packet_bytearray[i]== 0x0A):
if (packet_bytearray[i-1] == 0x0D):
break
print ("%c" % chr(packet_bytearray[i]), end = "")
print("")
...@@ -21,5 +21,5 @@ between 128 and 255 Kbytes in size, and another mode of 211 I/O were between ...@@ -21,5 +21,5 @@ between 128 and 255 Kbytes in size, and another mode of 211 I/O were between
4 and 7 Kbytes in size. 4 and 7 Kbytes in size.
Understanding this distribution is useful for characterizing workloads and Understanding this distribution is useful for characterizing workloads and
understanding performance. The existance of this distribution is not visible understanding performance. The existence of this distribution is not visible
from averages alone. from averages alone.
...@@ -65,7 +65,7 @@ Show a latency histogram for each disk device separately: ...@@ -65,7 +65,7 @@ Show a latency histogram for each disk device separately:
usecs usecs
Microsecond range Microsecond range
.TP .TP
mecs msecs
Millisecond range Millisecond range
.TP .TP
count count
...@@ -74,7 +74,7 @@ How many I/O fell into this range ...@@ -74,7 +74,7 @@ How many I/O fell into this range
distribution distribution
An ASCII bar chart to visualize the distribution (count column) An ASCII bar chart to visualize the distribution (count column)
.SH OVERHEAD .SH OVERHEAD
This traces kernel functions and maintains in-kernel timestamps and a histgroam, This traces kernel functions and maintains in-kernel timestamps and a histogram,
which are asynchronously copied to user-space. This method is very efficient, which are asynchronously copied to user-space. This method is very efficient,
and the overhead for most storage I/O rates (< 10k IOPS) should be negligible. and the overhead for most storage I/O rates (< 10k IOPS) should be negligible.
If you have a higher IOPS storage environment, test and quantify the overhead If you have a higher IOPS storage environment, test and quantify the overhead
......
.TH biotop 8 "2016-02-06" "USER COMMANDS"
.SH NAME
biotop \- Block device (disk) I/O by process top.
.SH SYNOPSIS
.B biotop [\-h] [\-C] [\-r MAXROWS] [interval] [count]
.SH DESCRIPTION
This is top for disks.
This traces block device I/O (disk I/O), and prints a per-process summary every
interval (by default, 1 second). The summary is sorted on the top disk
consumers by throughput (Kbytes). The PID and process name shown are measured
from when the I/O was first created, which usually identifies the responsible
process.
For efficiency, this uses in-kernel eBPF maps to cache process details (PID and
comm) by I/O request, as well as a starting timestamp for calculating I/O
latency, and the final summary.
This works by tracing various kernel blk_*() functions using dynamic tracing,
and will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-C
Don't clear the screen.
.TP
\-r MAXROWS
Maximum number of rows to print. Default is 20.
.TP
\-p PID
Trace this PID only.
.TP
interval
Interval between updates, seconds.
.TP
count
Number of interval summaries.
.SH EXAMPLES
.TP
Summarize block device I/O by process, 1 second screen refresh:
#
.B biotop
.TP
Don't clear the screen:
#
.B biotop -C
.TP
5 second summaries, 10 times only:
#
.B biotop 5 10
.SH FIELDS
.TP
loadavg:
The contents of /proc/loadavg
.TP
PID
Cached process ID, if present. This usually (but isn't guaranteed) to identify
the responsible process for the I/O.
.TP
COMM
Cached process name, if present. This usually (but isn't guaranteed) to identify
the responsible process for the I/O.
.TP
D
Direction: R == read, W == write.
.TP
MAJ
Major device number.
.TP
MIN
Minor device number.
.TP
DISK
Disk device name.
.TP
I/O
Number of I/O during the interval.
.TP
Kbytes
Total Kbytes for these I/O, during the interval.
.TP
AVGms
Average time for the I/O (latency) from the issue to the device, to its
completion, in milliseconds.
.SH OVERHEAD
Since block device I/O usually has a relatively low frequency (< 10,000/s),
the overhead for this tool is expected to be low or negligible. For high IOPS
storage systems, test and quantify before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH INSPIRATION
top(1) by William LeFebvre
.SH SEE ALSO
biosnoop(8), biolatency(8), iostat(1)
...@@ -34,7 +34,7 @@ Print output every five seconds, three times: ...@@ -34,7 +34,7 @@ Print output every five seconds, three times:
# #
.B cachestat 5 3 .B cachestat 5 3
.TP .TP
Print output with timetsmap every five seconds, three times: Print output with timestamp every five seconds, three times:
# #
.B cachestat -T 5 3 .B cachestat -T 5 3
.SH FIELDS .SH FIELDS
......
.TH dcsnoop 8 "2016-02-10" "USER COMMANDS"
.SH NAME
dcsnoop \- Trace directory entry cache (dcache) lookups. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B dcsnoop [\-h] [\-a]
.SH DESCRIPTION
By default, this traces every failed dcache lookup (cache miss), and shows the
process performing the lookup and the filename requested. A \-a option can be
used to show all lookups, not just failed ones.
The output of this tool can be verbose, and is intended for further
investigations of dcache performance beyond dcstat(8), which prints
per-second summaries.
This uses kernel dynamic tracing of the d_lookup() function, and will need
and will need updating to match any changes to this function.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-a
Trace references, not just failed lookups.
.SH EXAMPLES
.TP
Trace failed dcache lookups:
#
.B dcsnoop
.TP
Trace all dcache lookups:
#
.B dcsnoop \-a
.SH FIELDS
.TP
TIME(s)
Time of lookup, in seconds.
.TP
PID
Process ID.
.TP
COMM
Process name.
.TP
T
Type: R == reference (only visible with \-a), M == miss. A miss will print two
lines, one for the reference, and one for the miss.
.TP
FILE
The file name component that was being looked up. This contains trailing
pathname components (after '/'), which will be the subject of subsequent
lookups.
.SH OVERHEAD
File name lookups can be frequent (depending on the workload), and this tool
prints a line for each failed lookup, and with \-a, each reference as well. The
output may be verbose, and the incurred overhead, while optimized to some
extent, may still be from noticeable to significant. This is only really
intended for deeper investigations beyond dcstat(8), when absolutely necessary.
Measure and quantify the overhead in a test environment before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
dcstat(1)
.TH dcstat 8 "2016-02-09" "USER COMMANDS"
.SH NAME
dcstat \- Directory entry cache (dcache) stats. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B dcstat
[interval [count]]
.SH DESCRIPTION
The Linux directory entry cache (dcache) improves the performance of file and
directory name lookups. This tool provides per-second summary statistics of
dcache performance.
This uses kernel dynamic tracing of kernel functions, lookup_fast() and
d_lookup(), which will need to be modified to match kernel changes.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Print summaries each second:
#
.B dcstat
.TP
Print output every five seconds, three times:
#
.B dcstat 5 3
.SH FIELDS
.TP
REFS/s
Number dcache lookups (references) per second.
.TP
SLOW/s
Number of dcache lookups that failed the lookup_fast() path and executed the
lookup_slow() path instead.
.TP
MISS/s
Number of dcache misses (failed both fast and slow lookups).
.TP
HIT%
Percentage of dcache hits over total references.
.SH OVERHEAD
The overhead depends on the frequency of file and directory name lookups.
While the per-event overhead is low, some applications may make over 100k
lookups per second, and the low per-event overhead will begin to add up, and
could begin to be measurable (over 10% CPU usage). Measure in a test
environment.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
dcsnoop(8)
.TH execsnoop 8 "2016-02-07" "USER COMMANDS"
.SH NAME
execsnoop \- Trace new processes via exec() syscalls. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B execsnoop [\-h] [\-t] [\-x] [\-n NAME]
.SH DESCRIPTION
execsnoop traces new processes, showing the filename executed and argument
list.
It works by traces the execve() system call (commonly used exec() variant).
This catches new processes that follow the fork->exec sequence, as well as
processes that re-exec() themselves. Some applications fork() but do not
exec(), eg, for worker processes, which won't be included in the execsnoop
output.
This works by tracing the kernel sys_execve() function using dynamic tracing,
and will need updating to match any changes to this function.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-t
Include a timestamp column.
.TP
\-x
Include failed exec()s
.TP
\-n NAME
Only print command lines matching this name (regex), matched anywhere
.SH EXAMPLES
.TP
Trace all exec() syscalls:
#
.B execsnoop
.TP
Trace all exec() syscalls, and include timestamps:
#
.B execsnoop \-t
.TP
Include failed exec()s:
#
.B execsnoop \-x
.TP
Only trace exec()s where the filename or arguments contain "mount":
#
.B opensnoop \-n mount
.SH FIELDS
.TP
TIME(s)
Time of exec() return, in seconds.
.TP
PCOMM
Parent process/command name.
.TP
PID
Process ID
.TP
RET
Return value of exec(). 0 == successs. Failures are only shown when using the
\-x option.
.TP
ARGS
Filename for the exec(), followed be up to 19 arguments. An ellipsis "..." is
shown if the argument list is known to be truncated.
.SH OVERHEAD
This traces the kernel execve function and prints output for each event. As the
rate of this is generally expected to be low (< 1000/s), the overhead is also
expected to be negligible. If you have an application that is calling a high
rate of exec()s, then test and understand overhead before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
opensnoop(1)
.TH filelife 8 "2016-02-08" "USER COMMANDS"
.SH NAME
filelife \- Trace the lifespan of short-lived files. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B filelife [\-h] [\-p PID]
.SH DESCRIPTION
This traces the creation and deletion of files, providing information
on who deleted the file, the file age, and the file name. The intent is to
provide information on short-lived files, for debugging or performance
analysis.
This works by tracing the kernel vfs_create() and vfs_delete() functions using
dynamic tracing, and will need updating to match any changes to these
functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.SH EXAMPLES
.TP
Trace all short-lived files, and print details:
#
.B filelife
.TP
Trace all short-lived files created AND deleted by PID 181:
#
.B filelife \-p 181
.SH FIELDS
.TP
TIME
Time of the deletion.
.TP
PID
Process ID that deleted the file.
.TP
COMM
Process name for the PID.
.TP
AGE(s)
Age of the file, from creation to deletion, in seconds.
.TP
FILE
Filename.
.SH OVERHEAD
This traces the kernel VFS file create and delete functions and prints output
for each delete. As the rate of this is generally expected to be low
(< 1000/s), the overhead is also expected to be negligible.
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
opensnoop(1)
.TH fileslower 8 "2016-02-07" "USER COMMANDS"
.SH NAME
fileslower \- Trace slow synchronous file reads and writes.
.SH SYNOPSIS
.B fileslower [\-h] [\-p PID] [min_ms]
.SH DESCRIPTION
This script uses kernel dynamic tracing of synchronous reads and writes
at the VFS interface, to identify slow file reads and writes for any file
system.
This version traces __vfs_read() and __vfs_write() and only showing
synchronous I/O (the path to new_sync_read() and new_sync_write()), and
I/O with filenames. This approach provides a view of just two file
system request types: file reads and writes. There are typically many others:
asynchronous I/O, directory operations, file handle operations, file open()s,
fflush(), etc.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used.
Since this works by tracing various kernel __vfs_*() functions using dynamic
tracing, it will need updating to match any changes to these functions. A
future version should switch to using FS tracepoints instead.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file reads and writes slower than 10 ms:
#
.B fileslower
.TP
Trace slower than 1 ms:
#
.B fileslower 1
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B fileslower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
D
Direction of I/O. R == read, W == write.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when the application issued it to VFS
to when it completed. This time is inclusive of block device I/O, file system
CPU cycles, file system locks, run queue latency, etc. It's a more accurate
measure of the latency suffered by applications performing file system I/O,
than to measure this down at the block device interface.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.SH OVERHEAD
Depending on the frequency of application reads and writes, overhead can become
severe, in the worst case slowing applications by 2x. In the best case, the
overhead is negligible. Hopefully for real world workloads the overhead is
often at the lower end of the spectrum -- test before use. The reason for
high overhead is that this traces VFS reads and writes, which includes FS
cache reads and writes, and can exceed one million events per second if the
application is I/O heavy. While the instrumentation is extremely lightweight,
and uses in-kernel eBPF maps for efficient timing and filtering, multiply that
cost by one million events per second and that cost becomes a million times
worse. You can get an idea of the possible cost by just counting the
instrumented events using the bcc funccount tool, eg:
.PP
# ./funccount.py -i 1 -r '^__vfs_(read|write)$'
.PP
This also costs overhead, but is somewhat less than fileslower.
.PP
If the overhead is prohibitive for your workload, I'd recommend moving
down-stack a little from VFS into the file system functions (ext4, xfs, etc).
Look for updates to bcc for specific file system tools that do this. The
advantage of a per-file system approach is that we can trace post-cache,
greatly reducing events and overhead. The disadvantage is needing custom
tracing approaches for each different file system (whereas VFS is generic).
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
biosnoop(8), funccount(8)
.TH filetop 8 "2016-02-08" "USER COMMANDS"
.SH NAME
filetop \- File reads and writes by filename and process. Top for files.
.SH SYNOPSIS
.B filetop [\-h] [\-C] [\-r MAXROWS] [\-p PID] [interval] [count]
.SH DESCRIPTION
This is top for files.
This traces file reads and writes, and prints a per-file summary every
interval (by default, 1 second). The summary is sorted on the highest read
throughput (Kbytes).
This uses in-kernel eBPF maps to store per process summaries for efficiency.
This script works by tracing the __vfs_read() and __vfs_write() functions using
kernel dynamic tracing, which instruments explicit read and write calls. If
files are read or written using another means (eg, via mmap()), then they
will not be visible using this tool. Also, this tool will need updating to
match any code changes to those vfs functions.
This should be useful for file system workload characterization when analyzing
the performance of applications.
Note that tracing VFS level reads and writes can be a frequent activity, and
this tool can begin to cost measurable overhead at high I/O rates.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-C
Don't clear the screen.
.TP
\-r MAXROWS
Maximum number of rows to print. Default is 20.
.TP
\-p PID
Trace this PID only.
.TP
interval
Interval between updates, seconds.
.TP
count
Number of interval summaries.
.SH EXAMPLES
.TP
Summarize block device I/O by process, 1 second screen refresh:
#
.B filetop
.TP
Don't clear the screen, and top 8 rows only:
#
.B filetop -Cr 8
.TP
5 second summaries, 10 times only:
#
.B filetop 5 10
.SH FIELDS
.TP
loadavg:
The contents of /proc/loadavg
.TP
PID
Process ID.
.TP
COMM
Process name.
.TP
READS
Count of reads during interval.
.TP
WRITES
Count of writes during interval.
.TP
R_Kb
Total read Kbytes during interval.
.TP
W_Kb
Total write Kbytes during interval.
.TP
T
Type of file: R == regular, S == socket, O == other (pipe, etc).
.SH OVERHEAD
Depending on the frequency of application reads and writes, overhead can become
significant, in the worst case slowing applications by over 50%. Hopefully for
real world workloads the overhead is much less -- test before use. The reason
for the high overhead is that VFS reads and writes can be a frequent event, and
despite the eBPF overhead being very small per event, if you multiply this
small overhead by a million events per second, it becomes a million times
worse. Literally. You can gauge the number of reads and writes using the
vfsstat(8) tool, also from bcc.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH INSPIRATION
top(1) by William LeFebvre
.SH SEE ALSO
vfsstat(8), vfscount(8), fileslower(8)
.TH funclatency 8 "2015-08-18" "USER COMMANDS" .TH funclatency 8 "2015-08-18" "USER COMMANDS"
.SH NAME .SH NAME
funclatency \- Time kernel funcitons and print latency as a histogram. funclatency \- Time kernel functions and print latency as a histogram.
.SH SYNOPSIS .SH SYNOPSIS
.B funclatency [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-u] [\-m] [\-r] [\-F] pattern .B funclatency [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-u] [\-m] [\-r] [\-F] pattern
.SH DESCRIPTION .SH DESCRIPTION
...@@ -88,7 +88,7 @@ Nanosecond range ...@@ -88,7 +88,7 @@ Nanosecond range
usecs usecs
Microsecond range Microsecond range
.TP .TP
mecs msecs
Millisecond range Millisecond range
.TP .TP
count count
...@@ -97,7 +97,7 @@ How many calls fell into this range ...@@ -97,7 +97,7 @@ How many calls fell into this range
distribution distribution
An ASCII bar chart to visualize the distribution (count column) An ASCII bar chart to visualize the distribution (count column)
.SH OVERHEAD .SH OVERHEAD
This traces kernel functions and maintains in-kernel timestamps and a histgroam, This traces kernel functions and maintains in-kernel timestamps and a histogram,
which are asynchronously copied to user-space. While this method is very which are asynchronously copied to user-space. While this method is very
efficient, the rate of kernel functions can also be very high (>1M/sec), at efficient, the rate of kernel functions can also be very high (>1M/sec), at
which point the overhead is expected to be measurable. Measure in a test which point the overhead is expected to be measurable. Measure in a test
......
...@@ -73,7 +73,7 @@ This traces kernel functions and maintains in-kernel counts, which ...@@ -73,7 +73,7 @@ This traces kernel functions and maintains in-kernel counts, which
are asynchronously copied to user-space. While the rate of interrupts are asynchronously copied to user-space. While the rate of interrupts
be very high (>1M/sec), this is a relatively efficient way to trace these be very high (>1M/sec), this is a relatively efficient way to trace these
events, and so the overhead is expected to be small for normal workloads, but events, and so the overhead is expected to be small for normal workloads, but
could become noticable for heavy workloads. Measure in a test environment could become noticeable for heavy workloads. Measure in a test environment
before use. before use.
.SH SOURCE .SH SOURCE
This is from bcc. This is from bcc.
......
.TH memleak 8 "2016-01-14" "USER COMMANDS"
.SH NAME
memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
[-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
.SH DESCRIPTION
memleak traces and matches memory allocation and deallocation requests, and
collects call stacks for each allocation. memleak can then print a summary
of which call stacks performed allocations that weren't subsequently freed.
When tracing a specific process, memleak instruments malloc and free from libc.
When tracing all processes, memleak instruments kmalloc and kfree.
memleak may introduce significant overhead when tracing processes that allocate
and free many blocks very quickly. See the OVERHEAD section below.
The stack depth is limited to 10 by default (+1 for the current instruction pointer),
but it can be controlled using the \-d switch if deeper stacks are required.
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-p PID
Trace this process ID only (filtered in-kernel). This traces malloc and free from libc.
.TP
\-t
Print a trace of all allocation and free requests and results.
.TP
\-a
Print a list of allocations that weren't freed (and their sizes) in addition to their call stacks.
.TP
\-o OLDER
Print only allocations older than OLDER milliseconds. Useful to remove false positives.
The default value is 500 milliseconds.
.TP
\-c COMMAND
Run the specified command and trace its allocations only. This traces malloc and free from libc.
.TP
\-s SAMPLE_RATE
Record roughly every SAMPLE_RATE-th allocation to reduce overhead.
.TP
\-d STACK_DEPTH
Capture STACK_DEPTH frames (or less) when obtaining allocation call stacks.
The default value is 10.
.TP
\-t TOP
Print only the top TOP stacks (sorted by size).
The default value is 10.
.TP
\-z MIN_SIZE
Capture only allocations that are larger than or equal to MIN_SIZE bytes.
.TP
\-Z MAX_SIZE
Capture only allocations that are smaller than or equal to MAX_SIZE bytes.
.TP
INTERVAL
Print a summary of oustanding allocations and their call stacks every INTERVAL seconds.
The default interval is 5 seconds.
.TP
COUNT
Print the outstanding allocations summary COUNT times and then exit.
.SH EXAMPLES
.TP
Print outstanding kernel allocation stacks every 3 seconds:
#
.B memleak 3
.TP
Print user outstanding allocation stacks and allocation details for the process 1005:
#
.B memleak -p 1005 -a
.TP
Sample roughly every 5th allocation (~20%) of the call stacks and print the top 5
stacks 10 times before quitting.
#
.B memleak -s 5 --top=5 10
.TP
Run ./allocs and print outstanding allocation stacks for that process:
#
.B memleak -c "./allocs"
.TP
Capture only allocations between 16 and 32 bytes in size:
#
.B memleak -z 16 -Z 32
.SH OVERHEAD
memleak can have significant overhead if the target process or kernel performs
allocations at a very high rate. Pathological cases may exhibit up to 100x
degradation in running time. Most of the time, however, memleak shouldn't cause
a significant slowdown. You can use the \-s switch to reduce the overhead
further by capturing only every N-th allocation. The \-z and \-Z switches can
also reduce overhead by capturing only allocations of specific sizes.
To determine the rate at which your application is calling malloc/free, or the
rate at which your kernel is calling kmalloc/kfree, place a probe with perf and
collect statistics. For example, to determine how many calls to __kmalloc are
placed in a typical period of 10 seconds:
#
.B perf probe '__kmalloc'
#
.B perf stat -a -e 'probe:__kmalloc' -- sleep 10
Another setting that may help reduce overhead is lowering the number of stack
frames captured and parsed by memleak for each allocation, using the \-d switch.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Sasha Goldshtein
.TH oomkill 8 "2016-02-09" "USER COMMANDS"
.SH NAME
oomkill \- Trace oom_kill_process(). Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B oomkill
.SH DESCRIPTION
This traces the kernel out-of-memory killer, and prints basic details,
including the system load averages at the time of the OOM kill. This can
provide more context on the system state at the time: was it getting busier
or steady, based on the load averages? This tool may also be useful to
customize for investigations; for example, by adding other task_struct
details at the time of OOM.
This program is also a basic example of eBPF/bcc.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Trace OOM kill events:
#
.B oomkill
.SH FIELDS
.TP
Triggered by ...
The process ID and process name of the task that was running when another task was OOM
killed.
.TP
OOM kill of ...
The process ID and name of the target process that was OOM killed.
.TP
loadavg
Contents of /proc/loadavg. The first three numbers are 1, 5, and 15 minute
load averages (where the average is an exponentially damped moving sum, and
those numbers are constants in the equation); then there is the number of
running tasks, a slash, and the total number of tasks; and then the last number
is the last PID to be created.
.SH OVERHEAD
Negligible.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
memleak(8)
.TH runqlat 8 "2016-02-07" "USER COMMANDS"
.SH NAME
runqlat \- Run queue (scheduler) latency as a histogram.
.SH SYNOPSIS
.B runqlat [\-h] [\-T] [\-m] [\-P] [\-p PID] [interval] [count]
.SH DESCRIPTION
This measures the time a task spends waiting on a run queue (or equivalent
scheduler data structure) for a turn on-CPU, and shows this time as a
histogram. This time should be small, but a task may need to wait its turn due
to CPU load. The higher the CPU load, the longer a task will generally need to
wait its turn.
This tool measures two types of run queue latency:
1. The time from a task being enqueued on a run queue to its context switch
and execution. This traces enqueue_task_*() -> finish_task_switch(),
and instruments the run queue latency after a voluntary context switch.
2. The time from when a task was involuntary context switched and still
in the runnable state, to when it next executed. This is instrumented
from finish_task_switch() alone.
This tool uses in-kernel eBPF maps for storing timestamps and the histogram,
for efficiency. Despite this, the overhead of this tool may become significant
for some workloads: see the OVERHEAD section.
This works by tracing various kernel scheduler functions using dynamic tracing,
and will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-T
Include timestamps on output.
.TP
\-m
Output histogram in milliseconds.
.TP
\-P
Print a histogram for each PID.
.TP
\-p PID
Only show this PID (filtered in kernel for efficiency).
.TP
interval
Output interval, in seconds.
.TP
count
Number of outputs.
.SH EXAMPLES
.TP
Summarize run queue latency as a histogram:
#
.B runqlat
.TP
Print 1 second summaries, 10 times:
#
.B runqlat 1 10
.TP
Print 1 second summaries, using milliseconds as units for the histogram, and include timestamps on output:
#
.B runqlat \-mT 1
.TP
Trace PID 186 only, 1 second summaries:
#
.B runqlat -P 185 1
.SH FIELDS
.TP
usecs
Microsecond range
.TP
msecs
Millisecond range
.TP
count
How many times a task event fell into this range
.TP
distribution
An ASCII bar chart to visualize the distribution (count column)
.SH OVERHEAD
This traces scheduler functions, which can become very frequent. While eBPF
has very low overhead, and this tool uses in-kernel maps for efficiency, the
frequency of scheduler events for some workloads may be high enough that the
overhead of this tool becomes significant. Measure in a lab environment
to quantify the overhead before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
pidstat(1)
...@@ -74,7 +74,7 @@ This traces kernel functions and maintains in-kernel counts, which ...@@ -74,7 +74,7 @@ This traces kernel functions and maintains in-kernel counts, which
are asynchronously copied to user-space. While the rate of interrupts are asynchronously copied to user-space. While the rate of interrupts
be very high (>1M/sec), this is a relatively efficient way to trace these be very high (>1M/sec), this is a relatively efficient way to trace these
events, and so the overhead is expected to be small for normal workloads, but events, and so the overhead is expected to be small for normal workloads, but
could become noticable for heavy workloads. Measure in a test environment could become noticeable for heavy workloads. Measure in a test environment
before use. before use.
.SH SOURCE .SH SOURCE
This is from bcc. This is from bcc.
......
...@@ -7,7 +7,7 @@ stackcount \- Count kernel function calls and their stack traces. Uses Linux eBP ...@@ -7,7 +7,7 @@ stackcount \- Count kernel function calls and their stack traces. Uses Linux eBP
stackcount traces kernel functions and frequency counts them with their entire stackcount traces kernel functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. This allows higher kernel stack trace, summarized in-kernel for efficiency. This allows higher
frequency events to be studied. The output consists of unique stack traces, frequency events to be studied. The output consists of unique stack traces,
and their occurance counts. and their occurrence counts.
The pattern is a string with optional '*' wildcards, similar to file globbing. The pattern is a string with optional '*' wildcards, similar to file globbing.
If you'd prefer to use regular expressions, use the \-r option. If you'd prefer to use regular expressions, use the \-r option.
......
...@@ -56,7 +56,7 @@ Time of the call, in seconds. ...@@ -56,7 +56,7 @@ Time of the call, in seconds.
STACK STACK
Kernel stack trace. The first column shows "ip" for instruction pointer, and Kernel stack trace. The first column shows "ip" for instruction pointer, and
"r#" for each return pointer in the stack. The second column is the stack trace "r#" for each return pointer in the stack. The second column is the stack trace
as hexidecimal. The third column is the translated kernel symbol names. as hexadecimal. The third column is the translated kernel symbol names.
.SH OVERHEAD .SH OVERHEAD
This can have significant overhead if frequently called functions (> 1000/s) are This can have significant overhead if frequently called functions (> 1000/s) are
traced, and is only intended for low frequency function calls. This is because traced, and is only intended for low frequency function calls. This is because
......
.TH statsnoop 8 "2016-02-08" "USER COMMANDS"
.SH NAME
statsnoop \- Trace stat() syscalls. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B statsnoop [\-h] [\-t] [\-x] [\-p PID]
.SH DESCRIPTION
statsnoop traces the different stat() syscalls, showing which processes are
attempting to read information about which files. This can be useful for
determining the location of config and log files, or for troubleshooting
applications that are failing, especially on startup.
This works by tracing various kernel sys_stat() functions using dynamic
tracing, and will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-t
Include a timestamp column.
.TP
\-x
Only print failed stats.
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.SH EXAMPLES
.TP
Trace all stat() syscalls:
#
.B statsnoop
.TP
Trace all stat() syscalls, and include timestamps:
#
.B statsnoop \-t
.TP
Trace only stat() syscalls that failed:
#
.B statsnoop \-x
.TP
Trace PID 181 only:
#
.B statsnoop \-p 181
.SH FIELDS
.TP
TIME(s)
Time of the call, in seconds.
.TP
PID
Process ID
.TP
COMM
Process name
.TP
FD
File descriptor (if success), or -1 (if failed)
.TP
ERR
Error number (see the system's errno.h)
.TP
PATH
Open path
.SH OVERHEAD
This traces the kernel stat function and prints output for each event. As the
rate of this is generally expected to be low (< 1000/s), the overhead is also
expected to be negligible. If you have an application that is calling a high
rate of stat()s, then test and understand overhead before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
opensnoop(1)
...@@ -55,7 +55,7 @@ install(TARGETS bcc LIBRARY COMPONENT libbcc ...@@ -55,7 +55,7 @@ install(TARGETS bcc LIBRARY COMPONENT libbcc
install(DIRECTORY export/ COMPONENT libbcc install(DIRECTORY export/ COMPONENT libbcc
DESTINATION share/bcc/include/bcc DESTINATION share/bcc/include/bcc
FILES_MATCHING PATTERN "*.h") FILES_MATCHING PATTERN "*.h")
install(FILES bpf_common.h ../libbpf.h COMPONENT libbcc install(FILES bpf_common.h bpf_module.h ../libbpf.h COMPONENT libbcc
DESTINATION include/bcc) DESTINATION include/bcc)
install(DIRECTORY compat/linux/ COMPONENT libbcc install(DIRECTORY compat/linux/ COMPONENT libbcc
DESTINATION include/bcc/compat/linux DESTINATION include/bcc/compat/linux
......
...@@ -148,6 +148,19 @@ static void parse_type(IRBuilder<> &B, vector<Value *> *args, string *fmt, ...@@ -148,6 +148,19 @@ static void parse_type(IRBuilder<> &B, vector<Value *> *args, string *fmt,
*fmt += " "; *fmt += " ";
} }
*fmt += "}"; *fmt += "}";
} else if (ArrayType *at = dyn_cast<ArrayType>(type)) {
*fmt += "[ ";
for (size_t i = 0; i < at->getNumElements(); ++i) {
parse_type(B, args, fmt, at->getElementType(), B.CreateStructGEP(type, out, i), is_writer);
*fmt += " ";
}
*fmt += "]";
} else if (PointerType *pt = dyn_cast<PointerType>(type)) {
*fmt += "0xl";
if (is_writer)
*fmt += "x";
else
*fmt += "i";
} else if (IntegerType *it = dyn_cast<IntegerType>(type)) { } else if (IntegerType *it = dyn_cast<IntegerType>(type)) {
if (is_writer) if (is_writer)
*fmt += "0x"; *fmt += "0x";
......
...@@ -122,6 +122,8 @@ static int (*bpf_probe_read)(void *dst, u64 size, void *unsafe_ptr) = ...@@ -122,6 +122,8 @@ static int (*bpf_probe_read)(void *dst, u64 size, void *unsafe_ptr) =
(void *) BPF_FUNC_probe_read; (void *) BPF_FUNC_probe_read;
static u64 (*bpf_ktime_get_ns)(void) = static u64 (*bpf_ktime_get_ns)(void) =
(void *) BPF_FUNC_ktime_get_ns; (void *) BPF_FUNC_ktime_get_ns;
static u32 (*bpf_get_prandom_u32) =
(void *) BPF_FUNC_get_prandom_u32;
static int (*bpf_trace_printk_)(const char *fmt, u64 fmt_size, ...) = static int (*bpf_trace_printk_)(const char *fmt, u64 fmt_size, ...) =
(void *) BPF_FUNC_trace_printk; (void *) BPF_FUNC_trace_printk;
int bpf_trace_printk(const char *fmt, ...) asm("llvm.bpf.extra"); int bpf_trace_printk(const char *fmt, ...) asm("llvm.bpf.extra");
......
...@@ -113,7 +113,7 @@ several types of hooks available: ...@@ -113,7 +113,7 @@ several types of hooks available:
through the socket/interface. through the socket/interface.
EBPF programs can be used for many purposes; the main use cases are EBPF programs can be used for many purposes; the main use cases are
dynamic tracing and monitoring, and packet procesisng. We are mostly dynamic tracing and monitoring, and packet processing. We are mostly
interested in the latter use case in this document. interested in the latter use case in this document.
#### EBPF Tables #### EBPF Tables
...@@ -219,7 +219,7 @@ very complex packet filters and simple packet forwarding engines. In ...@@ -219,7 +219,7 @@ very complex packet filters and simple packet forwarding engines. In
the spirit of open-source "release early, release often", we expect the spirit of open-source "release early, release often", we expect
that the compiler's capabilities will improve gradually. that the compiler's capabilities will improve gradually.
* Packet filtering is peformed using the `drop()` action. Packets * Packet filtering is performed using the `drop()` action. Packets
that are not dropped will be forwarded. that are not dropped will be forwarded.
* Packet forwarding is performed by setting the * Packet forwarding is performed by setting the
...@@ -233,7 +233,7 @@ Here are some limitations imposed on the P4 programs: ...@@ -233,7 +233,7 @@ Here are some limitations imposed on the P4 programs:
EBPF program). In the future the compiler should probably generate EBPF program). In the future the compiler should probably generate
two separate EBPF programs. two separate EBPF programs.
* arbirary parsers can be compiled, but the BCC compiler will reject * arbitrary parsers can be compiled, but the BCC compiler will reject
parsers that contain cycles parsers that contain cycles
* arithmetic on data wider than 32 bits is not supported * arithmetic on data wider than 32 bits is not supported
...@@ -311,7 +311,7 @@ p4toEbpf.py file.p4 -o file.c ...@@ -311,7 +311,7 @@ p4toEbpf.py file.p4 -o file.c
The P4 compiler first runs the C preprocessor on the input P4 file. The P4 compiler first runs the C preprocessor on the input P4 file.
Some of the command-line options are passed directly to the Some of the command-line options are passed directly to the
preprocesor. preprocessor.
The following compiler options are available: The following compiler options are available:
......
...@@ -286,6 +286,30 @@ class BPF(object): ...@@ -286,6 +286,30 @@ class BPF(object):
if res < 0: if res < 0:
raise KeyError raise KeyError
# override the MutableMapping's implementation of these since they
# don't handle KeyError nicely
def itervalues(self):
for key in self:
# a map entry may be deleted in between discovering the key and
# fetching the value, suppress such errors
try:
yield self[key]
except KeyError:
pass
def iteritems(self):
for key in self:
try:
yield (key, self[key])
except KeyError:
pass
def items(self):
return [item for item in self.iteritems()]
def values(self):
return [value for value in self.itervalues()]
def clear(self): def clear(self):
# default clear uses popitem, which can race with the bpf prog # default clear uses popitem, which can race with the bpf prog
for k in self.keys(): for k in self.keys():
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
from bcc import BPF from bcc import BPF
import ctypes
from unittest import main, TestCase from unittest import main, TestCase
class TestClang(TestCase): class TestClang(TestCase):
...@@ -89,6 +90,22 @@ int foo(void *ctx) { ...@@ -89,6 +90,22 @@ int foo(void *ctx) {
self.assertEqual(l.s.a, 5) self.assertEqual(l.s.a, 5)
self.assertEqual(l.s.b, 6) self.assertEqual(l.s.b, 6)
def test_sscanf_array(self):
text = """
BPF_TABLE("hash", int, struct { u32 a[3]; u32 b; }, stats, 10);
"""
b = BPF(text=text, debug=0)
t = b.get_table("stats")
s1 = t.key_sprintf(t.Key(2))
self.assertEqual(s1, b"0x2")
s2 = t.leaf_sprintf(t.Leaf((ctypes.c_uint * 3)(1,2,3), 4))
self.assertEqual(s2, b"{ [ 0x1 0x2 0x3 ] 0x4 }")
l = t.leaf_scanf(s2)
self.assertEqual(l.a[0], 1)
self.assertEqual(l.a[1], 2)
self.assertEqual(l.a[2], 3)
self.assertEqual(l.b, 4)
def test_iosnoop(self): def test_iosnoop(self):
text = """ text = """
#include <linux/blkdev.h> #include <linux/blkdev.h>
......
...@@ -41,7 +41,7 @@ the last row printed, for which there were 2 I/O. ...@@ -41,7 +41,7 @@ the last row printed, for which there were 2 I/O.
For efficiency, biolatency uses an in-kernel eBPF map to store timestamps For efficiency, biolatency uses an in-kernel eBPF map to store timestamps
with requests, and another in-kernel map to store the histogram (the "count") with requests, and another in-kernel map to store the histogram (the "count")
column, which is copied to user-space only when output is printed. These column, which is copied to user-space only when output is printed. These
methods lower the perormance overhead when tracing is performed. methods lower the performance overhead when tracing is performed.
In the following example, the -m option is used to print a histogram using In the following example, the -m option is used to print a histogram using
......
...@@ -21,23 +21,21 @@ b = BPF(text=""" ...@@ -21,23 +21,21 @@ b = BPF(text="""
#include <linux/blkdev.h> #include <linux/blkdev.h>
struct val_t { struct val_t {
u32 pid;
char name[TASK_COMM_LEN]; char name[TASK_COMM_LEN];
}; };
BPF_HASH(start, struct request *); BPF_HASH(start, struct request *);
BPF_HASH(pidbyreq, struct request *, u32); BPF_HASH(infobyreq, struct request *, struct val_t);
BPF_HASH(commbyreq, struct request *, struct val_t);
// cache PID and comm by-req // cache PID and comm by-req
int trace_pid_start(struct pt_regs *ctx, struct request *req) int trace_pid_start(struct pt_regs *ctx, struct request *req)
{ {
u32 pid;
struct val_t val = {}; struct val_t val = {};
pid = bpf_get_current_pid_tgid();
pidbyreq.update(&req, &pid);
if (bpf_get_current_comm(&val.name, sizeof(val.name)) == 0) { if (bpf_get_current_comm(&val.name, sizeof(val.name)) == 0) {
commbyreq.update(&req, &val); val.pid = bpf_get_current_pid_tgid();
infobyreq.update(&req, &val);
} }
return 0; return 0;
...@@ -74,12 +72,11 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req) ...@@ -74,12 +72,11 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
// As bpf_trace_prink() is limited to a maximum of 1 string and 2 // As bpf_trace_prink() is limited to a maximum of 1 string and 2
// integers, we'll use more than one to output the data. // integers, we'll use more than one to output the data.
// //
valp = commbyreq.lookup(&req); valp = infobyreq.lookup(&req);
pidp = pidbyreq.lookup(&req); if (valp == 0) {
if (pidp == 0 || valp == 0) {
bpf_trace_printk("0 0 ? %d\\n", req->__data_len); bpf_trace_printk("0 0 ? %d\\n", req->__data_len);
} else { } else {
bpf_trace_printk("0 %d %s %d\\n", *pidp, valp->name, bpf_trace_printk("0 %d %s %d\\n", valp->pid, valp->name,
req->__data_len); req->__data_len);
} }
...@@ -93,8 +90,7 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req) ...@@ -93,8 +90,7 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
} }
start.delete(&req); start.delete(&req);
pidbyreq.delete(&req); infobyreq.delete(&req);
commbyreq.delete(&req);
return 0; return 0;
} }
......
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# biotop block device (disk) I/O by process.
# For Linux, uses BCC, eBPF.
#
# USAGE: biotop.py [-h] [-C] [-r MAXROWS] [interval] [count]
#
# This uses in-kernel eBPF maps to cache process details (PID and comm) by I/O
# request, as well as a starting timestamp for calculating I/O latency.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 06-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
from subprocess import call
# arguments
examples = """examples:
./biotop # block device I/O top, 1 second refresh
./biotop -C # don't clear the screen
./biotop 5 # 5 second summaries
./biotop 5 10 # 5 second summaries, 10 times only
"""
parser = argparse.ArgumentParser(
description="Block device (disk) I/O by process",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-C", "--noclear", action="store_true",
help="don't clear the screen")
parser.add_argument("-r", "--maxrows", default=20,
help="maximum rows to print, default 20")
parser.add_argument("interval", nargs="?", default=1,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
interval = int(args.interval)
countdown = int(args.count)
maxrows = int(args.maxrows)
clear = not int(args.noclear)
# linux stats
loadavg = "/proc/loadavg"
diskstats = "/proc/diskstats"
# signal handler
def signal_ignore(signal, frame):
print()
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
// for saving process info by request
struct who_t {
u32 pid;
char name[TASK_COMM_LEN];
};
// the key for the output summary
struct info_t {
u32 pid;
int type;
int major;
int minor;
char name[TASK_COMM_LEN];
};
// the value of the output summary
struct val_t {
u64 bytes;
u64 us;
u32 io;
};
BPF_HASH(start, struct request *);
BPF_HASH(whobyreq, struct request *, struct who_t);
BPF_HASH(counts, struct info_t, struct val_t);
// cache PID and comm by-req
int trace_pid_start(struct pt_regs *ctx, struct request *req)
{
struct who_t who = {};
if (bpf_get_current_comm(&who.name, sizeof(who.name)) == 0) {
who.pid = bpf_get_current_pid_tgid();
whobyreq.update(&req, &who);
}
return 0;
}
// time block I/O
int trace_req_start(struct pt_regs *ctx, struct request *req)
{
u64 ts;
ts = bpf_ktime_get_ns();
start.update(&req, &ts);
return 0;
}
// output
int trace_req_completion(struct pt_regs *ctx, struct request *req)
{
u64 *tsp;
// fetch timestamp and calculate delta
tsp = start.lookup(&req);
if (tsp == 0) {
return 0; // missed tracing issue
}
struct who_t *whop;
struct val_t *valp, zero = {};
u64 delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
// setup info_t key
struct info_t info = {};
info.major = req->rq_disk->major;
info.minor = req->rq_disk->first_minor;
info.type = req->cmd_flags & REQ_WRITE;
whop = whobyreq.lookup(&req);
if (whop == 0) {
// missed pid who, save stats as pid 0
valp = counts.lookup_or_init(&info, &zero);
} else {
info.pid = whop->pid;
__builtin_memcpy(&info.name, whop->name, sizeof(info.name));
valp = counts.lookup_or_init(&info, &zero);
}
// save stats
valp->us += delta_us;
valp->bytes += req->__data_len;
valp->io++;
start.delete(&req);
whobyreq.delete(&req);
return 0;
}
""", debug=0)
b.attach_kprobe(event="blk_account_io_start", fn_name="trace_pid_start")
b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_completion",
fn_name="trace_req_completion")
print('Tracing... Output every %d secs. Hit Ctrl-C to end' % interval)
# cache disk major,minor -> diskname
disklookup = {}
with open(diskstats) as stats:
for line in stats:
a = line.split()
disklookup[a[0] + "," + a[1]] = a[2]
# output
exiting = 0
while 1:
try:
sleep(interval)
except KeyboardInterrupt:
exiting = 1
# header
if clear:
call("clear")
else:
print()
with open(loadavg) as stats:
print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
print("%-6s %-16s %1s %-3s %-3s %-8s %5s %7s %6s" % ("PID", "COMM",
"D", "MAJ", "MIN", "DISK", "I/O", "Kbytes", "AVGms"))
# by-PID output
counts = b.get_table("counts")
line = 0
for k, v in reversed(sorted(counts.items(),
key=lambda counts: counts[1].bytes)):
# lookup disk
disk = str(k.major) + "," + str(k.minor)
if disk in disklookup:
diskname = disklookup[disk]
else:
diskname = "?"
# print line
avg_ms = (float(v.us) / 1000) / v.io
print("%-6d %-16s %1s %-3d %-3d %-8s %5s %7s %6.2f" % (k.pid, k.name,
"W" if k.type else "R", k.major, k.minor, diskname, v.io,
v.bytes / 1024, avg_ms))
line += 1
if line >= maxrows:
break
counts.clear()
countdown -= 1
if exiting or countdown == 0:
print("Detaching...")
exit()
Demonstrations of biotop, the Linux eBPF/bcc version.
Short for block device I/O top, biotop summarizes which processes are
performing disk I/O. It's top for disks. Sample output:
# ./biotop
Tracing... Output every 1 secs. Hit Ctrl-C to end
08:04:11 loadavg: 1.48 0.87 0.45 1/287 14547
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
14501 cksum R 202 1 xvda1 361 28832 3.39
6961 dd R 202 1 xvda1 1628 13024 0.59
13855 dd R 202 1 xvda1 1627 13016 0.59
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 3.00
1880 supervise W 202 1 xvda1 2 8 6.71
1873 supervise W 202 1 xvda1 2 8 2.51
1871 supervise W 202 1 xvda1 2 8 1.57
1876 supervise W 202 1 xvda1 2 8 1.22
1892 supervise W 202 1 xvda1 2 8 0.62
1878 supervise W 202 1 xvda1 2 8 0.78
1886 supervise W 202 1 xvda1 2 8 1.30
1894 supervise W 202 1 xvda1 2 8 3.46
1869 supervise W 202 1 xvda1 2 8 0.73
1888 supervise W 202 1 xvda1 2 8 1.48
By default the screen refreshes every 1 second, and shows the top 20 disk
consumers, sorted on total Kbytes. The first line printed is the header,
which has the time and then the contents of /proc/loadavg.
For the interval summarized by the output above, the "cksum" command performed
361 disk reads to the "xvda1" device, for a total of 28832 Kbytes, with an
average I/O time of 3.39 ms. Two "dd" processes were also reading from the
same disk, which a higher I/O rate and lower latency. While the average I/O
size is not printed, it can be determined by dividing the Kbytes column by
the I/O column.
The columns through to Kbytes show the workload applied. The final column,
AVGms, shows resulting performance. Other bcc tools can be used to get more
details when needed: biolatency and biosnoop.
Many years ago I created the original "iotop", and later regretted not calling
it diskiotop or blockiotop, as "io" alone is ambiguous. This time it is biotop.
The -C option can be used to prevent the screen from clearing (my preference).
Here's using it with a 5 second interval:
# ./biotop -C 5
Tracing... Output every 5 secs. Hit Ctrl-C to end
08:09:44 loadavg: 0.42 0.44 0.39 2/282 22115
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
22069 dd R 202 1 xvda1 5993 47976 0.33
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 2.67
1866 svscan R 202 1 xvda1 33 132 1.24
1880 supervise W 202 1 xvda1 10 40 0.56
1873 supervise W 202 1 xvda1 10 40 0.79
1871 supervise W 202 1 xvda1 10 40 0.78
1876 supervise W 202 1 xvda1 10 40 0.68
1892 supervise W 202 1 xvda1 10 40 0.71
1878 supervise W 202 1 xvda1 10 40 0.65
1886 supervise W 202 1 xvda1 10 40 0.78
1894 supervise W 202 1 xvda1 10 40 0.80
1869 supervise W 202 1 xvda1 10 40 0.91
1888 supervise W 202 1 xvda1 10 40 0.63
22069 bash R 202 1 xvda1 1 16 19.94
9251 kworker/u16:2 W 202 16 xvdb 2 8 0.13
08:09:49 loadavg: 0.47 0.44 0.39 1/282 22231
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
22069 dd R 202 1 xvda1 13450 107600 0.35
22199 cksum R 202 1 xvda1 941 45548 4.63
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 2.93
24467 kworker/0:2 W 202 16 xvdb 1 64 0.28
1880 supervise W 202 1 xvda1 10 40 0.81
1873 supervise W 202 1 xvda1 10 40 0.81
1871 supervise W 202 1 xvda1 10 40 1.03
1876 supervise W 202 1 xvda1 10 40 0.76
1892 supervise W 202 1 xvda1 10 40 0.74
1878 supervise W 202 1 xvda1 10 40 0.94
1886 supervise W 202 1 xvda1 10 40 0.76
1894 supervise W 202 1 xvda1 10 40 0.69
1869 supervise W 202 1 xvda1 10 40 0.72
1888 supervise W 202 1 xvda1 10 40 1.70
22199 bash R 202 1 xvda1 2 20 0.35
482 xfsaild/md0 W 202 16 xvdb 5 13 0.27
482 xfsaild/md0 W 202 32 xvdc 2 8 0.33
31331 pickup R 202 1 xvda1 1 4 0.31
08:09:54 loadavg: 0.51 0.45 0.39 2/282 22346
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
22069 dd R 202 1 xvda1 14689 117512 0.32
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 2.33
1880 supervise W 202 1 xvda1 10 40 0.65
1873 supervise W 202 1 xvda1 10 40 1.08
1871 supervise W 202 1 xvda1 10 40 0.66
1876 supervise W 202 1 xvda1 10 40 0.79
1892 supervise W 202 1 xvda1 10 40 0.67
1878 supervise W 202 1 xvda1 10 40 0.66
1886 supervise W 202 1 xvda1 10 40 1.02
1894 supervise W 202 1 xvda1 10 40 0.88
1869 supervise W 202 1 xvda1 10 40 0.89
1888 supervise W 202 1 xvda1 10 40 1.25
08:09:59 loadavg: 0.55 0.46 0.40 2/282 22461
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
22069 dd R 202 1 xvda1 14442 115536 0.33
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 3.46
1880 supervise W 202 1 xvda1 10 40 0.87
1873 supervise W 202 1 xvda1 10 40 0.87
1871 supervise W 202 1 xvda1 10 40 0.78
1876 supervise W 202 1 xvda1 10 40 0.86
1892 supervise W 202 1 xvda1 10 40 0.89
1878 supervise W 202 1 xvda1 10 40 0.87
1886 supervise W 202 1 xvda1 10 40 0.86
1894 supervise W 202 1 xvda1 10 40 1.06
1869 supervise W 202 1 xvda1 10 40 1.12
1888 supervise W 202 1 xvda1 10 40 0.98
08:10:04 loadavg: 0.59 0.47 0.40 3/282 22576
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
22069 dd R 202 1 xvda1 14179 113432 0.34
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 2.39
1880 supervise W 202 1 xvda1 10 40 0.81
1873 supervise W 202 1 xvda1 10 40 1.02
1871 supervise W 202 1 xvda1 10 40 1.15
1876 supervise W 202 1 xvda1 10 40 1.10
1892 supervise W 202 1 xvda1 10 40 0.77
1878 supervise W 202 1 xvda1 10 40 0.72
1886 supervise W 202 1 xvda1 10 40 0.81
1894 supervise W 202 1 xvda1 10 40 0.86
1869 supervise W 202 1 xvda1 10 40 0.83
1888 supervise W 202 1 xvda1 10 40 0.79
24467 kworker/0:2 R 202 32 xvdc 3 12 0.26
1056 cron R 202 1 xvda1 2 8 0.30
24467 kworker/0:2 R 202 16 xvdb 1 4 0.23
08:10:09 loadavg: 0.54 0.46 0.40 2/281 22668
PID COMM D MAJ MIN DISK I/O Kbytes AVGms
22069 dd R 202 1 xvda1 250 2000 0.34
326 jbd2/xvda1-8 W 202 1 xvda1 3 168 2.40
1880 supervise W 202 1 xvda1 8 32 0.93
1873 supervise W 202 1 xvda1 8 32 0.76
1871 supervise W 202 1 xvda1 8 32 0.60
1876 supervise W 202 1 xvda1 8 32 0.61
1892 supervise W 202 1 xvda1 8 32 0.68
1878 supervise W 202 1 xvda1 8 32 0.90
1886 supervise W 202 1 xvda1 8 32 0.57
1894 supervise W 202 1 xvda1 8 32 0.97
1869 supervise W 202 1 xvda1 8 32 0.69
1888 supervise W 202 1 xvda1 8 32 0.67
This shows another "dd" command reading from xvda1. On this system, various
"supervise" processes do 8 disk writes per second, every second (they are
creating and updating "status" files).
USAGE message:
# ./biotop.py -h
usage: biotop.py [-h] [-C] [-r MAXROWS] [interval] [count]
Block device (disk) I/O by process
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-C, --noclear don't clear the screen
-r MAXROWS, --maxrows MAXROWS
maximum rows to print, default 20
examples:
./biotop # block device I/O top, 1 second refresh
./biotop -C # don't clear the screen
./biotop 5 # 5 second summaries
./biotop 5 10 # 5 second summaries, 10 times only
...@@ -4,8 +4,8 @@ ...@@ -4,8 +4,8 @@
# For Linux, uses BCC, eBPF. See .c file. # For Linux, uses BCC, eBPF. See .c file.
# #
# USAGE: bitesize # USAGE: bitesize
# Ctrl-C will print the partially gathered histogram then exit.
# #
# Ctrl-C will print the partially gathered histogram then exit.
# #
# Copyright (c) 2016 Allan McAleavy # Copyright (c) 2016 Allan McAleavy
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
...@@ -41,12 +41,12 @@ int trace_pid_start(struct pt_regs *ctx, struct request *req) ...@@ -41,12 +41,12 @@ int trace_pid_start(struct pt_regs *ctx, struct request *req)
return 0; return 0;
} }
int do_count (struct pt_regs *ctx, struct request *req) int do_count(struct pt_regs *ctx, struct request *req)
{ {
struct val_t *valp; struct val_t *valp;
valp = commbyreq.lookup(&req); valp = commbyreq.lookup(&req);
if ( valp == 0) { if (valp == 0) {
return 0; return 0;
} }
...@@ -72,4 +72,4 @@ dist = b.get_table("dist") ...@@ -72,4 +72,4 @@ dist = b.get_table("dist")
try: try:
sleep(99999999) sleep(99999999)
except KeyboardInterrupt: except KeyboardInterrupt:
dist.print_log2_hist("Kbytes", "Process Name:") dist.print_log2_hist("Kbytes", "Process Name")
Example of BCC tool bitesize.py Examples of bitesize.py, the Linux bcc/eBPF version.
The aim of this tool is to show I/O distribution for requested block sizes, by process name. The aim of this tool is to show I/O distribution for requested block sizes, by process name.
...@@ -6,13 +7,13 @@ The aim of this tool is to show I/O distribution for requested block sizes, by p ...@@ -6,13 +7,13 @@ The aim of this tool is to show I/O distribution for requested block sizes, by p
Tracing... Hit Ctrl-C to end. Tracing... Hit Ctrl-C to end.
^C ^C
Process Name: = 'kworker/u128:1' Process Name = 'kworker/u128:1'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 1 |******************** | 0 -> 1 : 1 |******************** |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
4 -> 7 : 2 |****************************************| 4 -> 7 : 2 |****************************************|
Process Name: = 'bitesize.py' Process Name = 'bitesize.py'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
...@@ -23,7 +24,7 @@ Process Name: = 'bitesize.py' ...@@ -23,7 +24,7 @@ Process Name: = 'bitesize.py'
64 -> 127 : 0 | | 64 -> 127 : 0 | |
128 -> 255 : 1 |****************************************| 128 -> 255 : 1 |****************************************|
Process Name: = 'dd' Process Name = 'dd'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 3 | | 0 -> 1 : 3 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
...@@ -37,13 +38,13 @@ Process Name: = 'dd' ...@@ -37,13 +38,13 @@ Process Name: = 'dd'
512 -> 1023 : 0 | | 512 -> 1023 : 0 | |
1024 -> 2047 : 488 |****************************************| 1024 -> 2047 : 488 |****************************************|
Process Name: = 'jbd2/dm-1-8' Process Name = 'jbd2/dm-1-8'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************| 4 -> 7 : 1 |****************************************|
Process Name: = 'cat' Process Name = 'cat'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 1 | | 0 -> 1 : 1 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
...@@ -55,19 +56,19 @@ Process Name: = 'cat' ...@@ -55,19 +56,19 @@ Process Name: = 'cat'
128 -> 255 : 0 | | 128 -> 255 : 0 | |
256 -> 511 : 1924 |****************************************| 256 -> 511 : 1924 |****************************************|
Process Name: = 'ntpd' Process Name = 'ntpd'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
4 -> 7 : 104 |****************************************| 4 -> 7 : 104 |****************************************|
Process Name: = 'vmtoolsd' Process Name = 'vmtoolsd'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************| 4 -> 7 : 1 |****************************************|
Process Name: = 'bash' Process Name = 'bash'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
...@@ -75,7 +76,7 @@ Process Name: = 'bash' ...@@ -75,7 +76,7 @@ Process Name: = 'bash'
8 -> 15 : 0 | | 8 -> 15 : 0 | |
16 -> 31 : 2 |****************************************| 16 -> 31 : 2 |****************************************|
Process Name: = 'jbd2/sdb-8' Process Name = 'jbd2/sdb-8'
Kbytes : count distribution Kbytes : count distribution
0 -> 1 : 0 | | 0 -> 1 : 0 | |
2 -> 3 : 0 | | 2 -> 3 : 0 | |
...@@ -86,7 +87,3 @@ Process Name: = 'jbd2/sdb-8' ...@@ -86,7 +87,3 @@ Process Name: = 'jbd2/sdb-8'
We can see from above that there was a dd command being run which generated 488 IOPS between 1MB and 2MB, we can also see the We can see from above that there was a dd command being run which generated 488 IOPS between 1MB and 2MB, we can also see the
cat command generating 1924 IOPS between 256Kb and 512Kb. cat command generating 1924 IOPS between 256Kb and 512Kb.
See also systemtap version:
https://github.com/brendangregg/systemtap-lwtools/blob/master/disk/bitesize-nd.stp
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dcsnoop Trace directory entry cache (dcache) lookups.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: dcsnoop [-h] [-a]
#
# By default, this traces every failed dcache lookup, and shows the process
# performing the lookup and the filename requested. A -a option can be used
# to show all lookups, not just failed ones.
#
# This uses kernel dynamic tracing of the d_lookup() function, and will need
# to be modified to match kernel changes.
#
# Also see dcstat(8), for per-second summaries.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
./dcsnoop # trace failed dcache lookups
./dcsnoop -a # trace all dcache lookups
"""
parser = argparse.ArgumentParser(
description="Trace directory entry cache (dcache) lookups",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-a", "--all", action="store_true",
help="trace all lookups (default is fails only)")
args = parser.parse_args()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#define MAX_FILE_LEN 64
struct entry_t {
char name[MAX_FILE_LEN];
};
BPF_HASH(entrybypid, u32, struct entry_t);
/* from fs/namei.c: */
struct nameidata {
struct path path;
struct qstr last;
// [...]
};
int trace_fast(struct pt_regs *ctx, struct nameidata *nd, struct path *path)
{
bpf_trace_printk("R %s\\n", nd->last.name);
return 1;
}
int kprobe__d_lookup(struct pt_regs *ctx, const struct dentry *parent,
const struct qstr *name)
{
u32 pid = bpf_get_current_pid_tgid();
struct entry_t entry = {};
const char *fname = name->name;
if (fname) {
bpf_probe_read(&entry.name, sizeof(entry.name), (void *)fname);
}
entrybypid.update(&pid, &entry);
return 0;
}
int kretprobe__d_lookup(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid();
struct entry_t *ep;
ep = entrybypid.lookup(&pid);
if (ep == 0) {
return 0; // missed entry
}
if (ctx->ax == 0) {
bpf_trace_printk("M %s\\n", ep->name);
}
entrybypid.delete(&pid);
return 0;
}
"""
# initialize BPF
b = BPF(text=bpf_text)
if args.all:
b.attach_kprobe(event="lookup_fast", fn_name="trace_fast")
# header
print("%-11s %-6s %-16s %1s %s" % ("TIME(s)", "PID", "COMM", "T", "FILE"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
try:
(type, file) = msg.split(" ", 1)
except ValueError:
continue
if start_ts == 0:
start_ts = ts
print("%-11.6f %-6s %-16s %1s %s" % (ts - start_ts, pid, task, type, file))
Demonstrations of dcsnoop, the Linux eBPF/bcc version.
dcsnoop traces directory entry cache (dcache) lookups, and can be used for
further investigation beyond dcstat(8). The output is likely verbose, as
dcache lookups are likely frequent. By default, only failed lookups are shown.
For example:
# ./dcsnoop.py
TIME(s) PID COMM T FILE
0.002837 1643 snmpd M net/dev
0.002852 1643 snmpd M 1643
0.002856 1643 snmpd M net
0.002863 1643 snmpd M dev
0.002952 1643 snmpd M net/if_inet6
0.002964 1643 snmpd M if_inet6
0.003180 1643 snmpd M net/ipv4/neigh/eth0/retrans_time_ms
0.003192 1643 snmpd M ipv4/neigh/eth0/retrans_time_ms
0.003197 1643 snmpd M neigh/eth0/retrans_time_ms
0.003203 1643 snmpd M eth0/retrans_time_ms
0.003206 1643 snmpd M retrans_time_ms
0.003245 1643 snmpd M ipv6/neigh/eth0/retrans_time_ms
0.003249 1643 snmpd M neigh/eth0/retrans_time_ms
0.003252 1643 snmpd M eth0/retrans_time_ms
0.003255 1643 snmpd M retrans_time_ms
0.003287 1643 snmpd M conf/eth0/forwarding
0.003292 1643 snmpd M eth0/forwarding
0.003295 1643 snmpd M forwarding
0.003326 1643 snmpd M base_reachable_time_ms
[...]
I ran a drop caches at the same time as executing this tool. The output shows
the processes, the type of event ("T" column: M == miss, R == reference),
and the filename for the dcache lookup.
The way the dcache is currently implemented, each component of a path is
checked in turn. The first line, showing "net/dev" from snmp, will be a lookup
for "net" in a directory (that isn't shown here). If it finds "net", it will
then lookup "dev" inside net. You can see this sequence a little later,
starting at time 0.003180, where a pathname is being searched
directory by directory.
The -a option will show all lookups, although be warned, the output will be
very verbose. For example:
# ./dcsnoop
TIME(s) PID COMM T FILE
0.000000 20279 dcsnoop.py M p_lookup_fast
0.000010 20279 dcsnoop.py M enable
0.000013 20279 dcsnoop.py M id
0.000015 20279 dcsnoop.py M filter
0.000017 20279 dcsnoop.py M trigger
0.000019 20279 dcsnoop.py M format
0.006148 20279 dcsnoop.py R sys/kernel/debug/tracing/trace_pipe
0.006158 20279 dcsnoop.py R kernel/debug/tracing/trace_pipe
0.006161 20279 dcsnoop.py R debug/tracing/trace_pipe
0.006164 20279 dcsnoop.py R tracing/trace_pipe
0.006166 20279 dcsnoop.py R trace_pipe
0.015900 1643 snmpd R proc/sys/net/ipv6/conf/lo/forwarding
0.015901 1643 snmpd R sys/net/ipv6/conf/lo/forwarding
0.015901 1643 snmpd R net/ipv6/conf/lo/forwarding
0.015902 1643 snmpd R ipv6/conf/lo/forwarding
0.015903 1643 snmpd R conf/lo/forwarding
0.015904 1643 snmpd R lo/forwarding
0.015905 1643 snmpd M lo/forwarding
0.015908 1643 snmpd R forwarding
0.015909 1643 snmpd M forwarding
0.015937 1643 snmpd R proc/sys/net/ipv6/neigh/lo/base_reachable_time_ms
0.015937 1643 snmpd R sys/net/ipv6/neigh/lo/base_reachable_time_ms
0.015938 1643 snmpd R net/ipv6/neigh/lo/base_reachable_time_ms
0.015939 1643 snmpd R ipv6/neigh/lo/base_reachable_time_ms
0.015940 1643 snmpd R neigh/lo/base_reachable_time_ms
0.015941 1643 snmpd R lo/base_reachable_time_ms
0.015941 1643 snmpd R base_reachable_time_ms
0.015943 1643 snmpd M base_reachable_time_ms
0.043569 1876 supervise M 20281
0.043573 1886 supervise M 20280
0.043582 1886 supervise R supervise/status.new
[...]
USAGE message:
# ./dcsnoop.py -h
usage: dcsnoop.py [-h] [-a]
Trace directory entry cache (dcache) lookups
optional arguments:
-h, --help show this help message and exit
-a, --all trace all lookups (default is fails only)
examples:
./dcsnoop # trace failed dcache lookups
./dcsnoop -a # trace all dcache lookups
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dcstat Directory entry cache (dcache) stats.
# For Linux, uses BCC, eBPF.
#
# USAGE: dcstat [interval [count]]
#
# This uses kernel dynamic tracing of kernel functions, lookup_fast() and
# d_lookup(), which will need to be modified to match kernel changes. See
# code comments.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from ctypes import c_int
from time import sleep, strftime
from sys import argv
def usage():
print("USAGE: %s [interval [count]]" % argv[0])
exit()
# arguments
interval = 1
count = -1
if len(argv) > 1:
try:
interval = int(argv[1])
if interval == 0:
raise
if len(argv) > 2:
count = int(argv[2])
except: # also catches -h, --help
usage()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
enum stats {
S_REFS = 1,
S_SLOW,
S_MISS,
S_MAXSTAT
};
BPF_TABLE("array", int, u64, stats, S_MAXSTAT + 1);
/*
* How this is instrumented, and how to interpret the statistics, is very much
* tied to the current kernel implementation (this was written on Linux 4.4).
* This will need maintenance to keep working as the implementation changes. To
* aid future adventurers, this is is what the current code does, and why.
*
* First problem: the current implementation takes a path and then does a
* lookup of each component. So how do we count a reference? Once for the path
* lookup, or once for every component lookup? I've chosen the latter
* since it seems to map more closely to actual dcache lookups (via
* __d_lookup_rcu()). It's counted via calls to lookup_fast().
*
* The implementation tries different, progressively slower, approaches to
* lookup a file. At what point do we call it a dcache miss? I've choosen when
* a d_lookup() (which is called during lookup_slow()) returns zero.
*
* I've also included a "SLOW" statistic to show how often the fast lookup
* failed. Whether this exists or is interesting is an implementation detail,
* and the "SLOW" statistic may be removed in future versions.
*/
void count_fast(struct pt_regs *ctx) {
int key = S_REFS;
u64 *leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
}
void count_lookup(struct pt_regs *ctx) {
int key = S_SLOW;
u64 *leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
if (ctx->ax == 0) {
key = S_MISS;
leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
}
}
"""
# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="lookup_fast", fn_name="count_fast")
b.attach_kretprobe(event="d_lookup", fn_name="count_lookup")
# stat column labels and indexes
stats = {
"REFS": 1,
"SLOW": 2,
"MISS": 3
}
# header
print("%-8s " % "TIME", end="")
for stype, idx in sorted(stats.iteritems(), key=lambda (k, v): (v, k)):
print(" %8s" % (stype + "/s"), end="")
print(" %8s" % "HIT%")
# output
i = 0
while (1):
if count > 0:
i += 1
if i > count:
exit()
try:
sleep(interval)
except KeyboardInterrupt:
pass
exit()
print("%-8s: " % strftime("%H:%M:%S"), end="")
# print each statistic as a column
for stype, idx in sorted(stats.iteritems(), key=lambda (k, v): (v, k)):
try:
val = b["stats"][c_int(idx)].value / interval
print(" %8d" % val, end="")
except:
print(" %8d" % 0, end="")
# print hit ratio percentage
try:
ref = b["stats"][c_int(stats["REFS"])].value
miss = b["stats"][c_int(stats["MISS"])].value
hit = ref - miss
pct = float(100) * hit / ref
print(" %8.2f" % pct)
except:
print(" %7s%%" % "-")
b["stats"].clear()
Demonstrations of dcstat, the Linux eBPF/bcc version.
dcstat shows directory entry cache (dcache) statistics. For example:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:11:47: 2059 141 97 95.29
08:11:48: 79974 151 106 99.87
08:11:49: 192874 146 102 99.95
08:11:50: 2051 144 100 95.12
08:11:51: 73373 17239 17194 76.57
08:11:52: 54685 25431 25387 53.58
08:11:53: 18127 8182 8137 55.12
08:11:54: 22517 10345 10301 54.25
08:11:55: 7524 2881 2836 62.31
08:11:56: 2067 141 97 95.31
08:11:57: 2115 145 101 95.22
The output shows the total references per second ("REFS/s"), the number that
took a slower code path to be processed ("SLOW/s"), the number of dcache misses
("MISS/s"), and the hit ratio as a percentage. By default, an interval of 1
second is used.
At 08:11:49, there were 192 thousand references, which almost entirely hit
from the dcache, with a hit ration of 99.95%. A little later, starting at
08:11:51, a workload began that walked many uncached files, reducing the hit
ratio to 53%, and more importantly, a miss rate of over 10 thousand per second.
Here's an interesting workload:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:15:53: 250683 141 97 99.96
08:15:54: 266115 145 101 99.96
08:15:55: 268428 141 97 99.96
08:15:56: 260389 143 99 99.96
It's a 99.96% hit ratio, and these are all negative hits: accessing a file that
does not exist. Here's the C program that generated the workload:
# cat -n badopen.c
1 #include <sys/types.h>
2 #include <sys/stat.h>
3 #include <fcntl.h>
4
5 int
6 main(int argc, char *argv[])
7 {
8 int fd;
9 while (1) {
10 fd = open("bad", O_RDONLY);
11 }
12 return 0;
13 }
This is a simple workload generator than tries to open a missing file ("bad")
as quickly as possible.
Lets see what happens if the workload attempts to open a different filename
each time (which is also a missing file), using the following C code:
# cat -n badopen2.c
1 #include <sys/types.h>
2 #include <sys/stat.h>
3 #include <fcntl.h>
4 #include <stdio.h>
5
6 int
7 main(int argc, char *argv[])
8 {
9 int fd, i = 0;
10 char buf[128] = {};
11
12 while (1) {
13 sprintf(buf, "bad%d", i++);
14 fd = open(buf, O_RDONLY);
15 }
16 return 0;
17 }
Here's dcstat:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:18:52: 241131 237544 237505 1.51
08:18:53: 238210 236323 236278 0.82
08:18:54: 235259 233307 233261 0.85
08:18:55: 233144 231256 231214 0.83
08:18:56: 231981 230097 230053 0.83
dcstat also supports an optional interval and optional count. For example,
printing 5 second summaries 3 times:
# ./dcstat 5 3
TIME REFS/s SLOW/s MISS/s HIT%
08:20:03: 2085 143 99 95.23
08:20:08: 2077 143 98 95.24
08:20:14: 2071 144 100 95.15
USAGE message:
# ./dcstat -h
USAGE: ./dcstat [interval [count]]
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# execsnoop Trace new processes via exec() syscalls.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: execsnoop [-h] [-t] [-x] [-n NAME]
#
# This currently will print up to a maximum of 19 arguments, plus the process
# name, so 20 fields in total (MAXARG).
#
# This won't catch all new processes: an application may fork() but not exec().
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 07-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
./execsnoop # trace all exec() syscalls
./execsnoop -x # include failed exec()s
./execsnoop -t # include timestamps
./execsnoop -n main # only print command lines containing "main"
"""
parser = argparse.ArgumentParser(
description="Trace exec() syscalls",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-t", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-x", "--fails", action="store_true",
help="include failed exec()s")
parser.add_argument("-n", "--name",
help="only print commands matching this name (regex), any arg")
args = parser.parse_args()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/fs.h>
#define MAXARG 20
#define ARGSIZE 64
static int print_arg(void *ptr) {
// Fetch an argument, and print using bpf_trace_printk(). This is a work
// around until we have a binary trace interface for passing event data to
// bcc. Since exec()s should be low frequency, the additional overhead in
// this case should not be a problem.
const char *argp = NULL;
char buf[ARGSIZE] = {};
bpf_probe_read(&argp, sizeof(argp), ptr);
if (argp == NULL) return 0;
bpf_probe_read(&buf, sizeof(buf), (void *)(argp));
bpf_trace_printk("ARG %s\\n", buf);
return 1;
}
int kprobe__sys_execve(struct pt_regs *ctx, struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
{
char fname[ARGSIZE] = {};
bpf_probe_read(&fname, sizeof(fname), (void *)(filename));
bpf_trace_printk("ARG %s\\n", fname);
int i = 1; // skip first arg, as we printed fname
// unrolled loop to walk argv[] (MAXARG)
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++; // X
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++; // XX
bpf_trace_printk("ARG ...\\n"); // truncated
out:
return 0;
}
int kretprobe__sys_execve(struct pt_regs *ctx)
{
bpf_trace_printk("RET %d\\n", ctx->ax);
return 0;
}
"""
# initialize BPF
b = BPF(text=bpf_text)
# header
if args.timestamp:
print("%-8s" % ("TIME(s)"), end="")
print("%-16s %-6s %3s %s" % ("PCOMM", "PID", "RET", "ARGS"))
start_ts = 0
cmd = {}
pcomm = {}
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
try:
(type, arg) = msg.split(" ", 1)
except ValueError:
continue
if start_ts == 0:
start_ts = ts
if type == "RET":
if pid not in cmd:
# zero args
cmd[pid] = ""
pcomm[pid] = ""
skip = 0
if args.name:
if not re.search(args.name, cmd[pid]):
skip = 1
if not args.fails and int(arg) < 0:
skip = 1
if skip:
del cmd[pid]
del pcomm[pid]
continue
# output
if args.timestamp:
print("%-8.3f" % (ts - start_ts), end="")
print("%-16s %-6s %3s %s" % (pcomm[pid], pid, arg, cmd[pid]))
del cmd[pid]
del pcomm[pid]
else:
# build command line string
if pid in cmd:
cmd[pid] = cmd[pid] + " " + arg
else:
cmd[pid] = arg
if pid not in pcomm:
pcomm[pid] = task
Demonstrations of execsnoop, the Linux eBPF/bcc version.
execsnoop traces new processes. For example, tracing the commands invoked when
running "man ls":
# ./execsnoop
PCOMM PID RET ARGS
bash 15887 0 /usr/bin/man ls
preconv 15894 0 /usr/bin/preconv -e UTF-8
man 15896 0 /usr/bin/tbl
man 15897 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8
man 15898 0 /usr/bin/pager -s
nroff 15900 0 /usr/bin/locale charmap
nroff 15901 0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n -rLT=169n
groff 15902 0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169n -Tutf8
groff 15903 0 /usr/bin/grotty
The output shows the parent process/command name (PCOMM), the PID, the return
value of the exec() (RET), and the filename with arguments (ARGS).
This works by traces the execve() system call (commonly used exec() variant),
and shows details of the arguments and return value. This catches new processes
that follow the fork->exec sequence, as well as processes that re-exec()
themselves. Some applications fork() but do not exec(), eg, for worker
processes, which won't be included in the execsnoop output.
The -x option can be used to include failed exec()s. For example:
# ./execsnoop -x
PCOMM PID RET ARGS
supervise 9660 0 ./run
supervise 9661 0 ./run
mkdir 9662 0 /bin/mkdir -p ./main
run 9663 0 ./run
chown 9664 0 /bin/chown nobody:nobody ./main
run 9665 0 /bin/mkdir -p ./main
supervise 9667 0 ./run
run 9660 -2 /usr/local/bin/setuidgid nobody /command/multilog t ./main
chown 9668 0 /bin/chown nobody:nobody ./main
run 9666 0 /bin/chmod 0777 main
run 9663 -2 /usr/local/bin/setuidgid nobody /command/multilog t ./main
run 9669 0 /bin/mkdir -p ./main
run 9661 -2 /usr/local/bin/setuidgid nobody /command/multilog t ./main
supervise 9670 0 ./run
[...]
This example shows various regular system daemon activity, including some
failures (trying to execute a /usr/local/bin/setuidgid, which I just noticed
doesn't exist).
A -t option can be used to include a timestamp column, and a -n option to match
on a name or substring from the full command line (filename + args). Regular
expressions are allowed. For example, matching commands containing "mount":
# ./execsnoop -tn mount
TIME(s) PCOMM PID RET ARGS
2.849 bash 18049 0 /bin/mount -p
USAGE message:
# ./execsnoop -h
usage: execsnoop [-h] [-t] [-x] [-n NAME]
Trace exec() syscalls
optional arguments:
-h, --help show this help message and exit
-t, --timestamp include timestamp on output
-x, --fails include failed exec()s
-n NAME, --name NAME only print commands matching this name (regex), any
arg
examples:
./execsnoop # trace all exec() syscalls
./execsnoop -x # include failed exec()s
./execsnoop -t # include timestamps
./execsnoop -n main # only print command lines containing "main"
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# filelife Trace the lifespan of short-lived files.
# For Linux, uses BCC, eBPF. Embedded C.
#
# This traces the creation and deletion of files, providing information
# on who deleted the file, the file age, and the file name. The intent is to
# provide information on short-lived files, for debugging or performance
# analysis.
#
# USAGE: filelife [-h] [-p PID]
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 08-Feb-2015 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
# arguments
examples = """examples:
./filelife # trace all stat() syscalls
./filelife -p 181 # only trace PID 181
"""
parser = argparse.ArgumentParser(
description="Trace stat() syscalls",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
args = parser.parse_args()
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
BPF_HASH(birth, struct dentry *);
// trace file creation time
int trace_create(struct pt_regs *ctx, struct inode *dir, struct dentry *dentry)
{
u32 pid = bpf_get_current_pid_tgid();
FILTER
u64 ts = bpf_ktime_get_ns();
birth.update(&dentry, &ts);
return 0;
};
// trace file deletion and output details
int trace_unlink(struct pt_regs *ctx, struct inode *dir, struct dentry *dentry)
{
u32 pid = bpf_get_current_pid_tgid();
FILTER
u64 *tsp, delta;
tsp = birth.lookup(&dentry);
if (tsp == 0) {
return 0; // missed create
}
delta = (bpf_ktime_get_ns() - *tsp) / 1000000;
birth.delete(&dentry);
if (dentry->d_iname[0] == 0)
return 0;
bpf_trace_printk("%d %s\\n", delta, dentry->d_iname);
return 0;
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
'if (pid != %s) { return 0; }' % args.pid)
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="vfs_create", fn_name="trace_create")
b.attach_kprobe(event="vfs_unlink", fn_name="trace_unlink")
# header
print("%-8s %-6s %-16s %-7s %s" % ("TIME", "PID", "COMM", "AGE(s)", "FILE"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
(delta, filename) = msg.split(" ", 1)
# print columns
print("%-8s %-6d %-16s %-7.2f %s" % (strftime("%H:%M:%S"), pid, task,
float(delta) / 1000, filename))
Demonstrations of filelife, the Linux eBPF/bcc version.
filelife traces short-lived files: those that have been created and then
deleted while tracing. For example:
# ./filelife
TIME PID COMM AGE(s) FILE
05:57:59 8556 gcc 0.04 ccCB5EDe.s
05:57:59 8560 rm 0.02 .entry_64.o.d
05:57:59 8563 gcc 0.02 cc5UFHXf.s
05:57:59 8567 rm 0.01 .thunk_64.o.d
05:57:59 8578 rm 0.02 .syscall_64.o.d
05:58:00 8589 rm 0.03 .common.o.d
05:58:00 8596 rm 0.01 .8592.tmp
05:58:00 8601 rm 0.01 .8597.tmp
05:58:00 8606 rm 0.01 .8602.tmp
05:58:00 8639 rm 0.02 .vma.o.d
05:58:00 8650 rm 0.02 .vdso32-setup.o.d
05:58:00 8656 rm 0.00 .vdso.lds.d
05:58:00 8659 gcc 0.01 ccveeJAz.s
05:58:00 8663 rm 0.01 .vdso-note.o.d
05:58:00 8674 rm 0.02 .vclock_gettime.o.d
05:58:01 8684 rm 0.01 .vgetcpu.o.d
05:58:01 8690 collect2 0.00 ccvKMxdm.ld
This has caught short-lived files that were created during a Linux kernel
build. The PID shows the process ID that finally deleted the file, and COMM
is its process name. The AGE(s) column shows the age of the file, in seconds,
when it was deleted. These are all short-lived, and existed for less than
one tenth of a second.
Creating, populating, and then deleting files as part of another process can
be an inefficient method of inter-process communication. It can cause disk I/O
as files are closed and their file descriptors flushed, only later to be
deleted. As such, short-lived files can be a target of performance
optimizations.
USAGE message:
# ./filelife -h
usage: filelife [-h] [-p PID]
Trace stat() syscalls
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
examples:
./filelife # trace all stat() syscalls
./filelife -p 181 # only trace PID 181
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# fileslower Trace slow synchronous file reads and writes.
# For Linux, uses BCC, eBPF.
#
# USAGE: fileslower [-h] [-p PID] [min_ms]
#
# This script uses kernel dynamic tracing of synchronous reads and writes
# at the VFS interface, to identify slow file reads and writes for any file
# system.
#
# This works by tracing __vfs_read() and __vfs_write(), and filtering for
# synchronous I/O (the path to new_sync_read() and new_sync_write()), and
# for I/O with filenames. This approach provides a view of just two file
# system request types. There are typically many others: asynchronous I/O,
# directory operations, file handle operations, etc, that this tool does not
# instrument.
#
# WARNING: This traces VFS reads and writes, which can be extremely frequent,
# and so the overhead of this tool can become severe depending on the
# workload.
#
# By default, a minimum millisecond threshold of 10 is used.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 06-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import signal
# arguments
examples = """examples:
./fileslower # trace sync file I/O slower than 10 ms (default)
./fileslower 1 # trace sync file I/O slower than 1 ms
./fileslower -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Trace slow synchronous file reads and writes",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("min_ms", nargs="?", default='10',
help="minimum I/O duration to trace, in ms (default 10)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
debug = 0
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#define TRACE_READ 0
#define TRACE_WRITE 1
struct val_t {
u32 sz;
u64 ts;
char name[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, pid_t, struct val_t);
// store timestamp and size on entry
static int trace_rw_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER)
return 0;
// skip I/O lacking a filename
struct dentry *de = file->f_path.dentry;
if (de->d_iname[0] == 0)
return 0;
// store size and timestamp by pid
struct val_t val = {};
val.sz = count;
val.ts = bpf_ktime_get_ns();
__builtin_memcpy(&val.name, de->d_iname, sizeof(val.name));
entryinfo.update(&pid, &val);
return 0;
}
int trace_read_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
// skip non-sync I/O; see kernel code for __vfs_read()
if (!(file->f_op->read_iter))
return 0;
return trace_rw_entry(ctx, file, buf, count);
}
int trace_write_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
// skip non-sync I/O; see kernel code for __vfs_write()
if (!(file->f_op->write_iter))
return 0;
return trace_rw_entry(ctx, file, buf, count);
}
// output
static int trace_rw_return(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = entryinfo.lookup(&pid);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
u64 delta_us = (bpf_ktime_get_ns() - valp->ts) / 1000;
entryinfo.delete(&pid);
if (delta_us < MIN_US)
return 0;
if (type == TRACE_READ) {
bpf_trace_printk("R %d %d %s\\n", valp->sz, delta_us, valp->name);
} else {
bpf_trace_printk("W %d %d %s\\n", valp->sz, delta_us, valp->name);
}
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_rw_return(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_rw_return(ctx, TRACE_WRITE);
}
"""
bpf_text = bpf_text.replace('MIN_US', str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER', '0')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
# I'd rather trace these via new_sync_read/new_sync_write (which used to be
# do_sync_read/do_sync_write), but those became static. So trace these from
# the parent functions, at the cost of more overhead, instead.
# Ultimately, we should be using [V]FS tracepoints.
b.attach_kprobe(event="__vfs_read", fn_name="trace_read_entry")
b.attach_kprobe(event="__vfs_write", fn_name="trace_write_entry")
b.attach_kretprobe(event="__vfs_read", fn_name="trace_read_return")
b.attach_kretprobe(event="__vfs_write", fn_name="trace_write_return")
# header
print("Tracing sync read/writes slower than %d ms" % min_ms)
print("%-8s %-14s %-6s %1s %-7s %7s %s" % ("TIME(s)", "COMM", "PID", "D",
"BYTES", "LAT(ms)", "FILENAME"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
args = msg.split(" ", 3)
(type_s, sz, delta_us_s) = (args[0], args[1], args[2])
try:
filename = args[3]
except:
filename = "?"
if start_ts == 0:
start_ts = ts
ms = float(int(delta_us_s, 10)) / 1000
print("%-8.3f %-14.14s %-6s %1s %-7s %7.2f %s" % (
ts - start_ts, task, pid, type_s, sz, ms, filename))
Demonstrations of fileslower, the Linux eBPF/bcc version.
fileslower shows file-based synchronous reads and writes slower than a
threshold. For example:
# ./fileslower
Tracing sync read/writes slower than 10 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 randread.pl 4762 R 8192 12.70 data1
8.850 randread.pl 4762 R 8192 11.26 data1
12.852 randread.pl 4762 R 8192 10.43 data1
This showed a few reads from a "randread.pl" program, each 8 Kbytes in size,
and from a "data1" file. These all had over 10 ms latency.
This "latency" is measured from when the read or write was issued at the VFS
interface, to when it completed. This spans everything: block device I/O (disk
I/O), file system CPU cycles, file system locks, run queue latency, etc. This
is a better measure of the latency suffered by applications reading from the
file system than measuring this down at the block device interface.
Note that this only traces file reads and writes: other file system operations
(eg, directory operations, open(), fflush()) are not traced.
The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
# ./fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 randread.pl 6925 R 8192 1.06 data1
0.082 randread.pl 6925 R 8192 2.42 data1
0.116 randread.pl 6925 R 8192 1.78 data1
0.153 randread.pl 6925 R 8192 2.31 data1
0.330 randread.pl 6925 R 8192 1.14 data1
0.345 randread.pl 6925 R 8192 1.52 data1
0.359 randread.pl 6925 R 8192 1.04 data1
0.532 randread.pl 6925 R 8192 2.56 data1
0.609 supervise 1892 W 18 3.65 status.new
0.610 randread.pl 6925 R 8192 1.37 data1
0.614 randread.pl 6925 R 8192 3.04 data1
0.729 randread.pl 6925 R 8192 2.90 data1
0.755 randread.pl 6925 R 8192 1.12 data1
0.762 randread.pl 6925 R 8192 2.62 data1
0.771 randread.pl 6925 R 8192 1.07 data1
0.816 randread.pl 6925 R 8192 10.50 data1
0.983 randread.pl 6925 R 8192 1.73 data1
0.989 randread.pl 6925 R 8192 2.12 data1
0.992 randread.pl 6925 R 8192 2.17 data1
1.001 randread.pl 6925 R 8192 1.93 data1
1.007 randread.pl 6925 R 8192 2.03 data1
1.210 randread.pl 6925 R 8192 1.82 data1
1.213 randread.pl 6925 R 8192 2.58 data1
1.219 randread.pl 6925 R 8192 2.20 data1
1.430 randread.pl 6925 R 8192 1.01 data1
1.448 randread.pl 6925 R 8192 2.22 data1
[...]
There's now much more output (this spans only 1.4 seconds, the previous output
spanned 12 seconds), and the lower threshold is catching more I/O.
In the following example, the file system caches were dropped before running
fileslower, and then in another session a "man ls" was executed. The command
and files read from disk can be seen:
# echo 3 > /proc/sys/vm/drop_caches; ./fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 bash 9647 R 128 5.83 man
0.050 man 9647 R 832 19.52 libmandb-2.6.7.1.so
0.066 man 9647 R 832 15.79 libman-2.6.7.1.so
0.123 man 9647 R 832 56.36 libpipeline.so.1.3.0
0.135 man 9647 R 832 9.79 libgdbm.so.3.0.0
0.323 man 9647 R 4096 59.52 locale.alias
0.540 man 9648 R 8192 11.11 ls.1.gz
0.558 man 9647 R 72 6.97 index.db
0.563 man 9647 R 4096 5.12 index.db
0.723 man 9658 R 128 12.06 less
0.725 man 9656 R 128 14.52 nroff
0.779 man 9655 R 128 68.86 tbl
0.814 nroff 9660 R 128 14.55 locale
0.830 pager 9658 R 4096 28.27 .lesshst
0.866 man 9654 R 128 163.12 preconv
0.980 nroff 9684 R 128 13.80 groff
0.999 groff 9684 R 4096 14.29 DESC
1.036 groff 9685 R 128 5.94 troff
1.038 groff 9686 R 128 7.76 grotty
1.065 troff 9685 R 4096 6.33 R
1.082 troff 9685 R 4096 10.52 BI
1.096 troff 9685 R 4096 8.70 troffrc
1.176 troff 9685 R 4096 80.12 composite.tmac
1.195 troff 9685 R 4096 19.20 fallbacks.tmac
1.202 troff 9685 R 4096 6.79 tty.tmac
1.221 troff 9685 R 4096 7.87 man.local
2.977 supervise 1876 W 18 4.23 status.new
This caught an individual I/O reaching 163.12 ms, for the "preconv" file. While
the file system cache was flushed, causing these to need to be read from disk,
the duration here may not be entirely disk I/O: it can include file system
locks, run queue latency, etc. These can be explored using other commands.
USAGE message:
# ./fileslower -h
usage: fileslower [-h] [-p PID] [min_ms]
Trace slow synchronous file reads and writes
positional arguments:
min_ms minimum I/O duration to trace, in ms (default 10)
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
examples:
./fileslower # trace sync file I/O slower than 10 ms (default)
./fileslower 1 # trace sync file I/O slower than 1 ms
./fileslower -p 185 # trace PID 185 only
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# filetop file reads and writes by process.
# For Linux, uses BCC, eBPF.
#
# USAGE: filetop.py [-h] [-C] [-r MAXROWS] [interval] [count]
#
# This uses in-kernel eBPF maps to store per process summaries for efficiency.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 06-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
from subprocess import call
# arguments
examples = """examples:
./filetop # file I/O top, 1 second refresh
./filetop -C # don't clear the screen
./filetop -p 181 # PID 181 only
./filetop 5 # 5 second summaries
./filetop 5 10 # 5 second summaries, 10 times only
"""
parser = argparse.ArgumentParser(
description="File reads and writes by process",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-C", "--noclear", action="store_true",
help="don't clear the screen")
parser.add_argument("-r", "--maxrows", default=20,
help="maximum rows to print, default 20")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("interval", nargs="?", default=1,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
interval = int(args.interval)
countdown = int(args.count)
maxrows = int(args.maxrows)
clear = not int(args.noclear)
debug = 0
# linux stats
loadavg = "/proc/loadavg"
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
#define MAX_FILE_LEN 32
// the key for the output summary
struct info_t {
u32 pid;
char name[TASK_COMM_LEN];
char file[MAX_FILE_LEN];
char type;
};
// the value of the output summary
struct val_t {
u64 reads;
u64 writes;
u64 rbytes;
u64 wbytes;
};
BPF_HASH(counts, struct info_t, struct val_t);
static int do_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count, int is_read)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER)
return 0;
// skip I/O lacking a filename
struct dentry *de = file->f_path.dentry;
if (de->d_iname[0] == 0)
return 0;
// store counts and sizes by pid & file
struct info_t info = {.pid = pid};
bpf_get_current_comm(&info.name, sizeof(info.name));
__builtin_memcpy(&info.file, de->d_iname, sizeof(info.file));
int mode = file->f_inode->i_mode;
if (S_ISREG(mode)) {
info.type = 'R';
} else if (S_ISSOCK(mode)) {
info.type = 'S';
} else {
info.type = 'O';
}
struct val_t *valp, zero = {};
valp = counts.lookup_or_init(&info, &zero);
if (is_read) {
valp->reads++;
valp->rbytes += count;
} else {
valp->writes++;
valp->wbytes += count;
}
return 0;
}
int trace_read_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
return do_entry(ctx, file, buf, count, 1);
}
int trace_write_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
return do_entry(ctx, file, buf, count, 0);
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER', 'pid != %s' % args.pid)
else:
bpf_text = bpf_text.replace('FILTER', '0')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="__vfs_read", fn_name="trace_read_entry")
b.attach_kprobe(event="__vfs_write", fn_name="trace_write_entry")
print('Tracing... Output every %d secs. Hit Ctrl-C to end' % interval)
# output
exiting = 0
while 1:
try:
sleep(interval)
except KeyboardInterrupt:
exiting = 1
# header
if clear:
call("clear")
else:
print()
with open(loadavg) as stats:
print("%-8s loadavg: %s" % (strftime("%H:%M:%S"), stats.read()))
print("%-6s %-16s %-6s %-6s %-7s %-7s %1s %s" % ("PID", "COMM",
"READS", "WRITES", "R_Kb", "W_Kb", "T", "FILE"))
# by-PID output
counts = b.get_table("counts")
line = 0
for k, v in reversed(sorted(counts.items(),
key=lambda counts: counts[1].rbytes)):
# print line
print("%-6d %-16s %-6d %-6d %-7d %-7d %1s %s" % (k.pid, k.name,
v.reads, v.writes, v.rbytes / 1024, v.wbytes / 1024, k.type,
k.file))
line += 1
if line >= maxrows:
break
counts.clear()
countdown -= 1
if exiting or countdown == 0:
print("Detaching...")
exit()
Demonstrations of filetop, the Linux eBPF/bcc version.
filetop shows reads and writes by file, with process details. For example:
# ./filetop -C
Tracing... Output every 1 secs. Hit Ctrl-C to end
08:00:23 loadavg: 0.91 0.33 0.23 3/286 26635
PID COMM READS WRITES R_Kb W_Kb T FILE
26628 ld 161 186 643 152 R built-in.o
26634 cc1 1 0 200 0 R autoconf.h
26618 cc1 1 0 200 0 R autoconf.h
26634 cc1 12 0 192 0 R tracepoint.h
26584 cc1 2 0 143 0 R mm.h
26634 cc1 2 0 143 0 R mm.h
26631 make 34 0 136 0 R auto.conf
26634 cc1 1 0 98 0 R fs.h
26584 cc1 1 0 98 0 R fs.h
26634 cc1 1 0 91 0 R sched.h
26634 cc1 1 0 78 0 R printk.c
26634 cc1 3 0 73 0 R mmzone.h
26628 ld 18 0 72 0 R hibernate.o
26628 ld 16 0 64 0 R suspend.o
26628 ld 16 0 64 0 R snapshot.o
26630 cat 1 0 64 0 O null
26628 ld 16 0 64 0 R qos.o
26628 ld 13 0 52 0 R main.o
26628 ld 12 0 52 0 R swap.o
12421 sshd 3 0 48 0 O ptmx
[...]
This shows various files read and written during a Linux kernel build. The
output is sorted by the total read size in Kbytes (R_Kb). This is instrumenting
at the VFS interface, so this is reads and writes that may return entirely
from the file system cache (page cache).
While not printed, the average read and write size can be calculated by
dividing R_Kb by READS, and the same for writes.
The "T" column indicates the type of the file: "R" for regular files, "S" for
sockets, and "O" for other (including pipes).
This script works by tracing the vfs_read() and vfs_write() functions using
kernel dynamic tracing, which instruments explicit read and write calls. If
files are read or written using another means (eg, via mmap()), then they
will not be visible using this tool.
This should be useful for file system workload characterization when analyzing
the performance of applications.
Note that tracing VFS level reads and writes can be a frequent activity, and
this tool can begin to cost measurable overhead at high I/O rates.
A -C option will stop clearing the screen, and -r with a number will restrict
the output to that many rows (20 by default). For example, not clearing
the screen and showing the top 5 only:
# ./filetop -Cr 5
Tracing... Output every 1 secs. Hit Ctrl-C to end
08:05:11 loadavg: 0.75 0.35 0.25 3/285 822
PID COMM READS WRITES R_Kb W_Kb T FILE
32672 cksum 5006 0 320384 0 R data1
12296 sshd 2 0 32 0 O ptmx
809 run 2 0 8 0 R nsswitch.conf
811 run 2 0 8 0 R nsswitch.conf
804 chown 2 0 8 0 R nsswitch.conf
08:05:12 loadavg: 0.75 0.35 0.25 3/285 845
PID COMM READS WRITES R_Kb W_Kb T FILE
32672 cksum 4986 0 319104 0 R data1
845 chown 2 0 8 0 R nsswitch.conf
828 run 2 0 8 0 R nsswitch.conf
835 run 2 0 8 0 R nsswitch.conf
830 run 2 0 8 0 R nsswitch.conf
08:05:13 loadavg: 0.75 0.35 0.25 3/285 868
PID COMM READS WRITES R_Kb W_Kb T FILE
32672 cksum 4985 0 319040 0 R data1
857 run 2 0 8 0 R nsswitch.conf
858 run 2 0 8 0 R nsswitch.conf
859 run 2 0 8 0 R nsswitch.conf
848 run 2 0 8 0 R nsswitch.conf
[...]
This output shows a cksum command reading data1. Note that
An optional interval and optional count can also be added to the end of the
command line. For example, for 1 second interval, and 3 summaries in total:
# ./filetop -Cr 5 1 3
Tracing... Output every 1 secs. Hit Ctrl-C to end
08:08:20 loadavg: 0.30 0.42 0.31 3/282 5187
PID COMM READS WRITES R_Kb W_Kb T FILE
12421 sshd 14101 0 225616 0 O ptmx
12296 sshd 4 0 64 0 O ptmx
12421 sshd 3 14104 48 778 S TCP
5178 run 2 0 8 0 R nsswitch.conf
5165 run 2 0 8 0 R nsswitch.conf
08:08:21 loadavg: 0.30 0.42 0.31 5/282 5210
PID COMM READS WRITES R_Kb W_Kb T FILE
12421 sshd 9159 0 146544 0 O ptmx
12421 sshd 3 9161 48 534 S TCP
12296 sshd 1 0 16 0 S TCP
5188 run 2 0 8 0 R nsswitch.conf
5203 run 2 0 8 0 R nsswitch.conf
08:08:22 loadavg: 0.30 0.42 0.31 2/282 5233
PID COMM READS WRITES R_Kb W_Kb T FILE
12421 sshd 26166 0 418656 0 O ptmx
12421 sshd 4 26171 64 1385 S TCP
12296 sshd 1 0 16 0 O ptmx
5214 run 2 0 8 0 R nsswitch.conf
5227 run 2 0 8 0 R nsswitch.conf
Detaching...
This example has caught heavy socket I/O from an sshd process, showing up as
non-regular file types (the "O" for other, and "S" for socket, in the type
column: "T").
USAGE message:
# ./filetop -h
usage: filetop [-h] [-C] [-r MAXROWS] [-p PID] [interval] [count]
File reads and writes by process
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-C, --noclear don't clear the screen
-r MAXROWS, --maxrows MAXROWS
maximum rows to print, default 20
-p PID, --pid PID trace this PID only
examples:
./filetop # file I/O top, 1 second refresh
./filetop -C # don't clear the screen
./filetop -p 181 # PID 181 only
./filetop 5 # 5 second summaries
./filetop 5 10 # 5 second summaries, 10 times only
...@@ -166,7 +166,7 @@ The current implementation can take many seconds to detach from tracing, after ...@@ -166,7 +166,7 @@ The current implementation can take many seconds to detach from tracing, after
Ctrl-C has been hit. Ctrl-C has been hit.
Couting all vfs functions for process ID 5276 only: Counting all vfs functions for process ID 5276 only:
# ./funccount -p 5276 'vfs_*' # ./funccount -p 5276 'vfs_*'
Tracing... Ctrl-C to end. Tracing... Ctrl-C to end.
......
...@@ -37,12 +37,12 @@ the function began executing (was called) to when it finished (returned). ...@@ -37,12 +37,12 @@ the function began executing (was called) to when it finished (returned).
This example output shows that most of the time, do_sys_open() took between This example output shows that most of the time, do_sys_open() took between
2048 and 65536 nanoseconds (2 to 65 microseconds). The peak of this distribution 2048 and 65536 nanoseconds (2 to 65 microseconds). The peak of this distribution
shows 291 calls of between 4096 and 8191 nanoseconds. There was also one shows 291 calls of between 4096 and 8191 nanoseconds. There was also one
occurrance, an outlier, in the 2 to 4 millisecond range. occurrence, an outlier, in the 2 to 4 millisecond range.
How this works: the function entry and return are traced using the kernel kprobe How this works: the function entry and return are traced using the kernel kprobe
and kretprobe tracer. Timestamps are collected, the delta time calculated, which and kretprobe tracer. Timestamps are collected, the delta time calculated, which
is the bucketized and stored as an in-kernel histogram for efficiency. The is the bucketized and stored as an in-kernel histogram for efficiency. The
histgram is visible in the output: it's the "count" column; everything else is histogram is visible in the output: it's the "count" column; everything else is
decoration. Only the count column is copied to user-level on output. This is an decoration. Only the count column is copied to user-level on output. This is an
efficient way to time kernel functions and examine their latency distribution. efficient way to time kernel functions and examine their latency distribution.
...@@ -242,7 +242,7 @@ USAGE message: ...@@ -242,7 +242,7 @@ USAGE message:
usage: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-F] [-r] usage: funclatency [-h] [-p PID] [-i INTERVAL] [-T] [-u] [-m] [-F] [-r]
pattern pattern
Time kernel funcitons and print latency as a histogram Time kernel functions and print latency as a histogram
positional arguments: positional arguments:
pattern search expression for kernel functions pattern search expression for kernel functions
...@@ -260,7 +260,7 @@ optional arguments: ...@@ -260,7 +260,7 @@ optional arguments:
only. only.
examples: examples:
./funclatency do_sys_open # time the do_sys_open() kenel function ./funclatency do_sys_open # time the do_sys_open() kernel function
./funclatency -u vfs_read # time vfs_read(), in microseconds ./funclatency -u vfs_read # time vfs_read(), in microseconds
./funclatency -m do_nanosleep # time do_nanosleep(), in milliseconds ./funclatency -m do_nanosleep # time do_nanosleep(), in milliseconds
./funclatency -mTi 5 vfs_read # output every 5 seconds, with timestamps ./funclatency -mTi 5 vfs_read # output every 5 seconds, with timestamps
......
#!/usr/bin/env python
#
# memleak.py Trace and display outstanding allocations to detect
# memory leaks in user-mode processes and the kernel.
#
# USAGE: memleak.py [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
# [-s SAMPLE_RATE] [-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE]
# [-Z MAX_SIZE]
# [interval] [count]
#
# Licensed under the Apache License, Version 2.0 (the "License")
# Copyright (C) 2016 Sasha Goldshtein.
from bcc import BPF
from time import sleep
from datetime import datetime
import argparse
import subprocess
import ctypes
import os
class Time(object):
# BPF timestamps come from the monotonic clock. To be able to filter
# and compare them from Python, we need to invoke clock_gettime.
# Adapted from http://stackoverflow.com/a/1205762
CLOCK_MONOTONIC_RAW = 4 # see <linux/time.h>
class timespec(ctypes.Structure):
_fields_ = [
('tv_sec', ctypes.c_long),
('tv_nsec', ctypes.c_long)
]
librt = ctypes.CDLL('librt.so.1', use_errno=True)
clock_gettime = librt.clock_gettime
clock_gettime.argtypes = [ctypes.c_int, ctypes.POINTER(timespec)]
@staticmethod
def monotonic_time():
t = Time.timespec()
if Time.clock_gettime(
Time.CLOCK_MONOTONIC_RAW, ctypes.pointer(t)) != 0:
errno_ = ctypes.get_errno()
raise OSError(errno_, os.strerror(errno_))
return t.tv_sec * 1e9 + t.tv_nsec
class StackDecoder(object):
def __init__(self, pid, bpf):
self.pid = pid
self.bpf = bpf
self.ranges_cache = {}
self.refresh_code_ranges()
def refresh_code_ranges(self):
if self.pid == -1:
return
self.code_ranges = self._get_code_ranges()
@staticmethod
def _is_binary_segment(parts):
return len(parts) == 6 and \
parts[5][0] != '[' and 'x' in parts[1]
def _get_code_ranges(self):
ranges = {}
raw_ranges = open("/proc/%d/maps" % self.pid).readlines()
# A typical line from /proc/PID/maps looks like this:
# 7f21b6635000-7f21b67eb000 r-xp ... /usr/lib64/libc-2.21.so
# We are looking for executable segments that have a .so file
# or the main executable. The first two lines are the range of
# that memory segment, which we index by binary name.
for raw_range in raw_ranges:
parts = raw_range.split()
if not StackDecoder._is_binary_segment(parts):
continue
binary = parts[5]
range_parts = parts[0].split('-')
addr_range = (int(range_parts[0], 16),
int(range_parts[1], 16))
ranges[binary] = addr_range
return ranges
@staticmethod
def _is_function_symbol(parts):
return len(parts) == 6 and parts[3] == ".text" \
and parts[2] == "F"
def _get_sym_ranges(self, binary):
if binary in self.ranges_cache:
return self.ranges_cache[binary]
sym_ranges = {}
raw_symbols = run_command_get_output("objdump -t %s" % binary)
for raw_symbol in raw_symbols:
# A typical line from objdump -t looks like this:
# 00000000004007f5 g F .text 000000000000010e main
# We only care about functions in the .text segment.
# The first number is the start address, and the second
# number is the length.
parts = raw_symbol.split()
if not StackDecoder._is_function_symbol(parts):
continue
sym_start = int(parts[0], 16)
sym_len = int(parts[4], 16)
sym_name = parts[5]
sym_ranges[sym_name] = (sym_start, sym_len)
self.ranges_cache[binary] = sym_ranges
return sym_ranges
def _decode_sym(self, binary, offset):
sym_ranges = self._get_sym_ranges(binary)
# Find the symbol that contains the specified offset.
# There might not be one.
for name, (start, length) in sym_ranges.items():
if offset >= start and offset <= (start + length):
return "%s+0x%x" % (name, offset - start)
return "%x" % offset
def _decode_addr(self, addr):
code_ranges = self._get_code_ranges()
# Find the binary that contains the specified address.
# For .so files, look at the relative address; for the main
# executable, look at the absolute address.
for binary, (start, end) in code_ranges.items():
if addr >= start and addr <= end:
offset = addr - start \
if binary.endswith(".so") else addr
return "%s [%s]" % (self._decode_sym(binary,
offset), binary)
return "%x" % addr
def decode_stack(self, info, is_kernel_trace):
stack = ""
if info.num_frames <= 0:
return "???"
for i in range(0, info.num_frames):
addr = info.callstack[i]
if is_kernel_trace:
stack += " %s [kernel] (%x) ;" % \
(self.bpf.ksym(addr), addr)
else:
# At some point, we hope to have native BPF
# user-mode symbol decoding, but for now we
# have to use our own.
stack += " %s (%x) ;" % \
(self._decode_addr(addr), addr)
return stack
def run_command_get_output(command):
p = subprocess.Popen(command.split(),
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
return iter(p.stdout.readline, b'')
def run_command_get_pid(command):
p = subprocess.Popen(command.split())
return p.pid
examples = """
EXAMPLES:
./memleak.py -p $(pidof allocs)
Trace allocations and display a summary of "leaked" (outstanding)
allocations every 5 seconds
./memleak.py -p $(pidof allocs) -t
Trace allocations and display each individual call to malloc/free
./memleak.py -ap $(pidof allocs) 10
Trace allocations and display allocated addresses, sizes, and stacks
every 10 seconds for outstanding allocations
./memleak.py -c "./allocs"
Run the specified command and trace its allocations
./memleak.py
Trace allocations in kernel mode and display a summary of outstanding
allocations every 5 seconds
./memleak.py -o 60000
Trace allocations in kernel mode and display a summary of outstanding
allocations that are at least one minute (60 seconds) old
./memleak.py -s 5
Trace roughly every 5th allocation, to reduce overhead
"""
description = """
Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with malloc/free and kernel-mode
allocations made with kmalloc/kfree.
"""
parser = argparse.ArgumentParser(description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid", type=int, default=-1,
help="the PID to trace; if not specified, trace kernel allocs")
parser.add_argument("-t", "--trace", action="store_true",
help="print trace messages for each alloc/free call")
parser.add_argument("interval", nargs="?", default=5, type=int,
help="interval in seconds to print outstanding allocations")
parser.add_argument("count", nargs="?", type=int,
help="number of times to print the report before exiting")
parser.add_argument("-a", "--show-allocs", default=False, action="store_true",
help="show allocation addresses and sizes as well as call stacks")
parser.add_argument("-o", "--older", default=500, type=int,
help="prune allocations younger than this age in milliseconds")
parser.add_argument("-c", "--command",
help="execute and trace the specified command")
parser.add_argument("-s", "--sample-rate", default=1, type=int,
help="sample every N-th allocation to decrease the overhead")
parser.add_argument("-d", "--stack-depth", default=10, type=int,
help="maximum stack depth to capture")
parser.add_argument("-T", "--top", type=int, default=10,
help="display only this many top allocating stacks (by size)")
parser.add_argument("-z", "--min-size", type=int,
help="capture only allocations larger than this size")
parser.add_argument("-Z", "--max-size", type=int,
help="capture only allocations smaller than this size")
args = parser.parse_args()
pid = args.pid
command = args.command
kernel_trace = (pid == -1 and command is None)
trace_all = args.trace
interval = args.interval
min_age_ns = 1e6 * args.older
sample_every_n = args.sample_rate
num_prints = args.count
max_stack_size = args.stack_depth + 2
top_stacks = args.top
min_size = args.min_size
max_size = args.max_size
if min_size is not None and max_size is not None and min_size > max_size:
print("min_size (-z) can't be greater than max_size (-Z)")
exit(1)
if command is not None:
print("Executing '%s' and tracing the resulting process." % command)
pid = run_command_get_pid(command)
bpf_source = """
#include <uapi/linux/ptrace.h>
struct alloc_info_t {
u64 size;
u64 timestamp_ns;
int num_frames;
u64 callstack[MAX_STACK_SIZE];
};
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
// Adapted from https://github.com/iovisor/bcc/tools/offcputime.py
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
return ret;
}
return 0;
}
static int grab_stack(struct pt_regs *ctx, struct alloc_info_t *info)
{
int depth = 0;
u64 bp = ctx->bp;
GRAB_ONE_FRAME
return depth;
}
int alloc_enter(struct pt_regs *ctx, size_t size)
{
SIZE_FILTER
if (SAMPLE_EVERY_N > 1) {
u64 ts = bpf_ktime_get_ns();
if (ts % SAMPLE_EVERY_N != 0)
return 0;
}
u64 pid = bpf_get_current_pid_tgid();
u64 size64 = size;
sizes.update(&pid, &size64);
if (SHOULD_PRINT)
bpf_trace_printk("alloc entered, size = %u\\n", size);
return 0;
}
int alloc_exit(struct pt_regs *ctx)
{
u64 address = ctx->ax;
u64 pid = bpf_get_current_pid_tgid();
u64* size64 = sizes.lookup(&pid);
struct alloc_info_t info = {0};
if (size64 == 0)
return 0; // missed alloc entry
info.size = *size64;
sizes.delete(&pid);
info.timestamp_ns = bpf_ktime_get_ns();
info.num_frames = grab_stack(ctx, &info) - 2;
allocs.update(&address, &info);
if (SHOULD_PRINT) {
bpf_trace_printk("alloc exited, size = %lu, result = %lx, frames = %d\\n",
info.size, address, info.num_frames);
}
return 0;
}
int free_enter(struct pt_regs *ctx, void *address)
{
u64 addr = (u64)address;
struct alloc_info_t *info = allocs.lookup(&addr);
if (info == 0)
return 0;
allocs.delete(&addr);
if (SHOULD_PRINT) {
bpf_trace_printk("free entered, address = %lx, size = %lu\\n",
address, info->size);
}
return 0;
}
"""
bpf_source = bpf_source.replace("SHOULD_PRINT", "1" if trace_all else "0")
bpf_source = bpf_source.replace("SAMPLE_EVERY_N", str(sample_every_n))
bpf_source = bpf_source.replace("GRAB_ONE_FRAME", max_stack_size *
"\tif (!(info->callstack[depth++] = get_frame(&bp))) return depth;\n")
bpf_source = bpf_source.replace("MAX_STACK_SIZE", str(max_stack_size))
size_filter = ""
if min_size is not None and max_size is not None:
size_filter = "if (size < %d || size > %d) return 0;" % \
(min_size, max_size)
elif min_size is not None:
size_filter = "if (size < %d) return 0;" % min_size
elif max_size is not None:
size_filter = "if (size > %d) return 0;" % max_size
bpf_source = bpf_source.replace("SIZE_FILTER", size_filter)
bpf_program = BPF(text=bpf_source)
if not kernel_trace:
print("Attaching to malloc and free in pid %d, Ctrl+C to quit." % pid)
bpf_program.attach_uprobe(name="c", sym="malloc",
fn_name="alloc_enter", pid=pid)
bpf_program.attach_uretprobe(name="c", sym="malloc",
fn_name="alloc_exit", pid=pid)
bpf_program.attach_uprobe(name="c", sym="free",
fn_name="free_enter", pid=pid)
else:
print("Attaching to kmalloc and kfree, Ctrl+C to quit.")
bpf_program.attach_kprobe(event="__kmalloc", fn_name="alloc_enter")
bpf_program.attach_kretprobe(event="__kmalloc", fn_name="alloc_exit")
bpf_program.attach_kprobe(event="kfree", fn_name="free_enter")
decoder = StackDecoder(pid, bpf_program)
def print_outstanding():
stacks = {}
print("[%s] Top %d stacks with outstanding allocations:" %
(datetime.now().strftime("%H:%M:%S"), top_stacks))
allocs = bpf_program.get_table("allocs")
for address, info in sorted(allocs.items(), key=lambda a: a[1].size):
if Time.monotonic_time() - min_age_ns < info.timestamp_ns:
continue
stack = decoder.decode_stack(info, kernel_trace)
if stack in stacks:
stacks[stack] = (stacks[stack][0] + 1,
stacks[stack][1] + info.size)
else:
stacks[stack] = (1, info.size)
if args.show_allocs:
print("\taddr = %x size = %s" %
(address.value, info.size))
to_show = sorted(stacks.items(), key=lambda s: s[1][1])[-top_stacks:]
for stack, (count, size) in to_show:
print("\t%d bytes in %d allocations from stack\n\t\t%s" %
(size, count, stack.replace(";", "\n\t\t")))
count_so_far = 0
while True:
if trace_all:
print bpf_program.trace_fields()
else:
try:
sleep(interval)
except KeyboardInterrupt:
exit()
decoder.refresh_code_ranges()
print_outstanding()
count_so_far += 1
if num_prints is not None and count_so_far >= num_prints:
exit()
Demonstrations of memleak.
memleak traces and matches memory allocation and deallocation requests, and
collects call stacks for each allocation. memleak can then print a summary
of which call stacks performed allocations that weren't subsequently freed.
For example:
# ./memleak.py -p $(pidof allocs)
Attaching to malloc and free in pid 5193, Ctrl+C to quit.
[11:16:33] Top 2 stacks with outstanding allocations:
80 bytes in 5 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
[11:16:34] Top 2 stacks with outstanding allocations:
160 bytes in 10 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
Each entry printed is a set of allocations that originate from the same call
stack, and that weren't freed yet. The number of bytes and number of allocs
are followed by the call stack, top to bottom, of the allocation site.
As time goes on, it becomes apparent that the main function in the allocs
process is leaking memory, 16 bytes at a time. Fortunately, you don't have to
inspect each allocation individually -- you get a nice summary of which stack
is responsible for a large leak.
Occasionally, you do want the individual allocation details. Perhaps the same
stack is allocating various sizes and you want to confirm which sizes are
prevalent. Use the -a switch:
# ./memleak.py -p $(pidof allocs) -a
Attaching to malloc and free in pid 5193, Ctrl+C to quit.
[11:16:33] Top 2 stacks with outstanding allocations:
addr = 948cd0 size = 16
addr = 948d10 size = 16
addr = 948d30 size = 16
addr = 948cf0 size = 16
64 bytes in 4 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
[11:16:34] Top 2 stacks with outstanding allocations:
addr = 948d50 size = 16
addr = 948cd0 size = 16
addr = 948d10 size = 16
addr = 948d30 size = 16
addr = 948cf0 size = 16
addr = 948dd0 size = 16
addr = 948d90 size = 16
addr = 948db0 size = 16
addr = 948d70 size = 16
addr = 948df0 size = 16
160 bytes in 10 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
When using the -p switch, memleak traces the allocations of a particular
process. Without this switch, kernel allocations (kmalloc) are traced instead.
For example:
# ./memleak.py
Attaching to kmalloc and kfree, Ctrl+C to quit.
...
248 bytes in 4 allocations from stack
bpf_prog_load [kernel] (ffffffff8118c471)
sys_bpf [kernel] (ffffffff8118c8b5)
328 bytes in 1 allocations from stack
perf_mmap [kernel] (ffffffff811990fd)
mmap_region [kernel] (ffffffff811df5d4)
do_mmap [kernel] (ffffffff811dfb83)
vm_mmap_pgoff [kernel] (ffffffff811c494f)
sys_mmap_pgoff [kernel] (ffffffff811ddf02)
sys_mmap [kernel] (ffffffff8101b0ab)
464 bytes in 1 allocations from stack
traceprobe_command [kernel] (ffffffff81187cf2)
traceprobe_probes_write [kernel] (ffffffff81187d86)
probes_write [kernel] (ffffffff81181580)
__vfs_write [kernel] (ffffffff812237b7)
vfs_write [kernel] (ffffffff81223ec6)
sys_write [kernel] (ffffffff81224b85)
entry_SYSCALL_64_fastpath [kernel] (ffffffff8178182e)
8192 bytes in 1 allocations from stack
alloc_and_copy_ftrace_hash.constprop.59 [kernel] (ffffffff8115d17e)
ftrace_set_hash [kernel] (ffffffff8115e767)
ftrace_set_filter_ip [kernel] (ffffffff8115e9a8)
arm_kprobe [kernel] (ffffffff81148600)
enable_kprobe [kernel] (ffffffff811486f6)
kprobe_register [kernel] (ffffffff81182399)
perf_trace_init [kernel] (ffffffff8117c4e0)
perf_tp_event_init [kernel] (ffffffff81192479)
Here you can see that arming the kprobe to which our eBPF program is attached
consumed 8KB of memory. Loading the BPF program also consumed a couple hundred
bytes (in bpf_prog_load).
memleak stores each allocated block along with its size, timestamp, and the
stack that allocated it. When the block is deleted, this information is freed
to reduce the memory overhead.
To avoid false positives, allocations younger than a certain age (500ms by
default) are not printed. To change this threshold, use the -o switch.
By default, memleak prints its output every 5 seconds. To change this
interval, pass the interval as a positional parameter to memleak. You can
also control the number of times the output will be printed before exiting.
For example:
# ./memleak.py 1 10
... will print the outstanding allocation statistics every second, for ten
times, and then exit.
memleak may introduce considerable overhead if your application or kernel is
allocating and freeing memory at a very high rate. In that case, you can
control the overhead by sampling every N-th allocation. For example, to sample
roughly 10% of the allocations and print the outstanding allocations every 5
seconds, 3 times before quitting:
# ./memleak.py -p $(pidof allocs) -s 10 5 3
Attaching to malloc and free in pid 2614, Ctrl+C to quit.
[11:16:33] Top 2 stacks with outstanding allocations:
16 bytes in 1 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fdc11ce8790)
[11:16:38] Top 2 stacks with outstanding allocations:
16 bytes in 1 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fdc11ce8790)
[11:16:43] Top 2 stacks with outstanding allocations:
32 bytes in 2 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fdc11ce8790)
Note that even though the application leaks 16 bytes of memory every second,
the report (printed every 5 seconds) doesn't "see" all the allocations because
of the sampling rate applied.
USAGE message:
# ./memleak.py -h
usage: memleak.py [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
[-s SAMPLE_RATE] [-d STACK_DEPTH] [-T TOP]
[interval] [count]
Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with malloc/free and kernel-mode
allocations made with kmalloc/kfree.
interval interval in seconds to print outstanding allocations
count number of times to print the report before exiting
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID the PID to trace; if not specified, trace kernel
allocs
-t, --trace print trace messages for each alloc/free call
-a, --show-allocs show allocation addresses and sizes as well as call
stacks
-o OLDER, --older OLDER
prune allocations younger than this age in
milliseconds
-c COMMAND, --command COMMAND
execute and trace the specified command
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE
sample every N-th allocation to decrease the overhead
-d STACK_DEPTH, --stack_depth STACK_DEPTH
maximum stack depth to capture
-T TOP, --top TOP display only this many top allocating stacks (by size)
-z MIN_SIZE, --min-size MIN_SIZE
capture only allocations larger than this size
-Z MAX_SIZE, --max-size MAX_SIZE
capture only allocations smaller than this size
EXAMPLES:
./memleak.py -p $(pidof allocs)
Trace allocations and display a summary of "leaked" (outstanding)
allocations every 5 seconds
./memleak.py -p $(pidof allocs) -t
Trace allocations and display each individual call to malloc/free
./memleak.py -ap $(pidof allocs) 10
Trace allocations and display allocated addresses, sizes, and stacks
every 10 seconds for outstanding allocations
./memleak.py -c "./allocs"
Run the specified command and trace its allocations
./memleak.py
Trace allocations in kernel mode and display a summary of outstanding
allocations every 5 seconds
./memleak.py -o 60000
Trace allocations in kernel mode and display a summary of outstanding
allocations that are at least one minute (60 seconds) old
./memleak.py -s 5
Trace roughly every 5th allocation, to reduce overhead
...@@ -743,4 +743,4 @@ examples: ...@@ -743,4 +743,4 @@ examples:
./offcputime 5 # trace for 5 seconds only ./offcputime 5 # trace for 5 seconds only
./offcputime -f 5 # 5 seconds, and output in folded format ./offcputime -f 5 # 5 seconds, and output in folded format
./offcputime -u # don't include kernel threads (user only) ./offcputime -u # don't include kernel threads (user only)
./offcputime -p 185 # trace fo PID 185 only ./offcputime -p 185 # trace for PID 185 only
...@@ -165,7 +165,7 @@ via vfs_read() and the other doing a link_path_walk(). There is also a vmstat(8) ...@@ -165,7 +165,7 @@ via vfs_read() and the other doing a link_path_walk(). There is also a vmstat(8)
stack showing it sleeping between intervals, and an sshd(8) stack showing it stack showing it sleeping between intervals, and an sshd(8) stack showing it
waiting on a file descriptor for input. waiting on a file descriptor for input.
The stack shown at the bottom is the off-CPU stack beloning to the task name The stack shown at the bottom is the off-CPU stack belonging to the task name
shown after "target:". Then there is a separator, "-", and above it the waker shown after "target:". Then there is a separator, "-", and above it the waker
stack and the waker task name after "waker:". The wakeup stack is printed stack and the waker task name after "waker:". The wakeup stack is printed
in reverse order. in reverse order.
......
#!/usr/bin/env python
#
# oomkill Trace oom_kill_process(). For Linux, uses BCC, eBPF.
#
# This traces the kernel out-of-memory killer, and prints basic details,
# including the system load averages. This can provide more context on the
# system state at the time of OOM: was it getting busier or steady, based
# on the load averages? This tool may also be useful to customize for
# investigations; for example, by adding other task_struct details at the time
# of OOM.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from bcc import BPF
from time import strftime
import ctypes as ct
# linux stats
loadavg = "/proc/loadavg"
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/oom.h>
struct data_t {
u64 fpid;
u64 tpid;
u64 pages;
char fcomm[TASK_COMM_LEN];
char tcomm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
void kprobe__oom_kill_process(struct pt_regs *ctx, struct oom_control *oc,
struct task_struct *p, unsigned int points, unsigned long totalpages)
{
struct data_t data = {};
u32 pid = bpf_get_current_pid_tgid();
data.fpid = pid;
data.tpid = p->pid;
data.pages = totalpages;
bpf_get_current_comm(&data.fcomm, sizeof(data.fcomm));
bpf_probe_read(&data.tcomm, sizeof(data.tcomm), p->comm);
events.perf_submit(ctx, &data, sizeof(data));
}
"""
# kernel->user event data: struct data_t
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("fpid", ct.c_ulonglong),
("tpid", ct.c_ulonglong),
("pages", ct.c_ulonglong),
("fcomm", ct.c_char * TASK_COMM_LEN),
("tcomm", ct.c_char * TASK_COMM_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
with open(loadavg) as stats:
avgline = stats.read().rstrip()
print(("%s Triggered by PID %d (\"%s\"), OOM kill of PID %d (\"%s\")"
", %d pages, loadavg: %s") % (strftime("%H:%M:%S"), event.fpid,
event.fcomm, event.tpid, event.tcomm, event.pages, avgline))
# initialize BPF
b = BPF(text=bpf_text)
print("Tracing OOM kills... Ctrl-C to stop.")
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
Demonstrations of oomkill, the Linux eBPF/bcc version.
oomkill is a simple program that traces the Linux out-of-memory (OOM) killer,
and shows basic details on one line per OOM kill:
# ./oomkill
Tracing oom_kill_process()... Ctrl-C to end.
21:03:39 Triggered by PID 3297 ("ntpd"), OOM kill of PID 22516 ("perl"), 3850642 pages, loadavg: 0.99 0.39 0.30 3/282 22724
21:03:48 Triggered by PID 22517 ("perl"), OOM kill of PID 22517 ("perl"), 3850642 pages, loadavg: 0.99 0.41 0.30 2/282 22932
The first line shows that PID 22516, with process name "perl", was OOM killed
when it reached 3850642 pages (usually 4 Kbytes per page). This OOM kill
happened to be triggered by PID 3297, process name "ntpd", doing some memory
allocation.
The system log (dmesg) shows pages of details and system context about an OOM
kill. What it currently lacks, however, is context on how the system had been
changing over time. I've seen OOM kills where I wanted to know if the system
was at steady state at the time, or if there had been a recent increase in
workload that triggered the OOM event. oomkill provides some context: at the
end of the line is the load average information from /proc/loadavg. For both
of the oomkills here, we can see that the system was getting busier at the
time (a higher 1 minute "average" of 0.99, compared to the 15 minute "average"
of 0.30).
oomkill can also be the basis of other tools and customizations. For example,
you can edit it to include other task_struct details from the target PID at
the time of the OOM kill.
The following commands can be used to test this program, and invoke a memory
consuming process that exhausts system memory and is OOM killed:
sysctl -w vm.overcommit_memory=1 # always overcommit
perl -e 'while (1) { $a .= "A" x 1024; }' # eat all memory
WARNING: This exhausts system memory after disabling some overcommit checks.
Only test in a lab environment.
...@@ -63,7 +63,7 @@ int kretprobe__sys_open(struct pt_regs *ctx) ...@@ -63,7 +63,7 @@ int kretprobe__sys_open(struct pt_regs *ctx)
return 0; return 0;
} }
bpf_trace_printk("%s %d\\n", *filenamep, ret); bpf_trace_printk("%d %s\\n", ret, *filenamep);
args_filename.delete(&pid); args_filename.delete(&pid);
return 0; return 0;
...@@ -90,7 +90,7 @@ start_ts = 0 ...@@ -90,7 +90,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(filename, ret_s) = msg.split(" ") (ret_s, filename) = msg.split(" ", 1)
ret = int(ret_s) ret = int(ret_s)
if (args.failed and (ret >= 0)): if (args.failed and (ret >= 0)):
......
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# runqlat Run queue (scheduler) latency as a histogram.
# For Linux, uses BCC, eBPF.
#
# USAGE: runqlat [-h] [-T] [-m] [-P] [-p PID] [interval] [count]
#
# This measures the time a task spends waiting on a run queue for a turn
# on-CPU, and shows this time as a histogram. This time should be small, but a
# task may need to wait its turn due to CPU load.
#
# This measures two types of run queue latency:
# 1. The time from a task being enqueued on a run queue to its context switch
# and execution. This traces enqueue_task_*() -> finish_task_switch(),
# and instruments the run queue latency after a voluntary context switch.
# 2. The time from when a task was involuntary context switched and still
# in the runnable state, to when it next executed. This is instrumented
# from finish_task_switch() alone.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 07-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
# arguments
examples = """examples:
./runqlat # summarize run queue latency as a histogram
./runqlat 1 10 # print 1 second summaries, 10 times
./runqlat -mT 1 # 1s summaries, milliseconds, and timestamps
./runqlat -P # show each PID separately
./runqlat -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Summarize run queue (schedular) latency as a histogram",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
help="millisecond histogram")
parser.add_argument("-P", "--pids", action="store_true",
help="print a histogram per process ID")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("interval", nargs="?", default=99999999,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
typedef struct pid_key {
u64 pid; // work around
u64 slot;
} pid_key_t;
BPF_HASH(start, u32);
STORAGE
struct rq;
// record enqueue timestamp
int trace_enqueue(struct pt_regs *ctx, struct rq *rq, struct task_struct *p,
int flags)
{
u32 pid = p->pid;
if (FILTER)
return 0;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
// calculate latency
int trace_run(struct pt_regs *ctx, struct task_struct *prev)
{
u32 pid;
// ivcsw: treat like an enqueue event and store timestamp
if (prev->state == TASK_RUNNING) {
pid = prev->pid;
if (!(FILTER)) {
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
}
}
pid = bpf_get_current_pid_tgid();
if (FILTER)
return 0;
u64 *tsp, delta;
// fetch timestamp and calculate delta
tsp = start.lookup(&pid);
if (tsp == 0) {
return 0; // missed enqueue
}
delta = bpf_ktime_get_ns() - *tsp;
FACTOR
// store as histogram
STORE
start.delete(&pid);
return 0;
}
"""
# code substitutions
if args.pid:
bpf_text = bpf_text.replace('FILTER', 'pid != %s' % args.pid)
else:
bpf_text = bpf_text.replace('FILTER', '0')
if args.milliseconds:
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
label = "msecs"
else:
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
label = "usecs"
if args.pids:
bpf_text = bpf_text.replace('STORAGE',
'BPF_HISTOGRAM(dist, pid_key_t);')
bpf_text = bpf_text.replace('STORE',
'pid_key_t key = {.pid = pid, .slot = bpf_log2l(delta)}; ' +
'dist.increment(key);')
else:
bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
bpf_text = bpf_text.replace('STORE',
'dist.increment(bpf_log2l(delta));')
if debug:
print(bpf_text)
# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event_re="enqueue_task_*", fn_name="trace_enqueue")
b.attach_kprobe(event="finish_task_switch", fn_name="trace_run")
print("Tracing run queue latency... Hit Ctrl-C to end.")
# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1
print()
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
dist.print_log2_hist(label, "pid", section_print_fn=int)
dist.clear()
countdown -= 1
if exiting or countdown == 0:
exit()
Demonstrations of runqlat, the Linux eBPF/bcc version.
This program summarizes scheduler run queue latency as a histogram, showing
how long tasks spent waiting their turn to run on-CPU.
Here is a heavily loaded system:
# ./runqlat
Tracing run queue latency... Hit Ctrl-C to end.
^C
usecs : count distribution
0 -> 1 : 233 |*********** |
2 -> 3 : 742 |************************************ |
4 -> 7 : 203 |********** |
8 -> 15 : 173 |******** |
16 -> 31 : 24 |* |
32 -> 63 : 0 | |
64 -> 127 : 30 |* |
128 -> 255 : 6 | |
256 -> 511 : 3 | |
512 -> 1023 : 5 | |
1024 -> 2047 : 27 |* |
2048 -> 4095 : 30 |* |
4096 -> 8191 : 20 | |
8192 -> 16383 : 29 |* |
16384 -> 32767 : 809 |****************************************|
32768 -> 65535 : 64 |*** |
The distribution is bimodal, with one mode between 0 and 15 microseconds,
and another between 16 and 65 milliseconds. These modes are visible as the
spikes in the ASCII distribution (which is merely a visual representation
of the "count" column). As an example of reading one line: 809 events fell
into the 16384 to 32767 microsecond range (16 to 32 ms) while tracing.
I would expect the two modes to be due the workload: 16 hot CPU-bound threads,
and many other mostly idle threads doing occasional work. I suspect the mostly
idle threads will run with a higher priority when they wake up, and are
the reason for the low latency mode. The high latency mode will be the
CPU-bound threads. More analysis with this and other tools can confirm.
A -m option can be used to show milliseconds instead, as well as an interval
and a count. For example, showing three x five second summary in milliseconds:
# ./runqlat -m 5 3
Tracing run queue latency... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 3818 |****************************************|
2 -> 3 : 39 | |
4 -> 7 : 39 | |
8 -> 15 : 62 | |
16 -> 31 : 2214 |*********************** |
32 -> 63 : 226 |** |
msecs : count distribution
0 -> 1 : 3775 |****************************************|
2 -> 3 : 52 | |
4 -> 7 : 37 | |
8 -> 15 : 65 | |
16 -> 31 : 2230 |*********************** |
32 -> 63 : 212 |** |
msecs : count distribution
0 -> 1 : 3816 |****************************************|
2 -> 3 : 49 | |
4 -> 7 : 40 | |
8 -> 15 : 53 | |
16 -> 31 : 2228 |*********************** |
32 -> 63 : 221 |** |
This shows a similar distribution across the three summaries.
A -p option can be used to show one PID only, which is filtered in kernel for
efficiency. For example, PID 4505, and one second summaries:
# ./runqlat -mp 4505 1
Tracing run queue latency... Hit Ctrl-C to end.
msecs : count distribution
0 -> 1 : 1 |* |
2 -> 3 : 2 |*** |
4 -> 7 : 1 |* |
8 -> 15 : 0 | |
16 -> 31 : 25 |****************************************|
32 -> 63 : 3 |**** |
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 2 |** |
4 -> 7 : 0 | |
8 -> 15 : 1 |* |
16 -> 31 : 30 |****************************************|
32 -> 63 : 1 |* |
msecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 28 |****************************************|
32 -> 63 : 2 |** |
msecs : count distribution
0 -> 1 : 1 |* |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 27 |****************************************|
32 -> 63 : 4 |***** |
[...]
For comparison, here is pidstat(1) for that process:
# pidstat -p 4505 1
Linux 4.4.0-virtual (bgregg-xxxxxxxx) 02/08/2016 _x86_64_ (8 CPU)
08:56:11 AM UID PID %usr %system %guest %CPU CPU Command
08:56:12 AM 0 4505 9.00 3.00 0.00 12.00 0 bash
08:56:13 AM 0 4505 7.00 5.00 0.00 12.00 0 bash
08:56:14 AM 0 4505 10.00 2.00 0.00 12.00 0 bash
08:56:15 AM 0 4505 11.00 2.00 0.00 13.00 0 bash
08:56:16 AM 0 4505 9.00 3.00 0.00 12.00 0 bash
[...]
This is a synthetic workload that is CPU bound. It's only spending 12% on-CPU
each second because of high CPU demand on this server: the remaining time
is spent waiting on a run queue, as visualized by runqlat.
Here is the same system, but when it is CPU idle:
# ./runqlat 5 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 2250 |******************************** |
2 -> 3 : 2340 |********************************** |
4 -> 7 : 2746 |****************************************|
8 -> 15 : 418 |****** |
16 -> 31 : 93 |* |
32 -> 63 : 28 | |
64 -> 127 : 119 |* |
128 -> 255 : 9 | |
256 -> 511 : 4 | |
512 -> 1023 : 20 | |
1024 -> 2047 : 22 | |
2048 -> 4095 : 5 | |
4096 -> 8191 : 2 | |
Back to a microsecond scale, this time there is little run queue latency past 1
millisecond, as would be expected.
Now 16 threads are performing heavy disk I/O:
# ./runqlat 5 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 204 | |
2 -> 3 : 944 |* |
4 -> 7 : 16315 |********************* |
8 -> 15 : 29897 |****************************************|
16 -> 31 : 1044 |* |
32 -> 63 : 23 | |
64 -> 127 : 128 | |
128 -> 255 : 24 | |
256 -> 511 : 5 | |
512 -> 1023 : 13 | |
1024 -> 2047 : 15 | |
2048 -> 4095 : 13 | |
4096 -> 8191 : 10 | |
The distribution hasn't changed too much. While the disks are 100% busy, there
is still plenty of CPU headroom, and threads still don't spend much time
waiting their turn.
A -P option will print a distribution for each PID:
# ./runqlat -P
Tracing run queue latency... Hit Ctrl-C to end.
^C
pid = 0
usecs : count distribution
0 -> 1 : 351 |******************************** |
2 -> 3 : 96 |******** |
4 -> 7 : 437 |****************************************|
8 -> 15 : 12 |* |
16 -> 31 : 10 | |
32 -> 63 : 0 | |
64 -> 127 : 16 |* |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 1 | |
pid = 12929
usecs : count distribution
0 -> 1 : 1 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************|
pid = 12930
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 1 |****************************************|
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 12931
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |******************** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 |****************************************|
pid = 12932
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 1 |****************************************|
256 -> 511 : 0 | |
512 -> 1023 : 1 |****************************************|
pid = 7
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 426 |************************************* |
4 -> 7 : 457 |****************************************|
8 -> 15 : 16 |* |
pid = 9
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 425 |****************************************|
8 -> 15 : 16 |* |
pid = 11
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 10 |****************************************|
pid = 14
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 8 |****************************************|
4 -> 7 : 2 |********** |
pid = 18
usecs : count distribution
0 -> 1 : 414 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 20 |* |
8 -> 15 : 8 | |
pid = 12928
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************|
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 1867
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 15 |****************************************|
16 -> 31 : 1 |** |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 4 |********** |
pid = 1871
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 2 |****************************************|
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |******************** |
pid = 1876
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 1 |****************************************|
pid = 1878
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 3 |****************************************|
pid = 1880
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 |****************************************|
pid = 9307
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
pid = 1886
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |******************** |
8 -> 15 : 2 |****************************************|
pid = 1888
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 |****************************************|
pid = 3297
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
pid = 1892
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 1 |******************** |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2 |****************************************|
pid = 7024
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 4 |****************************************|
pid = 16468
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 |****************************************|
pid = 12922
usecs : count distribution
0 -> 1 : 1 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
16 -> 31 : 1 |****************************************|
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 12923
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1 |******************** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 2 |****************************************|
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |******************** |
1024 -> 2047 : 1 |******************** |
pid = 12924
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 2 |******************** |
8 -> 15 : 4 |****************************************|
16 -> 31 : 1 |********** |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 1 |********** |
pid = 12925
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 |****************************************|
pid = 12926
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 1 |****************************************|
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1 |****************************************|
pid = 12927
usecs : count distribution
0 -> 1 : 1 |****************************************|
2 -> 3 : 0 | |
4 -> 7 : 1 |****************************************|
USAGE message:
# ./runqlat -h
usage: runqlat [-h] [-T] [-m] [-P] [-p PID] [interval] [count]
Summarize run queue (schedular) latency as a histogram
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-T, --timestamp include timestamp on output
-m, --milliseconds millisecond histogram
-P, --pids print a histogram per process ID
-p PID, --pid PID trace this PID only
examples:
./runqlat # summarize run queue latency as a histogram
./runqlat 1 10 # print 1 second summaries, 10 times
./runqlat -mT 1 # 1s summaries, milliseconds, and timestamps
./runqlat -P # show each PID separately
./runqlat -p 185 # trace PID 185 only
...@@ -60,7 +60,7 @@ net_rx_action 15656 ...@@ -60,7 +60,7 @@ net_rx_action 15656
This can be useful for quantifying where CPU cycles are spent among the soft This can be useful for quantifying where CPU cycles are spent among the soft
interrupts (summarized as the %softirq column from mpstat(1), and shown as interrupts (summarized as the %softirq column from mpstat(1), and shown as
event counts in /proc/softirqs). The output above shows that most time was spent event counts in /proc/softirqs). The output above shows that most time was spent
processing net_rx_action(), which was around 15 milleconds per second (total processing net_rx_action(), which was around 15 milliseconds per second (total
time across all CPUs). time across all CPUs).
......
...@@ -376,7 +376,7 @@ Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. ...@@ -376,7 +376,7 @@ Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
Detaching... Detaching...
If it wasn't clear how one function called another, knowing the instruction If it wasn't clear how one function called another, knowing the instruction
offset can help you locate the lines of code from a dissassembly dump. offset can help you locate the lines of code from a disassembly dump.
A wildcard can also be used. Eg, all functions beginning with "tcp_send": A wildcard can also be used. Eg, all functions beginning with "tcp_send":
......
...@@ -3,7 +3,7 @@ Demonstrations of stacksnoop, the Linux eBPF/bcc version. ...@@ -3,7 +3,7 @@ Demonstrations of stacksnoop, the Linux eBPF/bcc version.
This program traces the given kernel function and prints the kernel stack trace This program traces the given kernel function and prints the kernel stack trace
for every call. This tool is useful for studying low frequency kernel functions, for every call. This tool is useful for studying low frequency kernel functions,
to see how they were invoked. For exmaple, tracing the ext4_sync_fs() call: to see how they were invoked. For example, tracing the ext4_sync_fs() call:
# ./stacksnoop ext4_sync_fs # ./stacksnoop ext4_sync_fs
TIME(s) STACK TIME(s) STACK
......
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# statsnoop Trace stat() syscalls.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: statsnoop [-h] [-t] [-x] [-p PID]
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 08-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
# arguments
examples = """examples:
./statsnoop # trace all stat() syscalls
./statsnoop -t # include timestamps
./statsnoop -x # only show failed stats
./statsnoop -p 181 # only trace PID 181
"""
parser = argparse.ArgumentParser(
description="Trace stat() syscalls",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-t", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-x", "--failed", action="store_true",
help="only show failed stats")
parser.add_argument("-p", "--pid",
help="trace this PID only")
args = parser.parse_args()
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
BPF_HASH(args_filename, u32, const char *);
int trace_entry(struct pt_regs *ctx, const char __user *filename)
{
u32 pid = bpf_get_current_pid_tgid();
FILTER
args_filename.update(&pid, &filename);
return 0;
};
int trace_return(struct pt_regs *ctx)
{
const char **filenamep;
int ret = ctx->ax;
u32 pid = bpf_get_current_pid_tgid();
filenamep = args_filename.lookup(&pid);
if (filenamep == 0) {
// missed entry
return 0;
}
bpf_trace_printk("%d %s\\n", ret, *filenamep);
args_filename.delete(&pid);
return 0;
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
'if (pid != %s) { return 0; }' % args.pid)
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="sys_stat", fn_name="trace_entry")
b.attach_kprobe(event="sys_statfs", fn_name="trace_entry")
b.attach_kprobe(event="sys_newstat", fn_name="trace_entry")
b.attach_kretprobe(event="sys_stat", fn_name="trace_return")
b.attach_kretprobe(event="sys_statfs", fn_name="trace_return")
b.attach_kretprobe(event="sys_newstat", fn_name="trace_return")
# header
if args.timestamp:
print("%-14s" % ("TIME(s)"), end="")
print("%-6s %-16s %4s %3s %s" % ("PID", "COMM", "FD", "ERR", "PATH"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
(ret_s, filename) = msg.split(" ", 1)
ret = int(ret_s)
if (args.failed and (ret >= 0)):
continue
# split return value into FD and errno columns
if ret >= 0:
fd_s = ret
err = 0
else:
fd_s = "-1"
err = - ret
# print columns
if args.timestamp:
if start_ts == 0:
start_ts = ts
print("%-14.9f" % (ts - start_ts), end="")
print("%-6d %-16s %4s %3s %s" % (pid, task, fd_s, err, filename))
Demonstrations of statsnoop, the Linux eBPF/bcc version.
statsnoop traces the different stat() syscalls system-wide, and prints various
details. Example output:
# ./statsnoop
PID COMM FD ERR PATH
31126 bash 0 0 .
31126 bash -1 2 /usr/local/sbin/iconfig
31126 bash -1 2 /usr/local/bin/iconfig
31126 bash -1 2 /usr/sbin/iconfig
31126 bash -1 2 /usr/bin/iconfig
31126 bash -1 2 /sbin/iconfig
31126 bash -1 2 /bin/iconfig
31126 bash -1 2 /usr/games/iconfig
31126 bash -1 2 /usr/local/games/iconfig
31126 bash -1 2 /apps/python/bin/iconfig
31126 bash -1 2 /mnt/src/llvm/build/bin/iconfig
8902 command-not-fou -1 2 /usr/bin/Modules/Setup
8902 command-not-fou -1 2 /usr/bin/lib/python3.4/os.py
8902 command-not-fou -1 2 /usr/bin/lib/python3.4/os.pyc
8902 command-not-fou 0 0 /usr/lib/python3.4/os.py
8902 command-not-fou -1 2 /usr/bin/pybuilddir.txt
8902 command-not-fou -1 2 /usr/bin/lib/python3.4/lib-dynload
8902 command-not-fou 0 0 /usr/lib/python3.4/lib-dynload
8902 command-not-fou 0 0 /apps/python/lib/python2.7/site-packages
8902 command-not-fou 0 0 /apps/python/lib/python2.7/site-packages
8902 command-not-fou 0 0 /apps/python/lib/python2.7/site-packages
8902 command-not-fou 0 0 /usr/lib/python3.4/
8902 command-not-fou 0 0 /usr/lib/python3.4/
[...]
This output has caught me mistyping a command in another shell, "iconfig"
instead of "ifconfig". The first several lines show the bash shell searching
the $PATH, and failing to find it (ERR == 2 is file not found). Then, a
"command-not-found" program executes (the name is truncated to 16 characters
in the COMM field), which begins the process of searching for and suggesting
a package. ie, this:
# iconfig
No command 'iconfig' found, did you mean:
Command 'vconfig' from package 'vlan' (main)
Command 'fconfig' from package 'redboot-tools' (universe)
Command 'mconfig' from package 'mono-devel' (main)
Command 'iwconfig' from package 'wireless-tools' (main)
Command 'zconfig' from package 'python-zconfig' (universe)
Command 'ifconfig' from package 'net-tools' (main)
iconfig: command not found
statsnoop can be used for general debugging, to see what file information has
been requested, and whether those files exist. It can be used as a companion
to opensnoop, which shows what files were actually opened.
USAGE message:
# ./statsnoop -h
usage: statsnoop [-h] [-t] [-x] [-p PID]
Trace stat() syscalls
optional arguments:
-h, --help show this help message and exit
-t, --timestamp include timestamp on output
-x, --failed only show failed stats
-p PID, --pid PID trace this PID only
examples:
./statsnoop # trace all stat() syscalls
./statsnoop -t # include timestamps
./statsnoop -x # only show failed stats
./statsnoop -p 181 # only trace PID 181
...@@ -467,4 +467,4 @@ examples: ...@@ -467,4 +467,4 @@ examples:
./wakeuptime 5 # trace for 5 seconds only ./wakeuptime 5 # trace for 5 seconds only
./wakeuptime -f 5 # 5 seconds, and output in folded format ./wakeuptime -f 5 # 5 seconds, and output in folded format
./wakeuptime -u # don't include kernel threads (user only) ./wakeuptime -u # don't include kernel threads (user only)
./wakeuptime -p 185 # trace fo PID 185 only ./wakeuptime -p 185 # trace for PID 185 only
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment