Commit 07175d05 authored by Sasha Goldshtein's avatar Sasha Goldshtein Committed by 4ast

stackcount: Support uprobes, tracepoints, and USDT (#730)

* stackcount: Support user-space functions

Add support for user-space functions in `stackcount` by taking an additional
`-l` command-line parameter specifying the name of the user-space library.
When a user-space library is specified, `stackcount` attaches to a specific
process and traces a user-space function with user-space stacks only.
Regex support for uprobes (similar to what is available for kprobes) is
not currently provided.

Also add a couple of functions to the `BPF` object for consistency.

* bcc: Support regex in attach_uprobe

attach_kprobe allows a regular expression for the function name,
while attach_uprobe does not. Add support in libccc for enumerating
all the function symbols in a binary, and use that in the BPF module
to attach uprobes according to a regular expression. For example:

```python
bpf = BPF(text="...")
bpf.attach_uprobe(name="c", sym_re=".*write$", fn_name="probe")
```

* python: Support regex in attach_tracepoint

Modify attach_tracepoint to take a regex argument, in which case
it enumerates all tracepoints matching that regex and attaches to
all of them. The logic for enumerating tracepoints should eventually
belong in libccc and be shared across all the tools (tplist, trace
and so on).

* cc: Fix termination condition bug in symbol enumeration

bcc_elf would not terminate the enumeration correctly when the
user-provided callback returned -1 but there were still more
sections remaining in the ELF to be enumerated.

* stackcount: Support uprobes and tracepoints

Refactored stackcount and added support for uprobes and tracepoints,
which also required changes to the BPF module. USDT support still
pending.

* bcc: Refactor symbol listing to use foreach-style

Refactor symbol listing from paging style to foreach-style with a
callback function per-symbol. Even though we're now performing a
callback from C to Python for each symbol, this is preferable to the
paging approach because we need all the symbols in the current use
case.

Also refactored `stackcount` slightly; only missing support for USDT
probes now.

* stackcount: Support per-process displays

For user-space functions, or when requested for kernel-space
functions or tracepoints, group the output by process. Toggled
with the -P switch, off by default (except for user-space).

* Fix rebase issues, print pid only when there is one

* stackcount: Add USDT support

Now, stackcount supports USDT tracepoints in addition to
kernel functions, user functions, and kernel tracepoints.
The format is the same as with the other general-purpose
tools (argdist, trace):

```
stackcount -p $(pidof node) u:node:gc*
stackcount -p 185 u:pthread:pthread_create
```

* stackcount: Update examples and man page

Add examples and man page documentation for kernel
tracepoints, USDT tracepoints, and other features.

* stackcount: Change printing format slightly

When -p is specified, don't print the comm and pid. Also,
when -P is specified for kernel probes (kprobes and
tracepoints), use -1 for symbol resolution so that we
don't try to resolve kernel functions as user symbols.
Finally, print the comm and pid at the end of the stack
output and not at the beginning.
parent ba404cfe
.TH stackcount 8 "2016-01-14" "USER COMMANDS" .TH stackcount 8 "2016-01-14" "USER COMMANDS"
.SH NAME .SH NAME
stackcount \- Count kernel function calls and their stack traces. Uses Linux eBPF/bcc. stackcount \- Count function calls and their stack traces. Uses Linux eBPF/bcc.
.SH SYNOPSIS .SH SYNOPSIS
.B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] pattern .B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] [\-s]
[\-P] [\-v] [\-d] pattern
.SH DESCRIPTION .SH DESCRIPTION
stackcount traces kernel functions and frequency counts them with their entire stackcount traces functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. This allows higher stack trace, summarized in-kernel for efficiency. This allows higher
frequency events to be studied. The output consists of unique stack traces, frequency events to be studied. The output consists of unique stack traces,
and their occurrence counts. and their occurrence counts. In addition to kernel and user functions, kernel
tracepoints and USDT tracepoint are also supported.
The pattern is a string with optional '*' wildcards, similar to file globbing. The pattern is a string with optional '*' wildcards, similar to file globbing.
If you'd prefer to use regular expressions, use the \-r option. If you'd prefer to use regular expressions, use the \-r option.
...@@ -35,14 +37,18 @@ Include a timestamp with interval output. ...@@ -35,14 +37,18 @@ Include a timestamp with interval output.
\-v \-v
Show raw addresses. Show raw addresses.
.TP .TP
\-d
Print the source of the BPF program when loading it (for debugging purposes).
.TP
\-i interval \-i interval
Summary interval, in seconds. Summary interval, in seconds.
.TP .TP
\-p PID \-p PID
Trace this process ID only (filtered in-kernel). Trace this process ID only (filtered in-kernel).
.TP .TP
.TP
pattern pattern
A kernel function name, or a search pattern. Can include wildcards ("*"). If the A function name, or a search pattern. Can include wildcards ("*"). If the
\-r option is used, can include regular expressions. \-r option is used, can include regular expressions.
.SH EXAMPLES .SH EXAMPLES
.TP .TP
...@@ -77,6 +83,18 @@ Output every 5 seconds, with timestamps: ...@@ -77,6 +83,18 @@ Output every 5 seconds, with timestamps:
Only count stacks when PID 185 is on-CPU: Only count stacks when PID 185 is on-CPU:
# #
.B stackcount -p 185 ip_output .B stackcount -p 185 ip_output
.TP
Count user stacks for dynamic heap allocations with malloc in PID 185:
#
.B stackcount -p 185 c:malloc
.TP
Count user stacks for thread creation (USDT tracepoint) in PID 185:
#
.B stackcount -p 185 u:pthread:pthread_create
.TP
Count kernel stacks for context switch events using a kernel tracepoint:
#
.B stackcount t:sched:sched_switch
.SH OVERHEAD .SH OVERHEAD
This summarizes unique stack traces in-kernel for efficiency, allowing it to This summarizes unique stack traces in-kernel for efficiency, allowing it to
trace a higher rate of function calls than methods that post-process in user trace a higher rate of function calls than methods that post-process in user
...@@ -99,6 +117,6 @@ Linux ...@@ -99,6 +117,6 @@ Linux
.SH STABILITY .SH STABILITY
Unstable - in development. Unstable - in development.
.SH AUTHOR .SH AUTHOR
Brendan Gregg Brendan Gregg, Sasha Goldshtein
.SH SEE ALSO .SH SEE ALSO
stacksnoop(8), funccount(8) stacksnoop(8), funccount(8)
...@@ -165,7 +165,7 @@ static int list_in_scn(Elf *e, Elf_Scn *section, size_t stridx, size_t symsize, ...@@ -165,7 +165,7 @@ static int list_in_scn(Elf *e, Elf_Scn *section, size_t stridx, size_t symsize,
continue; continue;
if (callback(name, sym.st_value, sym.st_size, sym.st_info, payload) < 0) if (callback(name, sym.st_value, sym.st_size, sym.st_info, payload) < 0)
break; return 1; // signal termination to caller
} }
} }
...@@ -184,9 +184,13 @@ static int listsymbols(Elf *e, bcc_elf_symcb callback, void *payload) { ...@@ -184,9 +184,13 @@ static int listsymbols(Elf *e, bcc_elf_symcb callback, void *payload) {
if (header.sh_type != SHT_SYMTAB && header.sh_type != SHT_DYNSYM) if (header.sh_type != SHT_SYMTAB && header.sh_type != SHT_DYNSYM)
continue; continue;
if (list_in_scn(e, section, header.sh_link, header.sh_entsize, callback, int rc = list_in_scn(e, section, header.sh_link, header.sh_entsize,
payload) < 0) callback, payload);
return -1; if (rc == 1)
break; // callback signaled termination
if (rc < 0)
return rc;
} }
return 0; return 0;
......
...@@ -270,6 +270,32 @@ int bcc_find_symbol_addr(struct bcc_symbol *sym) { ...@@ -270,6 +270,32 @@ int bcc_find_symbol_addr(struct bcc_symbol *sym) {
return bcc_elf_foreach_sym(sym->module, _find_sym, sym); return bcc_elf_foreach_sym(sym->module, _find_sym, sym);
} }
struct sym_search_t {
struct bcc_symbol *syms;
int start;
int requested;
int *actual;
};
// see <elf.h>
#define ELF_TYPE_IS_FUNCTION(flags) (((flags) & 0xf) == 2)
static int _list_sym(const char *symname, uint64_t addr, uint64_t end,
int flags, void *payload) {
if (!ELF_TYPE_IS_FUNCTION(flags) || addr == 0)
return 0;
SYM_CB cb = (SYM_CB) payload;
return cb(symname, addr);
}
int bcc_foreach_symbol(const char *module, SYM_CB cb) {
if (module == 0 || cb == 0)
return -1;
return bcc_elf_foreach_sym(module, _list_sym, (void *)cb);
}
int bcc_resolve_symname(const char *module, const char *symname, int bcc_resolve_symname(const char *module, const char *symname,
const uint64_t addr, struct bcc_symbol *sym) { const uint64_t addr, struct bcc_symbol *sym) {
uint64_t load_addr; uint64_t load_addr;
......
...@@ -29,6 +29,8 @@ struct bcc_symbol { ...@@ -29,6 +29,8 @@ struct bcc_symbol {
uint64_t offset; uint64_t offset;
}; };
typedef int(* SYM_CB)(const char *symname, uint64_t addr);
void *bcc_symcache_new(int pid); void *bcc_symcache_new(int pid);
int bcc_symcache_resolve(void *symcache, uint64_t addr, struct bcc_symbol *sym); int bcc_symcache_resolve(void *symcache, uint64_t addr, struct bcc_symbol *sym);
int bcc_symcache_resolve_name(void *resolver, const char *name, uint64_t *addr); int bcc_symcache_resolve_name(void *resolver, const char *name, uint64_t *addr);
...@@ -36,6 +38,7 @@ void bcc_symcache_refresh(void *resolver); ...@@ -36,6 +38,7 @@ void bcc_symcache_refresh(void *resolver);
int bcc_resolve_global_addr(int pid, const char *module, const uint64_t address, int bcc_resolve_global_addr(int pid, const char *module, const uint64_t address,
uint64_t *global); uint64_t *global);
int bcc_foreach_symbol(const char *module, SYM_CB cb);
int bcc_find_symbol_addr(struct bcc_symbol *sym); int bcc_find_symbol_addr(struct bcc_symbol *sym);
int bcc_resolve_symname(const char *module, const char *symname, int bcc_resolve_symname(const char *module, const char *symname,
const uint64_t addr, struct bcc_symbol *sym); const uint64_t addr, struct bcc_symbol *sym);
......
...@@ -25,7 +25,7 @@ import errno ...@@ -25,7 +25,7 @@ import errno
import sys import sys
basestring = (unicode if sys.version_info[0] < 3 else str) basestring = (unicode if sys.version_info[0] < 3 else str)
from .libbcc import lib, _CB_TYPE, bcc_symbol from .libbcc import lib, _CB_TYPE, bcc_symbol, _SYM_CB_TYPE
from .table import Table from .table import Table
from .perf import Perf from .perf import Perf
from .usyms import ProcessSymbols from .usyms import ProcessSymbols
...@@ -531,8 +531,25 @@ class BPF(object): ...@@ -531,8 +531,25 @@ class BPF(object):
res = lib.bcc_procutils_which_so(libname.encode("ascii")) res = lib.bcc_procutils_which_so(libname.encode("ascii"))
return res if res is None else res.decode() return res if res is None else res.decode()
def attach_tracepoint(self, tp="", fn_name="", pid=-1, cpu=0, group_fd=-1): def _get_tracepoints(self, tp_re):
"""attach_tracepoint(tp="", fn_name="", pid=-1, cpu=0, group_fd=-1) results = []
events_dir = os.path.join(TRACEFS, "events")
for category in os.listdir(events_dir):
cat_dir = os.path.join(events_dir, category)
if not os.path.isdir(cat_dir):
continue
for event in os.listdir(cat_dir):
evt_dir = os.path.join(cat_dir, event)
if os.path.isdir(evt_dir):
tp = ("%s:%s" % (category, event))
if re.match(tp_re, tp):
results.append(tp)
return results
def attach_tracepoint(self, tp="", tp_re="", fn_name="", pid=-1,
cpu=0, group_fd=-1):
"""attach_tracepoint(tp="", tp_re="", fn_name="", pid=-1,
cpu=0, group_fd=-1)
Run the bpf function denoted by fn_name every time the kernel tracepoint Run the bpf function denoted by fn_name every time the kernel tracepoint
specified by 'tp' is hit. The optional parameters pid, cpu, and group_fd specified by 'tp' is hit. The optional parameters pid, cpu, and group_fd
...@@ -540,12 +557,24 @@ class BPF(object): ...@@ -540,12 +557,24 @@ class BPF(object):
the tracepoint category and the tracepoint name, separated by a colon. the tracepoint category and the tracepoint name, separated by a colon.
For example: sched:sched_switch, syscalls:sys_enter_bind, etc. For example: sched:sched_switch, syscalls:sys_enter_bind, etc.
Instead of a tracepoint name, a regular expression can be provided in
tp_re. The program will then attach to tracepoints that match the
provided regular expression.
To obtain a list of kernel tracepoints, use the tplist tool or cat the To obtain a list of kernel tracepoints, use the tplist tool or cat the
file /sys/kernel/debug/tracing/available_events. file /sys/kernel/debug/tracing/available_events.
Example: BPF(text).attach_tracepoint("sched:sched_switch", "on_switch") Examples:
BPF(text).attach_tracepoint(tp="sched:sched_switch", fn_name="on_switch")
BPF(text).attach_tracepoint(tp_re="sched:.*", fn_name="on_switch")
""" """
if tp_re:
for tp in self._get_tracepoints(tp_re):
self.attach_tracepoint(tp=tp, fn_name=fn_name, pid=pid,
cpu=cpu, group_fd=group_fd)
return
fn = self.load_func(fn_name, BPF.TRACEPOINT) fn = self.load_func(fn_name, BPF.TRACEPOINT)
(tp_category, tp_name) = tp.split(':') (tp_category, tp_name) = tp.split(':')
res = lib.bpf_attach_tracepoint(fn.fd, tp_category.encode("ascii"), res = lib.bpf_attach_tracepoint(fn.fd, tp_category.encode("ascii"),
...@@ -586,9 +615,29 @@ class BPF(object): ...@@ -586,9 +615,29 @@ class BPF(object):
del self.open_uprobes[name] del self.open_uprobes[name]
_num_open_probes -= 1 _num_open_probes -= 1
def attach_uprobe(self, name="", sym="", addr=None, def _get_user_functions(self, name, sym_re):
"""
We are returning addresses here instead of symbol names because it
turns out that the same name may appear multiple times with different
addresses, and the same address may appear multiple times with the same
name. We can't attach a uprobe to the same address more than once, so
it makes sense to return the unique set of addresses that are mapped to
a symbol that matches the provided regular expression.
"""
addresses = []
def sym_cb(sym_name, addr):
if re.match(sym_re, sym_name) and addr not in addresses:
addresses.append(addr)
return 0
res = lib.bcc_foreach_symbol(name, _SYM_CB_TYPE(sym_cb))
if res < 0:
raise Exception("Error %d enumerating symbols in %s" % (res, name))
return addresses
def attach_uprobe(self, name="", sym="", sym_re="", addr=None,
fn_name="", pid=-1, cpu=0, group_fd=-1): fn_name="", pid=-1, cpu=0, group_fd=-1):
"""attach_uprobe(name="", sym="", addr=None, fn_name="" """attach_uprobe(name="", sym="", sym_re="", addr=None, fn_name=""
pid=-1, cpu=0, group_fd=-1) pid=-1, cpu=0, group_fd=-1)
Run the bpf function denoted by fn_name every time the symbol sym in Run the bpf function denoted by fn_name every time the symbol sym in
...@@ -596,6 +645,10 @@ class BPF(object): ...@@ -596,6 +645,10 @@ class BPF(object):
be supplied in place of sym. Optional parameters pid, cpu, and group_fd be supplied in place of sym. Optional parameters pid, cpu, and group_fd
can be used to filter the probe. can be used to filter the probe.
Instead of a symbol name, a regular expression can be provided in
sym_re. The uprobe will then attach to symbols that match the provided
regular expression.
Libraries can be given in the name argument without the lib prefix, or Libraries can be given in the name argument without the lib prefix, or
with the full path (/usr/lib/...). Binaries can be given only with the with the full path (/usr/lib/...). Binaries can be given only with the
full path (/bin/sh). full path (/bin/sh).
...@@ -605,6 +658,14 @@ class BPF(object): ...@@ -605,6 +658,14 @@ class BPF(object):
""" """
name = str(name) name = str(name)
if sym_re:
for sym_addr in self._get_user_functions(name, sym_re):
self.attach_uprobe(name=name, addr=sym_addr,
fn_name=fn_name, pid=pid, cpu=cpu,
group_fd=group_fd)
return
(path, addr) = BPF._check_path_symbol(name, sym, addr) (path, addr) = BPF._check_path_symbol(name, sym, addr)
self._check_probe_quota(1) self._check_probe_quota(1)
...@@ -798,6 +859,17 @@ class BPF(object): ...@@ -798,6 +859,17 @@ class BPF(object):
name, _ = BPF._sym_cache(pid).resolve(addr) name, _ = BPF._sym_cache(pid).resolve(addr)
return name return name
@staticmethod
def symaddr(addr, pid):
"""symaddr(addr, pid)
Translate a memory address into a function name plus the instruction
offset as a hexadecimal number, which is returned as a string.
A pid of less than zero will access the kernel symbol cache.
"""
name, offset = BPF._sym_cache(pid).resolve(addr)
return "%s+0x%x" % (name, offset)
@staticmethod @staticmethod
def ksym(addr): def ksym(addr):
"""ksym(addr) """ksym(addr)
...@@ -815,8 +887,7 @@ class BPF(object): ...@@ -815,8 +887,7 @@ class BPF(object):
instruction offset as a hexidecimal number, which is returned as a instruction offset as a hexidecimal number, which is returned as a
string. string.
""" """
name, offset = BPF._sym_cache(-1).resolve(addr) return BPF.symaddr(addr, -1)
return "%s+0x%x" % (name, offset)
@staticmethod @staticmethod
def ksymname(name): def ksymname(name):
...@@ -835,6 +906,20 @@ class BPF(object): ...@@ -835,6 +906,20 @@ class BPF(object):
""" """
return len([k for k in self.open_kprobes.keys() if isinstance(k, str)]) return len([k for k in self.open_kprobes.keys() if isinstance(k, str)])
def num_open_uprobes(self):
"""num_open_uprobes()
Get the number of open U[ret]probes.
"""
return len(self.open_uprobes)
def num_open_tracepoints(self):
"""num_open_tracepoints()
Get the number of open tracepoints.
"""
return len(self.open_tracepoints)
def kprobe_poll(self, timeout = -1): def kprobe_poll(self, timeout = -1):
"""kprobe_poll(self) """kprobe_poll(self)
......
...@@ -129,6 +129,10 @@ lib.bcc_resolve_symname.restype = ct.c_int ...@@ -129,6 +129,10 @@ lib.bcc_resolve_symname.restype = ct.c_int
lib.bcc_resolve_symname.argtypes = [ lib.bcc_resolve_symname.argtypes = [
ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.POINTER(bcc_symbol)] ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.POINTER(bcc_symbol)]
_SYM_CB_TYPE = ct.CFUNCTYPE(ct.c_int, ct.c_char_p, ct.c_ulonglong)
lib.bcc_foreach_symbol.restype = ct.c_int
lib.bcc_foreach_symbol.argtypes = [ct.c_char_p, _SYM_CB_TYPE]
lib.bcc_symcache_new.restype = ct.c_void_p lib.bcc_symcache_new.restype = ct.c_void_p
lib.bcc_symcache_new.argtypes = [ct.c_int] lib.bcc_symcache_new.argtypes = [ct.c_int]
......
#!/usr/bin/python #!/usr/bin/env python
# #
# stackcount Count kernel function calls and their stack traces. # stackcount Count events and their stack traces.
# For Linux, uses BCC, eBPF. # For Linux, uses BCC, eBPF.
# #
# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern # USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s]
# [-P] [-v] pattern
# #
# The pattern is a string with optional '*' wildcards, similar to file # The pattern is a string with optional '*' wildcards, similar to file
# globbing. If you'd prefer to use regular expressions, use the -r option. # globbing. If you'd prefer to use regular expressions, use the -r option.
# #
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 11 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc. # Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
# #
# 12-Jan-2016 Brendan Gregg Created this. # 12-Jan-2016 Brendan Gregg Created this.
# 09-Jul-2016 Sasha Goldshtein Generalized for uprobes and tracepoints.
from __future__ import print_function from __future__ import print_function
from bcc import BPF from bcc import BPF, USDT
from time import sleep, strftime from time import sleep, strftime
import argparse import argparse
import re
import signal import signal
import sys
import traceback
# arguments debug = False
examples = """examples:
./stackcount submit_bio # count kernel stack traces for submit_bio class Probe(object):
./stackcount ip_output # count kernel stack traces for ip_output def __init__(self, pattern, use_regex=False, pid=None, per_pid=False):
./stackcount -s ip_output # show symbol offsets """Init a new probe.
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send* Init the probe from the pattern provided by the user. The supported
./stackcount -r '^tcp_send.*' # same as above, using regular expressions patterns mimic the 'trace' and 'argdist' tools, but are simpler because
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps we don't have to distinguish between probes and retprobes.
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
""" func -- probe a kernel function
parser = argparse.ArgumentParser( lib:func -- probe a user-space function in the library 'lib'
description="Count kernel function calls and their stack traces", p::func -- same thing as 'func'
formatter_class=argparse.RawDescriptionHelpFormatter, p:lib:func -- same thing as 'lib:func'
epilog=examples) t:cat:event -- probe a kernel tracepoint
parser.add_argument("-p", "--pid", u:lib:probe -- probe a USDT tracepoint
help="trace this PID only") """
parser.add_argument("-i", "--interval", default=99999999, parts = pattern.split(':')
help="summary interval, seconds") if len(parts) == 1:
parser.add_argument("-T", "--timestamp", action="store_true", parts = ["p", "", parts[0]]
help="include timestamp on output") elif len(parts) == 2:
parser.add_argument("-r", "--regexp", action="store_true", parts = ["p", parts[0], parts[1]]
help="use regular expressions. Default is \"*\" wildcards only.") elif len(parts) == 3:
parser.add_argument("-s", "--offset", action="store_true", if parts[0] == "t":
help="show address offsets") parts = ["t", "", "%s:%s" % tuple(parts[1:])]
parser.add_argument("-v", "--verbose", action="store_true", if parts[0] not in ["p", "t", "u"]:
help="show raw addresses") raise Exception("Type must be 'p', 't', or 'u', but got %s" %
parser.add_argument("pattern", parts[0])
help="search expression for kernel functions") else:
args = parser.parse_args() raise Exception("Too many ':'-separated components in pattern %s" %
pattern = args.pattern pattern)
if not args.regexp:
pattern = pattern.replace('*', '.*') (self.type, self.library, self.pattern) = parts
pattern = '^' + pattern + '$' if not use_regex:
offset = args.offset self.pattern = self.pattern.replace('*', '.*')
verbose = args.verbose self.pattern = '^' + self.pattern + '$'
debug = 0
maxdepth = 10 # and MAXDEPTH if (self.type == "p" and self.library) or self.type == "u":
libpath = BPF.find_library(self.library)
# signal handler if libpath is None:
def signal_ignore(signal, frame): # This might be an executable (e.g. 'bash')
print() libpath = BPF.find_exe(self.library)
if libpath is None or len(libpath) == 0:
# load BPF program raise Exception("unable to find library %s" % self.library)
bpf_text = """ self.library = libpath
#include <uapi/linux/ptrace.h>
BPF_HASH(counts, int);
BPF_STACK_TRACE(stack_traces, 1024);
int trace_count(struct pt_regs *ctx) { self.pid = pid
self.per_pid = per_pid
self.matched = 0
def is_kernel_probe(self):
return self.type == "t" or (self.type == "p" and self.library == "")
def attach(self):
if self.type == "p":
if self.library:
self.bpf.attach_uprobe(name=self.library,
sym_re=self.pattern,
fn_name="trace_count",
pid=self.pid or -1)
self.matched = self.bpf.num_open_uprobes()
else:
self.bpf.attach_kprobe(event_re=self.pattern,
fn_name="trace_count",
pid=self.pid or -1)
self.matched = self.bpf.num_open_kprobes()
elif self.type == "t":
self.bpf.attach_tracepoint(tp_re=self.pattern,
fn_name="trace_count",
pid=self.pid or -1)
self.matched = self.bpf.num_open_tracepoints()
elif self.type == "u":
pass # Nothing to do -- attach already happened in `load`
if self.matched == 0:
raise Exception("No functions matched by pattern %s" % self.pattern)
def load(self):
trace_count_text = """
int trace_count(void *ctx) {
FILTER FILTER
int key = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID); struct key_t key = {};
key.pid = GET_PID;
key.stackid = stack_traces.get_stackid(ctx, STACK_FLAGS);
u64 zero = 0; u64 zero = 0;
u64 *val = counts.lookup_or_init(&key, &zero); u64 *val = counts.lookup_or_init(&key, &zero);
(*val)++; (*val)++;
return 0; return 0;
} }
""" """
if args.pid: bpf_text = """#include <uapi/linux/ptrace.h>
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' + struct key_t {
'if (pid != %s) { return 0; }') % (args.pid)) u32 pid;
else: int stackid;
bpf_text = bpf_text.replace('FILTER', '') };
if debug:
print(bpf_text) BPF_HASH(counts, struct key_t);
b = BPF(text=bpf_text) BPF_STACK_TRACE(stack_traces, 1024);
b.attach_kprobe(event_re=pattern, fn_name="trace_count")
matched = b.num_open_kprobes() """
if matched == 0:
print("0 functions matched by \"%s\". Exiting." % args.pattern) # We really mean the tgid from the kernel's perspective, which is in
exit() # the top 32 bits of bpf_get_current_pid_tgid().
if self.is_kernel_probe() and self.pid:
# header trace_count_text = trace_count_text.replace('FILTER',
print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." % ('u32 pid; pid = bpf_get_current_pid_tgid() >> 32; ' +
(matched, args.pattern)) 'if (pid != %d) { return 0; }') % (self.pid))
else:
def print_frame(addr): trace_count_text = trace_count_text.replace('FILTER', '')
print(" ", end="")
if verbose: # We need per-pid statistics when tracing a user-space process, because
print("%-16x " % addr, end="") # the meaning of the symbols depends on the pid. We also need them if
if offset: # per-pid statistics were requested with -P.
print("%s" % b.ksymaddr(addr)) if self.per_pid or not self.is_kernel_probe():
else: trace_count_text = trace_count_text.replace('GET_PID',
print("%s" % b.ksym(addr)) 'bpf_get_current_pid_tgid() >> 32')
else:
# output trace_count_text = trace_count_text.replace('GET_PID', '0xffffffff')
exiting = 0 if args.interval else 1
while (1): stack_flags = 'BPF_F_REUSE_STACKID'
if not self.is_kernel_probe():
stack_flags += '| BPF_F_USER_STACK' # can't do both U *and* K
trace_count_text = trace_count_text.replace('STACK_FLAGS', stack_flags)
self.usdt = None
if self.type == "u":
self.usdt = USDT(path=self.library, pid=self.pid)
for probe in self.usdt.enumerate_probes():
if not self.pid and (probe.bin_path != self.library):
continue
if re.match(self.pattern, probe.name):
# This hack is required because the bpf_usdt_readarg
# functions generated need different function names for
# each attached probe. If we just stick to trace_count,
# we'd get multiple bpf_usdt_readarg helpers with the same
# name when enabling more than one USDT probe.
new_func = "trace_count_%d" % self.matched
bpf_text += trace_count_text.replace(
"trace_count", new_func)
self.usdt.enable_probe(probe.name, new_func)
self.matched += 1
if debug:
print(self.usdt.get_text())
else:
bpf_text += trace_count_text
if debug:
print(bpf_text)
self.bpf = BPF(text=bpf_text, usdt_contexts=
[self.usdt] if self.usdt else [])
class Tool(object):
def __init__(self):
examples = """examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
./stackcount -p 185 c:malloc # count stacks for malloc in PID 185
./stackcount t:sched:sched_fork # count stacks for the sched_fork tracepoint
./stackcount -p 185 u:node:* # count stacks for all USDT probes in node
"""
parser = argparse.ArgumentParser(
description="Count events and their stack traces",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid", type=int,
help="trace this PID only")
parser.add_argument("-i", "--interval", default=99999999,
help="summary interval, seconds")
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-r", "--regexp", action="store_true",
help="use regular expressions. Default is \"*\" wildcards only.")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-P", "--perpid", action="store_true",
help="display stacks separately for each process")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("-d", "--debug", action="store_true",
help="print BPF program before starting (for debugging purposes)")
parser.add_argument("pattern",
help="search expression for events")
self.args = parser.parse_args()
global debug
debug = self.args.debug
self.probe = Probe(self.args.pattern, self.args.regexp,
self.args.pid, self.args.perpid)
def _print_frame(self, addr, pid):
print(" ", end="")
if self.args.verbose:
print("%-16x " % addr, end="")
if self.args.offset:
print("%s" % self.probe.bpf.symaddr(addr, pid))
else:
print("%s" % self.probe.bpf.sym(addr, pid))
@staticmethod
def _signal_ignore(signal, frame):
print()
def _comm_for_pid(self, pid):
if pid in self.comm_cache:
return self.comm_cache[pid]
try:
comm = " %s [%d]" % (
open("/proc/%d/comm" % pid).read().strip(),
pid)
self.comm_cache[pid] = comm
return comm
except:
return " unknown process [%d]" % pid
def run(self):
self.probe.load()
self.probe.attach()
print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
(self.probe.matched, self.args.pattern))
exiting = 0 if self.args.interval else 1
while True:
try:
sleep(int(self.args.interval))
except KeyboardInterrupt:
exiting = 1
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, Tool._signal_ignore)
print()
if self.args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = self.probe.bpf["counts"]
stack_traces = self.probe.bpf["stack_traces"]
self.comm_cache = {}
for k, v in sorted(counts.items(),
key=lambda counts: counts[1].value):
for addr in stack_traces.walk(k.stackid):
pid = -1 if self.probe.is_kernel_probe() else k.pid
self._print_frame(addr, pid)
if not self.args.pid and k.pid != 0xffffffff:
print(self._comm_for_pid(k.pid))
print(" %d\n" % v.value)
counts.clear()
if exiting:
print("Detaching...")
exit()
if __name__ == "__main__":
try: try:
sleep(int(args.interval)) Tool().run()
except KeyboardInterrupt: except Exception:
exiting = 1 if debug:
# as cleanup can take many seconds, trap Ctrl-C: traceback.print_exc()
signal.signal(signal.SIGINT, signal_ignore) elif sys.exc_info()[0] is not SystemExit:
print(sys.exc_info()[1])
print()
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = b["counts"]
stack_traces = b["stack_traces"]
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
for addr in stack_traces.walk(k.value):
print_frame(addr)
print(" %d\n" % v.value)
counts.clear()
if exiting:
print("Detaching...")
exit()
Demonstrations of stackcount, the Linux eBPF/bcc version. Demonstrations of stackcount, the Linux eBPF/bcc version.
This program traces kernel functions and frequency counts them with their entire This program traces functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. For example, counting stack trace, summarized in-kernel for efficiency. For example, counting
stack traces that led to submit_bio(), which creates block device I/O: stack traces that led to submit_bio(), which creates block device I/O:
# ./stackcount submit_bio # ./stackcount submit_bio
...@@ -268,6 +268,76 @@ As may be obvious, this is a great tool for quickly understanding kernel code ...@@ -268,6 +268,76 @@ As may be obvious, this is a great tool for quickly understanding kernel code
flow. flow.
User-space functions can also be traced if a library name is provided. For
example, to quickly identify code locations that allocate heap memory:
# ./stackcount -l c -p 4902 malloc
Tracing 1 functions for "malloc"... Hit Ctrl-C to end.
^C
malloc
rbtree_new
main
[unknown]
12
malloc
_rbtree_node_new_internal
_rbtree_node_insert
rbtree_insert
main
[unknown]
1189
Detaching...
Note that user-space uses of stackcount can be somewhat more limited because
a lot of user-space libraries and binaries are compiled without debuginfo, or
with frame-pointer omission (-fomit-frame-pointer), which makes it impossible
to reliably obtain the stack trace.
In addition to kernel and user-space functions, kernel tracepoints and USDT
tracepoints are also supported.
For example, to determine where threads are being created in a particular
process, use the pthread_create USDT tracepoint:
# ./stackcount -p $(pidof parprimes) u:pthread:pthread_create
Tracing 1 functions for "u:pthread:pthread_create"... Hit Ctrl-C to end.
^C
parprimes [11923]
pthread_create@@GLIBC_2.2.5
main
__libc_start_main
[unknown]
7
Similarly, to determine where context switching is happening in the kernel,
use the sched:sched_switch kernel tracepoint:
# ./stackcount t:sched:sched_switch
... (omitted for brevity)
__schedule
schedule
schedule_hrtimeout_range_clock
schedule_hrtimeout_range
poll_schedule_timeout
do_select
core_sys_select
SyS_select
entry_SYSCALL_64_fastpath
40
__schedule
schedule
schedule_preempt_disabled
cpu_startup_entry
start_secondary
85
A -i option can be used to set an output interval, and -T to include a A -i option can be used to set an output interval, and -T to include a
timestamp. For example: timestamp. For example:
...@@ -434,12 +504,13 @@ Use -r to allow regular expressions. ...@@ -434,12 +504,13 @@ Use -r to allow regular expressions.
USAGE message: USAGE message:
# ./stackcount -h # ./stackcount -h
usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s] [-v] pattern usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s]
[-l LIBRARY] [-v] [-d] pattern
Count kernel function calls and their stack traces Count function calls and their stack traces
positional arguments: positional arguments:
pattern search expression for kernel functions pattern search expression for functions
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
...@@ -450,14 +521,19 @@ optional arguments: ...@@ -450,14 +521,19 @@ optional arguments:
-r, --regexp use regular expressions. Default is "*" wildcards -r, --regexp use regular expressions. Default is "*" wildcards
only. only.
-s, --offset show address offsets -s, --offset show address offsets
-l, --library trace user-space functions from this library or executable
-v, --verbose show raw addresses -v, --verbose show raw addresses
-d, --debug print BPF program before starting (for debugging purposes)
examples: examples:
./stackcount submit_bio # count kernel stack traces for submit_bio ./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output ./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets ./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose) ./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send* ./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions ./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps ./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
./stackcount -p 185 -l c malloc # count stacks for malloc in PID 185
./stackcount t:sched:sched_fork # count stacks for the sched_fork tracepoint
./stackcount -p 185 u:node:* # count stacks for all USDT probes in node
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment