Commit 07175d05 authored by Sasha Goldshtein's avatar Sasha Goldshtein Committed by 4ast

stackcount: Support uprobes, tracepoints, and USDT (#730)

* stackcount: Support user-space functions

Add support for user-space functions in `stackcount` by taking an additional
`-l` command-line parameter specifying the name of the user-space library.
When a user-space library is specified, `stackcount` attaches to a specific
process and traces a user-space function with user-space stacks only.
Regex support for uprobes (similar to what is available for kprobes) is
not currently provided.

Also add a couple of functions to the `BPF` object for consistency.

* bcc: Support regex in attach_uprobe

attach_kprobe allows a regular expression for the function name,
while attach_uprobe does not. Add support in libccc for enumerating
all the function symbols in a binary, and use that in the BPF module
to attach uprobes according to a regular expression. For example:

```python
bpf = BPF(text="...")
bpf.attach_uprobe(name="c", sym_re=".*write$", fn_name="probe")
```

* python: Support regex in attach_tracepoint

Modify attach_tracepoint to take a regex argument, in which case
it enumerates all tracepoints matching that regex and attaches to
all of them. The logic for enumerating tracepoints should eventually
belong in libccc and be shared across all the tools (tplist, trace
and so on).

* cc: Fix termination condition bug in symbol enumeration

bcc_elf would not terminate the enumeration correctly when the
user-provided callback returned -1 but there were still more
sections remaining in the ELF to be enumerated.

* stackcount: Support uprobes and tracepoints

Refactored stackcount and added support for uprobes and tracepoints,
which also required changes to the BPF module. USDT support still
pending.

* bcc: Refactor symbol listing to use foreach-style

Refactor symbol listing from paging style to foreach-style with a
callback function per-symbol. Even though we're now performing a
callback from C to Python for each symbol, this is preferable to the
paging approach because we need all the symbols in the current use
case.

Also refactored `stackcount` slightly; only missing support for USDT
probes now.

* stackcount: Support per-process displays

For user-space functions, or when requested for kernel-space
functions or tracepoints, group the output by process. Toggled
with the -P switch, off by default (except for user-space).

* Fix rebase issues, print pid only when there is one

* stackcount: Add USDT support

Now, stackcount supports USDT tracepoints in addition to
kernel functions, user functions, and kernel tracepoints.
The format is the same as with the other general-purpose
tools (argdist, trace):

```
stackcount -p $(pidof node) u:node:gc*
stackcount -p 185 u:pthread:pthread_create
```

* stackcount: Update examples and man page

Add examples and man page documentation for kernel
tracepoints, USDT tracepoints, and other features.

* stackcount: Change printing format slightly

When -p is specified, don't print the comm and pid. Also,
when -P is specified for kernel probes (kprobes and
tracepoints), use -1 for symbol resolution so that we
don't try to resolve kernel functions as user symbols.
Finally, print the comm and pid at the end of the stack
output and not at the beginning.
parent ba404cfe
.TH stackcount 8 "2016-01-14" "USER COMMANDS"
.SH NAME
stackcount \- Count kernel function calls and their stack traces. Uses Linux eBPF/bcc.
stackcount \- Count function calls and their stack traces. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] pattern
.B stackcount [\-h] [\-p PID] [\-i INTERVAL] [\-T] [\-r] [\-s]
[\-P] [\-v] [\-d] pattern
.SH DESCRIPTION
stackcount traces kernel functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. This allows higher
stackcount traces functions and frequency counts them with their entire
stack trace, summarized in-kernel for efficiency. This allows higher
frequency events to be studied. The output consists of unique stack traces,
and their occurrence counts.
and their occurrence counts. In addition to kernel and user functions, kernel
tracepoints and USDT tracepoint are also supported.
The pattern is a string with optional '*' wildcards, similar to file globbing.
If you'd prefer to use regular expressions, use the \-r option.
......@@ -35,14 +37,18 @@ Include a timestamp with interval output.
\-v
Show raw addresses.
.TP
\-d
Print the source of the BPF program when loading it (for debugging purposes).
.TP
\-i interval
Summary interval, in seconds.
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.TP
.TP
pattern
A kernel function name, or a search pattern. Can include wildcards ("*"). If the
A function name, or a search pattern. Can include wildcards ("*"). If the
\-r option is used, can include regular expressions.
.SH EXAMPLES
.TP
......@@ -77,6 +83,18 @@ Output every 5 seconds, with timestamps:
Only count stacks when PID 185 is on-CPU:
#
.B stackcount -p 185 ip_output
.TP
Count user stacks for dynamic heap allocations with malloc in PID 185:
#
.B stackcount -p 185 c:malloc
.TP
Count user stacks for thread creation (USDT tracepoint) in PID 185:
#
.B stackcount -p 185 u:pthread:pthread_create
.TP
Count kernel stacks for context switch events using a kernel tracepoint:
#
.B stackcount t:sched:sched_switch
.SH OVERHEAD
This summarizes unique stack traces in-kernel for efficiency, allowing it to
trace a higher rate of function calls than methods that post-process in user
......@@ -99,6 +117,6 @@ Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
Brendan Gregg, Sasha Goldshtein
.SH SEE ALSO
stacksnoop(8), funccount(8)
......@@ -165,7 +165,7 @@ static int list_in_scn(Elf *e, Elf_Scn *section, size_t stridx, size_t symsize,
continue;
if (callback(name, sym.st_value, sym.st_size, sym.st_info, payload) < 0)
break;
return 1; // signal termination to caller
}
}
......@@ -184,9 +184,13 @@ static int listsymbols(Elf *e, bcc_elf_symcb callback, void *payload) {
if (header.sh_type != SHT_SYMTAB && header.sh_type != SHT_DYNSYM)
continue;
if (list_in_scn(e, section, header.sh_link, header.sh_entsize, callback,
payload) < 0)
return -1;
int rc = list_in_scn(e, section, header.sh_link, header.sh_entsize,
callback, payload);
if (rc == 1)
break; // callback signaled termination
if (rc < 0)
return rc;
}
return 0;
......
......@@ -270,6 +270,32 @@ int bcc_find_symbol_addr(struct bcc_symbol *sym) {
return bcc_elf_foreach_sym(sym->module, _find_sym, sym);
}
struct sym_search_t {
struct bcc_symbol *syms;
int start;
int requested;
int *actual;
};
// see <elf.h>
#define ELF_TYPE_IS_FUNCTION(flags) (((flags) & 0xf) == 2)
static int _list_sym(const char *symname, uint64_t addr, uint64_t end,
int flags, void *payload) {
if (!ELF_TYPE_IS_FUNCTION(flags) || addr == 0)
return 0;
SYM_CB cb = (SYM_CB) payload;
return cb(symname, addr);
}
int bcc_foreach_symbol(const char *module, SYM_CB cb) {
if (module == 0 || cb == 0)
return -1;
return bcc_elf_foreach_sym(module, _list_sym, (void *)cb);
}
int bcc_resolve_symname(const char *module, const char *symname,
const uint64_t addr, struct bcc_symbol *sym) {
uint64_t load_addr;
......
......@@ -29,6 +29,8 @@ struct bcc_symbol {
uint64_t offset;
};
typedef int(* SYM_CB)(const char *symname, uint64_t addr);
void *bcc_symcache_new(int pid);
int bcc_symcache_resolve(void *symcache, uint64_t addr, struct bcc_symbol *sym);
int bcc_symcache_resolve_name(void *resolver, const char *name, uint64_t *addr);
......@@ -36,6 +38,7 @@ void bcc_symcache_refresh(void *resolver);
int bcc_resolve_global_addr(int pid, const char *module, const uint64_t address,
uint64_t *global);
int bcc_foreach_symbol(const char *module, SYM_CB cb);
int bcc_find_symbol_addr(struct bcc_symbol *sym);
int bcc_resolve_symname(const char *module, const char *symname,
const uint64_t addr, struct bcc_symbol *sym);
......
......@@ -25,7 +25,7 @@ import errno
import sys
basestring = (unicode if sys.version_info[0] < 3 else str)
from .libbcc import lib, _CB_TYPE, bcc_symbol
from .libbcc import lib, _CB_TYPE, bcc_symbol, _SYM_CB_TYPE
from .table import Table
from .perf import Perf
from .usyms import ProcessSymbols
......@@ -531,8 +531,25 @@ class BPF(object):
res = lib.bcc_procutils_which_so(libname.encode("ascii"))
return res if res is None else res.decode()
def attach_tracepoint(self, tp="", fn_name="", pid=-1, cpu=0, group_fd=-1):
"""attach_tracepoint(tp="", fn_name="", pid=-1, cpu=0, group_fd=-1)
def _get_tracepoints(self, tp_re):
results = []
events_dir = os.path.join(TRACEFS, "events")
for category in os.listdir(events_dir):
cat_dir = os.path.join(events_dir, category)
if not os.path.isdir(cat_dir):
continue
for event in os.listdir(cat_dir):
evt_dir = os.path.join(cat_dir, event)
if os.path.isdir(evt_dir):
tp = ("%s:%s" % (category, event))
if re.match(tp_re, tp):
results.append(tp)
return results
def attach_tracepoint(self, tp="", tp_re="", fn_name="", pid=-1,
cpu=0, group_fd=-1):
"""attach_tracepoint(tp="", tp_re="", fn_name="", pid=-1,
cpu=0, group_fd=-1)
Run the bpf function denoted by fn_name every time the kernel tracepoint
specified by 'tp' is hit. The optional parameters pid, cpu, and group_fd
......@@ -540,12 +557,24 @@ class BPF(object):
the tracepoint category and the tracepoint name, separated by a colon.
For example: sched:sched_switch, syscalls:sys_enter_bind, etc.
Instead of a tracepoint name, a regular expression can be provided in
tp_re. The program will then attach to tracepoints that match the
provided regular expression.
To obtain a list of kernel tracepoints, use the tplist tool or cat the
file /sys/kernel/debug/tracing/available_events.
Example: BPF(text).attach_tracepoint("sched:sched_switch", "on_switch")
Examples:
BPF(text).attach_tracepoint(tp="sched:sched_switch", fn_name="on_switch")
BPF(text).attach_tracepoint(tp_re="sched:.*", fn_name="on_switch")
"""
if tp_re:
for tp in self._get_tracepoints(tp_re):
self.attach_tracepoint(tp=tp, fn_name=fn_name, pid=pid,
cpu=cpu, group_fd=group_fd)
return
fn = self.load_func(fn_name, BPF.TRACEPOINT)
(tp_category, tp_name) = tp.split(':')
res = lib.bpf_attach_tracepoint(fn.fd, tp_category.encode("ascii"),
......@@ -586,9 +615,29 @@ class BPF(object):
del self.open_uprobes[name]
_num_open_probes -= 1
def attach_uprobe(self, name="", sym="", addr=None,
def _get_user_functions(self, name, sym_re):
"""
We are returning addresses here instead of symbol names because it
turns out that the same name may appear multiple times with different
addresses, and the same address may appear multiple times with the same
name. We can't attach a uprobe to the same address more than once, so
it makes sense to return the unique set of addresses that are mapped to
a symbol that matches the provided regular expression.
"""
addresses = []
def sym_cb(sym_name, addr):
if re.match(sym_re, sym_name) and addr not in addresses:
addresses.append(addr)
return 0
res = lib.bcc_foreach_symbol(name, _SYM_CB_TYPE(sym_cb))
if res < 0:
raise Exception("Error %d enumerating symbols in %s" % (res, name))
return addresses
def attach_uprobe(self, name="", sym="", sym_re="", addr=None,
fn_name="", pid=-1, cpu=0, group_fd=-1):
"""attach_uprobe(name="", sym="", addr=None, fn_name=""
"""attach_uprobe(name="", sym="", sym_re="", addr=None, fn_name=""
pid=-1, cpu=0, group_fd=-1)
Run the bpf function denoted by fn_name every time the symbol sym in
......@@ -596,6 +645,10 @@ class BPF(object):
be supplied in place of sym. Optional parameters pid, cpu, and group_fd
can be used to filter the probe.
Instead of a symbol name, a regular expression can be provided in
sym_re. The uprobe will then attach to symbols that match the provided
regular expression.
Libraries can be given in the name argument without the lib prefix, or
with the full path (/usr/lib/...). Binaries can be given only with the
full path (/bin/sh).
......@@ -605,6 +658,14 @@ class BPF(object):
"""
name = str(name)
if sym_re:
for sym_addr in self._get_user_functions(name, sym_re):
self.attach_uprobe(name=name, addr=sym_addr,
fn_name=fn_name, pid=pid, cpu=cpu,
group_fd=group_fd)
return
(path, addr) = BPF._check_path_symbol(name, sym, addr)
self._check_probe_quota(1)
......@@ -798,6 +859,17 @@ class BPF(object):
name, _ = BPF._sym_cache(pid).resolve(addr)
return name
@staticmethod
def symaddr(addr, pid):
"""symaddr(addr, pid)
Translate a memory address into a function name plus the instruction
offset as a hexadecimal number, which is returned as a string.
A pid of less than zero will access the kernel symbol cache.
"""
name, offset = BPF._sym_cache(pid).resolve(addr)
return "%s+0x%x" % (name, offset)
@staticmethod
def ksym(addr):
"""ksym(addr)
......@@ -815,8 +887,7 @@ class BPF(object):
instruction offset as a hexidecimal number, which is returned as a
string.
"""
name, offset = BPF._sym_cache(-1).resolve(addr)
return "%s+0x%x" % (name, offset)
return BPF.symaddr(addr, -1)
@staticmethod
def ksymname(name):
......@@ -835,6 +906,20 @@ class BPF(object):
"""
return len([k for k in self.open_kprobes.keys() if isinstance(k, str)])
def num_open_uprobes(self):
"""num_open_uprobes()
Get the number of open U[ret]probes.
"""
return len(self.open_uprobes)
def num_open_tracepoints(self):
"""num_open_tracepoints()
Get the number of open tracepoints.
"""
return len(self.open_tracepoints)
def kprobe_poll(self, timeout = -1):
"""kprobe_poll(self)
......
......@@ -129,6 +129,10 @@ lib.bcc_resolve_symname.restype = ct.c_int
lib.bcc_resolve_symname.argtypes = [
ct.c_char_p, ct.c_char_p, ct.c_ulonglong, ct.POINTER(bcc_symbol)]
_SYM_CB_TYPE = ct.CFUNCTYPE(ct.c_int, ct.c_char_p, ct.c_ulonglong)
lib.bcc_foreach_symbol.restype = ct.c_int
lib.bcc_foreach_symbol.argtypes = [ct.c_char_p, _SYM_CB_TYPE]
lib.bcc_symcache_new.restype = ct.c_void_p
lib.bcc_symcache_new.argtypes = [ct.c_int]
......
This diff is collapsed.
Demonstrations of stackcount, the Linux eBPF/bcc version.
This program traces kernel functions and frequency counts them with their entire
kernel stack trace, summarized in-kernel for efficiency. For example, counting
This program traces functions and frequency counts them with their entire
stack trace, summarized in-kernel for efficiency. For example, counting
stack traces that led to submit_bio(), which creates block device I/O:
# ./stackcount submit_bio
......@@ -268,6 +268,76 @@ As may be obvious, this is a great tool for quickly understanding kernel code
flow.
User-space functions can also be traced if a library name is provided. For
example, to quickly identify code locations that allocate heap memory:
# ./stackcount -l c -p 4902 malloc
Tracing 1 functions for "malloc"... Hit Ctrl-C to end.
^C
malloc
rbtree_new
main
[unknown]
12
malloc
_rbtree_node_new_internal
_rbtree_node_insert
rbtree_insert
main
[unknown]
1189
Detaching...
Note that user-space uses of stackcount can be somewhat more limited because
a lot of user-space libraries and binaries are compiled without debuginfo, or
with frame-pointer omission (-fomit-frame-pointer), which makes it impossible
to reliably obtain the stack trace.
In addition to kernel and user-space functions, kernel tracepoints and USDT
tracepoints are also supported.
For example, to determine where threads are being created in a particular
process, use the pthread_create USDT tracepoint:
# ./stackcount -p $(pidof parprimes) u:pthread:pthread_create
Tracing 1 functions for "u:pthread:pthread_create"... Hit Ctrl-C to end.
^C
parprimes [11923]
pthread_create@@GLIBC_2.2.5
main
__libc_start_main
[unknown]
7
Similarly, to determine where context switching is happening in the kernel,
use the sched:sched_switch kernel tracepoint:
# ./stackcount t:sched:sched_switch
... (omitted for brevity)
__schedule
schedule
schedule_hrtimeout_range_clock
schedule_hrtimeout_range
poll_schedule_timeout
do_select
core_sys_select
SyS_select
entry_SYSCALL_64_fastpath
40
__schedule
schedule
schedule_preempt_disabled
cpu_startup_entry
start_secondary
85
A -i option can be used to set an output interval, and -T to include a
timestamp. For example:
......@@ -434,12 +504,13 @@ Use -r to allow regular expressions.
USAGE message:
# ./stackcount -h
usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s] [-v] pattern
usage: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] [-s]
[-l LIBRARY] [-v] [-d] pattern
Count kernel function calls and their stack traces
Count function calls and their stack traces
positional arguments:
pattern search expression for kernel functions
pattern search expression for functions
optional arguments:
-h, --help show this help message and exit
......@@ -450,14 +521,19 @@ optional arguments:
-r, --regexp use regular expressions. Default is "*" wildcards
only.
-s, --offset show address offsets
-l, --library trace user-space functions from this library or executable
-v, --verbose show raw addresses
-d, --debug print BPF program before starting (for debugging purposes)
examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
./stackcount -p 185 -l c malloc # count stacks for malloc in PID 185
./stackcount t:sched:sched_fork # count stacks for the sched_fork tracepoint
./stackcount -p 185 u:node:* # count stacks for all USDT probes in node
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment