Commit 59d525fa authored by Zaafar Ahmed's avatar Zaafar Ahmed

updating branch

Merge remote-tracking branch 'origin/master' into percpu
parents 64258256 6c08dcbb
......@@ -100,6 +100,7 @@ Examples:
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).
- tools/[stackcount](tools/stackcount.py): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
- tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
- tools/[statsnoop](tools/statsnoop.py): Trace stat() syscalls. [Examples](tools/statsnoop_example.txt).
......@@ -108,6 +109,7 @@ Examples:
- tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
- tools/[tcpconnlat](tools/tcpconnlat.py): Trace TCP active connection latency (connect()). [Examples](tools/tcpconnlat_example.txt).
- tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
- tools/[tplist](tools/tplist.py): Display kernel tracepoints or USDT probes and their formats. [Examples](tools/tplist_example.txt).
- tools/[trace](tools/trace.py): Trace arbitrary functions, with filters. [Examples](tools/trace_example.txt)
- tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
- tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).
......
#!/usr/bin/python
#
# mallocstacks Trace malloc() calls in a process and print the full
# stack trace for all callsites.
# For Linux, uses BCC, eBPF. Embedded C.
#
# This script is a basic example of the new Linux 4.6+ BPF_STACK_TRACE
# table API.
#
# Copyright 2016 GitHub, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
from __future__ import print_function
from bcc import BPF, ProcessSymbols
from time import sleep
import sys
if len(sys.argv) < 2:
print("USAGE: mallocstacks PID")
exit()
pid = int(sys.argv[1])
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
BPF_HASH(calls, int);
BPF_STACK_TRACE(stack_traces, 1024)
int alloc_enter(struct pt_regs *ctx, size_t size) {
int key = stack_traces.get_stackid(ctx,
BPF_F_USER_STACK|BPF_F_REUSE_STACKID);
if (key < 0)
return 0;
u64 zero = 0, *val;
val = calls.lookup_or_init(&key, &zero);
(*val) += size;
return 0;
};
""")
b.attach_uprobe(name="c", sym="malloc", fn_name="alloc_enter", pid=pid)
print("Attaching to malloc in pid %d, Ctrl+C to quit." % pid)
decoder = ProcessSymbols(pid)
# sleep until Ctrl-C
try:
sleep(99999999)
except KeyboardInterrupt:
pass
calls = b.get_table("calls")
stack_traces = b.get_table("stack_traces")
for k, v in reversed(sorted(calls.items(), key=lambda c: c[1].value)):
print("%d bytes allocated at:" % v.value)
for addr in stack_traces.walk(k.value):
print("\t%s (%x)" % (decoder.decode_addr(addr), addr))
......@@ -50,11 +50,11 @@ many cases, argdist will deduce the necessary header files automatically.
.SH SPECIFIER SYNTAX
The general specifier syntax is as follows:
.B {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
.B {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
.TP
.B {p,r}
Probe type \- "p" for function entry, "r" for function return;
\-H for histogram collection, \-C for frequency count.
.B {p,r,t,u}
Probe type \- "p" for function entry, "r" for function return, "t" for kernel
tracepoint, "u" for USDT probe; \-H for histogram collection, \-C for frequency count.
Indicates where to place the probe and whether the probe should collect frequency
count information, or aggregate the collected values into a histogram. Counting
probes will collect the number of times every parameter value was observed,
......@@ -68,12 +68,19 @@ Specify the full path to the .so or executable file where the function to probe
resides. Alternatively, you can specify just the lib name: for example, "c"
refers to libc. If no library name is specified, the kernel is assumed.
.TP
.B category
The category of the kernel tracepoint. For example: net, sched, block.
.TP
.B function(signature)
The function to probe, and its signature.
The function name must match exactly for the probe to be placed. The signature,
on the other hand, is only required if you plan to collect parameter values
based on that signature. For example, if you only want to collect the first
parameter, you don't have to specify the rest of the parameters in the signature.
When capturing kernel tracepoints, this should be the name of the event, e.g.
net_dev_start_xmit. The signature for kernel tracepoints should be empty. When
capturing USDT probes, this should be the name of the probe, e.g. reloc_complete.
The signature for USDT probes should be empty.
.TP
.B [type[,type...]]
The type(s) of the expression(s) to capture.
......@@ -85,6 +92,12 @@ The expression(s) to capture.
These are the values that are assigned to the histogram or raw event collection.
You may use the parameters directly, or valid C expressions that involve the
parameters, such as "size % 10".
Tracepoints may access a special structure called "tp" that is formatted according
to the tracepoint format (which you can obtain using tplist). For example, the
block:block_rq_complete tracepoint can access tp.nr_sector. You may also use the
members of the "tp" struct directly, e.g. "nr_sector" instead of "tp.nr_sector".
USDT probes may access the arguments defined by the tracing program in the
special arg1, arg2, ... variables. To obtain their types, use the tplist tool.
Return probes can use the argument values received by the
function when it was entered, through the $entry(paramname) special variable.
Return probes can also access the function's return value in $retval, and the
......@@ -137,6 +150,18 @@ Count fork() calls in libc across all processes, grouped by pid:
#
.B argdist -C 'p:c:fork():int:$PID;fork per process'
.TP
Print histogram of number of sectors in completing block I/O requests:
#
.B argdist -H 't:block:block_rq_complete():u32:nr_sector'
.TP
Aggregate interrupts by interrupt request (IRQ):
#
.B argdist -C 't:irq:irq_handler_entry():int:irq'
.TP
Print the functions used as thread entry points and how common they are:
#
.B argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
.TP
Print histograms of sleep() and nanosleep() parameter values:
#
.B argdist -H 'p:c:sleep(u32 seconds):u32:seconds' 'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
......
......@@ -3,7 +3,7 @@
memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
[-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
[-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
.SH DESCRIPTION
memleak traces and matches memory allocation and deallocation requests, and
collects call stacks for each allocation. memleak can then print a summary
......@@ -15,10 +15,7 @@ When tracing all processes, memleak instruments kmalloc and kfree.
memleak may introduce significant overhead when tracing processes that allocate
and free many blocks very quickly. See the OVERHEAD section below.
The stack depth is limited to 10 by default (+1 for the current instruction pointer),
but it can be controlled using the \-d switch if deeper stacks are required.
This currently only works on x86_64. Check for future versions.
This tool only works on Linux 4.6+. Stack traces are obtained using the new BPF_STACK_TRACE` APIs.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......@@ -45,10 +42,6 @@ Run the specified command and trace its allocations only. This traces malloc and
\-s SAMPLE_RATE
Record roughly every SAMPLE_RATE-th allocation to reduce overhead.
.TP
\-d STACK_DEPTH
Capture STACK_DEPTH frames (or less) when obtaining allocation call stacks.
The default value is 10.
.TP
\-t TOP
Print only the top TOP stacks (sorted by size).
The default value is 10.
......@@ -106,9 +99,6 @@ placed in a typical period of 10 seconds:
#
.B perf stat -a -e 'probe:__kmalloc' -- sleep 10
Another setting that may help reduce overhead is lowering the number of stack
frames captured and parsed by memleak for each allocation, using the \-d switch.
.SH SOURCE
This is from bcc.
.IP
......
......@@ -22,10 +22,8 @@ especially the -f format, can be used to generate an "off-CPU time flame graph".
See http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html
The stack depth is currently limited to 20, and the stack traces are kernel
mode only. Check for newer versions where either may be improved.
This currently only works on x86_64. Check for future versions.
This tool only works on Linux 4.6+. It uses the new `BPF_STACK_TRACE` table
APIs to generate the in-kernel stack traces.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......
......@@ -12,10 +12,8 @@ and their occurrence counts.
The pattern is a string with optional '*' wildcards, similar to file globbing.
If you'd prefer to use regular expressions, use the \-r option.
The stack depth is currently limited to 10 (+1 for the current instruction
pointer).
This tool only works on Linux 4.6+. Stack traces are obtained using the new `BPF_STACK_TRACE` APIs.
This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......@@ -82,13 +80,12 @@ Only count stacks when PID 185 is on-CPU:
This summarizes unique stack traces in-kernel for efficiency, allowing it to
trace a higher rate of function calls than methods that post-process in user
space. The stack trace data is only copied to user space when the output is
printed, which usually only happens once. Given these techniques, I'd suspect
that call rates of < 10,000/sec would incur negligible overhead (for this
current version; future versions may improve this). Beyond that,
there will be a point where the overhead is measurable, as this does add
a number of instructions to each function call to walk and save stacks.
Test before production use. You can also use funccount to get a handle on
function call rates first.
printed, which usually only happens once. The stack walking also happens in an
optimized code path in the kernel thanks to the new BPF_STACK_TRACE table APIs,
which should be more efficient than the manual walker in the eBPF tracer which
older versions of this script used. With this in mind, call rates of <
10,000/sec would incur negligible overhead. Test before production use. You can
also use funccount to get a handle on function call rates first.
.SH SOURCE
This is from bcc.
.IP
......
......@@ -9,10 +9,7 @@ kernel stack back trace for that call. This shows the ancestry of function
calls, and is a quick way to investigate low frequency kernel functions and
their cause. For high frequency kernel functions, see stackcount.
The stack depth is currently limited to 10 (+1 for the current instruction
pointer).
This currently only works on x86_64. Check for future versions.
This tool only works on Linux 4.6+. Stack traces are obtained using the new BPF_STACK_TRACE` APIs.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
......
.TH tplist 8 "2016-03-20" "USER COMMANDS"
.SH NAME
tplist \- Display kernel tracepoints or USDT probes and their formats.
.SH SYNOPSIS
.B tplist [-p PID] [-l LIB] [-v] [filter]
.SH DESCRIPTION
tplist lists all kernel tracepoints, and can optionally print out the tracepoint
format; namely, the variables that you can trace when the tracepoint is hit.
tplist can also list USDT probes embedded in a specific library or executable,
and can list USDT probes for all the libraries loaded by a specific process.
These features are usually used in conjunction with the argdist and/or trace tools.
On a typical system, accessing the tracepoint list and format requires root.
However, accessing USDT probes does not require root.
.SH OPTIONS
.TP
\-p PID
Display the USDT probes from all the libraries loaded by the specified process.
.TP
\-l LIB
Display the USDT probes from the specified library or executable. If the librar
or executable can be found in the standard paths, a full path is not required.
.TP
\-v
Display the variables associated with the tracepoint or USDT probe.
.TP
[filter]
A wildcard expression that specifies which tracepoints or probes to print.
For example, block:* will print all block tracepoints (block:block_rq_complete,
etc.). Regular expressions are not supported.
.SH EXAMPLES
.TP
Print all kernel tracepoints:
#
.B tplist
.TP
Print all net tracepoints with their format:
#
.B tplist -v 'net:*'
.TP
Print all USDT probes in libpthread:
$
.B tplist -l pthread
.TP
Print all USDT probes in process 4717 from the libc provider:
$
.B tplist -p 4717 'libc:*'
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Sasha Goldshtein
......@@ -45,10 +45,12 @@ information. See PROBE SYNTAX below.
The general probe syntax is as follows:
.B [{p,r}]:[library]:function [(predicate)] ["format string"[, arguments]]
.B {t:category:event,u:library:probe} [(predicate)] ["format string"[, arguments]]
.TP
.B [{p,r}]
Probe type \- "p" for function entry, "r" for function return. The default
probe type is "p".
.B {[{p,r}],t,u}
Probe type \- "p" for function entry, "r" for function return, "t" for kernel
tracepoint, "u" for USDT probe. The default probe type is "p".
.TP
.B [library]
Library containing the probe.
......@@ -58,9 +60,18 @@ refers to libc. If no library name is specified, the kernel is assumed. Also,
you can specify an executable name (without a full path) if it is in the PATH.
For example, "bash".
.TP
.B category
The tracepoint category. For example, "sched" or "irq".
.TP
.B function
The function to probe.
.TP
.B event
The tracepoint event. For example, "block_rq_complete".
.TP
.B probe
The USDT probe name. For example, "pthread_create".
.TP
.B [(predicate)]
The filter applied to the captured data. Only if the filter evaluates as true,
the trace message will be printed. The filter can use any valid C expression
......@@ -81,6 +92,16 @@ number of arguments as there are placeholders in the format string. The
format specifier replacements may be any C expressions, and may refer to the
same special keywords as in the predicate (arg1, arg2, etc.).
In tracepoints, both the predicate and the arguments may refer to the tracepoint
format structure, which is stored in the special "tp" variable. For example, the
block:block_rq_complete tracepoint can print or filter by tp.nr_sector. To
discover the format of your tracepoint, use the tplist tool. Note that you can
also use the members of the "tp" struct directly, e.g "nr_sector" instead of
"tp.nr_sector".
In USDT probes, the arg1, ..., argN variables refer to the probe's arguments.
To determine which arguments your probe has, use the tplist tool.
The predicate expression and the format specifier replacements for printing
may also use the following special keywords: $pid, $tgid to refer to the
current process' pid and tgid; $uid, $gid to refer to the current user's
......@@ -102,6 +123,14 @@ Trace all malloc calls and print the size of the requested allocation:
Trace returns from the readline function in bash and print the return value as a string:
#
.B trace 'r:bash:readline """%s"", retval'
.TP
Trace the block:block_rq_complete tracepoint and print the number of sectors completed:
#
.B trace 't:block:block_rq_complete """%d sectors"", nr_sector'
.TP
Trace the pthread_create USDT probe from the pthread library and print the address of the thread's start function:
#
.B trace 'u:pthread:pthread_create """start addr = %llx"", arg3'
.SH SOURCE
This is from bcc.
.IP
......
......@@ -42,16 +42,18 @@ set_target_properties(bcc-static PROPERTIES OUTPUT_NAME bcc)
# BPF is still experimental otherwise it should be available
#llvm_map_components_to_libnames(llvm_libs bpf mcjit irreader passes)
llvm_map_components_to_libnames(llvm_libs mcjit irreader passes linker
instrumentation objcarcopts bitwriter option x86codegen)
llvm_map_components_to_libnames(llvm_libs bitwriter bpfcodegen irreader linker
mcjit objcarcopts option passes x86codegen)
llvm_expand_dependencies(expanded_libs ${llvm_libs})
# order is important
set(clang_libs ${libclangFrontend} ${libclangSerialization} ${libclangDriver} ${libclangParse}
${libclangSema} ${libclangCodeGen} ${libclangAnalysis} ${libclangRewrite} ${libclangEdit}
${libclangAST} ${libclangLex} ${libclangBasic})
# Link against LLVM libraries
target_link_libraries(bcc-shared b_frontend clang_frontend ${clang_libs} ${llvm_libs} LLVMBPFCodeGen)
target_link_libraries(bcc-static b_frontend clang_frontend bcc-loader-static ${clang_libs} ${llvm_libs} LLVMBPFCodeGen)
target_link_libraries(bcc-shared b_frontend clang_frontend ${clang_libs} ${expanded_libs})
target_link_libraries(bcc-static b_frontend clang_frontend bcc-loader-static ${clang_libs} ${expanded_libs})
install(TARGETS bcc-shared LIBRARY COMPONENT libbcc
DESTINATION ${CMAKE_INSTALL_LIBDIR})
......
......@@ -692,7 +692,7 @@ BFrontendAction::BFrontendAction(llvm::raw_ostream &os, unsigned flags)
}
void BFrontendAction::EndSourceFileAction() {
if (flags_ & 0x4)
if (flags_ & DEBUG_PREPROCESSOR)
rewriter_->getEditBuffer(rewriter_->getSourceMgr().getMainFileID()).write(llvm::errs());
rewriter_->getEditBuffer(rewriter_->getSourceMgr().getMainFileID()).write(os_);
os_.flush();
......
......@@ -26,6 +26,8 @@
#include "table_desc.h"
#define DEBUG_PREPROCESSOR 0x4
namespace clang {
class ASTConsumer;
class ASTContext;
......
......@@ -82,6 +82,9 @@ int KBuildHelper::get_flags(const char *uname_machine, vector<string> *cflags) {
cflags->push_back("-include");
cflags->push_back("./include/linux/kconfig.h");
cflags->push_back("-D__KERNEL__");
cflags->push_back("-D__HAVE_BUILTIN_BSWAP16__");
cflags->push_back("-D__HAVE_BUILTIN_BSWAP32__");
cflags->push_back("-D__HAVE_BUILTIN_BSWAP64__");
cflags->push_back("-Wno-unused-value");
cflags->push_back("-Wno-pointer-sign");
......
......@@ -155,7 +155,7 @@ int ClangLoader::parse(unique_ptr<llvm::Module> *mod, unique_ptr<vector<TableDes
// Initialize a compiler invocation object from the clang (-cc1) arguments.
const driver::ArgStringList &ccargs = cmd.getArguments();
if (flags_ & 0x4) {
if (flags_ & DEBUG_PREPROCESSOR) {
llvm::errs() << "clang";
for (auto arg : ccargs)
llvm::errs() << " " << arg;
......
......@@ -20,13 +20,16 @@ import json
import multiprocessing
import os
import re
from subprocess import Popen, PIPE
from subprocess import Popen, PIPE, STDOUT
import struct
import sys
basestring = (unicode if sys.version_info[0] < 3 else str)
from .libbcc import lib, _CB_TYPE
from .procstat import ProcStat
from .table import Table
from .tracepoint import Perf, Tracepoint
from .usyms import ProcessSymbols
open_kprobes = {}
open_uprobes = {}
......@@ -38,6 +41,10 @@ ksym_names = {}
ksym_loaded = 0
_kprobe_limit = 1000
DEBUG_LLVM_IR = 0x1
DEBUG_BPF = 0x2
DEBUG_PREPROCESSOR = 0x4
@atexit.register
def cleanup_kprobes():
for k, v in open_kprobes.items():
......@@ -72,6 +79,30 @@ class BPF(object):
_lib_load_address_cache = {}
_lib_symbol_cache = {}
_auto_includes = {
"linux/time.h" : ["time"],
"linux/fs.h" : ["fs", "file"],
"linux/blkdev.h" : ["bio", "request"],
"linux/slab.h" : ["alloc"],
"linux/netdevice.h" : ["sk_buff", "net_device"]
}
@classmethod
def generate_auto_includes(cls, program_words):
"""
Generates #include statements automatically based on a set of
recognized types such as sk_buff and bio. The input is all the words
that appear in the BPF program, and the output is a (possibly empty)
string of #include statements, such as "#include <linux/fs.h>".
"""
headers = ""
for header, keywords in cls._auto_includes.items():
for keyword in keywords:
for word in program_words:
if keyword in word and header not in headers:
headers += "#include <%s>\n" % header
return headers
# defined for compatibility reasons, to be removed
Table = Table
......@@ -105,8 +136,9 @@ class BPF(object):
hdr_file (Optional[str]): Path to a helper header file for the `src_file`
text (Optional[str]): Contents of a source file for the module
debug (Optional[int]): Flags used for debug prints, can be |'d together
0x1: print LLVM IR to stderr
0x2: print BPF bytecode to stderr
DEBUG_LLVM_IR: print LLVM IR to stderr
DEBUG_BPF: print BPF bytecode to stderr
DEBUG_PREPROCESSOR: print Preprocessed C file to stderr
"""
self._reader_cb_impl = _CB_TYPE(BPF._reader_cb)
......@@ -165,7 +197,7 @@ class BPF(object):
lib.bpf_module_kern_version(self.module),
log_buf, ct.sizeof(log_buf) if log_buf else 0)
if self.debug & 0x2:
if self.debug & DEBUG_BPF:
print(log_buf.value.decode(), file=sys.stderr)
if fd < 0:
......@@ -325,6 +357,11 @@ class BPF(object):
global open_kprobes
return open_kprobes
@staticmethod
def open_uprobes():
global open_uprobes
return open_uprobes
@staticmethod
def detach_kprobe(event):
ev_name = "p_" + event.replace("+", "_").replace(".", "_")
......@@ -701,7 +738,7 @@ class BPF(object):
if idx == -1:
return "[unknown]"
offset = int(addr - ksyms[idx][1])
return ksyms[idx][0] + hex(offset)
return "%s+0x%x" % (ksyms[idx][0], offset)
@staticmethod
def ksymname(name):
......@@ -717,6 +754,28 @@ class BPF(object):
return 0
return ksyms[idx][1]
@classmethod
def usymaddr(cls, pid, addr, refresh_symbols=False):
"""usymaddr(pid, addr, refresh_symbols=False)
Decode the specified address in the specified process to a symbolic
representation that includes the symbol name, offset within the symbol,
and the module name. See the ProcessSymbols class for more details.
Specify refresh_symbols=True if you suspect the set of loaded modules
or their load addresses has changed since the last time you called
usymaddr() on this pid.
"""
proc_sym = None
if pid in cls._process_symbols:
proc_sym = cls._process_symbols[pid]
if refresh_symbols:
proc_sym.refresh_code_ranges()
else:
proc_sym = ProcessSymbols(pid)
cls._process_symbols[pid] = proc_sym
return proc_sym.decode_addr(addr)
@staticmethod
def num_open_kprobes():
"""num_open_kprobes()
......@@ -740,3 +799,5 @@ class BPF(object):
except KeyboardInterrupt:
exit()
from .usdt import USDTReader
# Copyright 2016 Sasha Goldshtein
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
class ProcStat(object):
def __init__(self, pid):
self.pid = pid
self.exe = self._get_exe()
self.start_time = self._get_start_time()
def is_stale(self):
return self.exe != self._get_exe() or \
self.start_time != self._get_start_time()
def _get_exe(self):
return os.popen("readlink -f /proc/%d/exe" % self.pid).read()
def _get_start_time(self):
return os.popen("cut -d' ' -f 22 /proc/%d/stat" %
self.pid).read()
......@@ -523,9 +523,37 @@ class PerCpuArray(ArrayBase):
return result
class StackTrace(TableBase):
MAX_DEPTH = 127
def __init__(self, *args, **kwargs):
super(StackTrace, self).__init__(*args, **kwargs)
class StackWalker(object):
def __init__(self, stack, resolve=None):
self.stack = stack
self.n = -1
self.resolve = resolve
def __iter__(self):
return self
def __next__(self):
return self.next()
def next(self):
self.n += 1
if self.n == StackTrace.MAX_DEPTH:
raise StopIteration()
addr = self.stack.ip[self.n]
if addr == 0 :
raise StopIteration()
return self.resolve(addr) if self.resolve else addr
def walk(self, stack_id, resolve=None):
return StackTrace.StackWalker(self[self.Key(stack_id)], resolve)
def __len__(self):
i = 0
for k in self: i += 1
......
# Copyright 2016 Sasha Goldshtein
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import ctypes as ct
import multiprocessing
import os
import re
class Perf(object):
class perf_event_attr(ct.Structure):
_fields_ = [
('type', ct.c_uint),
('size', ct.c_uint),
('config', ct.c_ulong),
('sample_period', ct.c_ulong),
('sample_type', ct.c_ulong),
('IGNORE1', ct.c_ulong),
('IGNORE2', ct.c_ulong),
('wakeup_events', ct.c_uint),
('IGNORE3', ct.c_uint),
('IGNORE4', ct.c_ulong),
('IGNORE5', ct.c_ulong),
('IGNORE6', ct.c_ulong),
('IGNORE7', ct.c_uint),
('IGNORE8', ct.c_int),
('IGNORE9', ct.c_ulong),
('IGNORE10', ct.c_uint),
('IGNORE11', ct.c_uint)
]
NR_PERF_EVENT_OPEN = 298
PERF_TYPE_TRACEPOINT = 2
PERF_SAMPLE_RAW = 1024
PERF_FLAG_FD_CLOEXEC = 8
PERF_EVENT_IOC_SET_FILTER = 1074275334
PERF_EVENT_IOC_ENABLE = 9216
libc = ct.CDLL('libc.so.6', use_errno=True)
syscall = libc.syscall # not declaring vararg types
ioctl = libc.ioctl # not declaring vararg types
@staticmethod
def _open_for_cpu(cpu, attr):
pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
-1, cpu, -1, Perf.PERF_FLAG_FD_CLOEXEC)
if pfd < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
"common_pid == -17") < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))
@staticmethod
def perf_event_open(tpoint_id):
attr = Perf.perf_event_attr()
attr.config = tpoint_id
attr.type = Perf.PERF_TYPE_TRACEPOINT
attr.sample_type = Perf.PERF_SAMPLE_RAW
attr.sample_period = 1
attr.wakeup_events = 1
for cpu in range(0, multiprocessing.cpu_count()):
Perf._open_for_cpu(cpu, attr)
class Tracepoint(object):
enabled_tracepoints = []
trace_root = "/sys/kernel/debug/tracing"
event_root = os.path.join(trace_root, "events")
@classmethod
def _any_tracepoints_enabled(cls):
return len(cls.enabled_tracepoints) > 0
@classmethod
def generate_decl(cls):
if not cls._any_tracepoints_enabled():
return ""
return "\nBPF_HASH(__trace_di, u64, u64);\n"
@classmethod
def generate_entry_probe(cls):
if not cls._any_tracepoints_enabled():
return ""
return """
int __trace_entry_update(struct pt_regs *ctx)
{
u64 tid = bpf_get_current_pid_tgid();
u64 val = ctx->di;
__trace_di.update(&tid, &val);
return 0;
}
"""
def __init__(self, category, event, tp_id):
self.category = category
self.event = event
self.tp_id = tp_id
self._retrieve_struct_fields()
def _retrieve_struct_fields(self):
self.struct_fields = []
format_lines = Tracepoint.get_tpoint_format(self.category,
self.event)
for line in format_lines:
match = re.search(r'field:([^;]*);.*size:(\d+);', line)
if match is None:
continue
parts = match.group(1).split()
field_name = parts[-1:][0]
field_type = " ".join(parts[:-1])
field_size = int(match.group(2))
if "__data_loc" in field_type:
continue
if field_name.startswith("common_"):
continue
self.struct_fields.append((field_type, field_name))
def _generate_struct_fields(self):
text = ""
for field_type, field_name in self.struct_fields:
text += " %s %s;\n" % (field_type, field_name)
return text
def generate_struct(self):
self.struct_name = self.event + "_trace_entry"
return """
struct %s {
u64 __do_not_use__;
%s
};
""" % (self.struct_name, self._generate_struct_fields())
def _generate_struct_locals(self):
text = ""
for field_type, field_name in self.struct_fields:
if field_type == "char" and field_name.endswith(']'):
# Special case for 'char whatever[N]', should
# be assigned to a 'char *'
field_type = "char *"
field_name = re.sub(r'\[\d+\]$', '', field_name)
text += " %s %s = tp.%s;\n" % (
field_type, field_name, field_name)
return text
def generate_get_struct(self):
return """
u64 tid = bpf_get_current_pid_tgid();
u64 *di = __trace_di.lookup(&tid);
if (di == 0) { return 0; }
struct %s tp = {};
bpf_probe_read(&tp, sizeof(tp), (void *)*di);
%s
""" % (self.struct_name, self._generate_struct_locals())
@classmethod
def enable_tracepoint(cls, category, event):
tp_id = cls.get_tpoint_id(category, event)
if tp_id == -1:
raise ValueError("no such tracepoint found: %s:%s" %
(category, event))
Perf.perf_event_open(tp_id)
new_tp = Tracepoint(category, event, tp_id)
cls.enabled_tracepoints.append(new_tp)
return new_tp
@staticmethod
def get_tpoint_id(category, event):
evt_dir = os.path.join(Tracepoint.event_root, category, event)
try:
return int(
open(os.path.join(evt_dir, "id")).read().strip())
except:
return -1
@staticmethod
def get_tpoint_format(category, event):
evt_dir = os.path.join(Tracepoint.event_root, category, event)
try:
return open(os.path.join(evt_dir, "format")).readlines()
except:
return ""
@classmethod
def attach(cls, bpf):
if cls._any_tracepoints_enabled():
bpf.attach_kprobe(event="tracing_generic_entry_update",
fn_name="__trace_entry_update")
This diff is collapsed.
# Copyright 2016 Sasha Goldshtein
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from subprocess import Popen, PIPE, STDOUT
class ProcessSymbols(object):
def __init__(self, pid):
"""
Initializes the process symbols store for the specified pid.
Call refresh_code_ranges() periodically if you anticipate changes
in the set of loaded libraries or their addresses.
"""
self.pid = pid
self.refresh_code_ranges()
def refresh_code_ranges(self):
self.code_ranges = self._get_code_ranges()
self.ranges_cache = {}
self.procstat = ProcStat(self.pid)
@staticmethod
def _is_binary_segment(parts):
return len(parts) == 6 and parts[5][0] != '[' and 'x' in parts[1]
def _get_code_ranges(self):
ranges = {}
raw_ranges = open("/proc/%d/maps" % self.pid).readlines()
# A typical line from /proc/PID/maps looks like this:
# 7f21b6635000-7f21b67eb000 r-xp ... /usr/lib64/libc-2.21.so
# We are looking for executable segments that have a .so file
# or the main executable. The first two lines are the range of
# that memory segment, which we index by binary name.
for raw_range in raw_ranges:
parts = raw_range.split()
if not ProcessSymbols._is_binary_segment(parts):
continue
binary = parts[5]
range_parts = parts[0].split('-')
addr_range = (int(range_parts[0], 16), int(range_parts[1], 16))
ranges[binary] = addr_range
return ranges
@staticmethod
def _is_function_symbol(parts):
return len(parts) == 6 and parts[3] == ".text" and parts[2] == "F"
@staticmethod
def _run_command_get_output(command):
p = Popen(command.split(), stdout=PIPE, stderr=STDOUT)
return iter(p.stdout.readline, b'')
def _get_sym_ranges(self, binary):
if binary in self.ranges_cache:
return self.ranges_cache[binary]
sym_ranges = {}
raw_symbols = ProcessSymbols._run_command_get_output(
"objdump -t %s" % binary)
for raw_symbol in raw_symbols:
# A typical line from objdump -t looks like this:
# 00000000004007f5 g F .text 000000000000010e main
# We only care about functions in the .text segment.
# The first number is the start address, and the second
# number is the length.
parts = raw_symbol.split()
if not ProcessSymbols._is_function_symbol(parts):
continue
sym_start = int(parts[0], 16)
sym_len = int(parts[4], 16)
sym_name = parts[5]
sym_ranges[sym_name] = (sym_start, sym_len)
self.ranges_cache[binary] = sym_ranges
return sym_ranges
def _decode_sym(self, binary, offset):
sym_ranges = self._get_sym_ranges(binary)
# Find the symbol that contains the specified offset.
# There might not be one.
for name, (start, length) in sym_ranges.items():
if offset >= start and offset <= (start + length):
return "%s+0x%x" % (name, offset - start)
return "%x" % offset
def _check_pid_wrap(self):
if self.procstat.is_stale():
self.refresh_code_ranges()
def decode_addr(self, addr):
"""
Given an address, return the best symbolic representation of it.
If it doesn't fall in any module, return its hex string. If it
falls within a module but we don't have a symbol for it, return
the hex string and the module. If we do have a symbol for it,
return the symbol and the module, e.g. "readline+0x10 [bash]".
"""
self._check_pid_wrap()
# Find the binary that contains the specified address.
# For .so files, look at the relative address; for the main
# executable, look at the absolute address.
for binary, (start, end) in self.code_ranges.items():
if addr >= start and addr <= end:
offset = addr - start \
if binary.endswith(".so") else addr
return "%s [%s]" % (self._decode_sym(binary, offset),
binary)
return "%x" % addr
from . import ProcStat
This diff is collapsed.
......@@ -262,6 +262,27 @@ p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
The flags value must be expanded by hand, but it's still helpful to eliminate
certain kinds of allocations or visually group them together.
argdist also has basic support for kernel tracepoints. It is sometimes more
convenient to use tracepoints because they are documented and don't vary a lot
between kernel versions like function signatures tend to. For example, let's
trace the net:net_dev_start_xmit tracepoint and print the interface name that
is transmitting:
# argdist -C 't:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name' -n 2
[05:01:10]
t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
COUNT EVENT
4 c->name = eth0
[05:01:11]
t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
COUNT EVENT
6 c->name = lo
92 c->name = eth0
Note that to determine the necessary function signature you need to look at the
TP_PROTO declaration in the kernel headers. For example, the net_dev_start_xmit
tracepoint is defined in the include/trace/events/net.h header file.
Here's a final example that finds how many write() system calls are performed
by each process on the system:
......@@ -311,13 +332,14 @@ optional arguments:
additional header files to include in the BPF program
Probe specifier syntax:
{p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
{p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
Where:
p,r -- probe at function entry or at function exit
p,r,t,u -- probe at function entry, function exit, kernel tracepoint,
or USDT probe
in exit probes: can use $retval, $entry(param), $latency
library -- the library that contains the function
(leave empty for kernel functions)
function -- the function name to trace
category -- the category of the kernel tracepoint (e.g. net, sched)
signature -- the function's parameters, as in the C header
type -- the type of the expression to collect (supports multiple)
expr -- the expression to collect (supports multiple)
......@@ -365,7 +387,17 @@ argdist -C 'p:c:fork()#fork calls'
Count fork() calls in libc across all processes
Can also use funccount.py, which is easier and more flexible
argdist -H \
argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
Print histogram of number of sectors in completing block I/O requests
argdist -C 't:irq:irq_handler_entry():int:tp.irq'
Aggregate interrupts by interrupt request (IRQ)
argdist -C 'u:pthread:pthread_start():u64:arg2' -p 1337
Print frequency of function addresses used as a pthread start function,
relying on the USDT pthread_start probe in process 1337
argdist -H \
'p:c:sleep(u32 seconds):u32:seconds' \
'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
Print histograms of sleep() and nanosleep() parameter values
......
This diff is collapsed.
......@@ -9,8 +9,6 @@
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
......@@ -48,7 +46,6 @@ args = parser.parse_args()
folded = args.folded
duration = int(args.duration)
debug = 0
maxdepth = 20 # and MAXDEPTH
if args.pid and args.useronly:
print("ERROR: use either -p or -u.")
exit()
......@@ -62,31 +59,15 @@ bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#define MAXDEPTH 20
#define MINBLOCK_US 1
struct key_t {
char name[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
int stack_id;
};
BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
BPF_STACK_TRACE(stack_traces, 1024)
int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
u32 pid;
......@@ -111,36 +92,12 @@ int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
return 0;
// create map key
u64 zero = 0, *val, bp = 0;
int depth = 0;
u64 zero = 0, *val;
struct key_t key = {};
bpf_get_current_comm(&key.name, sizeof(key.name));
bp = ctx->bp;
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
key.stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
val = counts.lookup_or_init(&key, &zero);
(*val) += delta;
return 0;
......@@ -183,24 +140,17 @@ while (1):
if not folded:
print()
counts = b.get_table("counts")
stack_traces = b.get_table("stack_traces")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
if folded:
# print folded stack output
line = k.name.decode() + ";"
for i in reversed(range(0, maxdepth)):
if k.ret[i] == 0:
continue
line = line + b.ksym(k.ret[i])
if i != 0:
line = line + ";"
print("%s %d" % (line, v.value))
stack = list(stack_traces.walk(k.stack_id))[1:]
line = [k.name.decode()] + [b.ksym(addr) for addr in reversed(stack)]
print("%s %d" % (";".join(line), v.value))
else:
# print default multi-line stack output
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
for addr in stack_traces.walk(k.stack_id):
print(" %-16x %s" % (addr, b.ksym(addr)))
print(" %-16s %s" % ("-", k.name))
print(" %d\n" % v.value)
counts.clear()
......
This diff is collapsed.
#!/usr/bin/python
#
# offcputime Summarize off-CPU time by kernel stack trace
# For Linux, uses BCC, eBPF.
#
# USAGE: offcputime [-h] [-u] [-p PID] [-v] [-f] [duration]
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 21 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 13-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./offcputime # trace off-CPU stack time until Ctrl-C
./offcputime 5 # trace for 5 seconds only
./offcputime -f 5 # 5 seconds, and output in folded format
./offcputime -u # don't include kernel threads (user only)
./offcputime -p 185 # trace fo PID 185 only
"""
parser = argparse.ArgumentParser(
description="Summarize off-CPU time by kernel stack trace",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-u", "--useronly", action="store_true",
help="user threads only (no kernel threads)")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("-f", "--folded", action="store_true",
help="output folded format")
parser.add_argument("duration", nargs="?", default=99999999,
help="duration of trace, in seconds")
args = parser.parse_args()
folded = args.folded
duration = int(args.duration)
debug = 0
maxdepth = 20 # and MAXDEPTH
if args.pid and args.useronly:
print("ERROR: use either -p or -u.")
exit()
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#define MAXDEPTH 20
#define MINBLOCK_US 1
struct key_t {
char name[TASK_COMM_LEN];
// Skip saving the ip
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
BPF_HASH(start, u32);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
u32 pid;
u64 ts, *tsp;
// record previous thread sleep time
if (FILTER) {
pid = prev->pid;
ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
}
// calculate current thread's delta time
pid = bpf_get_current_pid_tgid();
tsp = start.lookup(&pid);
if (tsp == 0)
return 0; // missed start or filtered
u64 delta = bpf_ktime_get_ns() - *tsp;
start.delete(&pid);
delta = delta / 1000;
if (delta < MINBLOCK_US)
return 0;
// create map key
u64 zero = 0, *val, bp = 0;
int depth = 0;
struct key_t key = {};
bpf_get_current_comm(&key.name, sizeof(key.name));
bp = ctx->bp;
// unrolled loop (MAXDEPTH):
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val) += delta;
return 0;
}
"""
if args.pid:
filter = 'pid == %s' % args.pid
elif args.useronly:
filter = '!(prev->flags & PF_KTHREAD)'
else:
filter = '1'
bpf_text = bpf_text.replace('FILTER', filter)
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="finish_task_switch", fn_name="oncpu")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions traced. Exiting.")
exit()
# header
if not folded:
print("Tracing off-CPU time (us) by kernel stack", end="")
if duration < 99999999:
print(" for %d secs." % duration)
else:
print("... Hit Ctrl-C to end.")
# output
while (1):
try:
sleep(duration)
except KeyboardInterrupt:
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
if not folded:
print()
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
if folded:
# print folded stack output
line = k.name.decode() + ";"
for i in reversed(range(0, maxdepth)):
if k.ret[i] == 0:
continue
line = line + b.ksym(k.ret[i])
if i != 0:
line = line + ";"
print("%s %d" % (line, v.value))
else:
# print default multi-line stack output
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print(" %-16x %s" % (k.ret[i],
b.ksym(k.ret[i])))
print(" %-16s %s" % ("-", k.name))
print(" %d\n" % v.value)
counts.clear()
if not folded:
print("Detaching...")
exit()
#!/usr/bin/python
#
# stackcount Count kernel function calls and their stack traces.
# For Linux, uses BCC, eBPF.
#
# USAGE: stackcount [-h] [-p PID] [-i INTERVAL] [-T] [-r] pattern
#
# The pattern is a string with optional '*' wildcards, similar to file
# globbing. If you'd prefer to use regular expressions, use the -r option.
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# Currently limited to a stack trace depth of 11 (maxdepth + 1).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
import signal
# arguments
examples = """examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount ip_output # count kernel stack traces for ip_output
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
"""
parser = argparse.ArgumentParser(
description="Count kernel function calls and their stack traces",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-i", "--interval", default=99999999,
help="summary interval, seconds")
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-r", "--regexp", action="store_true",
help="use regular expressions. Default is \"*\" wildcards only.")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-v", "--verbose", action="store_true",
help="show raw addresses")
parser.add_argument("pattern",
help="search expression for kernel functions")
args = parser.parse_args()
pattern = args.pattern
if not args.regexp:
pattern = pattern.replace('*', '.*')
pattern = '^' + pattern + '$'
offset = args.offset
verbose = args.verbose
debug = 0
maxdepth = 10 # and MAXDEPTH
# signal handler
def signal_ignore(signal, frame):
print()
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#define MAXDEPTH 10
struct key_t {
u64 ip;
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
int trace_count(struct pt_regs *ctx) {
FILTER
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
key.ip = ctx->ip;
bp = ctx->bp;
// unrolled loop, 10 (MAXDEPTH) frames deep:
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
'if (pid != %s) { return 0; }') % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
b = BPF(text=bpf_text)
b.attach_kprobe(event_re=pattern, fn_name="trace_count")
matched = b.num_open_kprobes()
if matched == 0:
print("0 functions matched by \"%s\". Exiting." % args.pattern)
exit()
# header
print("Tracing %d functions for \"%s\"... Hit Ctrl-C to end." %
(matched, args.pattern))
def print_frame(addr):
print(" ", end="")
if verbose:
print("%-16x " % addr, end="")
if offset:
print("%s" % b.ksymaddr(addr))
else:
print("%s" % b.ksym(addr))
# output
exiting = 0 if args.interval else 1
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1
# as cleanup can take many seconds, trap Ctrl-C:
signal.signal(signal.SIGINT, signal_ignore)
print()
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = b.get_table("counts")
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
print_frame(k.ip)
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print_frame(k.ret[i])
print(" %d\n" % v.value)
counts.clear()
if exiting:
print("Detaching...")
exit()
#!/usr/bin/python
#
# stacksnoop Trace a kernel function and print all kernel stack traces.
# For Linux, uses BCC, eBPF, and currently x86_64 only. Inline C.
#
# USAGE: stacksnoop [-h] [-p PID] [-s] [-v] function
#
# The current implementation uses an unrolled loop for x86_64, and was written
# as a proof of concept. This implementation should be replaced in the future
# with an appropriate bpf_ call, when available.
#
# The stack depth is limited to 10 (+1 for the current instruction pointer).
# This could be tunable in a future version.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
# arguments
examples = """examples:
./stacksnoop ext4_sync_fs # print kernel stack traces for ext4_sync_fs
./stacksnoop -s ext4_sync_fs # ... also show symbol offsets
./stacksnoop -v ext4_sync_fs # ... show extra columns
./stacksnoop -p 185 ext4_sync_fs # ... only when PID 185 is on-CPU
"""
parser = argparse.ArgumentParser(
description="Trace and print kernel stack traces for a kernel function",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-s", "--offset", action="store_true",
help="show address offsets")
parser.add_argument("-v", "--verbose", action="store_true",
help="print more fields")
parser.add_argument("function",
help="kernel function name")
args = parser.parse_args()
function = args.function
offset = args.offset
verbose = args.verbose
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
static int print_frame(u64 *bp, int *depth) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (ret < __START_KERNEL_map)
return 0;
bpf_trace_printk("r%d: %llx\\n", *depth, ret);
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
*depth += 1;
return 1;
}
return 0;
}
void trace_stack(struct pt_regs *ctx) {
FILTER
u64 bp = 0;
int depth = 0;
bpf_trace_printk("\\n");
if (ctx->ip)
bpf_trace_printk("ip: %llx\\n", ctx->ip);
bp = ctx->bp;
// unrolled loop, 10 frames deep:
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
};
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
('u32 pid; pid = bpf_get_current_pid_tgid(); ' +
'if (pid != %s) { return; }') % (args.pid))
else:
bpf_text = bpf_text.replace('FILTER', '')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event=function, fn_name="trace_stack")
matched = b.num_open_kprobes()
if matched == 0:
print("Function \"%s\" not found. Exiting." % function)
exit()
# header
if verbose:
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
"STACK"))
else:
print("%-18s %s" % ("TIME(s)", "STACK"))
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
if msg != "":
(reg, addr) = msg.split(" ")
if offset:
ip = b.ksymaddr(int(addr, 16))
else:
ip = b.ksym(int(addr, 16))
msg = msg + " " + ip
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
else:
print("%-18.9f %s" % (ts, msg))
#!/usr/bin/env python
#
# solisten Trace TCP listen events
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: solisten.py [-h] [-p PID] [--show-netns]
#
# This is provided as a basic example of TCP connection & socket tracing.
# It could be usefull in scenarios where load balancers needs to be updated
# dynamically as application is fully initialized.
#
# All IPv4 listen attempts are traced, even if they ultimately fail or the
# the listening program is not willing to accept().
#
# Copyright (c) 2016 Jean-Tiare Le Bigot.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 04-Mar-2016 Jean-Tiare Le Bigot Created this.
import os
import socket
import netaddr
import argparse
from bcc import BPF
import ctypes as ct
# Arguments
examples = """Examples:
./solisten.py # Stream socket listen
./solisten.py -p 1234 # Stream socket listen for specified PID only
./solisten.py --netns 4242 # Stream socket listen for specified network namespace ID only
./solisten.py --show-netns # Show network namespace ID. Probably usefull if you run containers
"""
parser = argparse.ArgumentParser(
description="Stream sockets listen",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("--show-netns", action="store_true",
help="show network namespace")
parser.add_argument("-p", "--pid", default=0, type=int,
help="trace this PID only")
parser.add_argument("-n", "--netns", default=0, type=int,
help="trace this Network Namespace only")
# BPF Program
bpf_text = """
#include <net/sock.h>
#include <net/inet_sock.h>
#include <net/net_namespace.h>
#include <bcc/proto.h>
// Endian conversion. We can't use kernel version here as it uses inline
// assembly, neither libc version as we can't import it here. Adapted from both.
#if defined(__LITTLE_ENDIAN)
#define bcc_be32_to_cpu(x) ((u32)(__builtin_bswap32)((x)))
#define bcc_be64_to_cpu(x) ((u64)(__builtin_bswap64)((x)))
#elif defined(__BIG_ENDIAN)
#define bcc_be32_to_cpu(x) (x)
#define bcc_be64_to_cpu(x) (x)
#else
#error Host endianness not defined
#endif
// Common structure for UDP/TCP IPv4/IPv6
struct listen_evt_t {
u64 ts_us;
u64 pid_tgid;
u64 backlog;
u64 netns;
u64 proto; // familiy << 16 | type
u64 lport; // use only 16 bits
u64 laddr[2]; // IPv4: store in laddr[0]
char task[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(listen_evt);
// Send an event for each IPv4 listen with PID, bound address and port
int kprobe__inet_listen(struct pt_regs *ctx, struct socket *sock, int backlog)
{
// cast types. Intermediate cast not needed, kept for readability
struct sock *sk = sock->sk;
struct inet_sock *inet = inet_sk(sk);
// Built event for userland
struct listen_evt_t evt = {
.ts_us = bpf_ktime_get_ns() / 1000,
.backlog = backlog,
};
// Get process comm. Needs LLVM >= 3.7.1 see https://github.com/iovisor/bcc/issues/393
bpf_get_current_comm(evt.task, TASK_COMM_LEN);
// Get socket IP family
u16 family = sk->__sk_common.skc_family;
evt.proto = family << 16 | SOCK_STREAM;
// Get PID
evt.pid_tgid = bpf_get_current_pid_tgid();
##FILTER_PID##
// Get port
bpf_probe_read(&evt.lport, sizeof(u16), &(inet->inet_sport));
evt.lport = ntohs(evt.lport);
// Get network namespace id, if kernel supports it
#ifdef CONFIG_NET_NS
evt.netns = sk->__sk_common.skc_net.net->ns.inum;
#else
evt.netns = 0;
#endif
##FILTER_NETNS##
// Get IP
if (family == AF_INET) {
bpf_probe_read(evt.laddr, sizeof(u32), &(inet->inet_rcv_saddr));
evt.laddr[0] = bcc_be32_to_cpu(evt.laddr[0]);
} else if (family == AF_INET6) {
bpf_probe_read(evt.laddr, sizeof(evt.laddr), sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
evt.laddr[0] = bcc_be64_to_cpu(evt.laddr[0]);
evt.laddr[1] = bcc_be64_to_cpu(evt.laddr[1]);
}
// Send event to userland
listen_evt.perf_submit(ctx, &evt, sizeof(evt));
return 0;
};
"""
# event data
TASK_COMM_LEN = 16 # linux/sched.h
class ListenEvt(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("pid_tgid", ct.c_ulonglong),
("backlog", ct.c_ulonglong),
("netns", ct.c_ulonglong),
("proto", ct.c_ulonglong),
("lport", ct.c_ulonglong),
("laddr", ct.c_ulonglong * 2),
("task", ct.c_char * TASK_COMM_LEN)
]
# TODO: properties to unpack protocol / ip / pid / tgid ...
# Format output
def event_printer(show_netns):
def print_event(cpu, data, size):
# Decode event
event = ct.cast(data, ct.POINTER(ListenEvt)).contents
pid = event.pid_tgid & 0xffffffff
proto_family = event.proto & 0xff
proto_type = event.proto >> 16 & 0xff
if proto_family == socket.SOCK_STREAM:
protocol = "TCP"
elif proto_family == socket.SOCK_DGRAM:
protocol = "UDP"
else:
protocol = "UNK"
address = ""
if proto_type == socket.AF_INET:
protocol += "v4"
address = netaddr.IPAddress(event.laddr[0])
elif proto_type == socket.AF_INET6:
address = netaddr.IPAddress(event.laddr[0]<<64 | event.laddr[1], version=6)
protocol += "v6"
# Display
if show_netns:
print("%-6d %-12.12s %-12s %-6s %-8s %-5s %-39s" % (
pid, event.task, event.netns, protocol, event.backlog,
event.lport, address,
))
else:
print("%-6d %-12.12s %-6s %-8s %-5s %-39s" % (
pid, event.task, protocol, event.backlog,
event.lport, address,
))
return print_event
if __name__ == "__main__":
# Parse arguments
args = parser.parse_args()
pid_filter = ""
netns_filter = ""
if args.pid:
pid_filter = "if (evt.pid_tgid != %d) return 0;" % args.pid
if args.netns:
netns_filter = "if (evt.netns != %d) return 0;" % args.netns
bpf_text = bpf_text.replace("##FILTER_PID##", pid_filter)
bpf_text = bpf_text.replace("##FILTER_NETNS##", netns_filter)
# Initialize BPF
b = BPF(text=bpf_text)
b["listen_evt"].open_perf_buffer(event_printer(args.show_netns))
# Print headers
if args.show_netns:
print("%-6s %-12s %-12s %-6s %-8s %-5s %-39s" % ("PID", "COMM", "NETNS", "PROTO", "BACKLOG", "PORT", "ADDR"))
else:
print("%-6s %-12s %-6s %-8s %-5s %-39s" % ("PID", "COMM", "PROTO", "BACKLOG", "PORT", "ADDR"))
# Read events
while 1:
b.kprobe_poll()
Demonstrations of solisten.py, the Linux eBPF/bcc version.
This tool traces the kernel function called when a program wants to listen
for TCP connections. It will not see UDP neither UNIX domain sockets.
It can be used to dynamically update a load balancer as a program is actually
ready to accept connexion, hence avoiding the "downtime" while it is initializing.
# ./solisten.py --show-netns
PID COMM NETNS PROTO BACKLOG ADDR PORT
3643 nc 4026531957 TCPv4 1 0.0.0.0 4242
3659 nc 4026531957 TCPv6 1 2001:f0d0:1002:51::4 4242
4221 redis-server 4026532165 TCPv6 128 :: 6379
4221 redis-server 4026532165 TCPv4 128 0.0.0.0 6379
6067 nginx 4026531957 TCPv4 128 0.0.0.0 80
6067 nginx 4026531957 TCPv6 128 :: 80
6069 nginx 4026531957 TCPv4 128 0.0.0.0 80
6069 nginx 4026531957 TCPv6 128 :: 80
6069 nginx 4026531957 TCPv4 128 0.0.0.0 80
6069 nginx 4026531957 TCPv6 128 :: 80
This output show the listen event from 3 programs. Netcat was started twice as
shown by the 2 different PIDs. The first time on the wilcard IPv4, the second
time on an IPv6. Netcat being a "one shot" program. It can accept a single
connection, hence the backlog of "1".
The next program is redis-server. As the netns column shows, it is in a
different network namespace than netcat and nginx. In this specific case
it was launched in a docker container. It listens both on IPv4 and IPv4
with up to 128 pending connections.
Determining the actual container is out if the scope of this tool. It could
be derived by scrapping /proc/<PID>/cgroup. Note that this is racy.
The overhead of this tool is negligeable as it traces listen() calls which are
invoked in the initialization path of a program. The operation part will remain
unaffected. In particular, accept() calls will not be affected. Neither
individual read() and write().
......@@ -72,52 +72,14 @@ def signal_ignore(signal, frame):
bpf_text = """
#include <uapi/linux/ptrace.h>
#define MAXDEPTH 10
struct key_t {
u64 ip;
u64 ret[MAXDEPTH];
};
BPF_HASH(counts, struct key_t);
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}
BPF_HASH(counts, int);
BPF_STACK_TRACE(stack_traces, 1024);
int trace_count(struct pt_regs *ctx) {
FILTER
struct key_t key = {};
u64 zero = 0, *val, bp = 0;
int depth = 0;
key.ip = ctx->ip;
bp = ctx->bp;
// unrolled loop, 10 (MAXDEPTH) frames deep:
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
if (!(key.ret[depth++] = get_frame(&bp))) goto out;
out:
val = counts.lookup_or_init(&key, &zero);
int key = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
u64 zero = 0;
u64 *val = counts.lookup_or_init(&key, &zero);
(*val)++;
return 0;
}
......@@ -164,13 +126,11 @@ while (1):
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
counts = b.get_table("counts")
counts = b["counts"]
stack_traces = b["stack_traces"]
for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
print_frame(k.ip)
for i in range(0, maxdepth):
if k.ret[i] == 0:
break
print_frame(k.ret[i])
for addr in stack_traces.walk(k.value):
print_frame(addr)
print(" %d\n" % v.value)
counts.clear()
......
......@@ -82,9 +82,6 @@ tracing.
It can be useful to trace the path to submit_bio to explain unusual rates of
disk IOPS. These could have in-kernel origins (eg, background scrub).
This version of stackcount truncates stacks to 10 levels deep (plus 1 for
the traced function, so 11).
As another example, here are the code paths that led to ip_output(), which
sends a packet at the IP level:
......
......@@ -20,6 +20,7 @@
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
......@@ -50,45 +51,14 @@ debug = 0
bpf_text = """
#include <uapi/linux/ptrace.h>
static int print_frame(u64 *bp, int *depth) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (ret < __START_KERNEL_map)
return 0;
bpf_trace_printk("r%d: %llx\\n", *depth, ret);
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
return 0;
*depth += 1;
return 1;
}
return 0;
}
BPF_STACK_TRACE(stack_traces, 128)
void trace_stack(struct pt_regs *ctx) {
FILTER
u64 bp = 0;
int depth = 0;
bpf_trace_printk("\\n");
if (ctx->ip)
bpf_trace_printk("ip: %llx\\n", ctx->ip);
bp = ctx->bp;
// unrolled loop, 10 frames deep:
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
if (!print_frame(&bp, &depth)) return;
};
int stack_id = stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID);
if (stack_id >= 0)
bpf_trace_printk("stack_id=%d\\n", stack_id);
}
"""
if args.pid:
bpf_text = bpf_text.replace('FILTER',
......@@ -107,24 +77,28 @@ if matched == 0:
print("Function \"%s\" not found. Exiting." % function)
exit()
stack_traces = b.get_table("stack_traces")
msg_regexp = re.compile("stack_id=(\d+)")
# header
if verbose:
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU",
"STACK"))
print("%-18s %-12s %-6s %-3s %s" % ("TIME(s)", "COMM", "PID", "CPU", "SYSCALL"))
else:
print("%-18s %s" % ("TIME(s)", "STACK"))
print("%-18s %s" % ("TIME(s)", "SYSCALL"))
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
if msg != "":
(reg, addr) = msg.split(" ")
if offset:
ip = b.ksymaddr(int(addr, 16))
m = msg_regexp.match(msg)
if m:
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, function))
else:
ip = b.ksym(int(addr, 16))
msg = msg + " " + ip
if verbose:
print("%-18.9f %-12.12s %-6d %-3d %s" % (ts, task, pid, cpu, msg))
else:
print("%-18.9f %s" % (ts, msg))
print("%-18.9f %s" % (ts, function))
stack_id = int(m.group(1))
for addr in stack_traces.walk(stack_id):
sym = b.ksymaddr(addr) if offset else b.ksym(addr)
print("\t%016x %s" % (addr, sym))
print()
......@@ -3,27 +3,20 @@ Demonstrations of stacksnoop, the Linux eBPF/bcc version.
This program traces the given kernel function and prints the kernel stack trace
for every call. This tool is useful for studying low frequency kernel functions,
to see how they were invoked. For example, tracing the ext4_sync_fs() call:
# ./stacksnoop ext4_sync_fs
TIME(s) STACK
42005194.132250004
42005194.132253997 ip: ffffffff81280461 ext4_sync_fs
42005194.132256001 r0: ffffffff811ed7f9 iterate_supers
42005194.132257000 r1: ffffffff8121ba25 sys_sync
42005194.132257000 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
42005194.132275000
42005194.132275999 ip: ffffffff81280461 ext4_sync_fs
42005194.132275999 r0: ffffffff811ed7f9 iterate_supers
42005194.132276997 r1: ffffffff8121ba35 sys_sync
42005194.132276997 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
This shows that ext4_sync_fs() was called by iterate_supers(), which was called
by sys_sync(), and so on. (It tells me that this was a syscall invoked sync,
so an application has requested it.)
The "ip" refers to the instruction pointer, and the "r#" refers to the return
address for each stack frame.
to see how they were invoked. For example, tracing the submit_bio() call:
# ./stacksnoop submit_bio
TIME(s) SYSCALL
3592.838736000 submit_bio
ffffffff813bd961 submit_bio
ffffffff81257c12 submit_bh
ffffffff81301948 jbd2_journal_commit_transaction
ffffffff8130653a kjournald2
ffffffff810a2df8 kthread
ffffffff8183a122 ret_from_fork
This shows that submit_bio() was called by submit_bh(), which was called
by jbd2_journal_commit_transaction(), and so on.
For high frequency functions, see stackcount, which summarizes in-kernel for
efficiency. If you don't know if your function is low or high frequency, try
......@@ -32,20 +25,17 @@ funccount.
The -v option includes more fields, including the on-CPU process (COMM and PID):
# ./stacksnoop -v ext4_sync_fs
TIME(s) COMM PID CPU STACK
42005557.056332998 sync 22352 1
42005557.056336999 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs
42005557.056339003 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers
42005557.056340002 sync 22352 1 r1: ffffffff8121ba25 sys_sync
42005557.056340002 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
42005557.056358002 sync 22352 1
42005557.056358002 sync 22352 1 ip: ffffffff81280461 ext4_sync_fs
42005557.056359001 sync 22352 1 r0: ffffffff811ed7f9 iterate_supers
42005557.056359999 sync 22352 1 r1: ffffffff8121ba35 sys_sync
42005557.056359999 sync 22352 1 r2: ffffffff81775cb6 entry_SYSCALL_64_fastpath
This identifies the application issuing the sync syscall: the sync(1) command
# ./stacksnoop -v submit_bio
TIME(s) COMM PID CPU SYSCALL
3734.855027000 jbd2/dm-0-8 313 0 submit_bio
ffffffff813bd961 submit_bio
ffffffff81257c12 submit_bh
ffffffff81301948 jbd2_journal_commit_transaction
ffffffff8130653a kjournald2
ffffffff810a2df8 kthread
ffffffff8183a122 ret_from_fork
This identifies the application issuing the sync syscall: the jbd2 process
(COMM column).
......@@ -53,29 +43,32 @@ Here's another example, showing the path to second_overflow() and on-CPU
process:
# ./stacksnoop -v second_overflow
TIME(s) COMM PID CPU STACK
42005696.529449999 <idle> 0 0
42005696.529457003 <idle> 0 0 ip: ffffffff810e5701 second_overflow
42005696.529459000 <idle> 0 0 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005696.529459998 <idle> 0 0 r1: ffffffff810ed6e0 tick_irq_enter
42005696.529459998 <idle> 0 0 r2: ffffffff8107a195 irq_enter
42005696.529460996 <idle> 0 0 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005696.529460996 <idle> 0 0 r4: ffffffff81777a2e xen_do_hypervisor_callback
42005697.616295002 <idle> 0 0
42005697.616301000 <idle> 0 0 ip: ffffffff810e5701 second_overflow
42005697.616302997 <idle> 0 0 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005697.616304003 <idle> 0 0 r1: ffffffff810ed6e0 tick_irq_enter
42005697.616304003 <idle> 0 0 r2: ffffffff8107a195 irq_enter
42005697.616305001 <idle> 0 0 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005697.616305001 <idle> 0 0 r4: ffffffff81777a2e xen_do_hypervisor_callback
42005698.556240998 <idle> 0 1
42005698.556247003 <idle> 0 1 ip: ffffffff810e5701 second_overflow
42005698.556249000 <idle> 0 1 r0: ffffffff810ecb1b tick_do_update_jiffies64
42005698.556249000 <idle> 0 1 r1: ffffffff810ed6e0 tick_irq_enter
42005698.556249999 <idle> 0 1 r2: ffffffff8107a195 irq_enter
42005698.556249999 <idle> 0 1 r3: ffffffff8146bb6f xen_evtchn_do_upcall
42005698.556250997 <idle> 0 1 r4: ffffffff81777a2e xen_do_hypervisor_callback
[...]
TIME(s) COMM PID CPU SYSCALL
3837.526433000 <idle> 0 1 second_overflow
ffffffff810fac41 second_overflow
ffffffff81102320 tick_do_update_jiffies64
ffffffff81102bf0 tick_irq_enter
ffffffff810882ac irq_enter
ffffffff8183c7df smp_apic_timer_interrupt
ffffffff8183aae2 apic_timer_interrupt
ffffffff81038f9e default_idle
ffffffff8103979f arch_cpu_idle
ffffffff810c69da default_idle_call
ffffffff810c6cd7 cpu_startup_entry
ffffffff81051cbe start_secondary
3838.526953000 <idle> 0 1 second_overflow
ffffffff810fac41 second_overflow
ffffffff81102320 tick_do_update_jiffies64
ffffffff81102bf0 tick_irq_enter
ffffffff810882ac irq_enter
ffffffff8183c7df smp_apic_timer_interrupt
ffffffff8183aae2 apic_timer_interrupt
ffffffff81038f9e default_idle
ffffffff8103979f arch_cpu_idle
ffffffff810c69da default_idle_call
ffffffff810c6cd7 cpu_startup_entry
ffffffff81051cbe start_secondary
This fires every second (see TIME(s)), and is from tick_do_update_jiffies64().
......
#!/usr/bin/env python
#
# tplist Display kernel tracepoints or USDT probes and their formats.
#
# USAGE: tplist [-p PID] [-l LIB] [-v] [filter]
#
# Licensed under the Apache License, Version 2.0 (the "License")
# Copyright (C) 2016 Sasha Goldshtein.
import argparse
import fnmatch
import os
import re
import sys
from bcc import USDTReader
trace_root = "/sys/kernel/debug/tracing"
event_root = os.path.join(trace_root, "events")
parser = argparse.ArgumentParser(description=
"Display kernel tracepoints or USDT probes and their formats.",
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("-p", "--pid", type=int, default=-1, help=
"List USDT probes in the specified process")
parser.add_argument("-l", "--lib", default="", help=
"List USDT probes in the specified library or executable")
parser.add_argument("-v", dest="variables", action="store_true", help=
"Print the format (available variables)")
parser.add_argument(dest="filter", nargs="?", help=
"A filter that specifies which probes/tracepoints to print")
args = parser.parse_args()
def print_tpoint_format(category, event):
fmt = open(os.path.join(event_root, category, event, "format")
).readlines()
for line in fmt:
match = re.search(r'field:([^;]*);', line)
if match is None:
continue
parts = match.group(1).split()
field_name = parts[-1:][0]
field_type = " ".join(parts[:-1])
if "__data_loc" in field_type:
continue
if field_name.startswith("common_"):
continue
print(" %s %s;" % (field_type, field_name))
def print_tpoint(category, event):
tpoint = "%s:%s" % (category, event)
if not args.filter or fnmatch.fnmatch(tpoint, args.filter):
print(tpoint)
if args.variables:
print_tpoint_format(category, event)
def print_tracepoints():
for category in os.listdir(event_root):
cat_dir = os.path.join(event_root, category)
if not os.path.isdir(cat_dir):
continue
for event in os.listdir(cat_dir):
evt_dir = os.path.join(cat_dir, event)
if os.path.isdir(evt_dir):
print_tpoint(category, event)
def print_usdt(pid, lib):
reader = USDTReader(bin_path=lib, pid=pid)
probes_seen = []
for probe in reader.probes:
probe_name = "%s:%s" % (probe.provider, probe.name)
if not args.filter or fnmatch.fnmatch(probe_name, args.filter):
if probe_name in probes_seen:
continue
probes_seen.append(probe_name)
if args.variables:
print(probe.display_verbose())
else:
print("%s %s:%s" % (probe.bin_path,
probe.provider, probe.name))
if __name__ == "__main__":
try:
if args.pid != -1 or args.lib != "":
print_usdt(args.pid, args.lib)
else:
print_tracepoints()
except:
if sys.exc_type is not SystemExit:
print(sys.exc_value)
Demonstrations of tplist.
tplist displays kernel tracepoints and USDT probes, including their
format. It can be used to discover probe points for use with the trace
and argdist tools. Kernel tracepoints are scattered around the kernel
and provide valuable static tracing on block and network I/O, scheduling,
power events, and many other subjects. USDT probes are placed in libraries
(such as libc) and executables (such as node) and provide static tracing
information that can (optionally) be turned on and off at runtime.
For example, suppose you want to discover which USDT probes a particular
executable contains. Just run tplist on that executable (or library):
$ tplist -l basic_usdt
/home/vagrant/basic_usdt basic_usdt:start_main
/home/vagrant/basic_usdt basic_usdt:loop_iter
/home/vagrant/basic_usdt basic_usdt:end_main
The loop_iter probe sounds interesting. What are the locations of that
probe, and which variables are available?
$ tplist '*loop_iter' -l basic_usdt -v
/home/vagrant/basic_usdt basic_usdt:loop_iter [sema 0x601036]
location 0x400550 raw args: -4@$42 8@%rax
4 signed bytes @ constant 42
8 unsigned bytes @ register %rax
location 0x40056f raw args: 8@-8(%rbp) 8@%rax
8 unsigned bytes @ -8(%rbp)
8 unsigned bytes @ register %rax
This output indicates that the loop_iter probe is used in two locations
in the basic_usdt executable. The first location passes a constant value,
42, to the probe. The second location passes a variable value located at
an offset from the %rbp register. Don't worry -- you don't have to trace
the register values yourself. The argdist and trace tools understand the
probe format and can print out the arguments automatically -- you can
refer to them as arg1, arg2, and so on.
Try to explore with some common libraries on your system and see if they
contain UDST probes. Here are two examples you might find interesting:
$ tplist -l pthread # list probes in libpthread
/lib64/libpthread.so.0 libpthread:pthread_start
/lib64/libpthread.so.0 libpthread:pthread_create
/lib64/libpthread.so.0 libpthread:pthread_join
/lib64/libpthread.so.0 libpthread:pthread_join_ret
/lib64/libpthread.so.0 libpthread:mutex_init
... more output truncated
$ tplist -l c # list probes in libc
/lib64/libc.so.6 libc:setjmp
/lib64/libc.so.6 libc:longjmp
/lib64/libc.so.6 libc:longjmp_target
/lib64/libc.so.6 libc:memory_arena_reuse_free_list
/lib64/libc.so.6 libc:memory_heap_new
... more output truncated
tplist also understands kernel tracepoints, and can list their format
as well. For example, let's look for all block I/O-related tracepoints:
# tplist 'block*'
block:block_touch_buffer
block:block_dirty_buffer
block:block_rq_abort
block:block_rq_requeue
block:block_rq_complete
block:block_rq_insert
block:block_rq_issue
block:block_bio_bounce
block:block_bio_complete
block:block_bio_backmerge
block:block_bio_frontmerge
block:block_bio_queue
block:block_getrq
block:block_sleeprq
block:block_plug
block:block_unplug
block:block_split
block:block_bio_remap
block:block_rq_remap
The block:block_rq_complete tracepoints sounds interesting. Let's print
its format to see what we can trace with argdist and trace:
$ tplist -v block:block_rq_complete
block:block_rq_complete
dev_t dev;
sector_t sector;
unsigned int nr_sector;
int errors;
char rwbs[8];
The dev, sector, nr_sector, etc. variables can now all be used in probes
you specify with argdist or trace.
USAGE message:
$ tplist -h
usage: tplist.py [-h] [-p PID] [-l LIB] [-v] [filter]
Display kernel tracepoints or USDT probes and their formats.
positional arguments:
filter A filter that specifies which probes/tracepoints to print
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID List USDT probes in the specified process
-l LIB, --lib LIB List USDT probes in the specified library or executable
-v Print the format (available variables)
This diff is collapsed.
......@@ -80,6 +80,31 @@ Note that the retval variable must be cast to int before comparing to zero.
The reason is that the default type for argN and retval is an unsigned 64-bit
integer, which can never be smaller than 0.
trace has also some basic support for kernel tracepoints. For example, let's
trace the block:block_rq_complete tracepoint and print out the number of sectors
transferred:
# trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
TIME PID COMM FUNC -
01:23:51 0 swapper/0 block_rq_complete sectors=8
01:23:55 10017 kworker/u64: block_rq_complete sectors=1
01:23:55 0 swapper/0 block_rq_complete sectors=8
^C
To discover the tracepoint structure format (which you can refer to as the "tp"
variable), use the tplist tool. For example:
# tplist -v block:block_rq_complete
block:block_rq_complete
dev_t dev;
sector_t sector;
unsigned int nr_sector;
int errors;
char rwbs[8];
This output tells you that you can use "tp.dev", "tp.sector", etc. in your
predicate and trace arguments.
As a final example, let's trace open syscalls for a specific process. By
default, tracing is system-wide, but the -p switch overrides this:
......@@ -144,4 +169,8 @@ trace 'r::__kmalloc (retval == 0) "kmalloc failed!"
Trace returns from __kmalloc which returned a null pointer
trace 'r:c:malloc (retval) "allocated = %p", retval
Trace returns from malloc and print non-NULL allocated buffers
trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
trace 'u:pthread:pthread_create (arg4 != 0)'
Trace the USDT probe pthread_create when its 4th argument is non-zero
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment