Commit 5f1b9764 authored by Sasha Goldshtein's avatar Sasha Goldshtein

Merge branch 'master' into alloc-hist

parents d1cc126c dd221c4e
...@@ -72,6 +72,8 @@ Tools: ...@@ -72,6 +72,8 @@ Tools:
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt). - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt). - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt). - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt). - tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt).
- tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt). - tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
- tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt). - tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt).
...@@ -83,6 +85,7 @@ Tools: ...@@ -83,6 +85,7 @@ Tools:
- tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_examples.txt). - tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_examples.txt).
- tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt). - tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
- tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt). - tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
- tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt). - tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt). - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt). - tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
......
.TH dcsnoop 8 "2016-02-10" "USER COMMANDS"
.SH NAME
dcsnoop \- Trace directory entry cache (dcache) lookups. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B dcsnoop [\-h] [\-a]
.SH DESCRIPTION
By default, this traces every failed dcache lookup (cache miss), and shows the
process performing the lookup and the filename requested. A \-a option can be
used to show all lookups, not just failed ones.
The output of this tool can be verbose, and is intended for further
investigations of dcache performance beyond dcstat(8), which prints
per-second summaries.
This uses kernel dynamic tracing of the d_lookup() function, and will need
and will need updating to match any changes to this function.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-a
Trace references, not just failed lookups.
.SH EXAMPLES
.TP
Trace failed dcache lookups:
#
.B dcsnoop
.TP
Trace all dcache lookups:
#
.B dcsnoop \-a
.SH FIELDS
.TP
TIME(s)
Time of lookup, in seconds.
.TP
PID
Process ID.
.TP
COMM
Process name.
.TP
T
Type: R == reference (only visible with \-a), M == miss. A miss will print two
lines, one for the reference, and one for the miss.
.TP
FILE
The file name component that was being looked up. This contains trailing
pathname components (after '/'), which will be the subject of subsequent
lookups.
.SH OVERHEAD
File name lookups can be frequent (depending on the workload), and this tool
prints a line for each failed lookup, and with \-a, each reference as well. The
output may be verbose, and the incurred overhead, while optimized to some
extent, may still be from noticeable to significant. This is only really
intended for deeper investigations beyond dcstat(8), when absolutely necessary.
Measure and quantify the overhead in a test environment before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
dcstat(1)
.TH dcstat 8 "2016-02-09" "USER COMMANDS"
.SH NAME
dcstat \- Directory entry cache (dcache) stats. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B dcstat
[interval [count]]
.SH DESCRIPTION
The Linux directory entry cache (dcache) improves the performance of file and
directory name lookups. This tool provides per-second summary statistics of
dcache performance.
This uses kernel dynamic tracing of kernel functions, lookup_fast() and
d_lookup(), which will need to be modified to match kernel changes.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Print summaries each second:
#
.B dcstat
.TP
Print output every five seconds, three times:
#
.B dcstat 5 3
.SH FIELDS
.TP
REFS/s
Number dcache lookups (references) per second.
.TP
SLOW/s
Number of dcache lookups that failed the lookup_fast() path and executed the
lookup_slow() path instead.
.TP
MISS/s
Number of dcache misses (failed both fast and slow lookups).
.TP
HIT%
Percentage of dcache hits over total references.
.SH OVERHEAD
The overhead depends on the frequency of file and directory name lookups.
While the per-event overhead is low, some applications may make over 100k
lookups per second, and the low per-event overhead will begin to add up, and
could begin to be measurable (over 10% CPU usage). Measure in a test
environment.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
dcsnoop(8)
...@@ -2,7 +2,8 @@ ...@@ -2,7 +2,8 @@
.SH NAME .SH NAME
memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc. memleak \- Print a summary of outstanding allocations and their call stacks to detect memory leaks. Uses Linux eBPF/bcc.
.SH SYNOPSIS .SH SYNOPSIS
.B memleak [-h] [-p PID] [-t] [-i INTERVAL] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE] .B memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND] [-s SAMPLE_RATE]
[-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE] [-Z MAX_SIZE] [INTERVAL] [COUNT]
.SH DESCRIPTION .SH DESCRIPTION
memleak traces and matches memory allocation and deallocation requests, and memleak traces and matches memory allocation and deallocation requests, and
collects call stacks for each allocation. memleak can then print a summary collects call stacks for each allocation. memleak can then print a summary
...@@ -11,7 +12,11 @@ of which call stacks performed allocations that weren't subsequently freed. ...@@ -11,7 +12,11 @@ of which call stacks performed allocations that weren't subsequently freed.
When tracing a specific process, memleak instruments malloc and free from libc. When tracing a specific process, memleak instruments malloc and free from libc.
When tracing all processes, memleak instruments kmalloc and kfree. When tracing all processes, memleak instruments kmalloc and kfree.
The stack depth is currently limited to 10 (+1 for the current instruction pointer). memleak may introduce significant overhead when tracing processes that allocate
and free many blocks very quickly. See the OVERHEAD section below.
The stack depth is limited to 10 by default (+1 for the current instruction pointer),
but it can be controlled using the \-d switch if deeper stacks are required.
This currently only works on x86_64. Check for future versions. This currently only works on x86_64. Check for future versions.
.SH REQUIREMENTS .SH REQUIREMENTS
...@@ -27,10 +32,6 @@ Trace this process ID only (filtered in-kernel). This traces malloc and free fro ...@@ -27,10 +32,6 @@ Trace this process ID only (filtered in-kernel). This traces malloc and free fro
\-t \-t
Print a trace of all allocation and free requests and results. Print a trace of all allocation and free requests and results.
.TP .TP
\-i INTERVAL
Print a summary of oustanding allocations and their call stacks every INTERVAL seconds.
The default interval is 5 seconds.
.TP
\-a \-a
Print a list of allocations that weren't freed (and their sizes) in addition to their call stacks. Print a list of allocations that weren't freed (and their sizes) in addition to their call stacks.
.TP .TP
...@@ -43,25 +44,56 @@ Run the specified command and trace its allocations only. This traces malloc and ...@@ -43,25 +44,56 @@ Run the specified command and trace its allocations only. This traces malloc and
.TP .TP
\-s SAMPLE_RATE \-s SAMPLE_RATE
Record roughly every SAMPLE_RATE-th allocation to reduce overhead. Record roughly every SAMPLE_RATE-th allocation to reduce overhead.
.TP
\-d STACK_DEPTH
Capture STACK_DEPTH frames (or less) when obtaining allocation call stacks.
The default value is 10.
.TP
\-t TOP
Print only the top TOP stacks (sorted by size).
The default value is 10.
.TP
\-z MIN_SIZE
Capture only allocations that are larger than or equal to MIN_SIZE bytes.
.TP
\-Z MAX_SIZE
Capture only allocations that are smaller than or equal to MAX_SIZE bytes.
.TP
INTERVAL
Print a summary of oustanding allocations and their call stacks every INTERVAL seconds.
The default interval is 5 seconds.
.TP
COUNT
Print the outstanding allocations summary COUNT times and then exit.
.SH EXAMPLES .SH EXAMPLES
.TP .TP
Print outstanding kernel allocation stacks every 3 seconds: Print outstanding kernel allocation stacks every 3 seconds:
# #
.B memleak -i 3 .B memleak 3
.TP .TP
Print user outstanding allocation stacks and allocation details for the process 1005: Print user outstanding allocation stacks and allocation details for the process 1005:
# #
.B memleak -p 1005 -a .B memleak -p 1005 -a
.TP .TP
Sample roughly every 5th allocation (~20%) of the call stacks and print the top 5
stacks 10 times before quitting.
#
.B memleak -s 5 --top=5 10
.TP
Run ./allocs and print outstanding allocation stacks for that process: Run ./allocs and print outstanding allocation stacks for that process:
# #
.B memleak -c "./allocs" .B memleak -c "./allocs"
.TP
Capture only allocations between 16 and 32 bytes in size:
#
.B memleak -z 16 -Z 32
.SH OVERHEAD .SH OVERHEAD
memleak can have significant overhead if the target process or kernel performs memleak can have significant overhead if the target process or kernel performs
allocations at a very high rate. Pathological cases may exhibit up to 100x allocations at a very high rate. Pathological cases may exhibit up to 100x
degradation in running time. Most of the time, however, memleak shouldn't cause degradation in running time. Most of the time, however, memleak shouldn't cause
a significant slowdown. You can also use the \-s switch to reduce the overhead a significant slowdown. You can use the \-s switch to reduce the overhead
further by capturing only every N-th allocation. further by capturing only every N-th allocation. The \-z and \-Z switches can
also reduce overhead by capturing only allocations of specific sizes.
To determine the rate at which your application is calling malloc/free, or the To determine the rate at which your application is calling malloc/free, or the
rate at which your kernel is calling kmalloc/kfree, place a probe with perf and rate at which your kernel is calling kmalloc/kfree, place a probe with perf and
...@@ -73,6 +105,10 @@ placed in a typical period of 10 seconds: ...@@ -73,6 +105,10 @@ placed in a typical period of 10 seconds:
# #
.B perf stat -a -e 'probe:__kmalloc' -- sleep 10 .B perf stat -a -e 'probe:__kmalloc' -- sleep 10
Another setting that may help reduce overhead is lowering the number of stack
frames captured and parsed by memleak for each allocation, using the \-d switch.
.SH SOURCE .SH SOURCE
This is from bcc. This is from bcc.
.IP .IP
......
.TH oomkill 8 "2016-02-09" "USER COMMANDS"
.SH NAME
oomkill \- Trace oom_kill_process(). Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B oomkill
.SH DESCRIPTION
This traces the kernel out-of-memory killer, and prints basic details,
including the system load averages at the time of the OOM kill. This can
provide more context on the system state at the time: was it getting busier
or steady, based on the load averages? This tool may also be useful to
customize for investigations; for example, by adding other task_struct
details at the time of OOM.
This program is also a basic example of eBPF/bcc.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Trace OOM kill events:
#
.B oomkill
.SH FIELDS
.TP
Triggered by ...
The process ID and process name of the task that was running when another task was OOM
killed.
.TP
OOM kill of ...
The process ID and name of the target process that was OOM killed.
.TP
loadavg
Contents of /proc/loadavg. The first three numbers are 1, 5, and 15 minute
load averages (where the average is an exponentially damped moving sum, and
those numbers are constants in the equation); then there is the number of
running tasks, a slash, and the total number of tasks; and then the last number
is the last PID to be created.
.SH OVERHEAD
Negligible.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
memleak(8)
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dcsnoop Trace directory entry cache (dcache) lookups.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: dcsnoop [-h] [-a]
#
# By default, this traces every failed dcache lookup, and shows the process
# performing the lookup and the filename requested. A -a option can be used
# to show all lookups, not just failed ones.
#
# This uses kernel dynamic tracing of the d_lookup() function, and will need
# to be modified to match kernel changes.
#
# Also see dcstat(8), for per-second summaries.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
./dcsnoop # trace failed dcache lookups
./dcsnoop -a # trace all dcache lookups
"""
parser = argparse.ArgumentParser(
description="Trace directory entry cache (dcache) lookups",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-a", "--all", action="store_true",
help="trace all lookups (default is fails only)")
args = parser.parse_args()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#define MAX_FILE_LEN 64
struct entry_t {
char name[MAX_FILE_LEN];
};
BPF_HASH(entrybypid, u32, struct entry_t);
/* from fs/namei.c: */
struct nameidata {
struct path path;
struct qstr last;
// [...]
};
int trace_fast(struct pt_regs *ctx, struct nameidata *nd, struct path *path)
{
bpf_trace_printk("R %s\\n", nd->last.name);
return 1;
}
int kprobe__d_lookup(struct pt_regs *ctx, const struct dentry *parent,
const struct qstr *name)
{
u32 pid = bpf_get_current_pid_tgid();
struct entry_t entry = {};
const char *fname = name->name;
if (fname) {
bpf_probe_read(&entry.name, sizeof(entry.name), (void *)fname);
}
entrybypid.update(&pid, &entry);
return 0;
}
int kretprobe__d_lookup(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid();
struct entry_t *ep;
ep = entrybypid.lookup(&pid);
if (ep == 0) {
return 0; // missed entry
}
if (ctx->ax == 0) {
bpf_trace_printk("M %s\\n", ep->name);
}
entrybypid.delete(&pid);
return 0;
}
"""
# initialize BPF
b = BPF(text=bpf_text)
if args.all:
b.attach_kprobe(event="lookup_fast", fn_name="trace_fast")
# header
print("%-11s %-6s %-16s %1s %s" % ("TIME(s)", "PID", "COMM", "T", "FILE"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
try:
(type, file) = msg.split(" ", 1)
except ValueError:
continue
if start_ts == 0:
start_ts = ts
print("%-11.6f %-6s %-16s %1s %s" % (ts - start_ts, pid, task, type, file))
Demonstrations of dcsnoop, the Linux eBPF/bcc version.
dcsnoop traces directory entry cache (dcache) lookups, and can be used for
further investigation beyond dcstat(8). The output is likely verbose, as
dcache lookups are likely frequent. By default, only failed lookups are shown.
For example:
# ./dcsnoop.py
TIME(s) PID COMM T FILE
0.002837 1643 snmpd M net/dev
0.002852 1643 snmpd M 1643
0.002856 1643 snmpd M net
0.002863 1643 snmpd M dev
0.002952 1643 snmpd M net/if_inet6
0.002964 1643 snmpd M if_inet6
0.003180 1643 snmpd M net/ipv4/neigh/eth0/retrans_time_ms
0.003192 1643 snmpd M ipv4/neigh/eth0/retrans_time_ms
0.003197 1643 snmpd M neigh/eth0/retrans_time_ms
0.003203 1643 snmpd M eth0/retrans_time_ms
0.003206 1643 snmpd M retrans_time_ms
0.003245 1643 snmpd M ipv6/neigh/eth0/retrans_time_ms
0.003249 1643 snmpd M neigh/eth0/retrans_time_ms
0.003252 1643 snmpd M eth0/retrans_time_ms
0.003255 1643 snmpd M retrans_time_ms
0.003287 1643 snmpd M conf/eth0/forwarding
0.003292 1643 snmpd M eth0/forwarding
0.003295 1643 snmpd M forwarding
0.003326 1643 snmpd M base_reachable_time_ms
[...]
I ran a drop caches at the same time as executing this tool. The output shows
the processes, the type of event ("T" column: M == miss, R == reference),
and the filename for the dcache lookup.
The way the dcache is currently implemented, each component of a path is
checked in turn. The first line, showing "net/dev" from snmp, will be a lookup
for "net" in a directory (that isn't shown here). If it finds "net", it will
then lookup "dev" inside net. You can see this sequence a little later,
starting at time 0.003180, where a pathname is being searched
directory by directory.
The -a option will show all lookups, although be warned, the output will be
very verbose. For example:
# ./dcsnoop
TIME(s) PID COMM T FILE
0.000000 20279 dcsnoop.py M p_lookup_fast
0.000010 20279 dcsnoop.py M enable
0.000013 20279 dcsnoop.py M id
0.000015 20279 dcsnoop.py M filter
0.000017 20279 dcsnoop.py M trigger
0.000019 20279 dcsnoop.py M format
0.006148 20279 dcsnoop.py R sys/kernel/debug/tracing/trace_pipe
0.006158 20279 dcsnoop.py R kernel/debug/tracing/trace_pipe
0.006161 20279 dcsnoop.py R debug/tracing/trace_pipe
0.006164 20279 dcsnoop.py R tracing/trace_pipe
0.006166 20279 dcsnoop.py R trace_pipe
0.015900 1643 snmpd R proc/sys/net/ipv6/conf/lo/forwarding
0.015901 1643 snmpd R sys/net/ipv6/conf/lo/forwarding
0.015901 1643 snmpd R net/ipv6/conf/lo/forwarding
0.015902 1643 snmpd R ipv6/conf/lo/forwarding
0.015903 1643 snmpd R conf/lo/forwarding
0.015904 1643 snmpd R lo/forwarding
0.015905 1643 snmpd M lo/forwarding
0.015908 1643 snmpd R forwarding
0.015909 1643 snmpd M forwarding
0.015937 1643 snmpd R proc/sys/net/ipv6/neigh/lo/base_reachable_time_ms
0.015937 1643 snmpd R sys/net/ipv6/neigh/lo/base_reachable_time_ms
0.015938 1643 snmpd R net/ipv6/neigh/lo/base_reachable_time_ms
0.015939 1643 snmpd R ipv6/neigh/lo/base_reachable_time_ms
0.015940 1643 snmpd R neigh/lo/base_reachable_time_ms
0.015941 1643 snmpd R lo/base_reachable_time_ms
0.015941 1643 snmpd R base_reachable_time_ms
0.015943 1643 snmpd M base_reachable_time_ms
0.043569 1876 supervise M 20281
0.043573 1886 supervise M 20280
0.043582 1886 supervise R supervise/status.new
[...]
USAGE message:
# ./dcsnoop.py -h
usage: dcsnoop.py [-h] [-a]
Trace directory entry cache (dcache) lookups
optional arguments:
-h, --help show this help message and exit
-a, --all trace all lookups (default is fails only)
examples:
./dcsnoop # trace failed dcache lookups
./dcsnoop -a # trace all dcache lookups
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dcstat Directory entry cache (dcache) stats.
# For Linux, uses BCC, eBPF.
#
# USAGE: dcstat [interval [count]]
#
# This uses kernel dynamic tracing of kernel functions, lookup_fast() and
# d_lookup(), which will need to be modified to match kernel changes. See
# code comments.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from ctypes import c_int
from time import sleep, strftime
from sys import argv
def usage():
print("USAGE: %s [interval [count]]" % argv[0])
exit()
# arguments
interval = 1
count = -1
if len(argv) > 1:
try:
interval = int(argv[1])
if interval == 0:
raise
if len(argv) > 2:
count = int(argv[2])
except: # also catches -h, --help
usage()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
enum stats {
S_REFS = 1,
S_SLOW,
S_MISS,
S_MAXSTAT
};
BPF_TABLE("array", int, u64, stats, S_MAXSTAT + 1);
/*
* How this is instrumented, and how to interpret the statistics, is very much
* tied to the current kernel implementation (this was written on Linux 4.4).
* This will need maintenance to keep working as the implementation changes. To
* aid future adventurers, this is is what the current code does, and why.
*
* First problem: the current implementation takes a path and then does a
* lookup of each component. So how do we count a reference? Once for the path
* lookup, or once for every component lookup? I've chosen the latter
* since it seems to map more closely to actual dcache lookups (via
* __d_lookup_rcu()). It's counted via calls to lookup_fast().
*
* The implementation tries different, progressively slower, approaches to
* lookup a file. At what point do we call it a dcache miss? I've choosen when
* a d_lookup() (which is called during lookup_slow()) returns zero.
*
* I've also included a "SLOW" statistic to show how often the fast lookup
* failed. Whether this exists or is interesting is an implementation detail,
* and the "SLOW" statistic may be removed in future versions.
*/
void count_fast(struct pt_regs *ctx) {
int key = S_REFS;
u64 *leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
}
void count_lookup(struct pt_regs *ctx) {
int key = S_SLOW;
u64 *leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
if (ctx->ax == 0) {
key = S_MISS;
leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
}
}
"""
# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="lookup_fast", fn_name="count_fast")
b.attach_kretprobe(event="d_lookup", fn_name="count_lookup")
# stat column labels and indexes
stats = {
"REFS": 1,
"SLOW": 2,
"MISS": 3
}
# header
print("%-8s " % "TIME", end="")
for stype, idx in sorted(stats.iteritems(), key=lambda (k, v): (v, k)):
print(" %8s" % (stype + "/s"), end="")
print(" %8s" % "HIT%")
# output
i = 0
while (1):
if count > 0:
i += 1
if i > count:
exit()
try:
sleep(interval)
except KeyboardInterrupt:
pass
exit()
print("%-8s: " % strftime("%H:%M:%S"), end="")
# print each statistic as a column
for stype, idx in sorted(stats.iteritems(), key=lambda (k, v): (v, k)):
try:
val = b["stats"][c_int(idx)].value / interval
print(" %8d" % val, end="")
except:
print(" %8d" % 0, end="")
# print hit ratio percentage
try:
ref = b["stats"][c_int(stats["REFS"])].value
miss = b["stats"][c_int(stats["MISS"])].value
hit = ref - miss
pct = float(100) * hit / ref
print(" %8.2f" % pct)
except:
print(" %7s%%" % "-")
b["stats"].clear()
Demonstrations of dcstat, the Linux eBPF/bcc version.
dcstat shows directory entry cache (dcache) statistics. For example:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:11:47: 2059 141 97 95.29
08:11:48: 79974 151 106 99.87
08:11:49: 192874 146 102 99.95
08:11:50: 2051 144 100 95.12
08:11:51: 73373 17239 17194 76.57
08:11:52: 54685 25431 25387 53.58
08:11:53: 18127 8182 8137 55.12
08:11:54: 22517 10345 10301 54.25
08:11:55: 7524 2881 2836 62.31
08:11:56: 2067 141 97 95.31
08:11:57: 2115 145 101 95.22
The output shows the total references per second ("REFS/s"), the number that
took a slower code path to be processed ("SLOW/s"), the number of dcache misses
("MISS/s"), and the hit ratio as a percentage. By default, an interval of 1
second is used.
At 08:11:49, there were 192 thousand references, which almost entirely hit
from the dcache, with a hit ration of 99.95%. A little later, starting at
08:11:51, a workload began that walked many uncached files, reducing the hit
ratio to 53%, and more importantly, a miss rate of over 10 thousand per second.
Here's an interesting workload:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:15:53: 250683 141 97 99.96
08:15:54: 266115 145 101 99.96
08:15:55: 268428 141 97 99.96
08:15:56: 260389 143 99 99.96
It's a 99.96% hit ratio, and these are all negative hits: accessing a file that
does not exist. Here's the C program that generated the workload:
# cat -n badopen.c
1 #include <sys/types.h>
2 #include <sys/stat.h>
3 #include <fcntl.h>
4
5 int
6 main(int argc, char *argv[])
7 {
8 int fd;
9 while (1) {
10 fd = open("bad", O_RDONLY);
11 }
12 return 0;
13 }
This is a simple workload generator than tries to open a missing file ("bad")
as quickly as possible.
Lets see what happens if the workload attempts to open a different filename
each time (which is also a missing file), using the following C code:
# cat -n badopen2.c
1 #include <sys/types.h>
2 #include <sys/stat.h>
3 #include <fcntl.h>
4 #include <stdio.h>
5
6 int
7 main(int argc, char *argv[])
8 {
9 int fd, i = 0;
10 char buf[128] = {};
11
12 while (1) {
13 sprintf(buf, "bad%d", i++);
14 fd = open(buf, O_RDONLY);
15 }
16 return 0;
17 }
Here's dcstat:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:18:52: 241131 237544 237505 1.51
08:18:53: 238210 236323 236278 0.82
08:18:54: 235259 233307 233261 0.85
08:18:55: 233144 231256 231214 0.83
08:18:56: 231981 230097 230053 0.83
dcstat also supports an optional interval and optional count. For example,
printing 5 second summaries 3 times:
# ./dcstat 5 3
TIME REFS/s SLOW/s MISS/s HIT%
08:20:03: 2085 143 99 95.23
08:20:08: 2077 143 98 95.24
08:20:14: 2071 144 100 95.15
USAGE message:
# ./dcstat -h
USAGE: ./dcstat [interval [count]]
...@@ -97,7 +97,7 @@ start_ts = 0 ...@@ -97,7 +97,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(delta, filename) = msg.split(" ") (delta, filename) = msg.split(" ", 1)
# print columns # print columns
print("%-8s %-6d %-16s %-7.2f %s" % (strftime("%H:%M:%S"), pid, task, print("%-8s %-6d %-16s %-7.2f %s" % (strftime("%H:%M:%S"), pid, task,
......
#include <uapi/linux/ptrace.h>
#define MAX_STACK_SIZE 10
struct alloc_info_t {
u64 size;
u64 timestamp_ns;
int num_frames;
u64 callstack[MAX_STACK_SIZE];
};
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
// Adapted from https://github.com/iovisor/bcc/tools/offcputime.py
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
return ret;
}
return 0;
}
static int grab_stack(struct pt_regs *ctx, struct alloc_info_t *info)
{
int depth = 0;
u64 bp = ctx->bp;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
if (!(info->callstack[depth++] = get_frame(&bp))) return depth;
return depth;
}
int alloc_enter(struct pt_regs *ctx, size_t size)
{
// Ideally, this should use a random number source, such as
// BPF_FUNC_get_prandom_u32, but that's currently not supported
// by the bcc front-end.
if (SAMPLE_EVERY_N > 1) {
u64 ts = bpf_ktime_get_ns();
if (ts % SAMPLE_EVERY_N != 0)
return 0;
}
u64 pid = bpf_get_current_pid_tgid();
u64 size64 = size;
sizes.update(&pid, &size64);
if (SHOULD_PRINT)
bpf_trace_printk("alloc entered, size = %u\n", size);
return 0;
}
int alloc_exit(struct pt_regs *ctx)
{
u64 address = ctx->ax;
u64 pid = bpf_get_current_pid_tgid();
u64* size64 = sizes.lookup(&pid);
struct alloc_info_t info = {0};
if (size64 == 0)
return 0; // missed alloc entry
info.size = *size64;
sizes.delete(&pid);
info.timestamp_ns = bpf_ktime_get_ns();
info.num_frames = grab_stack(ctx, &info) - 2;
allocs.update(&address, &info);
if (SHOULD_PRINT)
bpf_trace_printk("alloc exited, size = %lu, result = %lx, frames = %d\n", info.size, address, info.num_frames);
return 0;
}
int free_enter(struct pt_regs *ctx, void *address)
{
u64 addr = (u64)address;
struct alloc_info_t *info = allocs.lookup(&addr);
if (info == 0)
return 0;
allocs.delete(&addr);
if (SHOULD_PRINT)
bpf_trace_printk("free entered, address = %lx, size = %lu\n", address, info->size);
return 0;
}
#!/usr/bin/env python #!/usr/bin/env python
#
# memleak.py Trace and display outstanding allocations to detect
# memory leaks in user-mode processes and the kernel.
#
# USAGE: memleak.py [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
# [-s SAMPLE_RATE] [-d STACK_DEPTH] [-T TOP] [-z MIN_SIZE]
# [-Z MAX_SIZE]
# [interval] [count]
#
# Licensed under the Apache License, Version 2.0 (the "License")
# Copyright (C) 2016 Sasha Goldshtein.
from bcc import BPF from bcc import BPF
from time import sleep from time import sleep
from datetime import datetime
import argparse import argparse
import subprocess import subprocess
import ctypes import ctypes
...@@ -150,7 +162,7 @@ EXAMPLES: ...@@ -150,7 +162,7 @@ EXAMPLES:
allocations every 5 seconds allocations every 5 seconds
./memleak.py -p $(pidof allocs) -t ./memleak.py -p $(pidof allocs) -t
Trace allocations and display each individual call to malloc/free Trace allocations and display each individual call to malloc/free
./memleak.py -p $(pidof allocs) -a -i 10 ./memleak.py -ap $(pidof allocs) 10
Trace allocations and display allocated addresses, sizes, and stacks Trace allocations and display allocated addresses, sizes, and stacks
every 10 seconds for outstanding allocations every 10 seconds for outstanding allocations
./memleak.py -c "./allocs" ./memleak.py -c "./allocs"
...@@ -174,38 +186,161 @@ allocations made with kmalloc/kfree. ...@@ -174,38 +186,161 @@ allocations made with kmalloc/kfree.
parser = argparse.ArgumentParser(description=description, parser = argparse.ArgumentParser(description=description,
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples) epilog=examples)
parser.add_argument("-p", "--pid", parser.add_argument("-p", "--pid", type=int, default=-1,
help="the PID to trace; if not specified, trace kernel allocs") help="the PID to trace; if not specified, trace kernel allocs")
parser.add_argument("-t", "--trace", action="store_true", parser.add_argument("-t", "--trace", action="store_true",
help="print trace messages for each alloc/free call") help="print trace messages for each alloc/free call")
parser.add_argument("-i", "--interval", default=5, parser.add_argument("interval", nargs="?", default=5, type=int,
help="interval in seconds to print outstanding allocations") help="interval in seconds to print outstanding allocations")
parser.add_argument("count", nargs="?", type=int,
help="number of times to print the report before exiting")
parser.add_argument("-a", "--show-allocs", default=False, action="store_true", parser.add_argument("-a", "--show-allocs", default=False, action="store_true",
help="show allocation addresses and sizes as well as call stacks") help="show allocation addresses and sizes as well as call stacks")
parser.add_argument("-o", "--older", default=500, parser.add_argument("-o", "--older", default=500, type=int,
help="prune allocations younger than this age in milliseconds") help="prune allocations younger than this age in milliseconds")
parser.add_argument("-c", "--command", parser.add_argument("-c", "--command",
help="execute and trace the specified command") help="execute and trace the specified command")
parser.add_argument("-s", "--sample-rate", default=1, parser.add_argument("-s", "--sample-rate", default=1, type=int,
help="sample every N-th allocation to decrease the overhead") help="sample every N-th allocation to decrease the overhead")
parser.add_argument("-d", "--stack-depth", default=10, type=int,
help="maximum stack depth to capture")
parser.add_argument("-T", "--top", type=int, default=10,
help="display only this many top allocating stacks (by size)")
parser.add_argument("-z", "--min-size", type=int,
help="capture only allocations larger than this size")
parser.add_argument("-Z", "--max-size", type=int,
help="capture only allocations smaller than this size")
args = parser.parse_args() args = parser.parse_args()
pid = -1 if args.pid is None else int(args.pid) pid = args.pid
command = args.command command = args.command
kernel_trace = (pid == -1 and command is None) kernel_trace = (pid == -1 and command is None)
trace_all = args.trace trace_all = args.trace
interval = int(args.interval) interval = args.interval
min_age_ns = 1e6 * int(args.older) min_age_ns = 1e6 * args.older
sample_every_n = args.sample_rate sample_every_n = args.sample_rate
num_prints = args.count
max_stack_size = args.stack_depth + 2
top_stacks = args.top
min_size = args.min_size
max_size = args.max_size
if min_size is not None and max_size is not None and min_size > max_size:
print("min_size (-z) can't be greater than max_size (-Z)")
exit(1)
if command is not None: if command is not None:
print("Executing '%s' and tracing the resulting process." % command) print("Executing '%s' and tracing the resulting process." % command)
pid = run_command_get_pid(command) pid = run_command_get_pid(command)
bpf_source = open("memleak.c").read() bpf_source = """
#include <uapi/linux/ptrace.h>
struct alloc_info_t {
u64 size;
u64 timestamp_ns;
int num_frames;
u64 callstack[MAX_STACK_SIZE];
};
BPF_HASH(sizes, u64);
BPF_HASH(allocs, u64, struct alloc_info_t);
// Adapted from https://github.com/iovisor/bcc/tools/offcputime.py
static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
return ret;
}
return 0;
}
static int grab_stack(struct pt_regs *ctx, struct alloc_info_t *info)
{
int depth = 0;
u64 bp = ctx->bp;
GRAB_ONE_FRAME
return depth;
}
int alloc_enter(struct pt_regs *ctx, size_t size)
{
SIZE_FILTER
if (SAMPLE_EVERY_N > 1) {
u64 ts = bpf_ktime_get_ns();
if (ts % SAMPLE_EVERY_N != 0)
return 0;
}
u64 pid = bpf_get_current_pid_tgid();
u64 size64 = size;
sizes.update(&pid, &size64);
if (SHOULD_PRINT)
bpf_trace_printk("alloc entered, size = %u\\n", size);
return 0;
}
int alloc_exit(struct pt_regs *ctx)
{
u64 address = ctx->ax;
u64 pid = bpf_get_current_pid_tgid();
u64* size64 = sizes.lookup(&pid);
struct alloc_info_t info = {0};
if (size64 == 0)
return 0; // missed alloc entry
info.size = *size64;
sizes.delete(&pid);
info.timestamp_ns = bpf_ktime_get_ns();
info.num_frames = grab_stack(ctx, &info) - 2;
allocs.update(&address, &info);
if (SHOULD_PRINT) {
bpf_trace_printk("alloc exited, size = %lu, result = %lx, frames = %d\\n",
info.size, address, info.num_frames);
}
return 0;
}
int free_enter(struct pt_regs *ctx, void *address)
{
u64 addr = (u64)address;
struct alloc_info_t *info = allocs.lookup(&addr);
if (info == 0)
return 0;
allocs.delete(&addr);
if (SHOULD_PRINT) {
bpf_trace_printk("free entered, address = %lx, size = %lu\\n",
address, info->size);
}
return 0;
}
"""
bpf_source = bpf_source.replace("SHOULD_PRINT", "1" if trace_all else "0") bpf_source = bpf_source.replace("SHOULD_PRINT", "1" if trace_all else "0")
bpf_source = bpf_source.replace("SAMPLE_EVERY_N", str(sample_every_n)) bpf_source = bpf_source.replace("SAMPLE_EVERY_N", str(sample_every_n))
bpf_source = bpf_source.replace("GRAB_ONE_FRAME", max_stack_size *
"\tif (!(info->callstack[depth++] = get_frame(&bp))) return depth;\n")
bpf_source = bpf_source.replace("MAX_STACK_SIZE", str(max_stack_size))
size_filter = ""
if min_size is not None and max_size is not None:
size_filter = "if (size < %d || size > %d) return 0;" % \
(min_size, max_size)
elif min_size is not None:
size_filter = "if (size < %d) return 0;" % min_size
elif max_size is not None:
size_filter = "if (size > %d) return 0;" % max_size
bpf_source = bpf_source.replace("SIZE_FILTER", size_filter)
bpf_program = BPF(text=bpf_source) bpf_program = BPF(text=bpf_source)
...@@ -227,7 +362,8 @@ decoder = StackDecoder(pid, bpf_program) ...@@ -227,7 +362,8 @@ decoder = StackDecoder(pid, bpf_program)
def print_outstanding(): def print_outstanding():
stacks = {} stacks = {}
print("*** Outstanding allocations:") print("[%s] Top %d stacks with outstanding allocations:" %
(datetime.now().strftime("%H:%M:%S"), top_stacks))
allocs = bpf_program.get_table("allocs") allocs = bpf_program.get_table("allocs")
for address, info in sorted(allocs.items(), key=lambda a: a[1].size): for address, info in sorted(allocs.items(), key=lambda a: a[1].size):
if Time.monotonic_time() - min_age_ns < info.timestamp_ns: if Time.monotonic_time() - min_age_ns < info.timestamp_ns:
...@@ -241,11 +377,12 @@ def print_outstanding(): ...@@ -241,11 +377,12 @@ def print_outstanding():
if args.show_allocs: if args.show_allocs:
print("\taddr = %x size = %s" % print("\taddr = %x size = %s" %
(address.value, info.size)) (address.value, info.size))
for stack, (count, size) in sorted(stacks.items(), to_show = sorted(stacks.items(), key=lambda s: s[1][1])[-top_stacks:]
key=lambda s: s[1][1]): for stack, (count, size) in to_show:
print("\t%d bytes in %d allocations from stack\n\t\t%s" % print("\t%d bytes in %d allocations from stack\n\t\t%s" %
(size, count, stack.replace(";", "\n\t\t"))) (size, count, stack.replace(";", "\n\t\t")))
count_so_far = 0
while True: while True:
if trace_all: if trace_all:
print bpf_program.trace_fields() print bpf_program.trace_fields()
...@@ -256,3 +393,6 @@ while True: ...@@ -256,3 +393,6 @@ while True:
exit() exit()
decoder.refresh_code_ranges() decoder.refresh_code_ranges()
print_outstanding() print_outstanding()
count_so_far += 1
if num_prints is not None and count_so_far >= num_prints:
exit()
...@@ -8,12 +8,12 @@ For example: ...@@ -8,12 +8,12 @@ For example:
# ./memleak.py -p $(pidof allocs) # ./memleak.py -p $(pidof allocs)
Attaching to malloc and free in pid 5193, Ctrl+C to quit. Attaching to malloc and free in pid 5193, Ctrl+C to quit.
*** Outstanding allocations: [11:16:33] Top 2 stacks with outstanding allocations:
80 bytes in 5 allocations from stack 80 bytes in 5 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862) main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790) __libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
*** Outstanding allocations: [11:16:34] Top 2 stacks with outstanding allocations:
160 bytes in 10 allocations from stack 160 bytes in 10 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862) main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790) __libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
...@@ -34,7 +34,7 @@ prevalent. Use the -a switch: ...@@ -34,7 +34,7 @@ prevalent. Use the -a switch:
# ./memleak.py -p $(pidof allocs) -a # ./memleak.py -p $(pidof allocs) -a
Attaching to malloc and free in pid 5193, Ctrl+C to quit. Attaching to malloc and free in pid 5193, Ctrl+C to quit.
*** Outstanding allocations: [11:16:33] Top 2 stacks with outstanding allocations:
addr = 948cd0 size = 16 addr = 948cd0 size = 16
addr = 948d10 size = 16 addr = 948d10 size = 16
addr = 948d30 size = 16 addr = 948d30 size = 16
...@@ -43,7 +43,7 @@ Attaching to malloc and free in pid 5193, Ctrl+C to quit. ...@@ -43,7 +43,7 @@ Attaching to malloc and free in pid 5193, Ctrl+C to quit.
main+0x6d [/home/vagrant/allocs] (400862) main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790) __libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fd460ac2790)
*** Outstanding allocations: [11:16:34] Top 2 stacks with outstanding allocations:
addr = 948d50 size = 16 addr = 948d50 size = 16
addr = 948cd0 size = 16 addr = 948cd0 size = 16
addr = 948d10 size = 16 addr = 948d10 size = 16
...@@ -110,26 +110,62 @@ To avoid false positives, allocations younger than a certain age (500ms by ...@@ -110,26 +110,62 @@ To avoid false positives, allocations younger than a certain age (500ms by
default) are not printed. To change this threshold, use the -o switch. default) are not printed. To change this threshold, use the -o switch.
By default, memleak prints its output every 5 seconds. To change this By default, memleak prints its output every 5 seconds. To change this
interval, use the -i switch. interval, pass the interval as a positional parameter to memleak. You can
also control the number of times the output will be printed before exiting.
For example:
# ./memleak.py 1 10
... will print the outstanding allocation statistics every second, for ten
times, and then exit.
memleak may introduce considerable overhead if your application or kernel is
allocating and freeing memory at a very high rate. In that case, you can
control the overhead by sampling every N-th allocation. For example, to sample
roughly 10% of the allocations and print the outstanding allocations every 5
seconds, 3 times before quitting:
# ./memleak.py -p $(pidof allocs) -s 10 5 3
Attaching to malloc and free in pid 2614, Ctrl+C to quit.
[11:16:33] Top 2 stacks with outstanding allocations:
16 bytes in 1 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fdc11ce8790)
[11:16:38] Top 2 stacks with outstanding allocations:
16 bytes in 1 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fdc11ce8790)
[11:16:43] Top 2 stacks with outstanding allocations:
32 bytes in 2 allocations from stack
main+0x6d [/home/vagrant/allocs] (400862)
__libc_start_main+0xf0 [/usr/lib64/libc-2.21.so] (7fdc11ce8790)
Note that even though the application leaks 16 bytes of memory every second,
the report (printed every 5 seconds) doesn't "see" all the allocations because
of the sampling rate applied.
USAGE message: USAGE message:
# ./memleak.py -h # ./memleak.py -h
usage: memleak.py [-h] [-p PID] [-t] [-i INTERVAL] [-a] [-o OLDER] usage: memleak.py [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
[-c COMMAND] [-s SAMPLE_RATE] [-s SAMPLE_RATE] [-d STACK_DEPTH] [-T TOP]
[interval] [count]
Trace outstanding memory allocations that weren't freed. Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with malloc/free and kernel-mode Supports both user-mode allocations made with malloc/free and kernel-mode
allocations made with kmalloc/kfree. allocations made with kmalloc/kfree.
interval interval in seconds to print outstanding allocations
count number of times to print the report before exiting
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-p PID, --pid PID the PID to trace; if not specified, trace kernel -p PID, --pid PID the PID to trace; if not specified, trace kernel
allocs allocs
-t, --trace print trace messages for each alloc/free call -t, --trace print trace messages for each alloc/free call
-i INTERVAL, --interval INTERVAL
interval in seconds to print outstanding allocations
-a, --show-allocs show allocation addresses and sizes as well as call -a, --show-allocs show allocation addresses and sizes as well as call
stacks stacks
-o OLDER, --older OLDER -o OLDER, --older OLDER
...@@ -139,6 +175,13 @@ optional arguments: ...@@ -139,6 +175,13 @@ optional arguments:
execute and trace the specified command execute and trace the specified command
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE -s SAMPLE_RATE, --sample-rate SAMPLE_RATE
sample every N-th allocation to decrease the overhead sample every N-th allocation to decrease the overhead
-d STACK_DEPTH, --stack_depth STACK_DEPTH
maximum stack depth to capture
-T TOP, --top TOP display only this many top allocating stacks (by size)
-z MIN_SIZE, --min-size MIN_SIZE
capture only allocations larger than this size
-Z MAX_SIZE, --max-size MAX_SIZE
capture only allocations smaller than this size
EXAMPLES: EXAMPLES:
...@@ -147,7 +190,16 @@ EXAMPLES: ...@@ -147,7 +190,16 @@ EXAMPLES:
allocations every 5 seconds allocations every 5 seconds
./memleak.py -p $(pidof allocs) -t ./memleak.py -p $(pidof allocs) -t
Trace allocations and display each individual call to malloc/free Trace allocations and display each individual call to malloc/free
./memleak.py -p $(pidof allocs) -a -i 10 Trace allocations and display allocated addresses, sizes, and stacks every 10 seconds for outstanding allocations ./memleak.py -c "./allocs" Run the specified command and trace its allocations ./memleak.py Trace allocations in kernel mode and display a summary of outstanding allocations every 5 seconds ./memleak.py -o 60000 Trace allocations in kernel mode and display a summary of outstanding allocations that are at least one minute (60 seconds) old ./memleak.py -ap $(pidof allocs) 10
Trace allocations and display allocated addresses, sizes, and stacks
every 10 seconds for outstanding allocations
./memleak.py -c "./allocs"
Run the specified command and trace its allocations
./memleak.py
Trace allocations in kernel mode and display a summary of outstanding
allocations every 5 seconds
./memleak.py -o 60000
Trace allocations in kernel mode and display a summary of outstanding
allocations that are at least one minute (60 seconds) old
./memleak.py -s 5 ./memleak.py -s 5
Trace roughly every 5th allocation, to reduce overhead Trace roughly every 5th allocation, to reduce overhead
#!/usr/bin/env python
#
# oomkill Trace oom_kill_process(). For Linux, uses BCC, eBPF.
#
# This traces the kernel out-of-memory killer, and prints basic details,
# including the system load averages. This can provide more context on the
# system state at the time of OOM: was it getting busier or steady, based
# on the load averages? This tool may also be useful to customize for
# investigations; for example, by adding other task_struct details at the time
# of OOM.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from bcc import BPF
from time import strftime
import ctypes as ct
# linux stats
loadavg = "/proc/loadavg"
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/oom.h>
struct data_t {
u64 fpid;
u64 tpid;
u64 pages;
char fcomm[TASK_COMM_LEN];
char tcomm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
void kprobe__oom_kill_process(struct pt_regs *ctx, struct oom_control *oc,
struct task_struct *p, unsigned int points, unsigned long totalpages)
{
struct data_t data = {};
u32 pid = bpf_get_current_pid_tgid();
data.fpid = pid;
data.tpid = p->pid;
data.pages = totalpages;
bpf_get_current_comm(&data.fcomm, sizeof(data.fcomm));
bpf_probe_read(&data.tcomm, sizeof(data.tcomm), p->comm);
events.perf_submit(ctx, &data, sizeof(data));
}
"""
# kernel->user event data: struct data_t
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("fpid", ct.c_ulonglong),
("tpid", ct.c_ulonglong),
("pages", ct.c_ulonglong),
("fcomm", ct.c_char * TASK_COMM_LEN),
("tcomm", ct.c_char * TASK_COMM_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
with open(loadavg) as stats:
avgline = stats.read().rstrip()
print(("%s Triggered by PID %d (\"%s\"), OOM kill of PID %d (\"%s\")"
", %d pages, loadavg: %s") % (strftime("%H:%M:%S"), event.fpid,
event.fcomm, event.tpid, event.tcomm, event.pages, avgline))
# initialize BPF
b = BPF(text=bpf_text)
print("Tracing OOM kills... Ctrl-C to stop.")
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
Demonstrations of oomkill, the Linux eBPF/bcc version.
oomkill is a simple program that traces the Linux out-of-memory (OOM) killer,
and shows basic details on one line per OOM kill:
# ./oomkill
Tracing oom_kill_process()... Ctrl-C to end.
21:03:39 Triggered by PID 3297 ("ntpd"), OOM kill of PID 22516 ("perl"), 3850642 pages, loadavg: 0.99 0.39 0.30 3/282 22724
21:03:48 Triggered by PID 22517 ("perl"), OOM kill of PID 22517 ("perl"), 3850642 pages, loadavg: 0.99 0.41 0.30 2/282 22932
The first line shows that PID 22516, with process name "perl", was OOM killed
when it reached 3850642 pages (usually 4 Kbytes per page). This OOM kill
happened to be triggered by PID 3297, process name "ntpd", doing some memory
allocation.
The system log (dmesg) shows pages of details and system context about an OOM
kill. What it currently lacks, however, is context on how the system had been
changing over time. I've seen OOM kills where I wanted to know if the system
was at steady state at the time, or if there had been a recent increase in
workload that triggered the OOM event. oomkill provides some context: at the
end of the line is the load average information from /proc/loadavg. For both
of the oomkills here, we can see that the system was getting busier at the
time (a higher 1 minute "average" of 0.99, compared to the 15 minute "average"
of 0.30).
oomkill can also be the basis of other tools and customizations. For example,
you can edit it to include other task_struct details from the target PID at
the time of the OOM kill.
The following commands can be used to test this program, and invoke a memory
consuming process that exhausts system memory and is OOM killed:
sysctl -w vm.overcommit_memory=1 # always overcommit
perl -e 'while (1) { $a .= "A" x 1024; }' # eat all memory
WARNING: This exhausts system memory after disabling some overcommit checks.
Only test in a lab environment.
...@@ -63,7 +63,7 @@ int kretprobe__sys_open(struct pt_regs *ctx) ...@@ -63,7 +63,7 @@ int kretprobe__sys_open(struct pt_regs *ctx)
return 0; return 0;
} }
bpf_trace_printk("%s %d\\n", *filenamep, ret); bpf_trace_printk("%d %s\\n", ret, *filenamep);
args_filename.delete(&pid); args_filename.delete(&pid);
return 0; return 0;
...@@ -90,7 +90,7 @@ start_ts = 0 ...@@ -90,7 +90,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(filename, ret_s) = msg.split(" ") (ret_s, filename) = msg.split(" ", 1)
ret = int(ret_s) ret = int(ret_s)
if (args.failed and (ret >= 0)): if (args.failed and (ret >= 0)):
......
...@@ -63,7 +63,7 @@ int trace_return(struct pt_regs *ctx) ...@@ -63,7 +63,7 @@ int trace_return(struct pt_regs *ctx)
return 0; return 0;
} }
bpf_trace_printk("%s %d\\n", *filenamep, ret); bpf_trace_printk("%d %s\\n", ret, *filenamep);
args_filename.delete(&pid); args_filename.delete(&pid);
return 0; return 0;
...@@ -96,7 +96,7 @@ start_ts = 0 ...@@ -96,7 +96,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(filename, ret_s) = msg.split(" ") (ret_s, filename) = msg.split(" ", 1)
ret = int(ret_s) ret = int(ret_s)
if (args.failed and (ret >= 0)): if (args.failed and (ret >= 0)):
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment