Commit dd221c4e authored by Brenden Blanco's avatar Brenden Blanco

Merge pull request #360 from brendangregg/master

3 tools: oomkill, dcstat, dcsnoop
parents f1788962 ee31f61c
...@@ -72,6 +72,8 @@ Tools: ...@@ -72,6 +72,8 @@ Tools:
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt). - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt). - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt). - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt). - tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt).
- tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt). - tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
- tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt). - tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt).
...@@ -83,6 +85,7 @@ Tools: ...@@ -83,6 +85,7 @@ Tools:
- tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_examples.txt). - tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_examples.txt).
- tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt). - tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
- tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt). - tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
- tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt). - tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt). - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt). - tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
......
.TH dcsnoop 8 "2016-02-10" "USER COMMANDS"
.SH NAME
dcsnoop \- Trace directory entry cache (dcache) lookups. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B dcsnoop [\-h] [\-a]
.SH DESCRIPTION
By default, this traces every failed dcache lookup (cache miss), and shows the
process performing the lookup and the filename requested. A \-a option can be
used to show all lookups, not just failed ones.
The output of this tool can be verbose, and is intended for further
investigations of dcache performance beyond dcstat(8), which prints
per-second summaries.
This uses kernel dynamic tracing of the d_lookup() function, and will need
and will need updating to match any changes to this function.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-a
Trace references, not just failed lookups.
.SH EXAMPLES
.TP
Trace failed dcache lookups:
#
.B dcsnoop
.TP
Trace all dcache lookups:
#
.B dcsnoop \-a
.SH FIELDS
.TP
TIME(s)
Time of lookup, in seconds.
.TP
PID
Process ID.
.TP
COMM
Process name.
.TP
T
Type: R == reference (only visible with \-a), M == miss. A miss will print two
lines, one for the reference, and one for the miss.
.TP
FILE
The file name component that was being looked up. This contains trailing
pathname components (after '/'), which will be the subject of subsequent
lookups.
.SH OVERHEAD
File name lookups can be frequent (depending on the workload), and this tool
prints a line for each failed lookup, and with \-a, each reference as well. The
output may be verbose, and the incurred overhead, while optimized to some
extent, may still be from noticeable to significant. This is only really
intended for deeper investigations beyond dcstat(8), when absolutely necessary.
Measure and quantify the overhead in a test environment before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
dcstat(1)
.TH dcstat 8 "2016-02-09" "USER COMMANDS"
.SH NAME
dcstat \- Directory entry cache (dcache) stats. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B dcstat
[interval [count]]
.SH DESCRIPTION
The Linux directory entry cache (dcache) improves the performance of file and
directory name lookups. This tool provides per-second summary statistics of
dcache performance.
This uses kernel dynamic tracing of kernel functions, lookup_fast() and
d_lookup(), which will need to be modified to match kernel changes.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Print summaries each second:
#
.B dcstat
.TP
Print output every five seconds, three times:
#
.B dcstat 5 3
.SH FIELDS
.TP
REFS/s
Number dcache lookups (references) per second.
.TP
SLOW/s
Number of dcache lookups that failed the lookup_fast() path and executed the
lookup_slow() path instead.
.TP
MISS/s
Number of dcache misses (failed both fast and slow lookups).
.TP
HIT%
Percentage of dcache hits over total references.
.SH OVERHEAD
The overhead depends on the frequency of file and directory name lookups.
While the per-event overhead is low, some applications may make over 100k
lookups per second, and the low per-event overhead will begin to add up, and
could begin to be measurable (over 10% CPU usage). Measure in a test
environment.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
dcsnoop(8)
.TH oomkill 8 "2016-02-09" "USER COMMANDS"
.SH NAME
oomkill \- Trace oom_kill_process(). Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B oomkill
.SH DESCRIPTION
This traces the kernel out-of-memory killer, and prints basic details,
including the system load averages at the time of the OOM kill. This can
provide more context on the system state at the time: was it getting busier
or steady, based on the load averages? This tool may also be useful to
customize for investigations; for example, by adding other task_struct
details at the time of OOM.
This program is also a basic example of eBPF/bcc.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Trace OOM kill events:
#
.B oomkill
.SH FIELDS
.TP
Triggered by ...
The process ID and process name of the task that was running when another task was OOM
killed.
.TP
OOM kill of ...
The process ID and name of the target process that was OOM killed.
.TP
loadavg
Contents of /proc/loadavg. The first three numbers are 1, 5, and 15 minute
load averages (where the average is an exponentially damped moving sum, and
those numbers are constants in the equation); then there is the number of
running tasks, a slash, and the total number of tasks; and then the last number
is the last PID to be created.
.SH OVERHEAD
Negligible.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
memleak(8)
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dcsnoop Trace directory entry cache (dcache) lookups.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: dcsnoop [-h] [-a]
#
# By default, this traces every failed dcache lookup, and shows the process
# performing the lookup and the filename requested. A -a option can be used
# to show all lookups, not just failed ones.
#
# This uses kernel dynamic tracing of the d_lookup() function, and will need
# to be modified to match kernel changes.
#
# Also see dcstat(8), for per-second summaries.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
./dcsnoop # trace failed dcache lookups
./dcsnoop -a # trace all dcache lookups
"""
parser = argparse.ArgumentParser(
description="Trace directory entry cache (dcache) lookups",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-a", "--all", action="store_true",
help="trace all lookups (default is fails only)")
args = parser.parse_args()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#define MAX_FILE_LEN 64
struct entry_t {
char name[MAX_FILE_LEN];
};
BPF_HASH(entrybypid, u32, struct entry_t);
/* from fs/namei.c: */
struct nameidata {
struct path path;
struct qstr last;
// [...]
};
int trace_fast(struct pt_regs *ctx, struct nameidata *nd, struct path *path)
{
bpf_trace_printk("R %s\\n", nd->last.name);
return 1;
}
int kprobe__d_lookup(struct pt_regs *ctx, const struct dentry *parent,
const struct qstr *name)
{
u32 pid = bpf_get_current_pid_tgid();
struct entry_t entry = {};
const char *fname = name->name;
if (fname) {
bpf_probe_read(&entry.name, sizeof(entry.name), (void *)fname);
}
entrybypid.update(&pid, &entry);
return 0;
}
int kretprobe__d_lookup(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid();
struct entry_t *ep;
ep = entrybypid.lookup(&pid);
if (ep == 0) {
return 0; // missed entry
}
if (ctx->ax == 0) {
bpf_trace_printk("M %s\\n", ep->name);
}
entrybypid.delete(&pid);
return 0;
}
"""
# initialize BPF
b = BPF(text=bpf_text)
if args.all:
b.attach_kprobe(event="lookup_fast", fn_name="trace_fast")
# header
print("%-11s %-6s %-16s %1s %s" % ("TIME(s)", "PID", "COMM", "T", "FILE"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
try:
(type, file) = msg.split(" ", 1)
except ValueError:
continue
if start_ts == 0:
start_ts = ts
print("%-11.6f %-6s %-16s %1s %s" % (ts - start_ts, pid, task, type, file))
Demonstrations of dcsnoop, the Linux eBPF/bcc version.
dcsnoop traces directory entry cache (dcache) lookups, and can be used for
further investigation beyond dcstat(8). The output is likely verbose, as
dcache lookups are likely frequent. By default, only failed lookups are shown.
For example:
# ./dcsnoop.py
TIME(s) PID COMM T FILE
0.002837 1643 snmpd M net/dev
0.002852 1643 snmpd M 1643
0.002856 1643 snmpd M net
0.002863 1643 snmpd M dev
0.002952 1643 snmpd M net/if_inet6
0.002964 1643 snmpd M if_inet6
0.003180 1643 snmpd M net/ipv4/neigh/eth0/retrans_time_ms
0.003192 1643 snmpd M ipv4/neigh/eth0/retrans_time_ms
0.003197 1643 snmpd M neigh/eth0/retrans_time_ms
0.003203 1643 snmpd M eth0/retrans_time_ms
0.003206 1643 snmpd M retrans_time_ms
0.003245 1643 snmpd M ipv6/neigh/eth0/retrans_time_ms
0.003249 1643 snmpd M neigh/eth0/retrans_time_ms
0.003252 1643 snmpd M eth0/retrans_time_ms
0.003255 1643 snmpd M retrans_time_ms
0.003287 1643 snmpd M conf/eth0/forwarding
0.003292 1643 snmpd M eth0/forwarding
0.003295 1643 snmpd M forwarding
0.003326 1643 snmpd M base_reachable_time_ms
[...]
I ran a drop caches at the same time as executing this tool. The output shows
the processes, the type of event ("T" column: M == miss, R == reference),
and the filename for the dcache lookup.
The way the dcache is currently implemented, each component of a path is
checked in turn. The first line, showing "net/dev" from snmp, will be a lookup
for "net" in a directory (that isn't shown here). If it finds "net", it will
then lookup "dev" inside net. You can see this sequence a little later,
starting at time 0.003180, where a pathname is being searched
directory by directory.
The -a option will show all lookups, although be warned, the output will be
very verbose. For example:
# ./dcsnoop
TIME(s) PID COMM T FILE
0.000000 20279 dcsnoop.py M p_lookup_fast
0.000010 20279 dcsnoop.py M enable
0.000013 20279 dcsnoop.py M id
0.000015 20279 dcsnoop.py M filter
0.000017 20279 dcsnoop.py M trigger
0.000019 20279 dcsnoop.py M format
0.006148 20279 dcsnoop.py R sys/kernel/debug/tracing/trace_pipe
0.006158 20279 dcsnoop.py R kernel/debug/tracing/trace_pipe
0.006161 20279 dcsnoop.py R debug/tracing/trace_pipe
0.006164 20279 dcsnoop.py R tracing/trace_pipe
0.006166 20279 dcsnoop.py R trace_pipe
0.015900 1643 snmpd R proc/sys/net/ipv6/conf/lo/forwarding
0.015901 1643 snmpd R sys/net/ipv6/conf/lo/forwarding
0.015901 1643 snmpd R net/ipv6/conf/lo/forwarding
0.015902 1643 snmpd R ipv6/conf/lo/forwarding
0.015903 1643 snmpd R conf/lo/forwarding
0.015904 1643 snmpd R lo/forwarding
0.015905 1643 snmpd M lo/forwarding
0.015908 1643 snmpd R forwarding
0.015909 1643 snmpd M forwarding
0.015937 1643 snmpd R proc/sys/net/ipv6/neigh/lo/base_reachable_time_ms
0.015937 1643 snmpd R sys/net/ipv6/neigh/lo/base_reachable_time_ms
0.015938 1643 snmpd R net/ipv6/neigh/lo/base_reachable_time_ms
0.015939 1643 snmpd R ipv6/neigh/lo/base_reachable_time_ms
0.015940 1643 snmpd R neigh/lo/base_reachable_time_ms
0.015941 1643 snmpd R lo/base_reachable_time_ms
0.015941 1643 snmpd R base_reachable_time_ms
0.015943 1643 snmpd M base_reachable_time_ms
0.043569 1876 supervise M 20281
0.043573 1886 supervise M 20280
0.043582 1886 supervise R supervise/status.new
[...]
USAGE message:
# ./dcsnoop.py -h
usage: dcsnoop.py [-h] [-a]
Trace directory entry cache (dcache) lookups
optional arguments:
-h, --help show this help message and exit
-a, --all trace all lookups (default is fails only)
examples:
./dcsnoop # trace failed dcache lookups
./dcsnoop -a # trace all dcache lookups
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# dcstat Directory entry cache (dcache) stats.
# For Linux, uses BCC, eBPF.
#
# USAGE: dcstat [interval [count]]
#
# This uses kernel dynamic tracing of kernel functions, lookup_fast() and
# d_lookup(), which will need to be modified to match kernel changes. See
# code comments.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from ctypes import c_int
from time import sleep, strftime
from sys import argv
def usage():
print("USAGE: %s [interval [count]]" % argv[0])
exit()
# arguments
interval = 1
count = -1
if len(argv) > 1:
try:
interval = int(argv[1])
if interval == 0:
raise
if len(argv) > 2:
count = int(argv[2])
except: # also catches -h, --help
usage()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
enum stats {
S_REFS = 1,
S_SLOW,
S_MISS,
S_MAXSTAT
};
BPF_TABLE("array", int, u64, stats, S_MAXSTAT + 1);
/*
* How this is instrumented, and how to interpret the statistics, is very much
* tied to the current kernel implementation (this was written on Linux 4.4).
* This will need maintenance to keep working as the implementation changes. To
* aid future adventurers, this is is what the current code does, and why.
*
* First problem: the current implementation takes a path and then does a
* lookup of each component. So how do we count a reference? Once for the path
* lookup, or once for every component lookup? I've chosen the latter
* since it seems to map more closely to actual dcache lookups (via
* __d_lookup_rcu()). It's counted via calls to lookup_fast().
*
* The implementation tries different, progressively slower, approaches to
* lookup a file. At what point do we call it a dcache miss? I've choosen when
* a d_lookup() (which is called during lookup_slow()) returns zero.
*
* I've also included a "SLOW" statistic to show how often the fast lookup
* failed. Whether this exists or is interesting is an implementation detail,
* and the "SLOW" statistic may be removed in future versions.
*/
void count_fast(struct pt_regs *ctx) {
int key = S_REFS;
u64 *leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
}
void count_lookup(struct pt_regs *ctx) {
int key = S_SLOW;
u64 *leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
if (ctx->ax == 0) {
key = S_MISS;
leaf = stats.lookup(&key);
if (leaf) (*leaf)++;
}
}
"""
# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="lookup_fast", fn_name="count_fast")
b.attach_kretprobe(event="d_lookup", fn_name="count_lookup")
# stat column labels and indexes
stats = {
"REFS": 1,
"SLOW": 2,
"MISS": 3
}
# header
print("%-8s " % "TIME", end="")
for stype, idx in sorted(stats.iteritems(), key=lambda (k, v): (v, k)):
print(" %8s" % (stype + "/s"), end="")
print(" %8s" % "HIT%")
# output
i = 0
while (1):
if count > 0:
i += 1
if i > count:
exit()
try:
sleep(interval)
except KeyboardInterrupt:
pass
exit()
print("%-8s: " % strftime("%H:%M:%S"), end="")
# print each statistic as a column
for stype, idx in sorted(stats.iteritems(), key=lambda (k, v): (v, k)):
try:
val = b["stats"][c_int(idx)].value / interval
print(" %8d" % val, end="")
except:
print(" %8d" % 0, end="")
# print hit ratio percentage
try:
ref = b["stats"][c_int(stats["REFS"])].value
miss = b["stats"][c_int(stats["MISS"])].value
hit = ref - miss
pct = float(100) * hit / ref
print(" %8.2f" % pct)
except:
print(" %7s%%" % "-")
b["stats"].clear()
Demonstrations of dcstat, the Linux eBPF/bcc version.
dcstat shows directory entry cache (dcache) statistics. For example:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:11:47: 2059 141 97 95.29
08:11:48: 79974 151 106 99.87
08:11:49: 192874 146 102 99.95
08:11:50: 2051 144 100 95.12
08:11:51: 73373 17239 17194 76.57
08:11:52: 54685 25431 25387 53.58
08:11:53: 18127 8182 8137 55.12
08:11:54: 22517 10345 10301 54.25
08:11:55: 7524 2881 2836 62.31
08:11:56: 2067 141 97 95.31
08:11:57: 2115 145 101 95.22
The output shows the total references per second ("REFS/s"), the number that
took a slower code path to be processed ("SLOW/s"), the number of dcache misses
("MISS/s"), and the hit ratio as a percentage. By default, an interval of 1
second is used.
At 08:11:49, there were 192 thousand references, which almost entirely hit
from the dcache, with a hit ration of 99.95%. A little later, starting at
08:11:51, a workload began that walked many uncached files, reducing the hit
ratio to 53%, and more importantly, a miss rate of over 10 thousand per second.
Here's an interesting workload:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:15:53: 250683 141 97 99.96
08:15:54: 266115 145 101 99.96
08:15:55: 268428 141 97 99.96
08:15:56: 260389 143 99 99.96
It's a 99.96% hit ratio, and these are all negative hits: accessing a file that
does not exist. Here's the C program that generated the workload:
# cat -n badopen.c
1 #include <sys/types.h>
2 #include <sys/stat.h>
3 #include <fcntl.h>
4
5 int
6 main(int argc, char *argv[])
7 {
8 int fd;
9 while (1) {
10 fd = open("bad", O_RDONLY);
11 }
12 return 0;
13 }
This is a simple workload generator than tries to open a missing file ("bad")
as quickly as possible.
Lets see what happens if the workload attempts to open a different filename
each time (which is also a missing file), using the following C code:
# cat -n badopen2.c
1 #include <sys/types.h>
2 #include <sys/stat.h>
3 #include <fcntl.h>
4 #include <stdio.h>
5
6 int
7 main(int argc, char *argv[])
8 {
9 int fd, i = 0;
10 char buf[128] = {};
11
12 while (1) {
13 sprintf(buf, "bad%d", i++);
14 fd = open(buf, O_RDONLY);
15 }
16 return 0;
17 }
Here's dcstat:
# ./dcstat
TIME REFS/s SLOW/s MISS/s HIT%
08:18:52: 241131 237544 237505 1.51
08:18:53: 238210 236323 236278 0.82
08:18:54: 235259 233307 233261 0.85
08:18:55: 233144 231256 231214 0.83
08:18:56: 231981 230097 230053 0.83
dcstat also supports an optional interval and optional count. For example,
printing 5 second summaries 3 times:
# ./dcstat 5 3
TIME REFS/s SLOW/s MISS/s HIT%
08:20:03: 2085 143 99 95.23
08:20:08: 2077 143 98 95.24
08:20:14: 2071 144 100 95.15
USAGE message:
# ./dcstat -h
USAGE: ./dcstat [interval [count]]
...@@ -97,7 +97,7 @@ start_ts = 0 ...@@ -97,7 +97,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(delta, filename) = msg.split(" ") (delta, filename) = msg.split(" ", 1)
# print columns # print columns
print("%-8s %-6d %-16s %-7.2f %s" % (strftime("%H:%M:%S"), pid, task, print("%-8s %-6d %-16s %-7.2f %s" % (strftime("%H:%M:%S"), pid, task,
......
#!/usr/bin/env python
#
# oomkill Trace oom_kill_process(). For Linux, uses BCC, eBPF.
#
# This traces the kernel out-of-memory killer, and prints basic details,
# including the system load averages. This can provide more context on the
# system state at the time of OOM: was it getting busier or steady, based
# on the load averages? This tool may also be useful to customize for
# investigations; for example, by adding other task_struct details at the time
# of OOM.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from bcc import BPF
from time import strftime
import ctypes as ct
# linux stats
loadavg = "/proc/loadavg"
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/oom.h>
struct data_t {
u64 fpid;
u64 tpid;
u64 pages;
char fcomm[TASK_COMM_LEN];
char tcomm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
void kprobe__oom_kill_process(struct pt_regs *ctx, struct oom_control *oc,
struct task_struct *p, unsigned int points, unsigned long totalpages)
{
struct data_t data = {};
u32 pid = bpf_get_current_pid_tgid();
data.fpid = pid;
data.tpid = p->pid;
data.pages = totalpages;
bpf_get_current_comm(&data.fcomm, sizeof(data.fcomm));
bpf_probe_read(&data.tcomm, sizeof(data.tcomm), p->comm);
events.perf_submit(ctx, &data, sizeof(data));
}
"""
# kernel->user event data: struct data_t
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("fpid", ct.c_ulonglong),
("tpid", ct.c_ulonglong),
("pages", ct.c_ulonglong),
("fcomm", ct.c_char * TASK_COMM_LEN),
("tcomm", ct.c_char * TASK_COMM_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
with open(loadavg) as stats:
avgline = stats.read().rstrip()
print(("%s Triggered by PID %d (\"%s\"), OOM kill of PID %d (\"%s\")"
", %d pages, loadavg: %s") % (strftime("%H:%M:%S"), event.fpid,
event.fcomm, event.tpid, event.tcomm, event.pages, avgline))
# initialize BPF
b = BPF(text=bpf_text)
print("Tracing OOM kills... Ctrl-C to stop.")
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
Demonstrations of oomkill, the Linux eBPF/bcc version.
oomkill is a simple program that traces the Linux out-of-memory (OOM) killer,
and shows basic details on one line per OOM kill:
# ./oomkill
Tracing oom_kill_process()... Ctrl-C to end.
21:03:39 Triggered by PID 3297 ("ntpd"), OOM kill of PID 22516 ("perl"), 3850642 pages, loadavg: 0.99 0.39 0.30 3/282 22724
21:03:48 Triggered by PID 22517 ("perl"), OOM kill of PID 22517 ("perl"), 3850642 pages, loadavg: 0.99 0.41 0.30 2/282 22932
The first line shows that PID 22516, with process name "perl", was OOM killed
when it reached 3850642 pages (usually 4 Kbytes per page). This OOM kill
happened to be triggered by PID 3297, process name "ntpd", doing some memory
allocation.
The system log (dmesg) shows pages of details and system context about an OOM
kill. What it currently lacks, however, is context on how the system had been
changing over time. I've seen OOM kills where I wanted to know if the system
was at steady state at the time, or if there had been a recent increase in
workload that triggered the OOM event. oomkill provides some context: at the
end of the line is the load average information from /proc/loadavg. For both
of the oomkills here, we can see that the system was getting busier at the
time (a higher 1 minute "average" of 0.99, compared to the 15 minute "average"
of 0.30).
oomkill can also be the basis of other tools and customizations. For example,
you can edit it to include other task_struct details from the target PID at
the time of the OOM kill.
The following commands can be used to test this program, and invoke a memory
consuming process that exhausts system memory and is OOM killed:
sysctl -w vm.overcommit_memory=1 # always overcommit
perl -e 'while (1) { $a .= "A" x 1024; }' # eat all memory
WARNING: This exhausts system memory after disabling some overcommit checks.
Only test in a lab environment.
...@@ -63,7 +63,7 @@ int kretprobe__sys_open(struct pt_regs *ctx) ...@@ -63,7 +63,7 @@ int kretprobe__sys_open(struct pt_regs *ctx)
return 0; return 0;
} }
bpf_trace_printk("%s %d\\n", *filenamep, ret); bpf_trace_printk("%d %s\\n", ret, *filenamep);
args_filename.delete(&pid); args_filename.delete(&pid);
return 0; return 0;
...@@ -90,7 +90,7 @@ start_ts = 0 ...@@ -90,7 +90,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(filename, ret_s) = msg.split(" ") (ret_s, filename) = msg.split(" ", 1)
ret = int(ret_s) ret = int(ret_s)
if (args.failed and (ret >= 0)): if (args.failed and (ret >= 0)):
......
...@@ -63,7 +63,7 @@ int trace_return(struct pt_regs *ctx) ...@@ -63,7 +63,7 @@ int trace_return(struct pt_regs *ctx)
return 0; return 0;
} }
bpf_trace_printk("%s %d\\n", *filenamep, ret); bpf_trace_printk("%d %s\\n", ret, *filenamep);
args_filename.delete(&pid); args_filename.delete(&pid);
return 0; return 0;
...@@ -96,7 +96,7 @@ start_ts = 0 ...@@ -96,7 +96,7 @@ start_ts = 0
# format output # format output
while 1: while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields() (task, pid, cpu, flags, ts, msg) = b.trace_fields()
(filename, ret_s) = msg.split(" ") (ret_s, filename) = msg.split(" ", 1)
ret = int(ret_s) ret = int(ret_s)
if (args.failed and (ret >= 0)): if (args.failed and (ret >= 0)):
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment