Commit fe430e55 authored by Brendan Gregg's avatar Brendan Gregg

oomkill

parent 21878139
......@@ -83,6 +83,7 @@ Tools:
- tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_examples.txt).
- tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
- tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
- tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
......
.TH oomkill 8 "2016-02-09" "USER COMMANDS"
.SH NAME
oomkill \- Trace oom_kill_process(). Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B bashreadline
.SH DESCRIPTION
This traces the kernel out-of-memory killer, and prints basic details,
including the system load averages at the time of the OOM kill. This can
provide more context on the system state at the time: was it getting busier
or steady, based on the load averages? This tool may also be useful to
customize for investigations; for example, by adding other task_struct
details at the time of OOM.
This program is also a basic example of eBPF/bcc.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Trace OOM kill events:
#
.B oomkill
.SH FIELDS
.TP
Triggered by ...
The process ID and process name of the task that was running when another task was OOM
killed.
.TP
OOM kill of ...
The process ID and name of the target process that was OOM killed.
.TP
loadavg
Contents of /proc/loadavg. The first three numbers are 1, 5, and 15 minute
load averages (where the average is an exponentially damped moving sum, and
those numbers are constants in the equation); then there is the number of
running tasks, a slash, and the total number of tasks; and then the last number
is the last PID to be created.
.SH OVERHEAD
Negligible.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
memleak(8)
#!/usr/bin/env python
#
# oomkill Trace oom_kill_process(). For Linux, uses BCC, eBPF.
#
# This traces the kernel out-of-memory killer, and prints basic details,
# including the system load averages. This can provide more context on the
# system state at the time of OOM: was it getting busier or steady, based
# on the load averages? This tool may also be useful to customize for
# investigations; for example, by adding other task_struct details at the time
# of OOM.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 09-Feb-2016 Brendan Gregg Created this.
from bcc import BPF
from time import strftime
# linux stats
loadavg = "/proc/loadavg"
# initialize BPF
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/oom.h>
void kprobe__oom_kill_process(struct pt_regs *ctx, struct oom_control *oc,
struct task_struct *p, unsigned int points, unsigned long totalpages)
{
bpf_trace_printk("OOM kill of PID %d (\\"%s\\"), %d pages\\n", p->pid,
p->comm, totalpages);
}
""")
# print output
print("Tracing oom_kill_process()... Ctrl-C to end.")
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
with open(loadavg) as stats:
avgline = stats.read().rstrip()
print("%s Triggered by PID %d (\"%s\"), %s, loadavg: %s" % (
strftime("%H:%M:%S"), pid, task, msg, avgline))
Demonstrations of oomkill, the Linux eBPF/bcc version.
oomkill is a simple program that traces the Linux out-of-memory (OOM) killer,
and shows basic details on one line per OOM kill:
# ./oomkill
Tracing oom_kill_process()... Ctrl-C to end.
21:03:39 Triggered by PID 3297 ("ntpd"), OOM kill of PID 22516 ("perl"), 3850642 pages, loadavg: 0.99 0.39 0.30 3/282 22724
21:03:48 Triggered by PID 22517 ("perl"), OOM kill of PID 22517 ("perl"), 3850642 pages, loadavg: 0.99 0.41 0.30 2/282 22932
The first line shows that PID 22516, with process name "perl", was OOM killed
when it reached 3850642 pages (usually 4 Kbytes per page). This OOM kill
happened to be triggered by PID 3297, process name "ntpd", doing some memory
allocation.
The system log (dmesg) shows pages of details and system context about an OOM
kill. What it currently lacks, however, is context on how the system had been
changing over time. I've seen OOM kills where I wanted to know if the system
was at steady state at the time, or if there had been a recent increase in
workload that triggered the OOM event. oomkill provides some context: at the
end of the line is the load average information from /proc/loadavg. For both
of the oomkills here, we can see that the system was getting busier at the
time (a higher 1 minute "average" of 0.99, compared to the 15 minute "average"
of 0.30).
oomkill can also be the basis of other tools and customizations. For example,
you can edit it to include other task_struct details from the target PID at
the time of the OOM kill.
The following commands can be used to test this program, and invoke a memory
consuming process that exhausts system memory and is OOM killed:
sysctl -w vm.overcommit_memory=1 # always overcommit
perl -e 'while (1) { $a .= "A" x 1024; }' # eat all memory
WARNING: This exhausts system memory after disabling some overcommit checks.
Only test in a lab environment.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment