Commit 59d6c350 authored by Brenden Blanco's avatar Brenden Blanco

Merge pull request #353 from brendangregg/master

3 tools: fsslower, execsnoop, runqlat
parents 919045aa c4fa80e7
......@@ -71,6 +71,8 @@ Tools:
- tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
- tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt).
- tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
- tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
......@@ -80,6 +82,7 @@ Tools:
- tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[stackcount](tools/stackcount.py): Count kernel function calls and their stack traces. [Examples](tools/stackcount_example.txt).
- tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
......
......@@ -65,7 +65,7 @@ Show a latency histogram for each disk device separately:
usecs
Microsecond range
.TP
mecs
msecs
Millisecond range
.TP
count
......
.TH execsnoop 8 "2016-02-07" "USER COMMANDS"
.SH NAME
execsnoop \- Trace new processes via exec() syscalls. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B execsnoop [\-h] [\-t] [\-X] [\-n NAME]
.SH DESCRIPTION
execsnoop traces new processes, showing the filename executed, argument
list, and return value (0 for success).
It works by traces the execve() system call (commonly used exec() variant).
This catches new processes that follow the fork->exec sequence, as well as
processes that re-exec() themselves. Some applications fork() but do not
exec(), eg, for worker processes, which won't be included in the execsnoop
output.
This works by tracing the kernel sys_execve() function using dynamic tracing,
and will need updating to match any changes to this function.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-t
Include a timestamp column.
.TP
\-X
Exclude failed exec()s
.TP
\-n NAME
Only print command lines matching this name (regex), matched anywhere
.SH EXAMPLES
.TP
Trace all exec() syscalls:
#
.B execsnoop
.TP
Trace all exec() syscalls, and include timestamps:
#
.B execsnoop \-t
.TP
Only trace successful exec()s:
#
.B execsnoop \-X
.TP
Only trace exec()s where the filename or arguments contain "mount":
#
.B opensnoop \-n mount
.SH FIELDS
.TP
TIME(s)
Time of exec() return, in seconds.
.TP
PCOMM
Parent process/command name.
.TP
PID
Process ID
.TP
RET
Return value of exec(). 0 == successs.
.TP
ARGS
Filename for the exec(), followed be up to 19 arguments. An ellipsis "..." is
shown if the argument list is known to be truncated.
.SH OVERHEAD
This traces the kernel execve function and prints output for each event. As the
rate of this is generally expected to be low (< 1000/s), the overhead is also
expected to be negligible. If you have an application that is calling a high
rate of exec()s, then test and understand overhead before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
opensnoop(1)
.TH fileslower 8 "2016-02-07" "USER COMMANDS"
.SH NAME
fileslower \- Trace slow synchronous file reads and writes.
.SH SYNOPSIS
.B fileslower [\-h] [\-p PID] [min_ms]
.SH DESCRIPTION
This script uses kernel dynamic tracing of synchronous reads and writes
at the VFS interface, to identify slow file reads and writes for any file
system.
This version traces __vfs_read() and __vfs_write() and only showing
synchronous I/O (the path to new_sync_read() and new_sync_write()), and
I/O with filenames. This approach provides a view of just two file
system request types: file reads and writes. There are typically many others:
asynchronous I/O, directory operations, file handle operations, file open()s,
fflush(), etc.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used.
Since this works by tracing various kernel __vfs_*() functions using dynamic
tracing, it will need updating to match any changes to these functions. A
future version should switch to using FS tracepoints instead.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file reads and writes slower than 10 ms:
#
.B fileslower
.TP
Trace slower than 1 ms:
#
.B fileslower 1
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B fileslower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
D
Direction of I/O. R == read, W == write.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when the application issued it to VFS
to when it completed. This time is inclusive of block device I/O, file system
CPU cycles, file system locks, run queue latency, etc. It's a more accurate
measure of the latency suffered by applications performing file system I/O,
than to measure this down at the block device interface.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.SH OVERHEAD
Depending on the frequency of application reads and writes, overhead can become
severe, in the worst case slowing applications by 2x. In the best case, the
overhead is negligible. Hopefully for real world workloads the overhead is
often at the lower end of the spectrum -- test before use. The reason for
high overhead is that this traces VFS reads and writes, which includes FS
cache reads and writes, and can exceed one million events per second if the
application is I/O heavy. While the instrumentation is extremely lightweight,
and uses in-kernel eBPF maps for efficient timing and filtering, multiply that
cost by one million events per second and that cost becomes a million times
worse. You can get an idea of the possible cost by just counting the
instrumented events using the bcc funccount tool, eg:
.PP
# ./funccount.py -i 1 -r '^__vfs_(read|write)$'
.PP
This also costs overhead, but is somewhat less than fileslower.
.PP
If the overhead is prohibitive for your workload, I'd recommend moving
down-stack a little from VFS into the file system functions (ext4, xfs, etc).
Look for updates to bcc for specific file system tools that do this. The
advantage of a per-file system approach is that we can trace post-cache,
greatly reducing events and overhead. The disadvantage is needing custom
tracing approaches for each different file system (whereas VFS is generic).
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
biosnoop(8), funccount(8)
......@@ -88,7 +88,7 @@ Nanosecond range
usecs
Microsecond range
.TP
mecs
msecs
Millisecond range
.TP
count
......
.TH runqlat 8 "2016-02-07" "USER COMMANDS"
.SH NAME
runqlat \- Run queue (scheduler) latency as a histogram.
.SH SYNOPSIS
.B runqlat [\-h] [\-T] [\-m] [\-P] [\-p PID] [interval] [count]
.SH DESCRIPTION
This measures the time a task spends waiting on a run queue (or equivalent
scheduler data structure) for a turn on-CPU, and shows this time as a
histogram. This time should be small, but a task may need to wait its turn due
to CPU load. The higher the CPU load, the longer a task will generally need to
wait its turn.
This tool measures two types of run queue latency:
1. The time from a task being enqueued on a run queue to its context switch
and execution. This traces enqueue_task_*() -> finish_task_switch(),
and instruments the run queue latency after a voluntary context switch.
2. The time from when a task was involuntary context switched and still
in the runnable state, to when it next executed. This is instrumented
from finish_task_switch() alone.
This tool uses in-kernel eBPF maps for storing timestamps and the histogram,
for efficiency. Despite this, the overhead of this tool may become significant
for some workloads: see the OVERHEAD section.
This works by tracing various kernel scheduler functions using dynamic tracing,
and will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-T
Include timestamps on output.
.TP
\-m
Output histogram in milliseconds.
.TP
\-P
Print a histogram for each PID.
.TP
\-p PID
Only show this PID (filtered in kernel for efficiency).
.TP
interval
Output interval, in seconds.
.TP
count
Number of outputs.
.SH EXAMPLES
.TP
Summarize run queue latency as a histogram:
#
.B runqlat
.TP
Print 1 second summaries, 10 times:
#
.B runqlat 1 10
.TP
Print 1 second summaries, using milliseconds as units for the histogram, and include timestamps on output:
#
.B runqlat \-mT 1
.TP
Trace PID 186 only, 1 second summaries:
#
.B runqlat -P 185 1
.SH FIELDS
.TP
usecs
Microsecond range
.TP
msecs
Millisecond range
.TP
count
How many times a task event fell into this range
.TP
distribution
An ASCII bar chart to visualize the distribution (count column)
.SH OVERHEAD
This traces scheduler functions, which can become very frequent. While eBPF
has very low overhead, and this tool uses in-kernel maps for efficiency, the
frequency of scheduler events for some workloads may be high enough that the
overhead of this tool becomes significant. Measure in a lab environment
to quantify the overhead before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
pidstat(1)
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# execsnoop Trace new processes via exec() syscalls.
# For Linux, uses BCC, eBPF. Embedded C.
#
# USAGE: execsnoop [-h] [-t] [-X] [-n NAME]
#
# This currently will print up to a maximum of 19 arguments, plus the process
# name, so 20 fields in total (MAXARG).
#
# This won't catch all new processes: an application may fork() but not exec().
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 07-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import re
# arguments
examples = """examples:
./execsnoop # trace all exec() syscalls
./execsnoop -X # only show successful exec()s
./execsnoop -t # include timestamps
./execsnoop -n main # only print command lines containing "main"
"""
parser = argparse.ArgumentParser(
description="Trace exec() syscalls",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-t", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-X", "--excludefails", action="store_true",
help="exclude failed exec()s")
parser.add_argument("-n", "--name",
help="only print commands matching this name (regex), any arg")
args = parser.parse_args()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/fs.h>
#define MAXARG 20
#define ARGSIZE 64
static int print_arg(void *ptr) {
// Fetch an argument, and print using bpf_trace_printk(). This is a work
// around until we have a binary trace interface for passing event data to
// bcc. Since exec()s should be low frequency, the additional overhead in
// this case should not be a problem.
const char *argp = NULL;
char buf[ARGSIZE] = {};
bpf_probe_read(&argp, sizeof(argp), ptr);
if (argp == NULL) return 0;
bpf_probe_read(&buf, sizeof(buf), (void *)(argp));
bpf_trace_printk("ARG %s\\n", buf);
return 1;
}
int kprobe__sys_execve(struct pt_regs *ctx, struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
{
char fname[ARGSIZE] = {};
bpf_probe_read(&fname, sizeof(fname), (void *)(filename));
bpf_trace_printk("ARG %s\\n", fname);
int i = 1; // skip first arg, as we printed fname
// unrolled loop to walk argv[] (MAXARG)
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++; // X
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++;
if (print_arg((void *)&__argv[i]) == 0) goto out; i++; // XX
bpf_trace_printk("ARG ...\\n"); // truncated
out:
return 0;
}
int kretprobe__sys_execve(struct pt_regs *ctx)
{
bpf_trace_printk("RET %d\\n", ctx->ax);
return 0;
}
"""
# initialize BPF
b = BPF(text=bpf_text)
# header
if args.timestamp:
print("%-8s" % ("TIME(s)"), end="")
print("%-16s %-6s %3s %s" % ("PCOMM", "PID", "RET", "ARGS"))
start_ts = 0
cmd = {}
pcomm = {}
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
(type, arg) = msg.split(" ", 1)
if start_ts == 0:
start_ts = ts
if type == "RET":
skip = 0
if args.name:
if not re.search(args.name, cmd[pid]):
skip = 1
if args.excludefails and int(arg) < 0:
skip = 1
if skip:
del cmd[pid]
del pcomm[pid]
continue
# output
if args.timestamp:
print("%-8.3f" % (ts - start_ts), end="")
print("%-16s %-6s %3s %s" % (pcomm[pid], pid, arg, cmd[pid]))
del cmd[pid]
del pcomm[pid]
else:
# build command line string
if pid in cmd:
cmd[pid] = cmd[pid] + " " + arg
else:
cmd[pid] = arg
if pid not in pcomm:
pcomm[pid] = task
Demonstrations of execsnoop, the Linux eBPF/bcc version.
execsnoop traces new processes. For example:
# ./execsnoop
PCOMM PID RET ARGS
supervise 9660 0 ./run
supervise 9661 0 ./run
mkdir 9662 0 /bin/mkdir -p ./main
run 9663 0 ./run
chown 9664 0 /bin/chown nobody:nobody ./main
run 9665 0 /bin/mkdir -p ./main
supervise 9667 0 ./run
run 9660 -2 /usr/local/bin/setuidgid nobody /command/multilog t ./main
chown 9668 0 /bin/chown nobody:nobody ./main
run 9666 0 /bin/chmod 0777 main
run 9663 -2 /usr/local/bin/setuidgid nobody /command/multilog t ./main
run 9669 0 /bin/mkdir -p ./main
run 9661 -2 /usr/local/bin/setuidgid nobody /command/multilog t ./main
supervise 9670 0 ./run
[...]
The output shows the parent process/command name (PCOMM), the PID, the return
value of the exec() (RET), and the filename with arguments (ARGS). The example
above shows various regular system daemon activity, including some failures
(trying to execute a /usr/local/bin/setuidgid, which I just noticed doesn't
exist).
It works by traces the execve() system call (commonly used exec() variant), and
shows details of the arguments and return value. This catches new processes
that follow the fork->exec sequence, as well as processes that re-exec()
themselves. Some applications fork() but do not exec(), eg, for worker
processes, which won't be included in the execsnoop output.
The -X option can be used to only show successful exec()s. For example, tracing
a "man ls":
# ./execsnoop -X
PCOMM PID RET ARGS
bash 15887 0 /usr/bin/man ls
preconv 15894 0 /usr/bin/preconv -e UTF-8
man 15896 0 /usr/bin/tbl
man 15897 0 /usr/bin/nroff -mandoc -rLL=169n -rLT=169n -Tutf8
man 15898 0 /usr/bin/pager -s
nroff 15900 0 /usr/bin/locale charmap
nroff 15901 0 /usr/bin/groff -mtty-char -Tutf8 -mandoc -rLL=169n -rLT=169n
groff 15902 0 /usr/bin/troff -mtty-char -mandoc -rLL=169n -rLT=169n -Tutf8
groff 15903 0 /usr/bin/grotty
This shows the various commands used to process the "man ls" command.
A -t option can be used to include a timestamp column, and a -n option to match
on a name or substring from the full command line (filename + args). Regular
expressions are allowed. For example, matching commands containing "mount":
# ./execsnoop -tn mount
TIME(s) PCOMM PID RET ARGS
2.849 bash 18049 0 /bin/mount -p
USAGE message:
# ./execsnoop -h
usage: execsnoop [-h] [-t] [-X] [-n NAME]
Trace exec() syscalls
optional arguments:
-h, --help show this help message and exit
-t, --timestamp include timestamp on output
-X, --excludefails exclude failed exec()s
-n NAME, --name NAME only print commands matching this name (regex), any
arg
examples:
./execsnoop # trace all exec() syscalls
./execsnoop -X # only show successful exec()s
./execsnoop -t # include timestamps
./execsnoop -n main # only print command lines containing "main"
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# fileslower Trace slow synchronous file reads and writes.
# For Linux, uses BCC, eBPF.
#
# USAGE: fileslower [-h] [-p PID] [min_ms]
#
# This script uses kernel dynamic tracing of synchronous reads and writes
# at the VFS interface, to identify slow file reads and writes for any file
# system.
#
# This works by tracing __vfs_read() and __vfs_write(), and filtering for
# synchronous I/O (the path to new_sync_read() and new_sync_write()), and
# for I/O with filenames. This approach provides a view of just two file
# system request types. There are typically many others: asynchronous I/O,
# directory operations, file handle operations, etc, that this tool does not
# instrument.
#
# WARNING: This traces VFS reads and writes, which can be extremely frequent,
# and so the overhead of this tool can become severe depending on the
# workload.
#
# By default, a minimum millisecond threshold of 10 is used.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 06-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import signal
# arguments
examples = """examples:
./fileslower # trace sync file I/O slower than 10 ms (default)
./fileslower 1 # trace sync file I/O slower than 1 ms
./fileslower -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Trace slow synchronous file reads and writes",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("min_ms", nargs="?", default='10',
help="minimum I/O duration to trace, in ms (default 10)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
debug = 0
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#define TRACE_READ 0
#define TRACE_WRITE 1
struct val_t {
u32 sz;
u64 ts;
char name[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, pid_t, struct val_t);
// store timestamp and size on entry
static int trace_rw_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER)
return 0;
// skip I/O lacking a filename
struct dentry *de = file->f_path.dentry;
if (de->d_iname[0] == 0)
return 0;
// store size and timestamp by pid
struct val_t val = {};
val.sz = count;
val.ts = bpf_ktime_get_ns();
__builtin_memcpy(&val.name, de->d_iname, sizeof(val.name));
entryinfo.update(&pid, &val);
return 0;
}
int trace_read_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
// skip non-sync I/O; see kernel code for __vfs_read()
if (!(file->f_op->read_iter))
return 0;
return trace_rw_entry(ctx, file, buf, count);
}
int trace_write_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
// skip non-sync I/O; see kernel code for __vfs_write()
if (!(file->f_op->write_iter))
return 0;
return trace_rw_entry(ctx, file, buf, count);
}
// output
static int trace_rw_return(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = entryinfo.lookup(&pid);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
u64 delta_us = (bpf_ktime_get_ns() - valp->ts) / 1000;
entryinfo.delete(&pid);
if (delta_us < MIN_US)
return 0;
if (type == TRACE_READ) {
bpf_trace_printk("R %d %d %s\\n", valp->sz, delta_us, valp->name);
} else {
bpf_trace_printk("W %d %d %s\\n", valp->sz, delta_us, valp->name);
}
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_rw_return(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_rw_return(ctx, TRACE_WRITE);
}
"""
bpf_text = bpf_text.replace('MIN_US', str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER', '0')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
# I'd rather trace these via new_sync_read/new_sync_write (which used to be
# do_sync_read/do_sync_write), but those became static. So trace these from
# the parent functions, at the cost of more overhead, instead.
# Ultimately, we should be using [V]FS tracepoints.
b.attach_kprobe(event="__vfs_read", fn_name="trace_read_entry")
b.attach_kprobe(event="__vfs_write", fn_name="trace_write_entry")
b.attach_kretprobe(event="__vfs_read", fn_name="trace_read_return")
b.attach_kretprobe(event="__vfs_write", fn_name="trace_write_return")
# header
print("Tracing sync read/writes slower than %d ms" % min_ms)
print("%-8s %-14s %-6s %1s %-7s %7s %s" % ("TIME(s)", "COMM", "PID", "D",
"BYTES", "LAT(ms)", "FILENAME"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
args = msg.split(" ", 3)
(type_s, sz, delta_us_s) = (args[0], args[1], args[2])
try:
filename = args[3]
except:
filename = "?"
if start_ts == 0:
start_ts = ts
ms = float(int(delta_us_s, 10)) / 1000
print("%-8.3f %-14.14s %-6s %1s %-7s %7.2f %s" % (
ts - start_ts, task, pid, type_s, sz, ms, filename))
Demonstrations of fileslower, the Linux eBPF/bcc version.
fileslower shows file-based synchronous reads and writes slower than a
threshold. For example:
# ./fileslower
Tracing sync read/writes slower than 10 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 randread.pl 4762 R 8192 12.70 data1
8.850 randread.pl 4762 R 8192 11.26 data1
12.852 randread.pl 4762 R 8192 10.43 data1
This showed a few reads from a "randread.pl" program, each 8 Kbytes in size,
and from a "data1" file. These all had over 10 ms latency.
This "latency" is measured from when the read or write was issued at the VFS
interface, to when it completed. This spans everything: block device I/O (disk
I/O), file system CPU cycles, file system locks, run queue latency, etc. This
is a better measure of the latency suffered by applications reading from the
file system than measuring this down at the block device interface.
Note that this only traces file reads and writes: other file system operations
(eg, directory operations, open(), fflush()) are not traced.
The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
# ./fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 randread.pl 6925 R 8192 1.06 data1
0.082 randread.pl 6925 R 8192 2.42 data1
0.116 randread.pl 6925 R 8192 1.78 data1
0.153 randread.pl 6925 R 8192 2.31 data1
0.330 randread.pl 6925 R 8192 1.14 data1
0.345 randread.pl 6925 R 8192 1.52 data1
0.359 randread.pl 6925 R 8192 1.04 data1
0.532 randread.pl 6925 R 8192 2.56 data1
0.609 supervise 1892 W 18 3.65 status.new
0.610 randread.pl 6925 R 8192 1.37 data1
0.614 randread.pl 6925 R 8192 3.04 data1
0.729 randread.pl 6925 R 8192 2.90 data1
0.755 randread.pl 6925 R 8192 1.12 data1
0.762 randread.pl 6925 R 8192 2.62 data1
0.771 randread.pl 6925 R 8192 1.07 data1
0.816 randread.pl 6925 R 8192 10.50 data1
0.983 randread.pl 6925 R 8192 1.73 data1
0.989 randread.pl 6925 R 8192 2.12 data1
0.992 randread.pl 6925 R 8192 2.17 data1
1.001 randread.pl 6925 R 8192 1.93 data1
1.007 randread.pl 6925 R 8192 2.03 data1
1.210 randread.pl 6925 R 8192 1.82 data1
1.213 randread.pl 6925 R 8192 2.58 data1
1.219 randread.pl 6925 R 8192 2.20 data1
1.430 randread.pl 6925 R 8192 1.01 data1
1.448 randread.pl 6925 R 8192 2.22 data1
[...]
There's now much more output (this spans only 1.4 seconds, the previous output
spanned 12 seconds), and the lower threshold is catching more I/O.
In the following example, the file system caches were dropped before running
fileslower, and then in another session a "man ls" was executed. The command
and files read from disk can be seen:
# echo 3 > /proc/sys/vm/drop_caches; ./fileslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 bash 9647 R 128 5.83 man
0.050 man 9647 R 832 19.52 libmandb-2.6.7.1.so
0.066 man 9647 R 832 15.79 libman-2.6.7.1.so
0.123 man 9647 R 832 56.36 libpipeline.so.1.3.0
0.135 man 9647 R 832 9.79 libgdbm.so.3.0.0
0.323 man 9647 R 4096 59.52 locale.alias
0.540 man 9648 R 8192 11.11 ls.1.gz
0.558 man 9647 R 72 6.97 index.db
0.563 man 9647 R 4096 5.12 index.db
0.723 man 9658 R 128 12.06 less
0.725 man 9656 R 128 14.52 nroff
0.779 man 9655 R 128 68.86 tbl
0.814 nroff 9660 R 128 14.55 locale
0.830 pager 9658 R 4096 28.27 .lesshst
0.866 man 9654 R 128 163.12 preconv
0.980 nroff 9684 R 128 13.80 groff
0.999 groff 9684 R 4096 14.29 DESC
1.036 groff 9685 R 128 5.94 troff
1.038 groff 9686 R 128 7.76 grotty
1.065 troff 9685 R 4096 6.33 R
1.082 troff 9685 R 4096 10.52 BI
1.096 troff 9685 R 4096 8.70 troffrc
1.176 troff 9685 R 4096 80.12 composite.tmac
1.195 troff 9685 R 4096 19.20 fallbacks.tmac
1.202 troff 9685 R 4096 6.79 tty.tmac
1.221 troff 9685 R 4096 7.87 man.local
2.977 supervise 1876 W 18 4.23 status.new
This caught an individual I/O reaching 163.12 ms, for the "preconv" file. While
the file system cache was flushed, causing these to need to be read from disk,
the duration here may not be entirely disk I/O: it can include file system
locks, run queue latency, etc. These can be explored using other commands.
USAGE message:
# ./fileslower -h
usage: fileslower [-h] [-p PID] [min_ms]
Trace slow synchronous file reads and writes
positional arguments:
min_ms minimum I/O duration to trace, in ms (default 10)
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
examples:
./fileslower # trace sync file I/O slower than 10 ms (default)
./fileslower 1 # trace sync file I/O slower than 1 ms
./fileslower -p 185 # trace PID 185 only
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# runqlat Run queue (scheduler) latency as a histogram.
# For Linux, uses BCC, eBPF.
#
# USAGE: runqlat [-h] [-T] [-m] [-P] [-p PID] [interval] [count]
#
# This measures the time a task spends waiting on a run queue for a turn
# on-CPU, and shows this time as a histogram. This time should be small, but a
# task may need to wait its turn due to CPU load.
#
# This measures two types of run queue latency:
# 1. The time from a task being enqueued on a run queue to its context switch
# and execution. This traces enqueue_task_*() -> finish_task_switch(),
# and instruments the run queue latency after a voluntary context switch.
# 2. The time from when a task was involuntary context switched and still
# in the runnable state, to when it next executed. This is instrumented
# from finish_task_switch() alone.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 07-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
# arguments
examples = """examples:
./runqlat # summarize run queue latency as a histogram
./runqlat 1 10 # print 1 second summaries, 10 times
./runqlat -mT 1 # 1s summaries, milliseconds, and timestamps
./runqlat -P # show each PID separately
./runqlat -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Summarize run queue (schedular) latency as a histogram",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
help="millisecond histogram")
parser.add_argument("-P", "--pids", action="store_true",
help="print a histogram per process ID")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("interval", nargs="?", default=99999999,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include "kernel/sched/sched.h"
typedef struct pid_key {
u64 pid; // work around
u64 slot;
} pid_key_t;
BPF_HASH(start, u32);
STORAGE
// record enqueue timestamp
int trace_enqueue(struct pt_regs *ctx, struct rq *rq, struct task_struct *p,
int flags)
{
u32 pid = p->pid;
if (FILTER)
return 0;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
// calculate latency
int trace_run(struct pt_regs *ctx, struct task_struct *prev)
{
u32 pid;
// ivcsw: treat like an enqueue event and store timestamp
if (prev->state == TASK_RUNNING) {
pid = prev->pid;
if (!(FILTER)) {
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
}
}
pid = bpf_get_current_pid_tgid();
if (FILTER)
return 0;
u64 *tsp, delta;
// fetch timestamp and calculate delta
tsp = start.lookup(&pid);
if (tsp == 0) {
return 0; // missed enqueue
}
delta = bpf_ktime_get_ns() - *tsp;
FACTOR
// store as histogram
STORE
start.delete(&pid);
return 0;
}
"""
# code substitutions
if args.pid:
bpf_text = bpf_text.replace('FILTER', 'pid != %s' % args.pid)
else:
bpf_text = bpf_text.replace('FILTER', '0')
if args.milliseconds:
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
label = "msecs"
else:
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
label = "usecs"
if args.pids:
bpf_text = bpf_text.replace('STORAGE',
'BPF_HISTOGRAM(dist, pid_key_t);')
bpf_text = bpf_text.replace('STORE',
'pid_key_t key = {.pid = pid, .slot = bpf_log2l(delta)}; ' +
'dist.increment(key);')
else:
bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
bpf_text = bpf_text.replace('STORE',
'dist.increment(bpf_log2l(delta));')
if debug:
print(bpf_text)
# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event_re="enqueue_task_*", fn_name="trace_enqueue")
b.attach_kprobe(event="finish_task_switch", fn_name="trace_run")
print("Tracing run queue latency... Hit Ctrl-C to end.")
# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
try:
sleep(int(args.interval))
except KeyboardInterrupt:
exiting = 1
print()
if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
dist.print_log2_hist(label, "pid", section_print_fn=int)
dist.clear()
countdown -= 1
if exiting or countdown == 0:
exit()
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment