Commit a32fbafb authored by Brendan Gregg's avatar Brendan Gregg

fsslower

parent 788f18d2
......@@ -71,6 +71,7 @@ Tools:
- tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[fsslower](tools/fsslower.py): Trace slow file system synchronous reads and writes. [Examples](tools/fsslower_example.txt).
- tools/[funccount](tools/funccount.py): Count kernel function calls. [Examples](tools/funccount_example.txt).
- tools/[funclatency](tools/funclatency.py): Time kernel functions and show their latency distribution. [Examples](tools/funclatency_example.txt).
- tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
......
.TH fsslower 8 "2016-02-07" "USER COMMANDS"
.SH NAME
fsslower \- Summarize block device I/O latency as a histogram.
.SH SYNOPSIS
.B fsslower [\-h] [\-p PID] [min_ms]
.SH DESCRIPTION
This script uses kernel dynamic tracing of synchronous reads and writes
at the VFS interface, to identify slow file system I/O for any file system.
This version traces __vfs_read() and __vfs_write() and only showing
synchronous I/O (the path to new_sync_read() and new_sync_write()), and
I/O with filenames. This approach provides a view of just two file
system request types: reads and writes. There are typically many others:
asynchronous I/O, directory operations, file handle operations, file open()s,
fflush(), etc, that this tool does not currently instrument. This
implementation is a work around until we have suitable fs tracepoints.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used.
Since this works by tracing various kernel __vfs_*() functions using dynamic
tracing, it will need updating to match any changes to these functions. A
future version should switch to using FS tracepoints instead.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file system reads and writes slower than 10 ms:
#
.B fsslower
.TP
Trace slower than 1 ms:
#
.B fsslower 1
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B fsslower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
D
Direction of I/O. R == read, W == write.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when the application issued it to VFS
to when it completed. This time is inclusive of block device I/O, file system
CPU cycles, file system locks, run queue latency, etc. It's a more accurate
measure of the latency suffered by applications performing file system I/O,
than to measure this down at the block device interface.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.SH OVERHEAD
Depending on the frequency of application reads and writes, overhead can become
serve, in the worst case slowing applications by 2x. In the best case, the
overhead is negligible. Hopefully for real world workloads the overhead is
often at the lower end of the spectrum -- test before use. The reason for
high overhead is that this traces VFS reads and writes, which includes FS
cache reads and writes, and can exceed one million events per second if the
application is I/O heavy. While the instrumentation is extremely lightweight,
and uses in-kernel eBPF maps for efficient timing and filtering, multiply that
cost by one million events per second and that cost becomes a million times
worse. You can get an idea of the possible cost by just counting the
instrumented events using the bcc funccount tool, eg:
.PP
# ./funccount.py -i 1 -r '^__vfs_(read|write)$'
.PP
This also costs overhead, but is somewhat less than fsslower.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
biosnoop(8), funccount(8)
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# fsslower Trace slow file system synchronous reads and writes.
# For Linux, uses BCC, eBPF.
#
# USAGE: fsslower [-h] [-p PID] [min_ms]
#
# This script uses kernel dynamic tracing of synchronous reads and writes
# at the VFS interface, to identify slow file system I/O for any file system.
#
# This works by tracing __vfs_read() and __vfs_write(), and filtering for
# synchronous I/O (the path to new_sync_read() and new_sync_write()), and
# for I/O with filenames. This approach provides a view of just two file
# system request types. There are typically many others: asynchronous I/O,
# directory operations, file handle operations, etc, that this tool does not
# instrument. This implementation is a work around until we have suitable fs
# tracepoints.
#
# WARNING: This traces VFS reads and writes, which can be extremely frequent,
# and so the overhead of this tool can become severe depending on the
# workload.
#
# By default, a minimum millisecond threshold of 10 is used.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 06-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
import signal
# arguments
examples = """examples:
./fsslower # trace sync I/O slower than 10 ms (default)
./fsslower 1 # trace sync I/O slower than 1 ms
./fsslower -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Trace slow file system sync reads and writes",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("min_ms", nargs="?", default='10',
help="minimum I/O duration to trace, in ms (default 10)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
debug = 0
# signal handler
def signal_ignore(signal, frame):
print()
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#define TRACE_READ 0
#define TRACE_WRITE 1
struct val_t {
u32 sz;
u64 ts;
char name[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, pid_t, struct val_t);
// store timestamp and size on entry
static int trace_rw_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER)
return 0;
// skip I/O lacking a filename
struct dentry *de = file->f_path.dentry;
if (de->d_iname[0] == 0)
return 0;
// store size and timestamp by pid
struct val_t val = {};
val.sz = count;
val.ts = bpf_ktime_get_ns();
__builtin_memcpy(&val.name, de->d_iname, sizeof(val.name));
entryinfo.update(&pid, &val);
return 0;
}
int trace_read_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
// skip non-sync I/O; see kernel code for __vfs_read()
if (!(file->f_op->read_iter))
return 0;
return trace_rw_entry(ctx, file, buf, count);
}
int trace_write_entry(struct pt_regs *ctx, struct file *file,
char __user *buf, size_t count)
{
// skip non-sync I/O; see kernel code for __vfs_write()
if (!(file->f_op->write_iter))
return 0;
return trace_rw_entry(ctx, file, buf, count);
}
// output
static int trace_rw_return(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = entryinfo.lookup(&pid);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
u64 delta_us = (bpf_ktime_get_ns() - valp->ts) / 1000;
entryinfo.delete(&pid);
if (delta_us < MIN_US)
return 0;
if (type == TRACE_READ) {
bpf_trace_printk("R %d %d %s\\n", valp->sz, delta_us, valp->name);
} else {
bpf_trace_printk("W %d %d %s\\n", valp->sz, delta_us, valp->name);
}
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_rw_return(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_rw_return(ctx, TRACE_WRITE);
}
"""
bpf_text = bpf_text.replace('MIN_US', str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER', '0')
if debug:
print(bpf_text)
# initialize BPF
b = BPF(text=bpf_text)
# I'd rather trace these via new_sync_read/new_sync_write (which used to be
# do_sync_read/do_sync_write), but those became static. So trace these from
# the parent functions, at the cost of more overhead, instead.
# Ultimately, we should be using [V]FS tracepoints.
b.attach_kprobe(event="__vfs_read", fn_name="trace_read_entry")
b.attach_kprobe(event="__vfs_write", fn_name="trace_write_entry")
b.attach_kretprobe(event="__vfs_read", fn_name="trace_read_return")
b.attach_kretprobe(event="__vfs_write", fn_name="trace_write_return")
# header
print("Tracing sync read/writes slower than %d ms" % min_ms)
print("%-8s %-14s %-6s %1s %-7s %7s %s" % ("TIME(s)", "COMM", "PID", "D",
"BYTES", "LAT(ms)", "FILENAME"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
args = msg.split(" ", 3)
(type_s, sz, delta_us_s) = (args[0], args[1], args[2])
try:
filename = args[3]
except:
filename = "?"
if start_ts == 0:
start_ts = ts
ms = float(int(delta_us_s, 10)) / 1000
print("%-8.3f %-14.14s %-6s %1s %-7s %7.2f %s" % (
ts - start_ts, task, pid, type_s, sz, ms, filename))
Demonstrations of fsslower, the Linux eBPF/bcc version.
fsslower shows file-based synchronous reads and writes slower than a threshold.
For example:
# ./fsslower
Tracing sync read/writes slower than 10 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 randread.pl 4762 R 8192 12.70 data1
8.850 randread.pl 4762 R 8192 11.26 data1
12.852 randread.pl 4762 R 8192 10.43 data1
This showed a few reads from a "randread.pl" program, each 8 Kbytes in size,
and from a "data1" file. These all had over 10 ms latency.
This "latency" is measured from when the read or write was issued at the VFS
interface, to when it completed. This spans everything: block device I/O (disk
I/O), file system CPU cycles, file system locks, run queue latency, etc. This
is a better measure of the latency suffered by applications reading from the
file system than measuring this down at the block device interface.
Note that this only traces reads and writes: other file system operations
(eg, directory operations, open(), fflush()) are not currently traced.
The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
# ./fsslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 randread.pl 6925 R 8192 1.06 data1
0.082 randread.pl 6925 R 8192 2.42 data1
0.116 randread.pl 6925 R 8192 1.78 data1
0.153 randread.pl 6925 R 8192 2.31 data1
0.330 randread.pl 6925 R 8192 1.14 data1
0.345 randread.pl 6925 R 8192 1.52 data1
0.359 randread.pl 6925 R 8192 1.04 data1
0.532 randread.pl 6925 R 8192 2.56 data1
0.609 supervise 1892 W 18 3.65 status.new
0.610 randread.pl 6925 R 8192 1.37 data1
0.614 randread.pl 6925 R 8192 3.04 data1
0.729 randread.pl 6925 R 8192 2.90 data1
0.755 randread.pl 6925 R 8192 1.12 data1
0.762 randread.pl 6925 R 8192 2.62 data1
0.771 randread.pl 6925 R 8192 1.07 data1
0.816 randread.pl 6925 R 8192 10.50 data1
0.983 randread.pl 6925 R 8192 1.73 data1
0.989 randread.pl 6925 R 8192 2.12 data1
0.992 randread.pl 6925 R 8192 2.17 data1
1.001 randread.pl 6925 R 8192 1.93 data1
1.007 randread.pl 6925 R 8192 2.03 data1
1.210 randread.pl 6925 R 8192 1.82 data1
1.213 randread.pl 6925 R 8192 2.58 data1
1.219 randread.pl 6925 R 8192 2.20 data1
1.430 randread.pl 6925 R 8192 1.01 data1
1.448 randread.pl 6925 R 8192 2.22 data1
[...]
There's now much more output (this spans only 1.4 seconds, the previous output
spanned 12 seconds), and the lower threshold is catching more I/O.
In the following example, the file system caches were dropped before running
fsslower, and then in another session a "man ls" was executed. The command
and files read from disk can be seen:
# echo 3 > /proc/sys/vm/drop_caches; ./fsslower 1
Tracing sync read/writes slower than 1 ms
TIME(s) COMM PID D BYTES LAT(ms) FILENAME
0.000 bash 9647 R 128 5.83 man
0.050 man 9647 R 832 19.52 libmandb-2.6.7.1.so
0.066 man 9647 R 832 15.79 libman-2.6.7.1.so
0.123 man 9647 R 832 56.36 libpipeline.so.1.3.0
0.135 man 9647 R 832 9.79 libgdbm.so.3.0.0
0.323 man 9647 R 4096 59.52 locale.alias
0.540 man 9648 R 8192 11.11 ls.1.gz
0.558 man 9647 R 72 6.97 index.db
0.563 man 9647 R 4096 5.12 index.db
0.723 man 9658 R 128 12.06 less
0.725 man 9656 R 128 14.52 nroff
0.779 man 9655 R 128 68.86 tbl
0.814 nroff 9660 R 128 14.55 locale
0.830 pager 9658 R 4096 28.27 .lesshst
0.866 man 9654 R 128 163.12 preconv
0.980 nroff 9684 R 128 13.80 groff
0.999 groff 9684 R 4096 14.29 DESC
1.036 groff 9685 R 128 5.94 troff
1.038 groff 9686 R 128 7.76 grotty
1.065 troff 9685 R 4096 6.33 R
1.082 troff 9685 R 4096 10.52 BI
1.096 troff 9685 R 4096 8.70 troffrc
1.176 troff 9685 R 4096 80.12 composite.tmac
1.195 troff 9685 R 4096 19.20 fallbacks.tmac
1.202 troff 9685 R 4096 6.79 tty.tmac
1.221 troff 9685 R 4096 7.87 man.local
2.977 supervise 1876 W 18 4.23 status.new
This caught an individual I/O reaching 163.12 ms, for the "preconv" file. While
the file system cache was flush, causing these to need to be read from disk,
the duration here may not be entirely disk I/O: it can include file system
locks, run queue latency, etc. These can be explored using other commands.
USAGE message:
# ./fsslower -h
usage: fsslower [-h] [-p PID] [min_ms]
Trace slow file system sync reads and writes
positional arguments:
min_ms minimum I/O duration to trace, in ms (default 10)
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
examples:
./fsslower # trace sync I/O slower than 10 ms (default)
./fsslower 1 # trace sync I/O slower than 1 ms
./fsslower -p 185 # trace PID 185 only
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment