Commit 166bf0fa authored by samuelnair's avatar samuelnair Committed by Sasha Goldshtein

nfsslower: trace slow NFS operations

* Initial commit of nfsslower, tracking NFS4_READ, NFS4_WRITE and NFS4_OPEN

* Added in documentation, examples, support for csv and tracing for GETATTR

* Added in man pages, READM mods and example file, to comply with
https://github.com/samuelnair/bcc/blob/master/CONTRIBUTING-SCRIPTS.md

* Changes to address comments from @brendangregg and a small bug regarding
the output header not being printed when tracing all NFS operations

* Added nfsslower to the correct alphabetical postion

* Addressing Sasha's comments. I appreciate the thoroughness of the review

* Added test case for nfsslower and an extra function to check if the NFS kernel
module is loaded
parent 58889898
......@@ -117,6 +117,7 @@ pair of .c and .py files, and some are directories of files.
- tools/[mdflush](tools/mdflush.py): Trace md flush events. [Examples](tools/mdflush_example.txt).
- tools/[mysqld_qslower](tools/mysqld_qslower.py): Trace MySQL server queries slower than a threshold. [Examples](tools/mysqld_qslower_example.txt).
- tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_example.txt).
- tools/[nfsslower](tools/nfsslower.py): Trace slow NFS operations. [Examples](tools/nfsslower_example.txt).
- tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
- tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
- tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
......
.TH nfsslower 8 "2017-09-01" "USER COMMANDS"
.SH NAME
nfsslower \- Trace slow NFS file operations, with per-event details.
.SH SYNOPSIS
.B nfsslower [\-h] [\-j] [\-p PID] [min_ms]
.SH DESCRIPTION
This tool traces common NFSv3 & NFSv4 file operations: reads, writes, opens, and
getattrs. It measures the time spent in these operations, and prints details
for each that exceeded a threshold.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used. If a threshold of 0
is used, all events are printed (warning: verbose).
Since this works by tracing the nfs_file_operations interface functions, it
will need updating to match any changes to these functions.
This tool uses kprobes to instrument the kernel for entry and exit
information, in the future a preferred way would be to use tracepoints.
Currently there aren't any tracepoints available for nfs_read_file,
nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
tracepoints but we chose to use kprobes for consistency
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
\-j
Trace output in CSV format.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file reads and writes slower than 10 ms:
#
.B nfsslower
.TP
Trace slower than 1 ms:
#
.B nfsslower 1
.TP
Trace slower than 1 ms, and output just the fields in parsable format (CSV):
#
.B nfsslower \-j 1
.TP
Trace all file reads and writes (warning: the output will be verbose):
#
.B nfsslower 0
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B nfsslower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
T
Type of operation. R == read, W == write, O == open, G == getattr.
.TP
OFF_KB
File offset for the I/O, in Kbytes.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when it was issued by VFS to the
filesystem, to when it completed. This time is inclusive of RPC latency,
network latency, cache lookup, remote fileserver processing latency, etc.
Its a more accurate measure of the latency suffered by applications performing
NFS read/write calls to a fileserver.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.TP
ENDTIME_us
Completion timestamp, microseconds (\-j only).
.TP
OFFSET_b
File offset, bytes (\-j only).
.TP
LATENCY_us
Latency (duration) of the I/O, in microseconds (\-j only).
.SH OVERHEAD
This adds low-overhead instrumentation to NFS operations,
including reads and writes from the file system cache. Such read, writes and
particularly getattrs can be very frequent (depending on the workload; eg, 1M/sec),
at which point the overhead of this tool (even if it prints no "slower" events) can
begin to become significant. Measure and quantify before use. If this
continues to be a problem, consider switching to a tool that prints in-kernel
summaries only. This tool has been tested with NFSv3 & NVSv4, but it might work
with NFSv{1,2}, since it is tracing the generic functions from nfs_file_operations.
.PP
Note that the overhead of this tool should be less than fileslower(8), as
this tool targets NFS functions only, and not all file read/write paths.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion nfsslower_examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Samuel Nair
.SH SEE ALSO
biosnoop(8), funccount(8), fileslower(8)
......@@ -5,6 +5,7 @@
import distutils.version
import subprocess
import os
import re
from unittest import main, skipUnless, TestCase
TOOLS_DIR = "../../tools/"
......@@ -50,6 +51,14 @@ class SmokeTests(TestCase):
self.assertTrue((rc == 0 and allow_early) or rc == 124
or (rc == 137 and kill), "rc was %d" % rc)
def kmod_loaded(self, mod):
mods = open("/proc/modules", "r")
reg = re.compile("^%s\s" % mod)
for line in mods:
if reg.match(line):
return 1
return 0
def setUp(self):
pass
......@@ -209,6 +218,13 @@ class SmokeTests(TestCase):
# MySQL to be running, or it fails to attach.
pass
@skipUnless(kernel_version_ge(4,4), "requires kernel >= 4.4")
def test_nfsslower(self):
if(self.kmod_loaded("nfs")):
self.run_with_int("nfsslower.py")
else:
pass
@skipUnless(kernel_version_ge(4,6), "requires kernel >= 4.6")
def test_offcputime(self):
self.run_with_duration("offcputime.py 1")
......
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# nfsslower Trace slow NFS operations
# for Linux using BCC & eBPF
#
# Usage: nfsslower [-h] [-p PID] [min_ms]
#
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
# can be very frequent (depending on the workload; eg, 1M/sec), at which
# point the overhead of this tool (even if it prints no "slower" events) can
# begin to become significant.
#
# Most of this code is copied from similar tools (ext4slower, zfsslower etc)
#
# By default, a minimum millisecond threshold of 10 is used.
#
# This tool uses kprobes to instrument the kernel for entry and exit
# information, in the future a preferred way would be to use tracepoints.
# Currently there are'nt any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}
from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
import ctypes as ct
examples = """
./nfsslower # trace operations slower than 10ms
./nfsslower 1 # trace operations slower than 1ms
./nfsslower -j 1 # ... 1 ms, parsable output (csv)
./nfsslower 0 # trace all nfs operations
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
description="""Trace READ, WRITE, OPEN \
and GETATTR NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-j", "--csv", action="store_true",
help="just print fields: comma-separated values")
parser.add_argument("-p", "--pid", help="Trace this pid only")
parser.add_argument("min_ms", nargs="?", default='10',
help="Minimum IO duration to trace in ms (default=10ms)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
csv = args.csv
debug = 0
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
struct val_t {
u64 ts;
u64 offset;
struct file *fp;
struct dentry *d;
};
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
u64 type;
u64 size;
u64 offset;
u64 delta_us;
u64 pid;
char task[TASK_COMM_LEN];
char file[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
{
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32; // PID is higher part
if(FILTER_PID)
return 0;
// store filep and timestamp by id
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = iocb->ki_filp;
val.d = NULL;
val.offset = iocb->ki_pos;
if (val.fp)
entryinfo.update(&id, &val);
return 0;
}
int trace_file_open_entry (struct pt_regs *ctx, struct inode *inode,
struct file *filp)
{
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32; // PID is higher part
if(FILTER_PID)
return 0;
// store filep and timestamp by id
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = filp;
val.d = NULL;
val.offset = 0;
if (val.fp)
entryinfo.update(&id, &val);
return 0;
}
int trace_getattr_entry(struct pt_regs *ctx, struct vfsmount *mnt,
struct dentry *dentry, struct kstat *stat)
{
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32; // PID is higher part
if(FILTER_PID)
return 0;
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = NULL;
val.d = dentry;
val.offset = 0;
if (val.d)
entryinfo.update(&id, &val);
return 0;
}
static int trace_exit(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u64 id = bpf_get_current_pid_tgid();
u32 pid = id >> 32; // PID is higher part
valp = entryinfo.lookup(&id);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
// calculate delta
u64 ts = bpf_ktime_get_ns();
u64 delta_us = (ts - valp->ts) / 1000;
entryinfo.delete(&id);
if (FILTER_US)
return 0;
// populate output struct
u32 size = PT_REGS_RC(ctx);
struct data_t data = {.type = type, .size = size, .delta_us = delta_us,
.pid = pid};
data.ts_us = ts / 1000;
data.offset = valp->offset;
bpf_get_current_comm(&data.task, sizeof(data.task));
// workaround (rewriter should handle file to d_name in one step):
struct dentry *de = NULL;
struct qstr qs = {};
if(type == TRACE_GETATTR)
{
bpf_probe_read(&de,sizeof(de), &valp->d);
}
else
{
bpf_probe_read(&de, sizeof(de), &valp->fp->f_path.dentry);
}
bpf_probe_read(&qs, sizeof(qs), (void *)&de->d_name);
if (qs.len == 0)
return 0;
bpf_probe_read(&data.file, sizeof(data.file), (void *)qs.name);
// output
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
int trace_file_open_return(struct pt_regs *ctx)
{
return trace_exit(ctx, TRACE_OPEN);
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_exit(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_exit(ctx, TRACE_WRITE);
}
int trace_getattr_return(struct pt_regs *ctx)
{
return trace_exit(ctx, TRACE_GETATTR);
}
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
else:
bpf_text = bpf_text.replace('FILTER_US',
'delta_us <= %s' % str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug:
print(bpf_text)
# kernel->user event data: struct data_t
DNAME_INLINE_LEN = 32 # linux/dcache.h
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("type", ct.c_ulonglong),
("size", ct.c_ulonglong),
("offset", ct.c_ulonglong),
("delta_us", ct.c_ulonglong),
("pid", ct.c_ulonglong),
("task", ct.c_char * TASK_COMM_LEN),
("file", ct.c_char * DNAME_INLINE_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
type = 'R'
if event.type == 1:
type = 'W'
elif event.type == 2:
type = 'O'
elif event.type == 3:
type = 'G'
if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
event.ts_us, event.task, event.pid, type, event.size,
event.offset, event.delta_us, event.file))
return
print("%-8s %-14.14s %-6s %1s %-7s %-8d %7.2f %s" %
(strftime("%H:%M:%S"),
event.task.decode(),
event.pid,
type,
event.size,
event.offset / 1024,
float(event.delta_us) / 1000,
event.file.decode()))
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions
b = BPF(text=bpf_text)
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
b.attach_kprobe(event="nfs_file_open", fn_name="trace_file_open_entry")
b.attach_kprobe(event="nfs_getattr", fn_name="trace_getattr_entry")
b.attach_kretprobe(event="nfs_file_read", fn_name="trace_read_return")
b.attach_kretprobe(event="nfs_file_write", fn_name="trace_write_return")
b.attach_kretprobe(event="nfs4_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:
if min_ms == 0:
print("Tracing NFS operations... Ctrl-C to quit")
else:
print("""Tracing NFS operations that are slower than \
%d ms... Ctrl-C to quit"""
% min_ms)
print("%-8s %-14s %-6s %1s %-7s %-8s %7s %s" % ("TIME",
"COMM",
"PID",
"T",
"BYTES",
"OFF_KB",
"LAT(ms)",
"FILENAME"))
b["events"].open_perf_buffer(print_event, page_cnt=64)
while 1:
b.kprobe_poll()
Demonstrations of nfsslower, the Linux eBPF/bcc version.
nfsslower show NFS reads, writes, opens and getattrs, slower than a
threshold. For example:
./nfsslower.py
Tracing NFS operations that are slower than 10 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
11:25:16 dd 21295 W 1048576 15360 14.84 1.test
11:25:16 dd 21295 W 1048576 16384 12.73 1.test
11:25:16 dd 21295 W 1048576 17408 24.27 1.test
11:25:16 dd 21295 W 1048576 18432 22.93 1.test
11:25:16 dd 21295 W 1048576 19456 14.65 1.test
11:25:16 dd 21295 W 1048576 20480 12.58 1.test
11:25:16 dd 21297 W 1048576 6144 10.50 1.test.w
11:25:16 dd 21297 W 1048576 7168 16.65 1.test.w
11:25:16 dd 21297 W 1048576 8192 13.01 1.test.w
11:25:16 dd 21297 W 1048576 9216 14.06 1.test.w
This shows NFS writes from dd each 1MB in size to 2 different files. The
writes all had latency higher than 10ms.
This "latency" is measured from when the operation was issued from the VFS
interface to the file system, to when it completed. This spans everything:
RPC latency, network latency, file system CPU cycles, file system locks, run
queue latency, etc. This is a better measure of the latency suffered by
applications reading from a NFS share and can better expose problems
experienced by NFS clients.
Note that this only traces the common NFS operations (read,write,open and
getattr). I chose to include getattr as a significant percentage of NFS
traffic end up being getattr calls and are a good indicator of problems
with an NFS server.
The threshold can be provided as an argument. E.g. I/O slower than 1 ms:
./nfsslower.py 1
Tracing NFS operations that are slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
11:40:16 cp 21583 R 131072 0 4.35 1.test
11:40:16 cp 21583 R 131072 256 1.87 1.test
11:40:16 cp 21583 R 131072 384 2.99 1.test
11:40:16 cp 21583 R 131072 512 4.19 1.test
11:40:16 cp 21583 R 131072 640 4.25 1.test
11:40:16 cp 21583 R 131072 768 4.65 1.test
11:40:16 cp 21583 R 131072 1280 1.08 1.test
11:40:16 cp 21583 R 131072 1408 3.29 1.test
11:40:16 cp 21583 R 131072 1792 3.12 1.test
11:40:16 cp 21583 R 131072 3712 3.55 1.test
11:40:16 cp 21583 R 131072 3840 1.12 1.test
11:40:16 cp 21583 R 131072 4096 3.23 1.test
11:40:16 cp 21583 R 131072 4224 2.73 1.test
11:40:16 cp 21583 R 131072 4352 2.73 1.test
11:40:16 cp 21583 R 131072 4480 6.09 1.test
11:40:16 cp 21583 R 131072 5120 4.40 1.test
[...]
This shows all NFS_READS that were more than 1ms. Depending on your
latency to your fileserver, you might need to tweak this value to
remove
A threshold of 0 will trace all operations. Warning: the output will be
verbose, as it will include all file system cache hits.
./nfsslower.py 0
Tracing NFS operations
11:56:50 dd 21852 W 1048576 0 0.42 1.test
11:56:50 dd 21852 W 1048576 1024 0.46 1.test
11:56:50 dd 21852 W 1048576 2048 0.36 1.test
11:56:50 cp 21854 G 0 0 0.35 1.test
11:56:50 cp 21854 O 0 0 0.33 1.test
11:56:50 cp 21854 G 0 0 0.00 1.test
11:56:50 cp 21854 R 131072 0 0.07 1.test
11:56:50 cp 21854 R 131072 128 0.02 1.test
11:56:50 cp 21854 R 131072 256 0.02 1.test
11:56:50 cp 21854 R 131072 384 0.02 1.test
11:56:50 cp 21854 R 131072 512 0.02 1.test
11:56:50 cp 21854 R 131072 640 0.02 1.test
11:56:50 cp 21854 R 131072 768 0.02 1.test
11:56:50 cp 21854 R 131072 896 0.02 1.test
11:56:50 cp 21854 R 131072 1024 0.02 1.test
11:56:50 cp 21854 R 131072 1152 0.02 1.test
11:56:50 cp 21854 R 131072 1280 0.02 1.test
11:56:50 cp 21854 R 131072 1408 0.02 1.test
11:56:50 cp 21854 R 131072 1536 0.02 1.test
11:56:50 cp 21854 R 131072 1664 0.02 1.test
11:56:50 cp 21854 R 131072 1792 0.02 1.test
11:56:50 cp 21854 R 131072 1920 0.02 1.test
11:56:50 cp 21854 R 131072 2048 0.02 1.test
11:56:50 cp 21854 R 131072 2176 0.04 1.test
11:56:50 cp 21854 R 131072 2304 0.02 1.test
11:56:50 cp 21854 R 131072 2432 0.03 1.test
11:56:50 cp 21854 R 131072 2560 0.03 1.test
11:56:50 cp 21854 R 131072 2688 0.02 1.test
11:56:50 cp 21854 R 131072 2816 0.03 1.test
11:56:50 cp 21854 R 131072 2944 0.02 1.test
11:56:50 cp 21854 R 0 3072 0.00 1.test
11:56:50 ls 21855 G 0 0 0.00 1.test
11:56:50 ls 21856 G 0 0 0.36 music
11:56:50 ls 21856 G 0 0 0.00 music
11:56:50 ls 21856 G 0 0 0.00 test
11:56:50 ls 21856 G 0 0 0.00 ff
11:56:50 ls 21856 G 0 0 0.00 34.log
11:56:50 ls 21856 G 0 0 0.00 vmlinuz-linux
11:56:50 ls 21856 G 0 0 0.00 2.test
11:56:50 ls 21856 G 0 0 0.00 rt.log
11:56:50 ls 21856 G 0 0 0.00 1.lod
11:56:50 ls 21856 G 0 0 0.00 COPYRIGHT.txt
11:56:50 ls 21856 G 0 0 0.00 gg
11:56:50 ls 21856 G 0 0 0.00 qw.log
11:56:50 ls 21856 G 0 0 0.00 README.md
11:56:50 ls 21856 G 0 0 0.00 1.log
The output now includes open operations ("O"), and reads ("R") wand getattrs ("G").
A cp operation
A -j option will print just the fields (parsable output, csv):
./nfsslower.py -j 0
ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
87054476520,dd,22754,W,1048576,0,425,1.test
87054482916,dd,22754,W,1048576,1048576,320,1.test
87054488179,dd,22754,W,1048576,2097152,389,1.test
87054511340,cp,22756,G,0,0,371,1.test
87054511685,cp,22756,O,0,0,306,1.test
87054511700,cp,22756,G,0,0,2,1.test
87054512325,cp,22756,R,131072,0,56,1.test
87054512432,cp,22756,R,131072,131072,22,1.test
87054512520,cp,22756,R,131072,262144,32,1.test
87054512600,cp,22756,R,131072,393216,21,1.test
87054512678,cp,22756,R,131072,524288,21,1.test
87054512803,cp,22756,R,131072,655360,56,1.test
This may be useful for visualizing with another tool, for example, for
producing a scatter plot of ENDTIME vs LATENCY, to look for time-based
patterns.
USAGE message:
usage: nfsslower.py [-h] [-j] [-p PID] [min_ms]
Trace READ, WRITE, OPEN and GETATTR NFS calls slower than a threshold,supports NFSv{3,4}
positional arguments:
min_ms Minimum IO duration to trace in ms (default=10ms)
optional arguments:
-h, --help show this help message and exit
-j, --csv just print fields: comma-separated values
-p PID, --pid PID Trace this pid only
./nfsslower # trace operations slower than 10ms
./nfsslower 1 # trace operations slower than 1ms
./nfsslower -j 1 # ... 1 ms, parsable output (csv)
./nfsslower 0 # trace all nfs operations
./nfsslower -p 121 # trace pid 121 only
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment