Commit ee74c37a authored by Brendan Gregg's avatar Brendan Gregg

btrfsslower

parent 06f3b9a7
......@@ -71,6 +71,7 @@ Tools:
- tools/[biotop](tools/biotop.py): Top for disks: Summarize block device I/O by process. [Examples](tools/biotop_example.txt).
- tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
- tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
- tools/[btrfsslower](tools/btrfsslower.py): Trace slow btrfs operations. [Examples](tools/btrfsslower_example.txt).
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
......
.TH btrfsslower 8 "2016-02-15" "USER COMMANDS"
.SH NAME
btrfsslower \- Trace slow btrfs file operations, with per-event details.
.SH SYNOPSIS
.B btrfsslower [\-h] [\-j] [\-p PID] [min_ms]
.SH DESCRIPTION
This tool traces common btrfs file operations: reads, writes, opens, and
syncs. It measures the time spent in these operations, and prints details
for each that exceeded a threshold.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used. If a threshold of 0
is used, all events are printed (warning: verbose).
Since this works by tracing the btrfs_file_operations interface functions, it
will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file reads and writes slower than 10 ms:
#
.B btrfsslower
.TP
Trace slower than 1 ms:
#
.B btrfsslower 1
.TP
Trace slower than 1 ms, and output just the fields in parsable format (csv):
#
.B btrfsslower \-j 1
.TP
Trace all file reads and writes (warning: the output will be verbose):
#
.B btrfsslower 0
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B btrfsslower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
T
Type of operation. R == read, W == write, O == open, S == fsync.
.TP
OFF_KB
File offset for the I/O, in Kbytes.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when it was issued by VFS to the
filesystem, to when it completed. This time is inclusive of block device I/O,
file system CPU cycles, file system locks, run queue latency, etc. It's a more
accurate measure of the latency suffered by applications performing file
system I/O, than to measure this down at the block device interface.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.TP
ENDTIME_us
Completion timestamp, microseconds (\-j only).
.TP
OFFSET_b
File offset, bytes (\-j only).
.TP
LATENCY_us
Latency (duration) of the I/O, in microseconds (\-j only).
.SH OVERHEAD
This adds low-overhead instrumentation to btrfs writes and fsyncs, as well
as all system reads and opens (due to the current implementation of the
btrfs_file_operations interface). Particularly, all reads and writes from
the file system cache will incur extra overhead while tracing. Such reads and
writes can be very frequent (depending on the workload; eg, 1M/sec), at which
point the overhead of this tool may become noticeable.
Measure and quantify before use. If this
continues to be a problem, consider switching to a tool that prints in-kernel
summaries only, such as btrfsdist(8).
.PP
Note that the overhead of this tool should be less than fileslower(8), as
this tool targets btrfs functions only, and not all file read/write paths
(which can include socket I/O).
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
btrfsdist(8), biosnoop(8), funccount(8), fileslower(8)
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# btrfsslower Trace slow btrfs operations.
# For Linux, uses BCC, eBPF.
#
# USAGE: btrfsslower [-h] [-j] [-p PID] [min_ms]
#
# This script traces common btrfs file operations: reads, writes, opens, and
# syncs. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
#
# WARNING: This adds low-overhead instrumentation to these btrfs operations,
# including reads and writes from the file system cache. Such reads and writes
# can be very frequent (depending on the workload; eg, 1M/sec), at which
# point the overhead of this tool (even if it prints no "slower" events) can
# begin to become significant.
#
# By default, a minimum millisecond threshold of 10 is used.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 15-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
import ctypes as ct
# symbols
kallsyms = "/proc/kallsyms"
# arguments
examples = """examples:
./btrfsslower # trace operations slower than 10 ms (default)
./btrfsslower 1 # trace operations slower than 1 ms
./btrfsslower -j 1 # ... 1 ms, parsable output (csv)
./btrfsslower 0 # trace all operations (warning: verbose)
./btrfsslower -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Trace common btrfs file operations slower than a threshold",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-j", "--csv", action="store_true",
help="just print fields: comma-separated values")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("min_ms", nargs="?", default='10',
help="minimum I/O duration to trace, in ms (default 10)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
csv = args.csv
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
// XXX: switch these to char's when supported
#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_FSYNC 3
struct val_t {
u64 ts;
u64 offset;
struct file *fp;
};
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
u64 type;
u64 size;
u64 offset;
u64 delta_us;
u64 pid;
char task[TASK_COMM_LEN];
char file[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, pid_t, struct val_t);
BPF_PERF_OUTPUT(events);
//
// Store timestamp and size on entry
//
// The current btrfs (Linux 4.5) uses generic_file_read_iter() instead of it's
// own read function. So we need to trace that and then filter on btrfs, which
// I do by checking file->f_op.
int trace_read_entry(struct pt_regs *ctx, struct kiocb *iocb)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// btrfs filter on file->f_op == btrfs_file_operations
struct file *fp = iocb->ki_filp;
if ((u64)fp->f_op != BTRFS_FILE_OPERATIONS)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = fp;
val.offset = iocb->ki_pos;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// btrfs_file_write_iter():
int trace_write_entry(struct pt_regs *ctx, struct kiocb *iocb)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = iocb->ki_filp;
val.offset = iocb->ki_pos;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// The current btrfs (Linux 4.5) uses generic_file_open(), instead of it's own
// function. Same as with reads. Trace the generic path and filter:
int trace_open_entry(struct pt_regs *ctx, struct inode *inode,
struct file *file)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// btrfs filter on file->f_op == btrfs_file_operations
if ((u64)file->f_op != BTRFS_FILE_OPERATIONS)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = file;
val.offset = 0;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// btrfs_sync_file():
int trace_fsync_entry(struct pt_regs *ctx, struct file *file)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = file;
val.offset = 0;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
//
// Output
//
static int trace_return(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = entryinfo.lookup(&pid);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
// calculate delta
u64 ts = bpf_ktime_get_ns();
u64 delta_us = (ts - valp->ts) / 1000;
entryinfo.delete(&pid);
if (FILTER_US)
return 0;
// workaround (rewriter should handle file to d_iname in one step):
struct dentry *de = NULL;
bpf_probe_read(&de, sizeof(de), &valp->fp->f_path.dentry);
// populate output struct
u32 size = ctx->ax;
struct data_t data = {.type = type, .size = size, .delta_us = delta_us,
.pid = pid};
data.ts_us = ts / 1000;
data.offset = valp->offset;
bpf_probe_read(&data.file, sizeof(data.file), de->d_iname);
bpf_get_current_comm(&data.task, sizeof(data.task));
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_WRITE);
}
int trace_open_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_OPEN);
}
int trace_fsync_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_FSYNC);
}
"""
# code replacements
with open(kallsyms) as syms:
ops = ''
for line in syms:
a = line.rstrip().split()
(addr, name) = (a[0], a[2])
if name == "btrfs_file_operations":
ops = "0x" + addr
break
if ops == '':
print("ERROR: no btrfs_file_operations in /proc/kallsyms. Exiting.")
exit()
bpf_text = bpf_text.replace('BTRFS_FILE_OPERATIONS', ops)
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
else:
bpf_text = bpf_text.replace('FILTER_US',
'delta_us <= %s' % str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug:
print(bpf_text)
# kernel->user event data: struct data_t
DNAME_INLINE_LEN = 32 # linux/dcache.h
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("type", ct.c_ulonglong),
("size", ct.c_ulonglong),
("offset", ct.c_ulonglong),
("delta_us", ct.c_ulonglong),
("pid", ct.c_ulonglong),
("task", ct.c_char * TASK_COMM_LEN),
("file", ct.c_char * DNAME_INLINE_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
type = 'R'
if event.type == 1:
type = 'W'
elif event.type == 2:
type = 'O'
elif event.type == 3:
type = 'S'
if (csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
event.ts_us, event.task, event.pid, type, event.size,
event.offset, event.delta_us, event.file))
return
print("%-8s %-14.14s %-6s %1s %-7s %-8d %7.2f %s" % (strftime("%H:%M:%S"),
event.task, event.pid, type, event.size, event.offset / 1024,
float(event.delta_us) / 1000, event.file))
# initialize BPF
b = BPF(text=bpf_text)
# Common file functions. See earlier comment about generic_*().
b.attach_kprobe(event="generic_file_read_iter", fn_name="trace_read_entry")
b.attach_kprobe(event="btrfs_file_write_iter", fn_name="trace_write_entry")
b.attach_kprobe(event="generic_file_open", fn_name="trace_open_entry")
b.attach_kprobe(event="btrfs_sync_file", fn_name="trace_fsync_entry")
b.attach_kretprobe(event="generic_file_read_iter", fn_name="trace_read_return")
b.attach_kretprobe(event="btrfs_file_write_iter", fn_name="trace_write_return")
b.attach_kretprobe(event="generic_file_open", fn_name="trace_open_return")
b.attach_kretprobe(event="btrfs_sync_file", fn_name="trace_fsync_return")
# header
if (csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:
if min_ms == 0:
print("Tracing btrfs operations")
else:
print("Tracing btrfs operations slower than %d ms" % min_ms)
print("%-8s %-14s %-6s %1s %-7s %-8s %7s %s" % ("TIME", "COMM", "PID", "T",
"BYTES", "OFF_KB", "LAT(ms)", "FILENAME"))
# read events
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
Demonstrations of btrfsslower, the Linux eBPF/bcc version.
btrfsslower shows btrfs reads, writes, opens, and fsyncs, slower than a
threshold. For example:
# ./btrfsslower
Tracing btrfs operations slower than 10 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
01:22:03 randread.pl 13602 R 8192 391384 10.40 data1
01:22:03 randread.pl 13602 R 8192 92632 10.41 data1
01:22:06 randread.pl 13602 R 8192 199800 17.33 data1
01:22:06 randread.pl 13602 R 8192 415160 17.21 data1
01:22:07 randread.pl 13602 R 8192 729984 11.93 data1
01:22:09 randread.pl 13602 R 8192 342784 11.90 data1
[...]
This shows several reads from a "randread.pl" program, each 8 Kbytes in size,
and from a "data1" file. These all had over 10 ms latency.
This "latency" is measured from when the operation was issued from the VFS
interface to the file system, to when it completed. This spans everything:
block device I/O (disk I/O), file system CPU cycles, file system locks, run
queue latency, etc. This is a better measure of the latency suffered by
applications reading from the file system than measuring this down at the
block device interface.
Note that this only traces the common file system operations previously
listed: other file system operations (eg, inode operations including
getattr()) are not traced.
The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
# ./btrfsslower 1
Tracing btrfs operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
03:26:54 randread.pl 30578 R 8192 214864 1.87 data1
03:26:54 randread.pl 30578 R 8192 267600 1.48 data1
03:26:54 randread.pl 30578 R 8192 704200 1.30 data1
03:26:54 randread.pl 30578 R 8192 492352 3.09 data1
03:26:55 randread.pl 30578 R 8192 319448 1.34 data1
03:26:55 randread.pl 30578 R 8192 676032 1.88 data1
03:26:55 randread.pl 30578 R 8192 646712 2.24 data1
03:26:55 randread.pl 30578 R 8192 124376 1.02 data1
03:26:55 randread.pl 30578 R 8192 223064 2.64 data1
03:26:55 randread.pl 30578 R 8192 521280 1.55 data1
03:26:55 randread.pl 30578 R 8192 272992 2.48 data1
03:26:55 randread.pl 30578 R 8192 450112 2.67 data1
03:26:55 randread.pl 30578 R 8192 361808 1.78 data1
03:26:55 randread.pl 30578 R 8192 41088 1.46 data1
03:26:55 randread.pl 30578 R 8192 756576 1.67 data1
03:26:55 randread.pl 30578 R 8192 711776 2.74 data1
03:26:55 randread.pl 30578 R 8192 129472 1.34 data1
03:26:55 randread.pl 30578 R 8192 526928 1.82 data1
03:26:56 randread.pl 30578 R 8192 312768 1.44 data1
03:26:56 randread.pl 30578 R 8192 34720 1.14 data1
03:26:56 randread.pl 30578 R 8192 258376 1.13 data1
03:26:56 randread.pl 30578 R 8192 308456 1.44 data1
03:26:56 randread.pl 30578 R 8192 759656 1.27 data1
03:26:56 randread.pl 30578 R 8192 387424 3.24 data1
03:26:56 randread.pl 30578 R 8192 168864 3.38 data1
03:26:56 randread.pl 30578 R 8192 699296 1.38 data1
03:26:56 randread.pl 30578 R 8192 405688 2.37 data1
03:26:56 randread.pl 30578 R 8192 559064 1.18 data1
03:26:56 randread.pl 30578 R 8192 264808 1.13 data1
03:26:56 randread.pl 30578 R 8192 369240 2.20 data1
[...]
There's now much more output (this spans less than 3 seconds, the previous output
spanned 6 seconds), as the lower threshold is catching more I/O.
A threshold of 0 will trace all operations. Warning: the output will be
verbose, as it will include all file system cache hits.
# ./btrfsslower 0
Tracing btrfs operations
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
03:28:17 bash 32597 O 0 0 0.00 date.txt
03:28:17 date 32597 W 29 0 0.02 date.txt
03:28:23 cksum 32743 O 0 0 0.00 date.txt
03:28:23 cksum 32743 R 29 0 0.01 date.txt
03:28:23 cksum 32743 R 0 0 0.00 date.txt
While tracing, the following commands were run in another window:
# date > date.txt
# cksum date.txt
The output of btrfsslower now includes open operations ("O"), and writes ("W").
The first read from cksum(1) returned 29 bytes, and the second returned 0:
causing cksum(1) to stop reading.
A -j option will print just the fields (parsable output, csv):
# ./btrfsslower -j 1
ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
8930665366,randread.pl,2717,R,8192,230391808,4312,data1
8930670746,randread.pl,2717,R,8192,347832320,1296,data1
8930675995,randread.pl,2717,R,8192,409812992,4207,data1
8930680213,randread.pl,2717,R,8192,498204672,3104,data1
8930685970,randread.pl,2717,R,8192,553164800,1843,data1
8930687568,randread.pl,2717,R,8192,339492864,1475,data1
8930694108,randread.pl,2717,R,8192,500711424,6276,data1
8930697139,randread.pl,2717,R,8192,485801984,2180,data1
8930705755,randread.pl,2717,R,8192,376922112,7535,data1
8930711340,randread.pl,2717,R,8192,380084224,3314,data1
8930740964,randread.pl,2717,R,8192,226091008,24762,data1
8930743169,randread.pl,2717,R,8192,361570304,1809,data1
8930748789,randread.pl,2717,R,8192,346931200,1530,data1
8930763514,randread.pl,2717,R,8192,59719680,13938,data1
8930764870,randread.pl,2717,R,8192,406511616,1313,data1
8930774327,randread.pl,2717,R,8192,661430272,7361,data1
8930780360,randread.pl,2717,R,8192,406904832,2220,data1
8930785736,randread.pl,2717,R,8192,523419648,2005,data1
8930794560,randread.pl,2717,R,8192,342974464,8388,data1
[...]
This may be useful for visualizing with another tool, for example, for
producing a scatter plot of ENDTIME vs LATENCY, to look for time-based
patterns.
USAGE message:
# ./btrfsslower -h
usage: btrfsslower [-h] [-j] [-p PID] [min_ms]
Trace common btrfs file operations slower than a threshold
positional arguments:
min_ms minimum I/O duration to trace, in ms (default 10)
optional arguments:
-h, --help show this help message and exit
-j, --csv just print fields: comma-separated values
-p PID, --pid PID trace this PID only
examples:
./btrfsslower # trace operations slower than 10 ms (default)
./btrfsslower 1 # trace operations slower than 1 ms
./btrfsslower -j 1 # ... 1 ms, parsable output (csv)
./btrfsslower 0 # trace all operations (warning: verbose)
./btrfsslower -p 185 # trace PID 185 only
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment