Commit 0cc65a73 authored by Sasha Goldshtein's avatar Sasha Goldshtein

Merge branch 'master' into argdist-enh

parents e350115b 4d911d51
......@@ -75,6 +75,8 @@ Tools:
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency. [Examples](tools/ext4dist_example.txt).
- tools/[ext4slower](tools/ext4slower.py): Trace slow ext4 operations. [Examples](tools/ext4slower_example.txt).
- tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt).
- tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
- tools/[filetop](tools/filetop.py): File reads and writes by filename and process. Top for files. [Examples](tools/filetop_example.txt).
......@@ -100,6 +102,8 @@ Tools:
- tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
- tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).
- tools/[wakeuptime](tools/wakeuptime.py): Summarize sleep to wakeup time by waker kernel stack. [Examples](tools/wakeuptime_example.txt).
- tools/[xfsdist](tools/xfsdist.py): Summarize XFS operation latency. [Examples](tools/xfsdist_example.txt).
- tools/[xfsslower](tools/xfsslower.py): Trace slow XFS operations. [Examples](tools/xfsslower_example.txt).
### Networking
......
......@@ -10,6 +10,10 @@ entered command may fail: this is just showing what was entered.
This program is also a basic example of eBPF/bcc and uprobes.
This makes use of a Linux 4.5 feature (bpf_perf_event_output());
for kernels older than 4.5, see the version under tools/old,
which uses an older mechanism
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
......
......@@ -15,6 +15,10 @@ request, as well as a starting timestamp for calculating I/O latency.
This works by tracing various kernel blk_*() functions using dynamic tracing,
and will need updating to match any changes to these functions.
This makes use of a Linux 4.5 feature (bpf_perf_event_output());
for kernels older than 4.5, see the version under tools/old,
which uses an older mechanism
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
......
.TH ext4dist 8 "2016-02-12" "USER COMMANDS"
.SH NAME
ext4dist \- Summarize ext4 operation latency. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B ext4dist [\-h] [\-T] [\-N] [\-d] [interval] [count]
.SH DESCRIPTION
This tool summarizes time (latency) spent in common ext4 file operations: reads,
writes, opens, and syncs, and presents it as a power-of-2 histogram. It uses an
in-kernel eBPF map to store the histogram for efficiency.
Since this works by tracing the ext4_file_operations interface functions, it
will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-T
Don't include timestamps on interval output.
.TP
\-m
Output in milliseconds.
.TP
\-p PID
Trace this PID only.
.SH EXAMPLES
.TP
Trace ext4 operation time, and print a summary on Ctrl-C:
#
.B ext4dist
.TP
Trace PID 181 only:
#
.B ext4dist -p 181
.TP
Print 1 second summaries, 10 times:
#
.B ext4dist 1 10
.TP
1 second summaries, printed in milliseconds
#
.B ext4dist \-m 1
.SH FIELDS
.TP
msecs
Range of milliseconds for this bucket.
.TP
usecs
Range of microseconds for this bucket.
.TP
count
Number of operations in this time range.
.TP
distribution
ASCII representation of the distribution (the count column).
.SH OVERHEAD
This adds low-overhead instrumentation to these ext4 operations,
including reads and writes from the file system cache. Such reads and writes
can be very frequent (depending on the workload; eg, 1M/sec), at which
point the overhead of this tool may become noticeable.
Measure and quantify before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
ext4snoop(8)
.TH ext4slower 8 "2016-02-11" "USER COMMANDS"
.SH NAME
ext4slower \- Trace slow ext4 file operations, with per-event details.
.SH SYNOPSIS
.B ext4slower [\-h] [\-j] [\-p PID] [min_ms]
.SH DESCRIPTION
This tool traces common ext4 file operations: reads, writes, opens, and
syncs. It measures the time spent in these operations, and prints details
for each that exceeded a threshold.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used. If a threshold of 0
is used, all events are printed (warning: verbose).
Since this works by tracing the ext4_file_operations interface functions, it
will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file reads and writes slower than 10 ms:
#
.B ext4slower
.TP
Trace slower than 1 ms:
#
.B ext4slower 1
.TP
Trace slower than 1 ms, and output just the fields in parsable format (csv):
#
.B ext4slower \-j 1
.TP
Trace all file reads and writes (warning: the output will be verbose):
#
.B ext4slower 0
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B ext4slower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
T
Type of operation. R == read, W == write, O == open, S == fsync.
.TP
OFF_KB
File offset for the I/O, in Kbytes.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when it was issued by VFS to the
filesystem, to when it completed. This time is inclusive of block device I/O,
file system CPU cycles, file system locks, run queue latency, etc. It's a more
accurate measure of the latency suffered by applications performing file
system I/O, than to measure this down at the block device interface.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.TP
ENDTIME_us
Completion timestamp, microseconds (\-j only).
.TP
OFFSET_b
File offset, bytes (\-j only).
.TP
LATENCY_us
Latency (duration) of the I/O, in microseconds (\-j only).
.SH OVERHEAD
This adds low-overhead instrumentation to these ext4 operations,
including reads and writes from the file system cache. Such reads and writes
can be very frequent (depending on the workload; eg, 1M/sec), at which
point the overhead of this tool (even if it prints no "slower" events) can
begin to become significant. Measure and quantify before use. If this
continues to be a problem, consider switching to a tool that prints in-kernel
summaries only.
.PP
Note that the overhead of this tool should be less than fileslower(8), as
this tool targets ext4 functions only, and not all file read/write paths
(which can include socket I/O).
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
biosnoop(8), funccount(8), fileslower(8)
.TH xfsdist 8 "2016-02-12" "USER COMMANDS"
.SH NAME
xfsdist \- Summarize XFS operation latency. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B xfsdist [\-h] [\-T] [\-N] [\-d] [interval] [count]
.SH DESCRIPTION
This tool summarizes time (latency) spent in common XFS file operations: reads,
writes, opens, and syncs, and presents it as a power-of-2 histogram. It uses an
in-kernel eBPF map to store the histogram for efficiency.
Since this works by tracing the xfs_file_operations interface functions, it
will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-T
Don't include timestamps on interval output.
.TP
\-m
Output in milliseconds.
.TP
\-p PID
Trace this PID only.
.SH EXAMPLES
.TP
Trace XFS operation time, and print a summary on Ctrl-C:
#
.B xfsdist
.TP
Trace PID 181 only:
#
.B xfsdist -p 181
.TP
Print 1 second summaries, 10 times:
#
.B xfsdist 1 10
.TP
1 second summaries, printed in milliseconds
#
.B xfsdist \-m 1
.SH FIELDS
.TP
msecs
Range of milliseconds for this bucket.
.TP
usecs
Range of microseconds for this bucket.
.TP
count
Number of operations in this time range.
.TP
distribution
ASCII representation of the distribution (the count column).
.SH OVERHEAD
This adds low-overhead instrumentation to these XFS operations,
including reads and writes from the file system cache. Such reads and writes
can be very frequent (depending on the workload; eg, 1M/sec), at which
point the overhead of this tool may become noticeable.
Measure and quantify before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
xfssnoop(8)
.TH xfsslower 8 "2016-02-11" "USER COMMANDS"
.SH NAME
xfsslower \- Trace slow xfs file operations, with per-event details.
.SH SYNOPSIS
.B xfsslower [\-h] [\-j] [\-p PID] [min_ms]
.SH DESCRIPTION
This tool traces common XFS file operations: reads, writes, opens, and
syncs. It measures the time spent in these operations, and prints details
for each that exceeded a threshold.
WARNING: See the OVERHEAD section.
By default, a minimum millisecond threshold of 10 is used. If a threshold of 0
is used, all events are printed (warning: verbose).
Since this works by tracing the xfs_file_operations interface functions, it
will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
\-p PID
Trace this PID only.
.TP
min_ms
Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
.SH EXAMPLES
.TP
Trace synchronous file reads and writes slower than 10 ms:
#
.B xfsslower
.TP
Trace slower than 1 ms:
#
.B xfsslower 1
.TP
Trace slower than 1 ms, and output just the fields in parsable format (csv):
#
.B xfsslower \-j 1
.TP
Trace all file reads and writes (warning: the output will be verbose):
#
.B xfsslower 0
.TP
Trace slower than 1 ms, for PID 181 only:
#
.B xfsslower \-p 181 1
.SH FIELDS
.TP
TIME(s)
Time of I/O completion since the first I/O seen, in seconds.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
T
Type of operation. R == read, W == write, O == open, S == fsync.
.TP
OFF_KB
File offset for the I/O, in Kbytes.
.TP
BYTES
Size of I/O, in bytes.
.TP
LAT(ms)
Latency (duration) of I/O, measured from when it was issued by VFS to the
filesystem, to when it completed. This time is inclusive of block device I/O,
file system CPU cycles, file system locks, run queue latency, etc. It's a more
accurate measure of the latency suffered by applications performing file
system I/O, than to measure this down at the block device interface.
.TP
FILENAME
A cached kernel file name (comes from dentry->d_iname).
.TP
ENDTIME_us
Completion timestamp, microseconds (\-j only).
.TP
OFFSET_b
File offset, bytes (\-j only).
.TP
LATENCY_us
Latency (duration) of the I/O, in microseconds (\-j only).
.SH OVERHEAD
This adds low-overhead instrumentation to these XFS operations,
including reads and writes from the file system cache. Such reads and writes
can be very frequent (depending on the workload; eg, 1M/sec), at which
point the overhead of this tool (even if it prints no "slower" events) can
begin to become significant. Measure and quantify before use. If this
continues to be a problem, consider switching to a tool that prints in-kernel
summaries only.
.PP
Note that the overhead of this tool should be less than fileslower(8), as
this tool targets xfs functions only, and not all file read/write paths
(which can include socket I/O).
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
biosnoop(8), funccount(8), fileslower(8)
......@@ -9,35 +9,55 @@
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 28-Jan-2016 Brendan Gregg Created this.
# 12-Feb-2016 Allan McAleavy migrated to BPF_PERF_OUTPUT
from __future__ import print_function
from bcc import BPF
from time import strftime
import ctypes as ct
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
struct str_t {
u64 pid;
char str[80];
};
BPF_PERF_OUTPUT(events);
int printret(struct pt_regs *ctx) {
struct str_t data = {};
u32 pid;
if (!ctx->ax)
return 0;
char str[80] = {};
bpf_probe_read(&str, sizeof(str), (void *)ctx->ax);
bpf_trace_printk("%s\\n", &str);
pid = bpf_get_current_pid_tgid();
data.pid = pid;
bpf_probe_read(&data.str, sizeof(data.str), (void *)ctx->ax);
events.perf_submit(ctx,&data,sizeof(data));
return 0;
};
"""
STR_DATA = 80
class Data(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("str", ct.c_char * STR_DATA)
]
b = BPF(text=bpf_text)
b.attach_uretprobe(name="/bin/bash", sym="readline", fn_name="printret")
# header
print("%-9s %-6s %s" % ("TIME", "PID", "COMMAND"))
# format output
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
print("%-9s %-6d %s" % (strftime("%H:%M:%S"), event.pid, event.str))
b["events"].open_perf_buffer(print_event)
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-9s %-6d %s" % (strftime("%H:%M:%S"), pid, msg))
b.kprobe_poll()
......@@ -11,9 +11,12 @@
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 16-Sep-2015 Brendan Gregg Created this.
# 11-Feb-2016 Allan McAleavy updated for BPF_PERF_OUTPUT
from __future__ import print_function
from bcc import BPF
import ctypes as ct
import re
# load BPF program
b = BPF(text="""
......@@ -25,8 +28,20 @@ struct val_t {
char name[TASK_COMM_LEN];
};
struct data_t {
u32 pid;
u64 rwflag;
u64 delta;
u64 sector;
u64 len;
u64 ts;
char disk_name[DISK_NAME_LEN];
char name[TASK_COMM_LEN];
};
BPF_HASH(start, struct request *);
BPF_HASH(infobyreq, struct request *, struct val_t);
BPF_PERF_OUTPUT(events);
// cache PID and comm by-req
int trace_pid_start(struct pt_regs *ctx, struct request *req)
......@@ -37,7 +52,6 @@ int trace_pid_start(struct pt_regs *ctx, struct request *req)
val.pid = bpf_get_current_pid_tgid();
infobyreq.update(&req, &val);
}
return 0;
}
......@@ -58,6 +72,8 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
u64 *tsp, delta;
u32 *pidp = 0;
struct val_t *valp;
struct data_t data ={};
u64 ts;
// fetch timestamp and calculate delta
tsp = start.lookup(&req);
......@@ -65,30 +81,29 @@ int trace_req_completion(struct pt_regs *ctx, struct request *req)
// missed tracing issue
return 0;
}
delta = bpf_ktime_get_ns() - *tsp;
ts = bpf_ktime_get_ns();
data.delta = ts - *tsp;
data.ts = ts / 1000;
//
// Fetch and output issuing pid and comm.
// As bpf_trace_prink() is limited to a maximum of 1 string and 2
// integers, we'll use more than one to output the data.
//
valp = infobyreq.lookup(&req);
if (valp == 0) {
bpf_trace_printk("0 0 ? %d\\n", req->__data_len);
data.len = req->__data_len;
strcpy(data.name,"?");
} else {
bpf_trace_printk("0 %d %s %d\\n", valp->pid, valp->name,
req->__data_len);
data.pid = valp->pid;
data.len = req->__data_len;
data.sector = req->__sector;
bpf_probe_read(&data.name, sizeof(data.name), valp->name);
bpf_probe_read(&data.disk_name, sizeof(data.disk_name),
req->rq_disk->disk_name);
}
// output remaining details
if (req->cmd_flags & REQ_WRITE) {
bpf_trace_printk("1 W %s %d %d ?\\n", req->rq_disk->disk_name,
req->__sector, delta / 1000);
data.rwflag=1;
} else {
bpf_trace_printk("1 R %s %d %d ?\\n", req->rq_disk->disk_name,
req->__sector, delta / 1000);
data.rwflag=0;
}
events.perf_submit(ctx,&data,sizeof(data));
start.delete(&req);
infobyreq.delete(&req);
......@@ -101,29 +116,61 @@ b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_completion",
fn_name="trace_req_completion")
TASK_COMM_LEN = 16 # linux/sched.h
DISK_NAME_LEN = 32 # linux/genhd.h
class Data(ct.Structure):
_fields_ = [
("pid", ct.c_ulonglong),
("rwflag", ct.c_ulonglong),
("delta", ct.c_ulonglong),
("sector", ct.c_ulonglong),
("len", ct.c_ulonglong),
("ts", ct.c_ulonglong),
("disk_name", ct.c_char * DISK_NAME_LEN),
("name", ct.c_char * TASK_COMM_LEN)
]
# header
print("%-14s %-14s %-6s %-7s %-2s %-9s %-7s %7s" % ("TIME(s)", "COMM", "PID",
"DISK", "T", "SECTOR", "BYTES", "LAT(ms)"))
rwflg = ""
start_ts = 0
prev_ts = 0
delta = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
args = msg.split(" ")
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
if start_ts == 0:
start_ts = ts
val = -1
global start_ts
global prev_ts
global delta
if event.rwflag == 1:
rwflg = "W"
if event.rwflag == 0:
rwflg = "R"
if not re.match('\?', event.name):
val = event.sector
if args[0] == "0":
(real_pid, real_comm, bytes_s) = (args[1], args[2], args[3])
continue
else:
(type_s, disk_s, sector_s, us_s) = (args[1], args[2], args[3],
args[4])
if start_ts == 0:
prev_ts = start_ts
ms = float(int(us_s, 10)) / 1000
if start_ts == 1:
delta = float(delta) + (event.ts - prev_ts)
print("%-14.9f %-14.14s %-6s %-7s %-2s %-9s %-7s %7.2f" % (
ts - start_ts, real_comm, real_pid, disk_s, type_s, sector_s,
bytes_s, ms))
delta / 1000000, event.name, event.pid, event.disk_name, rwflg, val,
event.len, float(event.delta) / 1000000))
prev_ts = event.ts
start_ts = 1
# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# ext4dist Summarize ext4 operation latency.
# For Linux, uses BCC, eBPF.
#
# USAGE: ext4dist [-h] [-T] [-m] [-p PID] [interval] [count]
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
# symbols
kallsyms = "/proc/kallsyms"
# arguments
examples = """examples:
./ext4dist # show operation latency as a histogram
./ext4dist -p 181 # trace PID 181 only
./ext4dist 1 10 # print 1 second summaries, 10 times
./ext4dist -m 5 # 5s summaries, milliseconds
"""
parser = argparse.ArgumentParser(
description="Summarize ext4 operation latency",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-T", "--notimestamp", action="store_true",
help="don't include timestamp on interval output")
parser.add_argument("-m", "--milliseconds", action="store_true",
help="output in milliseconds")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("interval", nargs="?",
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
pid = args.pid
countdown = int(args.count)
if args.milliseconds:
factor = 1000000
label = "msecs"
else:
factor = 1000
label = "usecs"
if args.interval and int(args.interval) == 0:
print("ERROR: interval 0. Exiting.")
exit()
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#define OP_NAME_LEN 8
typedef struct dist_key {
char op[OP_NAME_LEN];
u64 slot;
} dist_key_t;
BPF_HASH(start, u32);
BPF_HISTOGRAM(dist, dist_key_t);
// time operation
int trace_entry(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
// The current ext4 (Linux 4.5) uses generic_file_read_iter(), instead of it's
// own function, for reads. So we need to trace that and then filter on ext4,
// which I do by checking file->f_op.
int trace_read_entry(struct pt_regs *ctx, struct kiocb *iocb)
{
u32 pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// ext4 filter on file->f_op == ext4_file_operations
struct file *fp = iocb->ki_filp;
if ((u64)fp->f_op != EXT4_FILE_OPERATIONS)
return 0;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
static int trace_return(struct pt_regs *ctx, const char *op)
{
u64 *tsp;
u32 pid = bpf_get_current_pid_tgid();
// fetch timestamp and calculate delta
tsp = start.lookup(&pid);
if (tsp == 0) {
return 0; // missed start or filtered
}
u64 delta = (bpf_ktime_get_ns() - *tsp) / FACTOR;
// store as histogram
dist_key_t key = {.slot = bpf_log2l(delta)};
__builtin_memcpy(&key.op, op, sizeof(key.op));
dist.increment(key);
start.delete(&pid);
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
char *op = "read";
return trace_return(ctx, op);
}
int trace_write_return(struct pt_regs *ctx)
{
char *op = "write";
return trace_return(ctx, op);
}
int trace_open_return(struct pt_regs *ctx)
{
char *op = "open";
return trace_return(ctx, op);
}
int trace_fsync_return(struct pt_regs *ctx)
{
char *op = "fsync";
return trace_return(ctx, op);
}
"""
# code replacements
with open(kallsyms) as syms:
ops = ''
for line in syms:
(addr, size, name) = line.rstrip().split(" ", 2)
if name == "ext4_file_operations":
ops = "0x" + addr
break
if ops == '':
print("ERROR: no ext4_file_operations in /proc/kallsyms. Exiting.")
exit()
bpf_text = bpf_text.replace('EXT4_FILE_OPERATIONS', ops)
bpf_text = bpf_text.replace('FACTOR', str(factor))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug:
print(bpf_text)
# load BPF program
b = BPF(text=bpf_text)
# Common file functions. See earlier comment about generic_file_read_iter().
b.attach_kprobe(event="generic_file_read_iter", fn_name="trace_read_entry")
b.attach_kprobe(event="ext4_file_write_iter", fn_name="trace_entry")
b.attach_kprobe(event="ext4_file_open", fn_name="trace_entry")
b.attach_kprobe(event="ext4_sync_file", fn_name="trace_entry")
b.attach_kretprobe(event="generic_file_read_iter", fn_name="trace_read_return")
b.attach_kretprobe(event="ext4_file_write_iter", fn_name="trace_write_return")
b.attach_kretprobe(event="ext4_file_open", fn_name="trace_open_return")
b.attach_kretprobe(event="ext4_sync_file", fn_name="trace_fsync_return")
print("Tracing ext4 operation latency... Hit Ctrl-C to end.")
# output
exiting = 0
dist = b.get_table("dist")
while (1):
try:
if args.interval:
sleep(int(args.interval))
else:
sleep(99999999)
except KeyboardInterrupt:
exiting = 1
print()
if args.interval and (not args.notimestamp):
print(strftime("%H:%M:%S:"))
dist.print_log2_hist(label, "operation")
dist.clear()
countdown -= 1
if exiting or countdown == 0:
exit()
Demonstrations of ext4dist, the Linux eBPF/bcc version.
ext4dist traces ext4 reads, writes, opens, and fsyncs, and summarizes their
latency as a power-of-2 histogram. For example:
# ./ext4dist
Tracing ext4 operation latency... Hit Ctrl-C to end.
^C
operation = 'read'
usecs : count distribution
0 -> 1 : 1210 |****************************************|
2 -> 3 : 126 |**** |
4 -> 7 : 376 |************ |
8 -> 15 : 86 |** |
16 -> 31 : 9 | |
32 -> 63 : 47 |* |
64 -> 127 : 6 | |
128 -> 255 : 24 | |
256 -> 511 : 137 |**** |
512 -> 1023 : 66 |** |
1024 -> 2047 : 13 | |
2048 -> 4095 : 7 | |
4096 -> 8191 : 13 | |
8192 -> 16383 : 3 | |
operation = 'write'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 75 |****************************************|
16 -> 31 : 5 |** |
operation = 'open'
usecs : count distribution
0 -> 1 : 1278 |****************************************|
2 -> 3 : 40 |* |
4 -> 7 : 4 | |
8 -> 15 : 1 | |
16 -> 31 : 1 | |
This output shows a bi-modal distribution for read latency, with a faster
mode of less than 7 microseconds, and a slower mode of between 256 and 1023
microseconds. The count column shows how many events fell into that latency
range. It's likely that the faster mode was a hit from the in-memory file
system cache, and the slower mode is a read from a storage device (disk).
This "latency" is measured from when the operation was issued from the VFS
interface to the file system, to when it completed. This spans everything:
block device I/O (disk I/O), file system CPU cycles, file system locks, run
queue latency, etc. This is a better measure of the latency suffered by
applications reading from the file system than measuring this down at the
block device interface.
Note that this only traces the common file system operations previously
listed: other file system operations (eg, inode operations including
getattr()) are not traced.
An optional interval and a count can be provided, as well as -m to show the
distributions in milliseconds. For example:
# ./ext4dist -m 1 5
Tracing ext4 operation latency... Hit Ctrl-C to end.
10:19:00:
operation = 'read'
msecs : count distribution
0 -> 1 : 576 |****************************************|
2 -> 3 : 5 | |
4 -> 7 : 6 | |
8 -> 15 : 13 | |
16 -> 31 : 17 |* |
32 -> 63 : 5 | |
64 -> 127 : 1 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 20 |****************************************|
operation = 'open'
msecs : count distribution
0 -> 1 : 346 |****************************************|
10:19:01:
operation = 'read'
msecs : count distribution
0 -> 1 : 584 |****************************************|
2 -> 3 : 10 | |
4 -> 7 : 11 | |
8 -> 15 : 16 |* |
16 -> 31 : 6 | |
32 -> 63 : 4 | |
64 -> 127 : 2 | |
128 -> 255 : 1 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 20 |****************************************|
operation = 'open'
msecs : count distribution
0 -> 1 : 336 |****************************************|
10:19:02:
operation = 'read'
msecs : count distribution
0 -> 1 : 678 |****************************************|
2 -> 3 : 7 | |
4 -> 7 : 9 | |
8 -> 15 : 25 |* |
16 -> 31 : 10 | |
32 -> 63 : 3 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 19 |****************************************|
2 -> 3 : 1 |** |
operation = 'open'
msecs : count distribution
0 -> 1 : 390 |****************************************|
10:19:03:
operation = 'read'
msecs : count distribution
0 -> 1 : 567 |****************************************|
2 -> 3 : 7 | |
4 -> 7 : 9 | |
8 -> 15 : 20 |* |
16 -> 31 : 15 |* |
32 -> 63 : 5 | |
64 -> 127 : 2 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 20 |****************************************|
operation = 'open'
msecs : count distribution
0 -> 1 : 417 |****************************************|
10:19:04:
operation = 'read'
msecs : count distribution
0 -> 1 : 762 |****************************************|
2 -> 3 : 9 | |
4 -> 7 : 9 | |
8 -> 15 : 11 | |
16 -> 31 : 20 |* |
32 -> 63 : 4 | |
64 -> 127 : 1 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 20 |****************************************|
operation = 'open'
msecs : count distribution
0 -> 1 : 427 |****************************************|
This shows a mixed read/write workload.
USAGE message:
# ./ext4dist -h
usage: ext4dist [-h] [-T] [-m] [-p PID] [interval] [count]
Summarize ext4 operation latency
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-T, --notimestamp don't include timestamp on interval output
-m, --milliseconds output in milliseconds
-p PID, --pid PID trace this PID only
examples:
./ext4dist # show operation latency as a histogram
./ext4dist -p 181 # trace PID 181 only
./ext4dist 1 10 # print 1 second summaries, 10 times
./ext4dist -m 5 # 5s summaries, milliseconds
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# ext4slower Trace slow ext4 operations.
# For Linux, uses BCC, eBPF.
#
# USAGE: ext4slower [-h] [-j] [-p PID] [min_ms]
#
# This script traces common ext4 file operations: reads, writes, opens, and
# syncs. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
#
# WARNING: This adds low-overhead instrumentation to these ext4 operations,
# including reads and writes from the file system cache. Such reads and writes
# can be very frequent (depending on the workload; eg, 1M/sec), at which
# point the overhead of this tool (even if it prints no "slower" events) can
# begin to become significant.
#
# By default, a minimum millisecond threshold of 10 is used.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 11-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
import ctypes as ct
# symbols
kallsyms = "/proc/kallsyms"
# arguments
examples = """examples:
./ext4slower # trace operations slower than 10 ms (default)
./ext4slower 1 # trace operations slower than 1 ms
./ext4slower -j 1 # ... 1 ms, parsable output (csv)
./ext4slower 0 # trace all operations (warning: verbose)
./ext4slower -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Trace common ext4 file operations slower than a threshold",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-j", "--csv", action="store_true",
help="just print fields: comma-separated values")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("min_ms", nargs="?", default='10',
help="minimum I/O duration to trace, in ms (default 10)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
csv = args.csv
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
// XXX: switch these to char's when supported
#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_FSYNC 3
struct val_t {
u64 ts;
u64 offset;
struct file *fp;
};
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
u64 type;
u64 size;
u64 offset;
u64 delta_us;
u64 pid;
char task[TASK_COMM_LEN];
char file[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, pid_t, struct val_t);
BPF_PERF_OUTPUT(events);
//
// Store timestamp and size on entry
//
// The current ext4 (Linux 4.5) uses generic_file_read_iter(), instead of it's
// own function, for reads. So we need to trace that and then filter on ext4,
// which I do by checking file->f_op.
int trace_read_entry(struct pt_regs *ctx, struct kiocb *iocb)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// ext4 filter on file->f_op == ext4_file_operations
struct file *fp = iocb->ki_filp;
if ((u64)fp->f_op != EXT4_FILE_OPERATIONS)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = fp;
val.offset = iocb->ki_pos;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// ext4_file_write_iter():
int trace_write_entry(struct pt_regs *ctx, struct kiocb *iocb)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = iocb->ki_filp;
val.offset = iocb->ki_pos;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// ext4_file_open():
int trace_open_entry(struct pt_regs *ctx, struct inode *inode,
struct file *file)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = file;
val.offset = 0;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// ext4_sync_file():
int trace_fsync_entry(struct pt_regs *ctx, struct file *file)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = file;
val.offset = 0;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
//
// Output
//
static int trace_return(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = entryinfo.lookup(&pid);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
// calculate delta
u64 ts = bpf_ktime_get_ns();
u64 delta_us = (ts - valp->ts) / 1000;
entryinfo.delete(&pid);
if (FILTER_US)
return 0;
// workaround (rewriter should handle file to d_iname in one step):
struct dentry *de = NULL;
bpf_probe_read(&de, sizeof(de), &valp->fp->f_path.dentry);
// populate output struct
u32 size = ctx->ax;
struct data_t data = {.type = type, .size = size, .delta_us = delta_us,
.pid = pid};
data.ts_us = ts / 1000;
data.offset = valp->offset;
bpf_probe_read(&data.file, sizeof(data.file), de->d_iname);
bpf_get_current_comm(&data.task, sizeof(data.task));
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_WRITE);
}
int trace_open_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_OPEN);
}
int trace_fsync_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_FSYNC);
}
"""
# code replacements
with open(kallsyms) as syms:
ops = ''
for line in syms:
(addr, size, name) = line.rstrip().split(" ", 2)
if name == "ext4_file_operations":
ops = "0x" + addr
break
if ops == '':
print("ERROR: no ext4_file_operations in /proc/kallsyms. Exiting.")
exit()
bpf_text = bpf_text.replace('EXT4_FILE_OPERATIONS', ops)
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
else:
bpf_text = bpf_text.replace('FILTER_US',
'delta_us <= %s' % str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug:
print(bpf_text)
# kernel->user event data: struct data_t
DNAME_INLINE_LEN = 32 # linux/dcache.h
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("type", ct.c_ulonglong),
("size", ct.c_ulonglong),
("offset", ct.c_ulonglong),
("delta_us", ct.c_ulonglong),
("pid", ct.c_ulonglong),
("task", ct.c_char * TASK_COMM_LEN),
("file", ct.c_char * DNAME_INLINE_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
type = 'R'
if event.type == 1:
type = 'W'
elif event.type == 2:
type = 'O'
elif event.type == 3:
type = 'S'
if (csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
event.ts_us, event.task, event.pid, type, event.size,
event.offset, event.delta_us, event.file))
return
print("%-8s %-14.14s %-6s %1s %-7s %-8d %7.2f %s" % (strftime("%H:%M:%S"),
event.task, event.pid, type, event.size, event.offset / 1024,
float(event.delta_us) / 1000, event.file))
# initialize BPF
b = BPF(text=bpf_text)
# Common file functions. See earlier comment about generic_file_read_iter().
b.attach_kprobe(event="generic_file_read_iter", fn_name="trace_read_entry")
b.attach_kprobe(event="ext4_file_write_iter", fn_name="trace_write_entry")
b.attach_kprobe(event="ext4_file_open", fn_name="trace_open_entry")
b.attach_kprobe(event="ext4_sync_file", fn_name="trace_fsync_entry")
b.attach_kretprobe(event="generic_file_read_iter", fn_name="trace_read_return")
b.attach_kretprobe(event="ext4_file_write_iter", fn_name="trace_write_return")
b.attach_kretprobe(event="ext4_file_open", fn_name="trace_open_return")
b.attach_kretprobe(event="ext4_sync_file", fn_name="trace_fsync_return")
# header
if (csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:
if min_ms == 0:
print("Tracing ext4 operations")
else:
print("Tracing ext4 operations slower than %d ms" % min_ms)
print("%-8s %-14s %-6s %1s %-7s %-8s %7s %s" % ("TIME", "COMM", "PID", "T",
"BYTES", "OFF_KB", "LAT(ms)", "FILENAME"))
# read events
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
This diff is collapsed.
#!/usr/bin/python
#
# bashreadline Print entered bash commands from all running shells.
# For Linux, uses BCC, eBPF. Embedded C.
#
# This works by tracing the readline() function using a uretprobe (uprobes).
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 28-Jan-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import strftime
# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int printret(struct pt_regs *ctx) {
if (!ctx->ax)
return 0;
char str[80] = {};
bpf_probe_read(&str, sizeof(str), (void *)ctx->ax);
bpf_trace_printk("%s\\n", &str);
return 0;
};
"""
b = BPF(text=bpf_text)
b.attach_uretprobe(name="/bin/bash", sym="readline", fn_name="printret")
# header
print("%-9s %-6s %s" % ("TIME", "PID", "COMMAND"))
# format output
while 1:
try:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
except ValueError:
continue
print("%-9s %-6d %s" % (strftime("%H:%M:%S"), pid, msg))
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# biosnoop Trace block device I/O and print details including issuing PID.
# For Linux, uses BCC, eBPF.
#
# This uses in-kernel eBPF maps to cache process details (PID and comm) by I/O
# request, as well as a starting timestamp for calculating I/O latency.
#
# Copyright (c) 2015 Brendan Gregg.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 16-Sep-2015 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
# load BPF program
b = BPF(text="""
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
struct val_t {
u32 pid;
char name[TASK_COMM_LEN];
};
BPF_HASH(start, struct request *);
BPF_HASH(infobyreq, struct request *, struct val_t);
// cache PID and comm by-req
int trace_pid_start(struct pt_regs *ctx, struct request *req)
{
struct val_t val = {};
if (bpf_get_current_comm(&val.name, sizeof(val.name)) == 0) {
val.pid = bpf_get_current_pid_tgid();
infobyreq.update(&req, &val);
}
return 0;
}
// time block I/O
int trace_req_start(struct pt_regs *ctx, struct request *req)
{
u64 ts;
ts = bpf_ktime_get_ns();
start.update(&req, &ts);
return 0;
}
// output
int trace_req_completion(struct pt_regs *ctx, struct request *req)
{
u64 *tsp, delta;
u32 *pidp = 0;
struct val_t *valp;
// fetch timestamp and calculate delta
tsp = start.lookup(&req);
if (tsp == 0) {
// missed tracing issue
return 0;
}
delta = bpf_ktime_get_ns() - *tsp;
//
// Fetch and output issuing pid and comm.
// As bpf_trace_prink() is limited to a maximum of 1 string and 2
// integers, we'll use more than one to output the data.
//
valp = infobyreq.lookup(&req);
if (valp == 0) {
bpf_trace_printk("0 0 ? %d\\n", req->__data_len);
} else {
bpf_trace_printk("0 %d %s %d\\n", valp->pid, valp->name,
req->__data_len);
}
// output remaining details
if (req->cmd_flags & REQ_WRITE) {
bpf_trace_printk("1 W %s %d %d ?\\n", req->rq_disk->disk_name,
req->__sector, delta / 1000);
} else {
bpf_trace_printk("1 R %s %d %d ?\\n", req->rq_disk->disk_name,
req->__sector, delta / 1000);
}
start.delete(&req);
infobyreq.delete(&req);
return 0;
}
""", debug=0)
b.attach_kprobe(event="blk_account_io_start", fn_name="trace_pid_start")
b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_completion",
fn_name="trace_req_completion")
# header
print("%-14s %-14s %-6s %-7s %-2s %-9s %-7s %7s" % ("TIME(s)", "COMM", "PID",
"DISK", "T", "SECTOR", "BYTES", "LAT(ms)"))
start_ts = 0
# format output
while 1:
(task, pid, cpu, flags, ts, msg) = b.trace_fields()
args = msg.split(" ")
if start_ts == 0:
start_ts = ts
if args[0] == "0":
(real_pid, real_comm, bytes_s) = (args[1], args[2], args[3])
continue
else:
(type_s, disk_s, sector_s, us_s) = (args[1], args[2], args[3],
args[4])
ms = float(int(us_s, 10)) / 1000
print("%-14.9f %-14.14s %-6s %-7s %-2s %-9s %-7s %7.2f" % (
ts - start_ts, real_comm, real_pid, disk_s, type_s, sector_s,
bytes_s, ms))
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# xfsdist Summarize XFS operation latency.
# For Linux, uses BCC, eBPF.
#
# USAGE: xfsdist [-h] [-T] [-m] [-p PID] [interval] [count]
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 12-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse
# arguments
examples = """examples:
./xfsdist # show operation latency as a histogram
./xfsdist -p 181 # trace PID 181 only
./xfsdist 1 10 # print 1 second summaries, 10 times
./xfsdist -m 5 # 5s summaries, milliseconds
"""
parser = argparse.ArgumentParser(
description="Summarize XFS operation latency",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-T", "--notimestamp", action="store_true",
help="don't include timestamp on interval output")
parser.add_argument("-m", "--milliseconds", action="store_true",
help="output in milliseconds")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("interval", nargs="?",
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
pid = args.pid
countdown = int(args.count)
if args.milliseconds:
factor = 1000000
label = "msecs"
else:
factor = 1000
label = "usecs"
if args.interval and int(args.interval) == 0:
print("ERROR: interval 0. Exiting.")
exit()
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#define OP_NAME_LEN 8
typedef struct dist_key {
char op[OP_NAME_LEN];
u64 slot;
} dist_key_t;
BPF_HASH(start, u32);
BPF_HISTOGRAM(dist, dist_key_t);
// time operation
int trace_entry(struct pt_regs *ctx)
{
u32 pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
static int trace_return(struct pt_regs *ctx, const char *op)
{
u64 *tsp;
u32 pid = bpf_get_current_pid_tgid();
// fetch timestamp and calculate delta
tsp = start.lookup(&pid);
if (tsp == 0) {
return 0; // missed start or filtered
}
u64 delta = (bpf_ktime_get_ns() - *tsp) / FACTOR;
// store as histogram
dist_key_t key = {.slot = bpf_log2l(delta)};
__builtin_memcpy(&key.op, op, sizeof(key.op));
dist.increment(key);
start.delete(&pid);
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
char *op = "read";
return trace_return(ctx, op);
}
int trace_write_return(struct pt_regs *ctx)
{
char *op = "write";
return trace_return(ctx, op);
}
int trace_open_return(struct pt_regs *ctx)
{
char *op = "open";
return trace_return(ctx, op);
}
int trace_fsync_return(struct pt_regs *ctx)
{
char *op = "fsync";
return trace_return(ctx, op);
}
"""
bpf_text = bpf_text.replace('FACTOR', str(factor))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug:
print(bpf_text)
# load BPF program
b = BPF(text=bpf_text)
# common file functions
b.attach_kprobe(event="xfs_file_read_iter", fn_name="trace_entry")
b.attach_kprobe(event="xfs_file_write_iter", fn_name="trace_entry")
b.attach_kprobe(event="xfs_file_open", fn_name="trace_entry")
b.attach_kprobe(event="xfs_file_fsync", fn_name="trace_entry")
b.attach_kretprobe(event="xfs_file_read_iter", fn_name="trace_read_return")
b.attach_kretprobe(event="xfs_file_write_iter", fn_name="trace_write_return")
b.attach_kretprobe(event="xfs_file_open", fn_name="trace_open_return")
b.attach_kretprobe(event="xfs_file_fsync", fn_name="trace_fsync_return")
print("Tracing XFS operation latency... Hit Ctrl-C to end.")
# output
exiting = 0
dist = b.get_table("dist")
while (1):
try:
if args.interval:
sleep(int(args.interval))
else:
sleep(99999999)
except KeyboardInterrupt:
exiting = 1
print()
if args.interval and (not args.notimestamp):
print(strftime("%H:%M:%S:"))
dist.print_log2_hist(label, "operation")
dist.clear()
countdown -= 1
if exiting or countdown == 0:
exit()
Demonstrations of xfsdist, the Linux eBPF/bcc version.
xfsdist traces XFS reads, writes, opens, and fsyncs, and summarizes their
latency as a power-of-2 histogram. For example:
# ./xfsdist
Tracing XFS operation latency... Hit Ctrl-C to end.
^C
operation = 'read'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 362 | |
4 -> 7 : 807 |* |
8 -> 15 : 20686 |****************************************|
16 -> 31 : 512 | |
32 -> 63 : 4 | |
64 -> 127 : 2744 |***** |
128 -> 255 : 7127 |************* |
256 -> 511 : 2483 |**** |
512 -> 1023 : 1281 |** |
1024 -> 2047 : 39 | |
2048 -> 4095 : 5 | |
4096 -> 8191 : 1 | |
operation = 'open'
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 3 |****************************************|
This output shows a bi-modal distribution for read latency, with a faster
mode of 20,686 reads that took between 8 and 15 microseconds, and a slower
mode of over 10,000 reads that took between 64 and 1023 microseconds. It's
likely that the faster mode was a hit from the in-memory file system cache,
and the slower mode is a read from a storage device (disk).
This "latency" is measured from when the operation was issued from the VFS
interface to the file system, to when it completed. This spans everything:
block device I/O (disk I/O), file system CPU cycles, file system locks, run
queue latency, etc. This is a better measure of the latency suffered by
applications reading from the file system than measuring this down at the
block device interface.
Note that this only traces the common file system operations previously
listed: other file system operations (eg, inode operations including
getattr()) are not traced.
An optional interval and a count can be provided, as well as -m to show the
distributions in milliseconds. For example:
# ./xfsdist -m 1 5
Tracing XFS operation latency... Hit Ctrl-C to end.
10:14:15:
operation = 'read'
msecs : count distribution
0 -> 1 : 1366 |****************************************|
2 -> 3 : 86 |** |
4 -> 7 : 95 |** |
8 -> 15 : 132 |*** |
16 -> 31 : 72 |** |
operation = 'write'
msecs : count distribution
0 -> 1 : 685 |****************************************|
10:14:16:
operation = 'read'
msecs : count distribution
0 -> 1 : 984 |****************************************|
2 -> 3 : 66 |** |
4 -> 7 : 67 |** |
8 -> 15 : 104 |**** |
16 -> 31 : 70 |** |
32 -> 63 : 12 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 536 |****************************************|
10:14:17:
operation = 'read'
msecs : count distribution
0 -> 1 : 1262 |****************************************|
2 -> 3 : 75 |** |
4 -> 7 : 80 |** |
8 -> 15 : 119 |*** |
16 -> 31 : 75 |** |
32 -> 63 : 3 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 639 |****************************************|
10:14:18:
operation = 'read'
msecs : count distribution
0 -> 1 : 1070 |****************************************|
2 -> 3 : 58 |** |
4 -> 7 : 74 |** |
8 -> 15 : 140 |***** |
16 -> 31 : 60 |** |
32 -> 63 : 5 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 556 |****************************************|
10:14:19:
operation = 'read'
msecs : count distribution
0 -> 1 : 1176 |****************************************|
2 -> 3 : 53 |* |
4 -> 7 : 94 |*** |
8 -> 15 : 112 |*** |
16 -> 31 : 77 |** |
32 -> 63 : 3 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 613 |****************************************|
This shows a mixed read/write workload, where the slower read mode was around
10 ms.
USAGE message:
# ./xfsdist -h
usage: xfsdist [-h] [-T] [-m] [-p PID] [interval] [count]
Summarize XFS operation latency
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-T, --notimestamp don't include timestamp on interval output
-m, --milliseconds output in milliseconds
-p PID, --pid PID trace this PID only
examples:
./xfsdist # show operation latency as a histogram
./xfsdist -p 181 # trace PID 181 only
./xfsdist 1 10 # print 1 second summaries, 10 times
./xfsdist -m 5 # 5s summaries, milliseconds
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# xfsslower Trace slow XFS operations.
# For Linux, uses BCC, eBPF.
#
# USAGE: xfsslower [-h] [-j] [-p PID] [min_ms]
#
# This script traces common XFS file operations: reads, writes, opens, and
# syncs. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
#
# WARNING: This adds low-overhead instrumentation to these XFS operations,
# including reads and writes from the file system cache. Such reads and writes
# can be very frequent (depending on the workload; eg, 1M/sec), at which
# point the overhead of this tool (even if it prints no "slower" events) can
# begin to become significant.
#
# By default, a minimum millisecond threshold of 10 is used.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 11-Feb-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
import ctypes as ct
# arguments
examples = """examples:
./xfsslower # trace operations slower than 10 ms (default)
./xfsslower 1 # trace operations slower than 1 ms
./xfsslower -j 1 # ... 1 ms, parsable output (csv)
./xfsslower 0 # trace all operations (warning: verbose)
./xfsslower -p 185 # trace PID 185 only
"""
parser = argparse.ArgumentParser(
description="Trace common XFS file operations slower than a threshold",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-j", "--csv", action="store_true",
help="just print fields: comma-separated values")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("min_ms", nargs="?", default='10',
help="minimum I/O duration to trace, in ms (default 10)")
args = parser.parse_args()
min_ms = int(args.min_ms)
pid = args.pid
csv = args.csv
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
// XXX: switch these to char's when supported
#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_FSYNC 3
struct val_t {
u64 ts;
u64 offset;
struct file *fp;
};
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
u64 type;
u64 size;
u64 offset;
u64 delta_us;
u64 pid;
char task[TASK_COMM_LEN];
char file[DNAME_INLINE_LEN];
};
BPF_HASH(entryinfo, pid_t, struct val_t);
BPF_PERF_OUTPUT(events);
//
// Store timestamp and size on entry
//
// xfs_file_read_iter(), xfs_file_write_iter():
int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = iocb->ki_filp;
val.offset = iocb->ki_pos;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// xfs_file_open():
int trace_open_entry(struct pt_regs *ctx, struct inode *inode,
struct file *file)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = file;
val.offset = 0;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
// xfs_file_fsync():
int trace_fsync_entry(struct pt_regs *ctx, struct file *file)
{
u32 pid;
pid = bpf_get_current_pid_tgid();
if (FILTER_PID)
return 0;
// store filep and timestamp by pid
struct val_t val = {};
val.ts = bpf_ktime_get_ns();
val.fp = file;
val.offset = 0;
if (val.fp)
entryinfo.update(&pid, &val);
return 0;
}
//
// Output
//
static int trace_return(struct pt_regs *ctx, int type)
{
struct val_t *valp;
u32 pid = bpf_get_current_pid_tgid();
valp = entryinfo.lookup(&pid);
if (valp == 0) {
// missed tracing issue or filtered
return 0;
}
// calculate delta
u64 ts = bpf_ktime_get_ns();
u64 delta_us = (ts - valp->ts) / 1000;
entryinfo.delete(&pid);
if (FILTER_US)
return 0;
// workaround (rewriter should handle file to d_iname in one step):
struct dentry *de = NULL;
bpf_probe_read(&de, sizeof(de), &valp->fp->f_path.dentry);
// populate output struct
u32 size = ctx->ax;
struct data_t data = {.type = type, .size = size, .delta_us = delta_us,
.pid = pid};
data.ts_us = ts / 1000;
data.offset = valp->offset;
bpf_probe_read(&data.file, sizeof(data.file), de->d_iname);
bpf_get_current_comm(&data.task, sizeof(data.task));
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
int trace_read_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_READ);
}
int trace_write_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_WRITE);
}
int trace_open_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_OPEN);
}
int trace_fsync_return(struct pt_regs *ctx)
{
return trace_return(ctx, TRACE_FSYNC);
}
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
else:
bpf_text = bpf_text.replace('FILTER_US',
'delta_us <= %s' % str(min_ms * 1000))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug:
print(bpf_text)
# kernel->user event data: struct data_t
DNAME_INLINE_LEN = 32 # linux/dcache.h
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("type", ct.c_ulonglong),
("size", ct.c_ulonglong),
("offset", ct.c_ulonglong),
("delta_us", ct.c_ulonglong),
("pid", ct.c_ulonglong),
("task", ct.c_char * TASK_COMM_LEN),
("file", ct.c_char * DNAME_INLINE_LEN)
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
type = 'R'
if event.type == 1:
type = 'W'
elif event.type == 2:
type = 'O'
elif event.type == 3:
type = 'S'
if (csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
event.ts_us, event.task, event.pid, type, event.size,
event.offset, event.delta_us, event.file))
return
print("%-8s %-14.14s %-6s %1s %-7s %-8d %7.2f %s" % (strftime("%H:%M:%S"),
event.task, event.pid, type, event.size, event.offset / 1024,
float(event.delta_us) / 1000, event.file))
# initialize BPF
b = BPF(text=bpf_text)
# common file functions
b.attach_kprobe(event="xfs_file_read_iter", fn_name="trace_rw_entry")
b.attach_kprobe(event="xfs_file_write_iter", fn_name="trace_rw_entry")
b.attach_kprobe(event="xfs_file_open", fn_name="trace_open_entry")
b.attach_kprobe(event="xfs_file_fsync", fn_name="trace_fsync_entry")
b.attach_kretprobe(event="xfs_file_read_iter", fn_name="trace_read_return")
b.attach_kretprobe(event="xfs_file_write_iter", fn_name="trace_write_return")
b.attach_kretprobe(event="xfs_file_open", fn_name="trace_open_return")
b.attach_kretprobe(event="xfs_file_fsync", fn_name="trace_fsync_return")
# header
if (csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:
if min_ms == 0:
print("Tracing XFS operations")
else:
print("Tracing XFS operations slower than %d ms" % min_ms)
print("%-8s %-14s %-6s %1s %-7s %-8s %7s %s" % ("TIME", "COMM", "PID", "T",
"BYTES", "OFF_KB", "LAT(ms)", "FILENAME"))
# read events
b["events"].open_perf_buffer(print_event)
while 1:
b.kprobe_poll()
Demonstrations of xfsslower, the Linux eBPF/bcc version.
xfsslower shows XFS reads, writes, opens, and fsyncs, slower than a threshold.
For example:
# ./xfsslower
Tracing XFS operations slower than 10 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:23:06 randread.pl 32497 R 8192 24938024 17.93 data1
06:23:06 randread.pl 32521 R 8192 13431528 18.27 data1
06:23:08 randread.pl 32497 R 8192 5070904 16.37 data1
06:23:08 randread.pl 32521 R 8192 12693016 16.06 data1
06:23:18 randread.pl 32521 R 8192 27049136 21.68 data1
06:23:18 randread.pl 32497 R 8192 257864 21.74 data1
06:23:20 randread.pl 32497 R 8192 17797208 13.37 data1
06:23:20 randread.pl 32497 R 8192 6088224 19.74 data1
This shows several reads from a "randread.pl" program, each 8 Kbytes in size,
and from a "data1" file. These all had over 10 ms latency.
This "latency" is measured from when the operation was issued from the VFS
interface to the file system, to when it completed. This spans everything:
block device I/O (disk I/O), file system CPU cycles, file system locks, run
queue latency, etc. This is a better measure of the latency suffered by
applications reading from the file system than measuring this down at the
block device interface.
Note that this only traces the common file system operations previously
listed: other file system operations (eg, inode operations including
getattr()) are not traced.
The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
# ./xfsslower 1
Tracing XFS operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:26:59 randread.pl 5394 R 8192 9045728 1.24 data1
06:26:59 randread.pl 5394 R 8192 23532136 1.17 data1
06:26:59 randread.pl 5442 R 8192 2192376 2.06 data1
06:27:00 randread.pl 5394 R 8192 3535176 1.27 data1
06:27:00 randread.pl 5442 R 8192 21361784 3.18 data1
06:27:00 randread.pl 5394 R 8192 2556336 3.23 data1
06:27:00 randread.pl 5394 R 8192 20020880 2.87 data1
06:27:00 randread.pl 5442 R 8192 20708888 3.32 data1
06:27:00 randread.pl 5394 R 8192 4654680 2.00 data1
06:27:00 randread.pl 5442 R 8192 5591744 1.98 data1
06:27:00 randread.pl 5394 R 8192 2431056 1.22 data1
06:27:00 randread.pl 5394 R 8192 384288 2.95 data1
06:27:00 randread.pl 5442 R 8192 29277672 3.07 data1
06:27:00 randread.pl 5442 R 8192 29508216 3.23 data1
06:27:00 randread.pl 5394 R 8192 17200008 2.86 data1
06:27:00 randread.pl 5442 R 8192 20693088 1.06 data1
06:27:00 randread.pl 5394 R 8192 28124192 1.38 data1
06:27:00 randread.pl 5442 R 8192 23821184 1.28 data1
06:27:00 randread.pl 5394 R 8192 1623200 1.47 data1
[...]
There's now much more output (this spans only 2 seconds, the previous output
spanned 14 seconds), as the lower threshold is catching more I/O.
A threshold of 0 will trace all operations. Warning: the output will be
verbose, as it will include all file system cache hits.
# ./xfsslower 0
Tracing XFS operations
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
06:29:43 ls 9291 O 0 0 0.00 bench
06:29:47 cat 9361 O 0 0 0.00 date.txt
06:29:47 cat 9361 R 29 0 0.01 date.txt
06:29:47 cat 9361 R 0 0 0.00 date.txt
06:29:50 bash 20500 O 0 0 0.00 bench
06:29:50 bash 20500 O 0 0 0.00 bench
06:29:50 bash 20500 O 0 0 0.00 bench
06:29:50 bash 9431 O 0 0 0.00 bench
06:29:50 bash 9432 O 0 0 0.00 bench
06:29:50 bash 9456 O 0 0 0.00 newdate.txt
06:29:50 date 9456 W 29 0 0.01 newdate.txt
06:29:53 cksum 9503 O 0 0 0.00 data1
06:29:53 cksum 9503 R 65536 0 0.06 data1
06:29:53 cksum 9503 R 65536 64 0.01 data1
06:29:53 cksum 9503 R 65536 128 0.02 data1
06:29:53 cksum 9503 R 65536 192 0.01 data1
06:29:53 cksum 9503 R 65536 256 0.01 data1
06:29:53 cksum 9503 R 65536 320 0.01 data1
06:29:53 cksum 9503 R 65536 384 0.01 data1
06:29:53 cksum 9503 R 65536 448 0.04 data1
06:29:53 cksum 9503 R 65536 512 0.01 data1
06:29:53 cksum 9503 R 65536 576 0.02 data1
06:29:53 cksum 9503 R 65536 640 0.01 data1
06:29:53 cksum 9503 R 65536 704 0.01 data1
06:29:53 cksum 9503 R 65536 768 0.01 data1
06:29:53 cksum 9503 R 65536 832 0.01 data1
06:29:53 cksum 9503 R 65536 896 0.01 data1
06:29:53 cksum 9503 R 65536 960 0.01 data1
06:29:53 cksum 9503 R 65536 1024 0.01 data1
06:29:53 cksum 9503 R 65536 1088 0.02 data1
06:29:53 cksum 9503 R 65536 1152 0.01 data1
06:29:53 cksum 9503 R 65536 1216 0.01 data1
[...]
The output now includes open operations ("O"), and writes ("W"). A cksum(1)
command can be seen reading from a data1 file, from progressively increasing
offsets: a sequential workload.
A -j option will print just the fields (parsable output, csv):
# ./xfsslower -j 1
ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
125563830632,randread.pl,12155,R,8192,27824193536,1057,data1
125565050578,randread.pl,12155,R,8192,16908525568,1969,data1
125566331140,randread.pl,12202,R,8192,16310689792,1738,data1
125566427955,randread.pl,12155,R,8192,11127439360,1058,data1
125567223494,randread.pl,12202,R,8192,8422031360,1131,data1
125567331145,randread.pl,12155,R,8192,9233088512,1230,data1
125567331220,randread.pl,12202,R,8192,12716326912,1148,data1
125567334983,randread.pl,12155,R,8192,24545206272,2182,data1
[...]
This may be useful for visualizing with another tool, for example, for
producing a scatter plot of ENDTIME vs LATENCY, to look for time-based
patterns.
USAGE message:
# ./xfsslower -h
usage: xfsslower [-h] [-j] [-p PID] [min_ms]
Trace common XFS file operations slower than a threshold
positional arguments:
min_ms minimum I/O duration to trace, in ms (default 10)
optional arguments:
-h, --help show this help message and exit
-j, --csv just print fields: comma-separated values
-p PID, --pid PID trace this PID only
examples:
./xfsslower # trace operations slower than 10 ms (default)
./xfsslower 1 # trace operations slower than 1 ms
./xfsslower -j 1 # ... 1 ms, parsable output (csv)
./xfsslower 0 # trace all operations (warning: verbose)
./xfsslower -p 185 # trace PID 185 only
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment