Merge pull request #388 from brendangregg/master

2 tools: btrfsdist, btrfsslower

Merge pull request #388 from brendangregg/master
2 tools: btrfsdist, btrfsslower
8b127628 · 4ast · 06f3b9a7 · ddce4db5 · 8b127628 · 8b127628
Commit 8b127628 authored Feb 16, 2016 by 4ast
7 changed files
--- a/README.md
+++ b/README.md
@@ -71,6 +71,8 @@ Tools:
 - tools/[biotop](tools/biotop.py): Top for disks: Summarize block device I/O by process. [Examples](tools/biotop_example.txt).
 - tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
 - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
+- tools/[btrfsdist](tools/btrfsdist.py): Summarize btrfs operation latency distribution as a histogram. [Examples](tools/btrfsdist_example.txt).
+- tools/[btrfsslower](tools/btrfsslower.py): Trace slow btrfs operations. [Examples](tools/btrfsslower_example.txt).
 - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
 - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
 - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).

--- a/man/man8/btrfsdist.8
+++ b/man/man8/btrfsdist.8
+.TH btrfsdist 8  "2016-02-15" "USER COMMANDS"
+.SH NAME
+btrfsdist \- Summarize btrfs operation latency. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B btrfsdist [\-h] [\-T] [\-N] [\-d] [interval] [count]
+.SH DESCRIPTION
+This tool summarizes time (latency) spent in common btrfs file operations:
+reads, writes, opens, and syncs, and presents it as a power-of-2 histogram. It
+uses an in-kernel eBPF map to store the histogram for efficiency.
+
+Since this works by tracing the btrfs_file_operations interface functions, it
+will need updating to match any changes to these functions.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-T
+Don't include timestamps on interval output.
+.TP
+\-m
+Output in milliseconds.
+.TP
+\-p PID
+Trace this PID only.
+.SH EXAMPLES
+.TP
+Trace btrfs operation time, and print a summary on Ctrl-C:
+#
+.B btrfsdist
+.TP
+Trace PID 181 only:
+#
+.B btrfsdist -p 181
+.TP
+Print 1 second summaries, 10 times:
+#
+.B btrfsdist 1 10
+.TP
+1 second summaries, printed in milliseconds
+#
+.B btrfsdist \-m 1
+.SH FIELDS
+.TP
+msecs
+Range of milliseconds for this bucket.
+.TP
+usecs
+Range of microseconds for this bucket.
+.TP
+count
+Number of operations in this time range.
+.TP
+distribution
+ASCII representation of the distribution (the count column).
+.SH OVERHEAD
+This adds low-overhead instrumentation to btrfs writes and fsyncs, as well
+as all system reads and opens (due to the current implementation of the
+btrfs_file_operations interface). Particularly, all reads and writes from
+the file system cache will incur extra overhead while tracing. Such reads and
+writes can be very frequent (depending on the workload; eg, 1M/sec), at which
+point the overhead of this tool may become noticeable.
+Measure and quantify before use.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+btrfsslower(8)
--- a/man/man8/btrfsslower.8
+++ b/man/man8/btrfsslower.8
+.TH btrfsslower 8  "2016-02-15" "USER COMMANDS"
+.SH NAME
+btrfsslower \- Trace slow btrfs file operations, with per-event details.
+.SH SYNOPSIS
+.B btrfsslower [\-h] [\-j] [\-p PID] [min_ms]
+.SH DESCRIPTION
+This tool traces common btrfs file operations: reads, writes, opens, and
+syncs. It measures the time spent in these operations, and prints details
+for each that exceeded a threshold.
+
+WARNING: See the OVERHEAD section.
+
+By default, a minimum millisecond threshold of 10 is used. If a threshold of 0
+is used, all events are printed (warning: verbose).
+
+Since this works by tracing the btrfs_file_operations interface functions, it
+will need updating to match any changes to these functions.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+\-p PID
+Trace this PID only.
+.TP
+min_ms
+Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
+.SH EXAMPLES
+.TP
+Trace synchronous file reads and writes slower than 10 ms:
+#
+.B btrfsslower
+.TP
+Trace slower than 1 ms:
+#
+.B btrfsslower 1
+.TP
+Trace slower than 1 ms, and output just the fields in parsable format (csv):
+#
+.B btrfsslower \-j 1
+.TP
+Trace all file reads and writes (warning: the output will be verbose):
+#
+.B btrfsslower 0
+.TP
+Trace slower than 1 ms, for PID 181 only:
+#
+.B btrfsslower \-p 181 1
+.SH FIELDS
+.TP
+TIME(s)
+Time of I/O completion since the first I/O seen, in seconds.
+.TP
+COMM
+Process name.
+.TP
+PID
+Process ID.
+.TP
+T
+Type of operation. R == read, W == write, O == open, S == fsync.
+.TP
+OFF_KB
+File offset for the I/O, in Kbytes.
+.TP
+BYTES
+Size of I/O, in bytes.
+.TP
+LAT(ms)
+Latency (duration) of I/O, measured from when it was issued by VFS to the
+filesystem, to when it completed. This time is inclusive of block device I/O,
+file system CPU cycles, file system locks, run queue latency, etc. It's a more
+accurate measure of the latency suffered by applications performing file
+system I/O, than to measure this down at the block device interface.
+.TP
+FILENAME
+A cached kernel file name (comes from dentry->d_iname).
+.TP
+ENDTIME_us
+Completion timestamp, microseconds (\-j only).
+.TP
+OFFSET_b
+File offset, bytes (\-j only).
+.TP
+LATENCY_us
+Latency (duration) of the I/O, in microseconds (\-j only).
+.SH OVERHEAD
+This adds low-overhead instrumentation to btrfs writes and fsyncs, as well
+as all system reads and opens (due to the current implementation of the
+btrfs_file_operations interface). Particularly, all reads and writes from
+the file system cache will incur extra overhead while tracing. Such reads and
+writes can be very frequent (depending on the workload; eg, 1M/sec), at which
+point the overhead of this tool may become noticeable.
+Measure and quantify before use. If this
+continues to be a problem, consider switching to a tool that prints in-kernel
+summaries only, such as btrfsdist(8).
+.PP
+Note that the overhead of this tool should be less than fileslower(8), as
+this tool targets btrfs functions only, and not all file read/write paths
+(which can include socket I/O).
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+btrfsdist(8), biosnoop(8), funccount(8), fileslower(8)
--- a/tools/btrfsdist.py
+++ b/tools/btrfsdist.py
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# btrfsdist  Summarize btrfs operation latency.
+#            For Linux, uses BCC, eBPF.
+#
+# USAGE: btrfsdist [-h] [-T] [-m] [-p PID] [interval] [count]
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 15-Feb-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+
+# symbols
+kallsyms = "/proc/kallsyms"
+
+# arguments
+examples = """examples:
+    ./btrfsdist            # show operation latency as a histogram
+    ./btrfsdist -p 181     # trace PID 181 only
+    ./btrfsdist 1 10       # print 1 second summaries, 10 times
+    ./btrfsdist -m 5       # 5s summaries, milliseconds
+"""
+parser = argparse.ArgumentParser(
+    description="Summarize btrfs operation latency",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-T", "--notimestamp", action="store_true",
+    help="don't include timestamp on interval output")
+parser.add_argument("-m", "--milliseconds", action="store_true",
+    help="output in milliseconds")
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("interval", nargs="?",
+    help="output interval, in seconds")
+parser.add_argument("count", nargs="?", default=99999999,
+    help="number of outputs")
+args = parser.parse_args()
+pid = args.pid
+countdown = int(args.count)
+if args.milliseconds:
+    factor = 1000000
+    label = "msecs"
+else:
+    factor = 1000
+    label = "usecs"
+if args.interval and int(args.interval) == 0:
+    print("ERROR: interval 0. Exiting.")
+    exit()
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+
+#define OP_NAME_LEN 8
+typedef struct dist_key {
+    char op[OP_NAME_LEN];
+    u64 slot;
+} dist_key_t;
+BPF_HASH(start, u32);
+BPF_HISTOGRAM(dist, dist_key_t);
+
+// time operation
+int trace_entry(struct pt_regs *ctx)
+{
+    u32 pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+    u64 ts = bpf_ktime_get_ns();
+    start.update(&pid, &ts);
+    return 0;
+}
+
+// The current btrfs (Linux 4.5) uses generic_file_read_iter() instead of it's
+// own read function. So we need to trace that and then filter on btrfs, which
+// I do by checking file->f_op.
+int trace_read_entry(struct pt_regs *ctx, struct kiocb *iocb)
+{
+    u32 pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // btrfs filter on file->f_op == btrfs_file_operations
+    struct file *fp = iocb->ki_filp;
+    if ((u64)fp->f_op != BTRFS_FILE_OPERATIONS)
+        return 0;
+
+    u64 ts = bpf_ktime_get_ns();
+    start.update(&pid, &ts);
+    return 0;
+}
+
+// The current btrfs (Linux 4.5) uses generic_file_open(), instead of it's own
+// function. Same as with reads. Trace the generic path and filter:
+int trace_open_entry(struct pt_regs *ctx, struct inode *inode,
+    struct file *file)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // btrfs filter on file->f_op == btrfs_file_operations
+    if ((u64)file->f_op != BTRFS_FILE_OPERATIONS)
+        return 0;
+
+    u64 ts = bpf_ktime_get_ns();
+    start.update(&pid, &ts);
+    return 0;
+}
+
+static int trace_return(struct pt_regs *ctx, const char *op)
+{
+    u64 *tsp;
+    u32 pid = bpf_get_current_pid_tgid();
+
+    // fetch timestamp and calculate delta
+    tsp = start.lookup(&pid);
+    if (tsp == 0) {
+        return 0;   // missed start or filtered
+    }
+    u64 delta = (bpf_ktime_get_ns() - *tsp) / FACTOR;
+
+    // store as histogram
+    dist_key_t key = {.slot = bpf_log2l(delta)};
+    __builtin_memcpy(&key.op, op, sizeof(key.op));
+    dist.increment(key);
+
+    start.delete(&pid);
+    return 0;
+}
+
+int trace_read_return(struct pt_regs *ctx)
+{
+    char *op = "read";
+    return trace_return(ctx, op);
+}
+
+int trace_write_return(struct pt_regs *ctx)
+{
+    char *op = "write";
+    return trace_return(ctx, op);
+}
+
+int trace_open_return(struct pt_regs *ctx)
+{
+    char *op = "open";
+    return trace_return(ctx, op);
+}
+
+int trace_fsync_return(struct pt_regs *ctx)
+{
+    char *op = "fsync";
+    return trace_return(ctx, op);
+}
+"""
+
+# code replacements
+with open(kallsyms) as syms:
+    ops = ''
+    for line in syms:
+        a = line.rstrip().split()
+        (addr, name) = (a[0], a[2])
+        if name == "btrfs_file_operations":
+            ops = "0x" + addr
+            break
+    if ops == '':
+        print("ERROR: no btrfs_file_operations in /proc/kallsyms. Exiting.")
+        exit()
+    bpf_text = bpf_text.replace('BTRFS_FILE_OPERATIONS', ops)
+bpf_text = bpf_text.replace('FACTOR', str(factor))
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
+else:
+    bpf_text = bpf_text.replace('FILTER_PID', '0')
+if debug:
+    print(bpf_text)
+
+# load BPF program
+b = BPF(text=bpf_text)
+
+# Common file functions. See earlier comment about generic_file_read_iter().
+b.attach_kprobe(event="generic_file_read_iter", fn_name="trace_read_entry")
+b.attach_kprobe(event="btrfs_file_write_iter", fn_name="trace_entry")
+b.attach_kprobe(event="generic_file_open", fn_name="trace_open_entry")
+b.attach_kprobe(event="btrfs_sync_file", fn_name="trace_entry")
+b.attach_kretprobe(event="generic_file_read_iter", fn_name="trace_read_return")
+b.attach_kretprobe(event="btrfs_file_write_iter", fn_name="trace_write_return")
+b.attach_kretprobe(event="generic_file_open", fn_name="trace_open_return")
+b.attach_kretprobe(event="btrfs_sync_file", fn_name="trace_fsync_return")
+
+print("Tracing btrfs operation latency... Hit Ctrl-C to end.")
+
+# output
+exiting = 0
+dist = b.get_table("dist")
+while (1):
+    try:
+        if args.interval:
+            sleep(int(args.interval))
+        else:
+            sleep(99999999)
+    except KeyboardInterrupt:
+        exiting = 1
+
+    print()
+    if args.interval and (not args.notimestamp):
+        print(strftime("%H:%M:%S:"))
+
+    dist.print_log2_hist(label, "operation")
+    dist.clear()
+
+    countdown -= 1
+    if exiting or countdown == 0:
+        exit()
--- a/tools/btrfsdist_example.txt
+++ b/tools/btrfsdist_example.txt
+Demonstrations of btrfsdist, the Linux eBPF/bcc version.
+
+
+btrfsdist traces btrfs reads, writes, opens, and fsyncs, and summarizes their
+latency as a power-of-2 histogram. For example:
+
+# ./btrfsdist 
+Tracing btrfs operation latency... Hit Ctrl-C to end.
+^C
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 15       |                                        |
+         2 -> 3          : 1308     |*******                                 |
+         4 -> 7          : 198      |*                                       |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 11       |                                        |
+        32 -> 63         : 361      |*                                       |
+        64 -> 127        : 55       |                                        |
+       128 -> 255        : 104      |                                        |
+       256 -> 511        : 7312     |****************************************|
+       512 -> 1023       : 387      |**                                      |
+      1024 -> 2047       : 10       |                                        |
+      2048 -> 4095       : 4        |                                        |
+
+operation = 'write'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 4        |****************************************|
+
+operation = 'open'
+     usecs               : count     distribution
+         0 -> 1          : 1        |**********                              |
+         2 -> 3          : 4        |****************************************|
+
+This output shows a bi-modal distribution for read latency, with a faster
+mode of 1,308 reads that took between 2 and 3 microseconds, and a slower
+mode of over 7,312 reads that took between 256 and 511 microseconds. It's
+likely that the faster mode was a hit from the in-memory file system cache,
+and the slower mode is a read from a storage device (disk).
+
+This "latency" is measured from when the operation was issued from the VFS
+interface to the file system, to when it completed. This spans everything:
+block device I/O (disk I/O), file system CPU cycles, file system locks, run
+queue latency, etc. This is a better measure of the latency suffered by
+applications reading from the file system than measuring this down at the
+block device interface.
+
+Note that this only traces the common file system operations previously
+listed: other file system operations (eg, inode operations including
+getattr()) are not traced.
+
+
+An optional interval and a count can be provided, as well as -m to show the
+distributions in milliseconds. For example, two second summaries, five times:
+
+# ./btrfsdist 2 5
+Tracing btrfs operation latency... Hit Ctrl-C to end.
+
+03:40:49:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 15       |                                        |
+         2 -> 3          : 833      |********                                |
+         4 -> 7          : 127      |*                                       |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 8        |                                        |
+        32 -> 63         : 907      |********                                |
+        64 -> 127        : 91       |                                        |
+       128 -> 255        : 246      |**                                      |
+       256 -> 511        : 4164     |****************************************|
+       512 -> 1023       : 193      |*                                       |
+      1024 -> 2047       : 4        |                                        |
+      2048 -> 4095       : 6        |                                        |
+      4096 -> 8191       : 2        |                                        |
+
+03:40:51:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 25       |                                        |
+         2 -> 3          : 1491     |***************                         |
+         4 -> 7          : 218      |**                                      |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 16       |                                        |
+        32 -> 63         : 1527     |***************                         |
+        64 -> 127        : 319      |***                                     |
+       128 -> 255        : 429      |****                                    |
+       256 -> 511        : 3841     |****************************************|
+       512 -> 1023       : 232      |**                                      |
+      1024 -> 2047       : 3        |                                        |
+      2048 -> 4095       : 6        |                                        |
+      4096 -> 8191       : 1        |                                        |
+      8192 -> 16383      : 1        |                                        |
+
+03:40:53:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 27       |                                        |
+         2 -> 3          : 2999     |*********************************       |
+         4 -> 7          : 407      |****                                    |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 46       |                                        |
+        32 -> 63         : 3538     |****************************************|
+        64 -> 127        : 595      |******                                  |
+       128 -> 255        : 621      |*******                                 |
+       256 -> 511        : 3532     |*************************************** |
+       512 -> 1023       : 212      |**                                      |
+      1024 -> 2047       : 1        |                                        |
+      2048 -> 4095       : 0        |                                        |
+      4096 -> 8191       : 0        |                                        |
+      8192 -> 16383      : 0        |                                        |
+     16384 -> 32767      : 1        |                                        |
+
+03:40:55:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 221      |                                        |
+         2 -> 3          : 12580    |****************************************|
+         4 -> 7          : 1366     |****                                    |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 289      |                                        |
+        32 -> 63         : 10782    |**********************************      |
+        64 -> 127        : 1232     |***                                     |
+       128 -> 255        : 807      |**                                      |
+       256 -> 511        : 2299     |*******                                 |
+       512 -> 1023       : 135      |                                        |
+      1024 -> 2047       : 5        |                                        |
+      2048 -> 4095       : 2        |                                        |
+
+03:40:57:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 73951    |*************************               |
+         2 -> 3          : 117639   |****************************************|
+         4 -> 7          : 7943     |**                                      |
+         8 -> 15         : 1841     |                                        |
+        16 -> 31         : 1143     |                                        |
+        32 -> 63         : 5006     |*                                       |
+        64 -> 127        : 483      |                                        |
+       128 -> 255        : 242      |                                        |
+       256 -> 511        : 253      |                                        |
+       512 -> 1023       : 84       |                                        |
+      1024 -> 2047       : 23       |                                        |
+
+This shows a read workload that begins bimodal, and eventually the second
+mode disappears. The reason for this is that the workload cached during
+tracing. Note that the rate also increased, with over 200k reads for the
+final two second sample.
+
+
+USAGE message:
+
+# ./btrfsdist -h
+usage: btrfsdist [-h] [-T] [-m] [-p PID] [interval] [count]
+
+Summarize btrfs operation latency
+
+positional arguments:
+  interval            output interval, in seconds
+  count               number of outputs
+
+optional arguments:
+  -h, --help          show this help message and exit
+  -T, --notimestamp   don't include timestamp on interval output
+  -m, --milliseconds  output in milliseconds
+  -p PID, --pid PID   trace this PID only
+
+examples:
+    ./btrfsdist            # show operation latency as a histogram
+    ./btrfsdist -p 181     # trace PID 181 only
+    ./btrfsdist 1 10       # print 1 second summaries, 10 times
+    ./btrfsdist -m 5       # 5s summaries, milliseconds
--- a/tools/btrfsslower.py
+++ b/tools/btrfsslower.py
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# btrfsslower  Trace slow btrfs operations.
+#              For Linux, uses BCC, eBPF.
+#
+# USAGE: btrfsslower [-h] [-j] [-p PID] [min_ms]
+#
+# This script traces common btrfs file operations: reads, writes, opens, and
+# syncs. It measures the time spent in these operations, and prints details
+# for each that exceeded a threshold.
+#
+# WARNING: This adds low-overhead instrumentation to these btrfs operations,
+# including reads and writes from the file system cache. Such reads and writes
+# can be very frequent (depending on the workload; eg, 1M/sec), at which
+# point the overhead of this tool (even if it prints no "slower" events) can
+# begin to become significant.
+#
+# By default, a minimum millisecond threshold of 10 is used.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 15-Feb-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+from time import strftime
+import ctypes as ct
+
+# symbols
+kallsyms = "/proc/kallsyms"
+
+# arguments
+examples = """examples:
+    ./btrfsslower             # trace operations slower than 10 ms (default)
+    ./btrfsslower 1           # trace operations slower than 1 ms
+    ./btrfsslower -j 1        # ... 1 ms, parsable output (csv)
+    ./btrfsslower 0           # trace all operations (warning: verbose)
+    ./btrfsslower -p 185      # trace PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Trace common btrfs file operations slower than a threshold",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-j", "--csv", action="store_true",
+    help="just print fields: comma-separated values")
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("min_ms", nargs="?", default='10',
+    help="minimum I/O duration to trace, in ms (default 10)")
+args = parser.parse_args()
+min_ms = int(args.min_ms)
+pid = args.pid
+csv = args.csv
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/dcache.h>
+
+// XXX: switch these to char's when supported
+#define TRACE_READ      0
+#define TRACE_WRITE     1
+#define TRACE_OPEN      2
+#define TRACE_FSYNC     3
+
+struct val_t {
+    u64 ts;
+    u64 offset;
+    struct file *fp;
+};
+
+struct data_t {
+    // XXX: switch some to u32's when supported
+    u64 ts_us;
+    u64 type;
+    u64 size;
+    u64 offset;
+    u64 delta_us;
+    u64 pid;
+    char task[TASK_COMM_LEN];
+    char file[DNAME_INLINE_LEN];
+};
+
+BPF_HASH(entryinfo, pid_t, struct val_t);
+BPF_PERF_OUTPUT(events);
+
+//
+// Store timestamp and size on entry
+//
+
+// The current btrfs (Linux 4.5) uses generic_file_read_iter() instead of it's
+// own read function. So we need to trace that and then filter on btrfs, which
+// I do by checking file->f_op.
+int trace_read_entry(struct pt_regs *ctx, struct kiocb *iocb)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // btrfs filter on file->f_op == btrfs_file_operations
+    struct file *fp = iocb->ki_filp;
+    if ((u64)fp->f_op != BTRFS_FILE_OPERATIONS)
+        return 0;
+
+    // store filep and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = fp;
+    val.offset = iocb->ki_pos;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+// btrfs_file_write_iter():
+int trace_write_entry(struct pt_regs *ctx, struct kiocb *iocb)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // store filep and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = iocb->ki_filp;
+    val.offset = iocb->ki_pos;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+// The current btrfs (Linux 4.5) uses generic_file_open(), instead of it's own
+// function. Same as with reads. Trace the generic path and filter:
+int trace_open_entry(struct pt_regs *ctx, struct inode *inode,
+    struct file *file)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // btrfs filter on file->f_op == btrfs_file_operations
+    if ((u64)file->f_op != BTRFS_FILE_OPERATIONS)
+        return 0;
+
+    // store filep and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = file;
+    val.offset = 0;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+// btrfs_sync_file():
+int trace_fsync_entry(struct pt_regs *ctx, struct file *file)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // store filep and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = file;
+    val.offset = 0;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+//
+// Output
+//
+
+static int trace_return(struct pt_regs *ctx, int type)
+{
+    struct val_t *valp;
+    u32 pid = bpf_get_current_pid_tgid();
+
+    valp = entryinfo.lookup(&pid);
+    if (valp == 0) {
+        // missed tracing issue or filtered
+        return 0;
+    }
+
+    // calculate delta
+    u64 ts = bpf_ktime_get_ns();
+    u64 delta_us = (ts - valp->ts) / 1000;
+    entryinfo.delete(&pid);
+    if (FILTER_US)
+        return 0;
+
+    // workaround (rewriter should handle file to d_iname in one step):
+    struct dentry *de = NULL;
+    bpf_probe_read(&de, sizeof(de), &valp->fp->f_path.dentry);
+
+    // populate output struct
+    u32 size = ctx->ax;
+    struct data_t data = {.type = type, .size = size, .delta_us = delta_us,
+        .pid = pid};
+    data.ts_us = ts / 1000;
+    data.offset = valp->offset;
+    bpf_probe_read(&data.file, sizeof(data.file), de->d_iname);
+    bpf_get_current_comm(&data.task, sizeof(data.task));
+
+    events.perf_submit(ctx, &data, sizeof(data));
+
+    return 0;
+}
+
+int trace_read_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_READ);
+}
+
+int trace_write_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_WRITE);
+}
+
+int trace_open_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_OPEN);
+}
+
+int trace_fsync_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_FSYNC);
+}
+
+"""
+
+# code replacements
+with open(kallsyms) as syms:
+    ops = ''
+    for line in syms:
+        a = line.rstrip().split()
+        (addr, name) = (a[0], a[2])
+        if name == "btrfs_file_operations":
+            ops = "0x" + addr
+            break
+    if ops == '':
+        print("ERROR: no btrfs_file_operations in /proc/kallsyms. Exiting.")
+        exit()
+    bpf_text = bpf_text.replace('BTRFS_FILE_OPERATIONS', ops)
+if min_ms == 0:
+    bpf_text = bpf_text.replace('FILTER_US', '0')
+else:
+    bpf_text = bpf_text.replace('FILTER_US',
+        'delta_us <= %s' % str(min_ms * 1000))
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
+else:
+    bpf_text = bpf_text.replace('FILTER_PID', '0')
+if debug:
+    print(bpf_text)
+
+# kernel->user event data: struct data_t
+DNAME_INLINE_LEN = 32   # linux/dcache.h
+TASK_COMM_LEN = 16      # linux/sched.h
+class Data(ct.Structure):
+    _fields_ = [
+        ("ts_us", ct.c_ulonglong),
+        ("type", ct.c_ulonglong),
+        ("size", ct.c_ulonglong),
+        ("offset", ct.c_ulonglong),
+        ("delta_us", ct.c_ulonglong),
+        ("pid", ct.c_ulonglong),
+        ("task", ct.c_char * TASK_COMM_LEN),
+        ("file", ct.c_char * DNAME_INLINE_LEN)
+    ]
+
+# process event
+def print_event(cpu, data, size):
+    event = ct.cast(data, ct.POINTER(Data)).contents
+
+    type = 'R'
+    if event.type == 1:
+        type = 'W'
+    elif event.type == 2:
+        type = 'O'
+    elif event.type == 3:
+        type = 'S'
+
+    if (csv):
+        print("%d,%s,%d,%s,%d,%d,%d,%s" % (
+            event.ts_us, event.task, event.pid, type, event.size,
+            event.offset, event.delta_us, event.file))
+        return
+    print("%-8s %-14.14s %-6s %1s %-7s %-8d %7.2f %s" % (strftime("%H:%M:%S"),
+        event.task, event.pid, type, event.size, event.offset / 1024,
+        float(event.delta_us) / 1000, event.file))
+
+# initialize BPF
+b = BPF(text=bpf_text)
+
+# Common file functions. See earlier comment about generic_*().
+b.attach_kprobe(event="generic_file_read_iter", fn_name="trace_read_entry")
+b.attach_kprobe(event="btrfs_file_write_iter", fn_name="trace_write_entry")
+b.attach_kprobe(event="generic_file_open", fn_name="trace_open_entry")
+b.attach_kprobe(event="btrfs_sync_file", fn_name="trace_fsync_entry")
+b.attach_kretprobe(event="generic_file_read_iter", fn_name="trace_read_return")
+b.attach_kretprobe(event="btrfs_file_write_iter", fn_name="trace_write_return")
+b.attach_kretprobe(event="generic_file_open", fn_name="trace_open_return")
+b.attach_kretprobe(event="btrfs_sync_file", fn_name="trace_fsync_return")
+
+# header
+if (csv):
+    print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
+else:
+    if min_ms == 0:
+        print("Tracing btrfs operations")
+    else:
+        print("Tracing btrfs operations slower than %d ms" % min_ms)
+    print("%-8s %-14s %-6s %1s %-7s %-8s %7s %s" % ("TIME", "COMM", "PID", "T",
+        "BYTES", "OFF_KB", "LAT(ms)", "FILENAME"))
+
+# read events
+b["events"].open_perf_buffer(print_event)
+while 1:
+    b.kprobe_poll()
--- a/tools/btrfsslower_example.txt
+++ b/tools/btrfsslower_example.txt
+Demonstrations of btrfsslower, the Linux eBPF/bcc version.
+
+
+btrfsslower shows btrfs reads, writes, opens, and fsyncs, slower than a
+threshold. For example:
+
+# ./btrfsslower
+Tracing btrfs operations slower than 10 ms
+TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
+01:22:03 randread.pl    13602  R 8192    391384     10.40 data1
+01:22:03 randread.pl    13602  R 8192    92632      10.41 data1
+01:22:06 randread.pl    13602  R 8192    199800     17.33 data1
+01:22:06 randread.pl    13602  R 8192    415160     17.21 data1
+01:22:07 randread.pl    13602  R 8192    729984     11.93 data1
+01:22:09 randread.pl    13602  R 8192    342784     11.90 data1
+[...]
+
+This shows several reads from a "randread.pl" program, each 8 Kbytes in size,
+and from a "data1" file. These all had over 10 ms latency.
+
+This "latency" is measured from when the operation was issued from the VFS
+interface to the file system, to when it completed. This spans everything:
+block device I/O (disk I/O), file system CPU cycles, file system locks, run
+queue latency, etc. This is a better measure of the latency suffered by
+applications reading from the file system than measuring this down at the
+block device interface.
+
+Note that this only traces the common file system operations previously
+listed: other file system operations (eg, inode operations including
+getattr()) are not traced.
+
+
+The threshold can be provided as an argument. Eg, I/O slower than 1 ms:
+
+# ./btrfsslower 1
+Tracing btrfs operations slower than 1 ms
+TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
+03:26:54 randread.pl    30578  R 8192    214864      1.87 data1
+03:26:54 randread.pl    30578  R 8192    267600      1.48 data1
+03:26:54 randread.pl    30578  R 8192    704200      1.30 data1
+03:26:54 randread.pl    30578  R 8192    492352      3.09 data1
+03:26:55 randread.pl    30578  R 8192    319448      1.34 data1
+03:26:55 randread.pl    30578  R 8192    676032      1.88 data1
+03:26:55 randread.pl    30578  R 8192    646712      2.24 data1
+03:26:55 randread.pl    30578  R 8192    124376      1.02 data1
+03:26:55 randread.pl    30578  R 8192    223064      2.64 data1
+03:26:55 randread.pl    30578  R 8192    521280      1.55 data1
+03:26:55 randread.pl    30578  R 8192    272992      2.48 data1
+03:26:55 randread.pl    30578  R 8192    450112      2.67 data1
+03:26:55 randread.pl    30578  R 8192    361808      1.78 data1
+03:26:55 randread.pl    30578  R 8192    41088       1.46 data1
+03:26:55 randread.pl    30578  R 8192    756576      1.67 data1
+03:26:55 randread.pl    30578  R 8192    711776      2.74 data1
+03:26:55 randread.pl    30578  R 8192    129472      1.34 data1
+03:26:55 randread.pl    30578  R 8192    526928      1.82 data1
+03:26:56 randread.pl    30578  R 8192    312768      1.44 data1
+03:26:56 randread.pl    30578  R 8192    34720       1.14 data1
+03:26:56 randread.pl    30578  R 8192    258376      1.13 data1
+03:26:56 randread.pl    30578  R 8192    308456      1.44 data1
+03:26:56 randread.pl    30578  R 8192    759656      1.27 data1
+03:26:56 randread.pl    30578  R 8192    387424      3.24 data1
+03:26:56 randread.pl    30578  R 8192    168864      3.38 data1
+03:26:56 randread.pl    30578  R 8192    699296      1.38 data1
+03:26:56 randread.pl    30578  R 8192    405688      2.37 data1
+03:26:56 randread.pl    30578  R 8192    559064      1.18 data1
+03:26:56 randread.pl    30578  R 8192    264808      1.13 data1
+03:26:56 randread.pl    30578  R 8192    369240      2.20 data1
+[...]
+
+There's now much more output (this spans less than 3 seconds, the previous output
+spanned 6 seconds), as the lower threshold is catching more I/O.
+
+
+A threshold of 0 will trace all operations. Warning: the output will be
+verbose, as it will include all file system cache hits.
+
+# ./btrfsslower 0
+Tracing btrfs operations
+TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
+03:28:17 bash           32597  O 0       0           0.00 date.txt
+03:28:17 date           32597  W 29      0           0.02 date.txt
+03:28:23 cksum          32743  O 0       0           0.00 date.txt
+03:28:23 cksum          32743  R 29      0           0.01 date.txt
+03:28:23 cksum          32743  R 0       0           0.00 date.txt
+
+While tracing, the following commands were run in another window:
+
+# date > date.txt
+# cksum date.txt 
+
+The output of btrfsslower now includes open operations ("O"), and writes ("W").
+The first read from cksum(1) returned 29 bytes, and the second returned 0:
+causing cksum(1) to stop reading.
+
+
+A -j option will print just the fields (parsable output, csv):
+
+# ./btrfsslower -j 1
+ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
+8930665366,randread.pl,2717,R,8192,230391808,4312,data1
+8930670746,randread.pl,2717,R,8192,347832320,1296,data1
+8930675995,randread.pl,2717,R,8192,409812992,4207,data1
+8930680213,randread.pl,2717,R,8192,498204672,3104,data1
+8930685970,randread.pl,2717,R,8192,553164800,1843,data1
+8930687568,randread.pl,2717,R,8192,339492864,1475,data1
+8930694108,randread.pl,2717,R,8192,500711424,6276,data1
+8930697139,randread.pl,2717,R,8192,485801984,2180,data1
+8930705755,randread.pl,2717,R,8192,376922112,7535,data1
+8930711340,randread.pl,2717,R,8192,380084224,3314,data1
+8930740964,randread.pl,2717,R,8192,226091008,24762,data1
+8930743169,randread.pl,2717,R,8192,361570304,1809,data1
+8930748789,randread.pl,2717,R,8192,346931200,1530,data1
+8930763514,randread.pl,2717,R,8192,59719680,13938,data1
+8930764870,randread.pl,2717,R,8192,406511616,1313,data1
+8930774327,randread.pl,2717,R,8192,661430272,7361,data1
+8930780360,randread.pl,2717,R,8192,406904832,2220,data1
+8930785736,randread.pl,2717,R,8192,523419648,2005,data1
+8930794560,randread.pl,2717,R,8192,342974464,8388,data1
+[...]
+
+This may be useful for visualizing with another tool, for example, for
+producing a scatter plot of ENDTIME vs LATENCY, to look for time-based
+patterns.
+
+
+USAGE message:
+
+# ./btrfsslower -h
+usage: btrfsslower [-h] [-j] [-p PID] [min_ms]
+
+Trace common btrfs file operations slower than a threshold
+
+positional arguments:
+  min_ms             minimum I/O duration to trace, in ms (default 10)
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -j, --csv          just print fields: comma-separated values
+  -p PID, --pid PID  trace this PID only
+
+examples:
+    ./btrfsslower             # trace operations slower than 10 ms (default)
+    ./btrfsslower 1           # trace operations slower than 1 ms
+    ./btrfsslower -j 1        # ... 1 ms, parsable output (csv)
+    ./btrfsslower 0           # trace all operations (warning: verbose)
+    ./btrfsslower -p 185      # trace PID 185 only