Merge pull request #383 from brendangregg/master

3 tools: tcpretrans, zfsslower, zfsdist

Merge pull request #383 from brendangregg/master
3 tools: tcpretrans, zfsslower, zfsdist
b712c66a · 4ast · fb7ae106 · 157fee3f · b712c66a · b712c66a
Commit b712c66a authored Feb 15, 2016 by 4ast
10 changed files
--- a/README.md
+++ b/README.md
@@ -72,10 +72,10 @@ Tools:
 - tools/[biosnoop](tools/biosnoop.py): Trace block device I/O with PID and latency. [Examples](tools/biosnoop_example.txt).
 - tools/[bitesize](tools/bitesize.py): Show per process I/O size histogram. [Examples](tools/bitesize_example.txt).
 - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
 - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
 - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency. [Examples](tools/ext4dist_example.txt).
+- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
+- tools/[ext4dist](tools/ext4dist.py): Summarize ext4 operation latency distribution as a histogram. [Examples](tools/ext4dist_example.txt).
 - tools/[ext4slower](tools/ext4slower.py): Trace slow ext4 operations. [Examples](tools/ext4slower_example.txt).
 - tools/[filelife](tools/filelife.py): Trace the lifespan of short-lived files. [Examples](tools/filelife_example.txt).
 - tools/[fileslower](tools/fileslower.py): Trace slow synchronous file reads and writes. [Examples](tools/fileslower_example.txt).
@@ -85,7 +85,7 @@ Tools:
 - tools/[gethostlatency](tools/gethostlatency.py): Show latency for getaddrinfo/gethostbyname[2] calls. [Examples](tools/gethostlatency_example.txt).
 - tools/[hardirqs](tools/hardirqs.py):  Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
 - tools/[killsnoop](tools/killsnoop.py): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
- tools/[mdflush](tools/mdflush.py): Trace md flush events. [Examples](tools/mdflush.txt).
+- tools/[mdflush](tools/mdflush.py): Trace md flush events. [Examples](tools/mdflush_example.txt).
 - tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_example.txt).
 - tools/[offcputime](tools/offcputime.py): Summarize off-CPU time by kernel stack trace. [Examples](tools/offcputime_example.txt).
 - tools/[offwaketime](tools/offwaketime.py): Summarize blocked time by kernel off-CPU stack and waker stack. [Examples](tools/offwaketime_example.txt).
@@ -100,11 +100,14 @@ Tools:
 - tools/[syncsnoop](tools/syncsnoop.py): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
 - tools/[tcpaccept](tools/tcpaccept.py): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
 - tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
+- tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
 - tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
 - tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).
 - tools/[wakeuptime](tools/wakeuptime.py): Summarize sleep to wakeup time by waker kernel stack. [Examples](tools/wakeuptime_example.txt).
- tools/[xfsdist](tools/xfsdist.py): Summarize XFS operation latency. [Examples](tools/xfsdist_example.txt).
+- tools/[xfsdist](tools/xfsdist.py): Summarize XFS operation latency distribution as a histogram. [Examples](tools/xfsdist_example.txt).
 - tools/[xfsslower](tools/xfsslower.py): Trace slow XFS operations. [Examples](tools/xfsslower_example.txt).
+- tools/[zfsdist](tools/zfsdist.py): Summarize ZFS operation latency distribution as a histogram. [Examples](tools/zfsdist_example.txt).
+- tools/[zfsslower](tools/zfsslower.py): Trace slow ZFS operations. [Examples](tools/zfsslower_example.txt).

 ### Networking


--- a/man/man8/tcpretrans.8
+++ b/man/man8/tcpretrans.8
+.TH tcpaccept 8  "2016-02-14" "USER COMMANDS"
+.SH NAME
+tcpaccept \- Trace TCP retransmits and TLPs. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B tcpaccept [\-h] [\-l]
+.SH DESCRIPTION
+This traces TCP retransmits, showing address, port, and TCP state information,
+and sometimes the PID (although usually not, since retransmits are usually
+sent by the kernel on timeouts). To keep overhead very low, only
+the TCP retransmit functions are traced. This does not trace every packet
+(like tcpdump(8) or a packet sniffer).
+
+This uses dynamic tracing of the kernel tcp_retransmit_skb() and
+tcp_send_loss_probe() functions, and will need to be updated to
+match kernel changes to these functions.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-l
+Include tail loss probe attempts (in some cases the kernel may not
+complete the TLP send).
+.SH EXAMPLES
+.TP
+Trace TCP retransmits:
+#
+.B tcpretrans
+.TP
+Trace TCP retransmits and TLP attempts:
+#
+.B tcpconnect \-l
+.SH FIELDS
+.TP
+TIME
+Time of the retransmit.
+.TP
+PID
+Process ID that was on-CPU. This is less useful than it might sound, as it
+may usually be 0, for the kernel, for timer-based retransmits.
+.TP
+IP
+IP address family (4 or 6).
+.TP
+LADDR
+Local IP address. IPv4 as a dotted quad, IPv6 shows "..." then the last 4
+bytes (check for newer versions of this tool for the full address).
+.TP
+LPORT
+Local port.
+.TP
+T>
+Type of event: R> == retransmit, L> == tail loss probe.
+.TP
+RADDR
+Remote IP address. IPv4 as a dotted quad, IPv6 shows "..." then the last 4
+bytes (check for newer versions of this tool for the full address).
+.TP
+RPORT
+Remote port.
+.TP
+STATE
+TCP session state.
+.SH OVERHEAD
+Should be negligible: TCP retransmit events should be low (<1000/s), and the
+low overhead this tool adds to each event should make the cost negligible.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+tcpconnect(8), tcpaccept(8)
--- a/man/man8/zfsdist.8
+++ b/man/man8/zfsdist.8
+.TH zfsdist 8  "2016-02-12" "USER COMMANDS"
+.SH NAME
+zfsdist \- Summarize ZFS operation latency. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B zfsdist [\-h] [\-T] [\-N] [\-d] [interval] [count]
+.SH DESCRIPTION
+This tool summarizes time (latency) spent in common ZFS file operations: reads,
+writes, opens, and syncs, and presents it as a power-of-2 histogram. It uses an
+in-kernel eBPF map to store the histogram for efficiency.
+
+This uses kernel dynamic tracing of the ZPL interface (ZFS POSIX
+Layer), and will need updates to match any changes to this interface.
+.TP
+This is intended to work with the ZFS on Linux project:
+http://zfsonlinux.org
+.PP
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-T
+Don't include timestamps on interval output.
+.TP
+\-m
+Output in milliseconds.
+.TP
+\-p PID
+Trace this PID only.
+.SH EXAMPLES
+.TP
+Trace ZFS operation time, and print a summary on Ctrl-C:
+#
+.B zfsdist
+.TP
+Trace PID 181 only:
+#
+.B zfsdist -p 181
+.TP
+Print 1 second summaries, 10 times:
+#
+.B zfsdist 1 10
+.TP
+1 second summaries, printed in milliseconds
+#
+.B zfsdist \-m 1
+.SH FIELDS
+.TP
+msecs
+Range of milliseconds for this bucket.
+.TP
+usecs
+Range of microseconds for this bucket.
+.TP
+count
+Number of operations in this time range.
+.TP
+distribution
+ASCII representation of the distribution (the count column).
+.SH OVERHEAD
+This adds low-overhead instrumentation to these ZFS operations,
+including reads and writes from the file system cache. Such reads and writes
+can be very frequent (depending on the workload; eg, 1M/sec), at which
+point the overhead of this tool may become noticeable.
+Measure and quantify before use.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+zfssnoop(8)
--- a/man/man8/zfsslower.8
+++ b/man/man8/zfsslower.8
+.TH zfsslower 8  "2016-02-11" "USER COMMANDS"
+.SH NAME
+zfsslower \- Trace slow zfs file operations, with per-event details.
+.SH SYNOPSIS
+.B zfsslower [\-h] [\-j] [\-p PID] [min_ms]
+.SH DESCRIPTION
+This tool traces common ZFS file operations: reads, writes, opens, and
+syncs. It measures the time spent in these operations, and prints details
+for each that exceeded a threshold.
+
+WARNING: See the OVERHEAD section.
+
+By default, a minimum millisecond threshold of 10 is used. If a threshold of 0
+is used, all events are printed (warning: verbose).
+
+This uses kernel dynamic tracing of the ZPL interface (ZFS POSIX
+Layer), and will need updates to match any changes to this interface.
+.TP
+This is intended to work with the ZFS on Linux project:
+http://zfsonlinux.org
+.PP
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH OPTIONS
+\-p PID
+Trace this PID only.
+.TP
+min_ms
+Minimum I/O latency (duration) to trace, in milliseconds. Default is 10 ms.
+.SH EXAMPLES
+.TP
+Trace synchronous file reads and writes slower than 10 ms:
+#
+.B zfsslower
+.TP
+Trace slower than 1 ms:
+#
+.B zfsslower 1
+.TP
+Trace slower than 1 ms, and output just the fields in parsable format (csv):
+#
+.B zfsslower \-j 1
+.TP
+Trace all file reads and writes (warning: the output will be verbose):
+#
+.B zfsslower 0
+.TP
+Trace slower than 1 ms, for PID 181 only:
+#
+.B zfsslower \-p 181 1
+.SH FIELDS
+.TP
+TIME(s)
+Time of I/O completion since the first I/O seen, in seconds.
+.TP
+COMM
+Process name.
+.TP
+PID
+Process ID.
+.TP
+T
+Type of operation. R == read, W == write, O == open, S == fsync.
+.TP
+OFF_KB
+File offset for the I/O, in Kbytes.
+.TP
+BYTES
+Size of I/O, in bytes.
+.TP
+LAT(ms)
+Latency (duration) of I/O, measured from when it was issued by VFS to the
+filesystem, to when it completed. This time is inclusive of block device I/O,
+file system CPU cycles, file system locks, run queue latency, etc. It's a more
+accurate measure of the latency suffered by applications performing file
+system I/O, than to measure this down at the block device interface.
+.TP
+FILENAME
+A cached kernel file name (comes from dentry->d_iname).
+.TP
+ENDTIME_us
+Completion timestamp, microseconds (\-j only).
+.TP
+OFFSET_b
+File offset, bytes (\-j only).
+.TP
+LATENCY_us
+Latency (duration) of the I/O, in microseconds (\-j only).
+.SH OVERHEAD
+This adds low-overhead instrumentation to these ZFS operations,
+including reads and writes from the file system cache. Such reads and writes
+can be very frequent (depending on the workload; eg, 1M/sec), at which
+point the overhead of this tool (even if it prints no "slower" events) can
+begin to become significant. Measure and quantify before use. If this
+continues to be a problem, consider switching to a tool that prints in-kernel
+summaries only.
+.PP
+Note that the overhead of this tool should be less than fileslower(8), as
+this tool targets zfs functions only, and not all file read/write paths
+(which can include socket I/O).
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+biosnoop(8), funccount(8), fileslower(8)
--- a/tools/tcpretrans.py
+++ b/tools/tcpretrans.py
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# tcpretrans    Trace TCP retransmits and TLPs.
+#               For Linux, uses BCC, eBPF. Embedded C.
+#
+# USAGE: tcpretrans [-h] [-l]
+#
+# This uses dynamic tracing of kernel functions, and will need to be updated
+# to match kernel changes.
+#
+# IPv4 addresses are printed as dotted quads. For IPv6 addresses, the last four
+# bytes are printed after "..."; check for future versions with better IPv6
+# support.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 14-Feb-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+from time import strftime
+import ctypes as ct
+
+# arguments
+examples = """examples:
+    ./tcpretrans           # trace TCP retransmits
+    ./tcpretrans -l        # include TLP attempts
+"""
+parser = argparse.ArgumentParser(
+    description="Trace TCP retransmits",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-l", "--lossprobe", action="store_true",
+    help="include tail loss probe attempts")
+args = parser.parse_args()
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <net/sock.h>
+#include <bcc/proto.h>
+
+#define RETRANSMIT  1
+#define TLP         2
+
+// separate data structs for ipv4 and ipv6
+struct ipv4_data_t {
+    // XXX: switch some to u32's when supported
+    u64 pid;
+    u64 ip;
+    u64 saddr;
+    u64 daddr;
+    u64 lport;
+    u64 dport;
+    u64 state;
+    u64 type;
+};
+BPF_PERF_OUTPUT(ipv4_events);
+
+struct ipv6_data_t {
+    // XXX: update to transfer full ipv6 addrs
+    u64 pid;
+    u64 ip;
+    u64 saddr;
+    u64 daddr;
+    u64 lport;
+    u64 dport;
+    u64 state;
+    u64 type;
+};
+BPF_PERF_OUTPUT(ipv6_events);
+
+static int trace_event(struct pt_regs *ctx, struct sock *sk, int type)
+{
+    if (sk == NULL)
+        return 0;
+    u32 pid = bpf_get_current_pid_tgid();
+    struct sock *skp = NULL;
+    bpf_probe_read(&skp, sizeof(skp), &sk);
+
+    // pull in details
+    u16 family = 0, lport = 0, dport = 0;
+    char state = 0;
+    bpf_probe_read(&family, sizeof(family), &skp->__sk_common.skc_family);
+    bpf_probe_read(&lport, sizeof(lport), &skp->__sk_common.skc_num);
+    bpf_probe_read(&dport, sizeof(dport), &skp->__sk_common.skc_dport);
+    bpf_probe_read(&state, sizeof(state), (void *)&skp->__sk_common.skc_state);
+
+    if (family == AF_INET) {
+        struct ipv4_data_t data4 = {.pid = pid, .ip = 4, .type = type};
+        bpf_probe_read(&data4.saddr, sizeof(u32),
+            &skp->__sk_common.skc_rcv_saddr);
+        bpf_probe_read(&data4.daddr, sizeof(u32),
+            &skp->__sk_common.skc_daddr);
+        data4.lport = lport;
+        data4.dport = dport;
+        data4.state = state;
+        ipv4_events.perf_submit(ctx, &data4, sizeof(data4));
+
+    } else if (family == AF_INET6) {
+        struct ipv6_data_t data6 = {.pid = pid, .ip = 6, .type = type};
+        // just grab the last 4 bytes for now
+        u32 saddr = 0, daddr = 0;
+        bpf_probe_read(&saddr, sizeof(saddr),
+            &skp->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32[3]);
+        bpf_probe_read(&daddr, sizeof(daddr),
+            &skp->__sk_common.skc_v6_daddr.in6_u.u6_addr32[3]);
+        data6.saddr = bpf_ntohl(saddr);
+        data6.daddr = bpf_ntohl(daddr);
+        data6.lport = lport;
+        data6.dport = dport;
+        data6.state = state;
+        ipv6_events.perf_submit(ctx, &data6, sizeof(data6));
+    }
+    // else drop
+
+    return 0;
+}
+
+int trace_retransmit(struct pt_regs *ctx, struct sock *sk)
+{
+    trace_event(ctx, sk, RETRANSMIT);
+    return 0;
+}
+
+int trace_tlp(struct pt_regs *ctx, struct sock *sk)
+{
+    trace_event(ctx, sk, TLP);
+    return 0;
+}
+"""
+
+# event data
+TASK_COMM_LEN = 16      # linux/sched.h
+class Data_ipv4(ct.Structure):
+    _fields_ = [
+        ("pid", ct.c_ulonglong),
+        ("ip", ct.c_ulonglong),
+        ("saddr", ct.c_ulonglong),
+        ("daddr", ct.c_ulonglong),
+        ("lport", ct.c_ulonglong),
+        ("dport", ct.c_ulonglong),
+        ("state", ct.c_ulonglong),
+        ("type", ct.c_ulonglong)
+    ]
+class Data_ipv6(ct.Structure):
+    _fields_ = [
+        ("pid", ct.c_ulonglong),
+        ("ip", ct.c_ulonglong),
+        ("saddr", ct.c_ulonglong),
+        ("daddr", ct.c_ulonglong),
+        ("lport", ct.c_ulonglong),
+        ("dport", ct.c_ulonglong),
+        ("state", ct.c_ulonglong),
+        ("type", ct.c_ulonglong)
+    ]
+
+# from bpf_text:
+type = {}
+type[1] = 'R'
+type[2] = 'L'
+
+def inet_ntoa(addr):
+    dq = ''
+    for i in range(0, 4):
+        dq = dq + str(addr & 0xff)
+        if (i != 3):
+            dq = dq + '.'
+        addr = addr >> 8
+    return dq
+
+# from include/net/tcp_states.h:
+tcpstate = {}
+tcpstate[1] = 'ESTABLISHED'
+tcpstate[2] = 'SYN_SENT'
+tcpstate[3] = 'SYN_RECV'
+tcpstate[4] = 'FIN_WAIT1'
+tcpstate[5] = 'FIN_WAIT2'
+tcpstate[6] = 'TIME_WAIT'
+tcpstate[7] = 'CLOSE'
+tcpstate[8] = 'CLOSE_WAIT'
+tcpstate[9] = 'LAST_ACK'
+tcpstate[10] = 'LISTEN'
+tcpstate[11] = 'CLOSING'
+tcpstate[12] = 'NEW_SYN_RECV'
+
+# process event
+def print_ipv4_event(cpu, data, size):
+    event = ct.cast(data, ct.POINTER(Data_ipv4)).contents
+    print("%-8s %-6d %-2d %-20s %1s> %-20s %s" % (
+        strftime("%H:%M:%S"), event.pid, event.ip,
+        "%s:%s" % (inet_ntoa(event.saddr), event.lport),
+        type[event.type],
+        "%s:%s" % (inet_ntoa(event.daddr), event.dport),
+        tcpstate[event.state]))
+def print_ipv6_event(cpu, data, size):
+    event = ct.cast(data, ct.POINTER(Data_ipv6)).contents
+    print("%%-8s -6d %-2d %-20s %1s> %-20s %s" % (
+        strftime("%H:%M:%S"), event.pid, event.ip,
+        "...%x:%d" % (event.saddr, event.lport),
+        type[event.type],
+        "...%x:%d" % (event.daddr, event.dport),
+        tcpstate[event.state]))
+
+# initialize BPF
+b = BPF(text=bpf_text)
+b.attach_kprobe(event="tcp_retransmit_skb", fn_name="trace_retransmit")
+b.attach_kprobe(event="tcp_send_loss_probe", fn_name="trace_tlp")
+
+# header
+print("%-8s %-6s %-2s %-20s %1s> %-20s %-4s" % ("TIME", "PID", "IP",
+    "LADDR:LPORT", "T", "RADDR:RPORT", "STATE"))
+
+# read events
+b["ipv4_events"].open_perf_buffer(print_ipv4_event)
+b["ipv6_events"].open_perf_buffer(print_ipv6_event)
+while 1:
+    b.kprobe_poll()
--- a/tools/tcpretrans_example.txt
+++ b/tools/tcpretrans_example.txt
+Demonstrations of tcpretrans, the Linux eBPF/bcc version.
+
+
+This tool traces the kernel TCP retransmit function to show details of these
+retransmits. For example:
+
+# ./tcpretrans 
+TIME     PID    IP LADDR:LPORT          T> RADDR:RPORT          STATE
+01:55:05 0      4  10.153.223.157:22    R> 69.53.245.40:34619   ESTABLISHED
+01:55:05 0      4  10.153.223.157:22    R> 69.53.245.40:34619   ESTABLISHED
+01:55:17 0      4  10.153.223.157:22    R> 69.53.245.40:22957   ESTABLISHED
+[...]
+
+This output shows three TCP retransmits, the first two were for an IPv4
+connection from 10.153.223.157 port 22 to 69.53.245.40 port 34619. The TCP
+state was "ESTABLISHED" at the time of the retransmit. The on-CPU PID at the
+time of the retransmit is printed, in this case 0 (the kernel, which will
+be the case most of the time).
+
+Retransmits are usually a sign of poor network health, and this tool is
+useful for their investigation. Unlike using tcpdump, this tool has very
+low overhead, as it only traces the retransmit function. It also prints
+additional kernel details: the state of the TCP session at the time of the
+retransmit.
+
+
+A -l option will include TCP tail loss probe attempts:
+
+# ./tcpretrans -l
+TIME     PID    IP LADDR:LPORT          T> RADDR:RPORT          STATE
+01:55:45 0      4  10.153.223.157:22    R> 69.53.245.40:51601   ESTABLISHED
+01:55:46 0      4  10.153.223.157:22    R> 69.53.245.40:51601   ESTABLISHED
+01:55:46 0      4  10.153.223.157:22    R> 69.53.245.40:51601   ESTABLISHED
+01:55:53 0      4  10.153.223.157:22    L> 69.53.245.40:46444   ESTABLISHED
+01:56:06 0      4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+01:56:06 0      4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+01:56:08 0      4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+01:56:08 0      4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+01:56:08 1938   4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+01:56:08 0      4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+01:56:08 0      4  10.153.223.157:22    R> 69.53.245.40:46444   ESTABLISHED
+[...]
+
+See the "L>" in the "T>" column. These are attempts: the kernel probably
+sent a TLP, but in some cases it might not have been ultimately sent.
+
+
+USAGE message:
+
+# ./tcpretrans -h
+usage: tcpretrans [-h] [-l]
+
+Trace TCP retransmits
+
+optional arguments:
+  -h, --help       show this help message and exit
+  -l, --lossprobe  include tail loss probe attempts
+
+examples:
+    ./tcpretrans           # trace TCP retransmits
+    ./tcpretrans -l        # include TLP attempts
--- a/tools/zfsdist.py
+++ b/tools/zfsdist.py
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# zfsdist  Summarize ZFS operation latency.
+#          For Linux, uses BCC, eBPF.
+#
+# USAGE: zfsdist [-h] [-T] [-m] [-p PID] [interval] [count]
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 14-Feb-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+from time import sleep, strftime
+import argparse
+
+# arguments
+examples = """examples:
+    ./zfsdist            # show operation latency as a histogram
+    ./zfsdist -p 181     # trace PID 181 only
+    ./zfsdist 1 10       # print 1 second summaries, 10 times
+    ./zfsdist -m 5       # 5s summaries, milliseconds
+"""
+parser = argparse.ArgumentParser(
+    description="Summarize ZFS operation latency",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-T", "--notimestamp", action="store_true",
+    help="don't include timestamp on interval output")
+parser.add_argument("-m", "--milliseconds", action="store_true",
+    help="output in milliseconds")
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("interval", nargs="?",
+    help="output interval, in seconds")
+parser.add_argument("count", nargs="?", default=99999999,
+    help="number of outputs")
+args = parser.parse_args()
+pid = args.pid
+countdown = int(args.count)
+if args.milliseconds:
+    factor = 1000000
+    label = "msecs"
+else:
+    factor = 1000
+    label = "usecs"
+if args.interval and int(args.interval) == 0:
+    print("ERROR: interval 0. Exiting.")
+    exit()
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+
+#define OP_NAME_LEN 8
+typedef struct dist_key {
+    char op[OP_NAME_LEN];
+    u64 slot;
+} dist_key_t;
+BPF_HASH(start, u32);
+BPF_HISTOGRAM(dist, dist_key_t);
+
+// time operation
+int trace_entry(struct pt_regs *ctx)
+{
+    u32 pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+    u64 ts = bpf_ktime_get_ns();
+    start.update(&pid, &ts);
+    return 0;
+}
+
+static int trace_return(struct pt_regs *ctx, const char *op)
+{
+    u64 *tsp;
+    u32 pid = bpf_get_current_pid_tgid();
+
+    // fetch timestamp and calculate delta
+    tsp = start.lookup(&pid);
+    if (tsp == 0) {
+        return 0;   // missed start or filtered
+    }
+    u64 delta = (bpf_ktime_get_ns() - *tsp) / FACTOR;
+
+    // store as histogram
+    dist_key_t key = {.slot = bpf_log2l(delta)};
+    __builtin_memcpy(&key.op, op, sizeof(key.op));
+    dist.increment(key);
+
+    start.delete(&pid);
+    return 0;
+}
+
+int trace_read_return(struct pt_regs *ctx)
+{
+    char *op = "read";
+    return trace_return(ctx, op);
+}
+
+int trace_write_return(struct pt_regs *ctx)
+{
+    char *op = "write";
+    return trace_return(ctx, op);
+}
+
+int trace_open_return(struct pt_regs *ctx)
+{
+    char *op = "open";
+    return trace_return(ctx, op);
+}
+
+int trace_fsync_return(struct pt_regs *ctx)
+{
+    char *op = "fsync";
+    return trace_return(ctx, op);
+}
+"""
+bpf_text = bpf_text.replace('FACTOR', str(factor))
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
+else:
+    bpf_text = bpf_text.replace('FILTER_PID', '0')
+if debug:
+    print(bpf_text)
+
+# load BPF program
+b = BPF(text=bpf_text)
+
+# common file functions
+b.attach_kprobe(event="zpl_read", fn_name="trace_entry")
+b.attach_kprobe(event="zpl_write", fn_name="trace_entry")
+b.attach_kprobe(event="zpl_open", fn_name="trace_entry")
+b.attach_kprobe(event="zpl_fsync", fn_name="trace_entry")
+b.attach_kretprobe(event="zpl_read", fn_name="trace_read_return")
+b.attach_kretprobe(event="zpl_write", fn_name="trace_write_return")
+b.attach_kretprobe(event="zpl_open", fn_name="trace_open_return")
+b.attach_kretprobe(event="zpl_fsync", fn_name="trace_fsync_return")
+
+print("Tracing ZFS operation latency... Hit Ctrl-C to end.")
+
+# output
+exiting = 0
+dist = b.get_table("dist")
+while (1):
+    try:
+        if args.interval:
+            sleep(int(args.interval))
+        else:
+            sleep(99999999)
+    except KeyboardInterrupt:
+        exiting = 1
+
+    print()
+    if args.interval and (not args.notimestamp):
+        print(strftime("%H:%M:%S:"))
+
+    dist.print_log2_hist(label, "operation")
+    dist.clear()
+
+    countdown -= 1
+    if exiting or countdown == 0:
+        exit()
--- a/tools/zfsdist_example.txt
+++ b/tools/zfsdist_example.txt
+Demonstrations of zfsdist, the Linux eBPF/bcc version.
+
+
+zfsdist traces ZFS reads, writes, opens, and fsyncs, and summarizes their
+latency as a power-of-2 histogram. It has been written to work on ZFS on Linux
+(http://zfsonlinux.org). For example:
+
+# ./zfsdist 
+Tracing ZFS operation latency... Hit Ctrl-C to end.
+^C
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 4479     |****************************************|
+         8 -> 15         : 1028     |*********                               |
+        16 -> 31         : 14       |                                        |
+        32 -> 63         : 1        |                                        |
+        64 -> 127        : 2        |                                        |
+       128 -> 255        : 6        |                                        |
+       256 -> 511        : 1        |                                        |
+       512 -> 1023       : 1256     |***********                             |
+      1024 -> 2047       : 9        |                                        |
+      2048 -> 4095       : 1        |                                        |
+      4096 -> 8191       : 2        |                                        |
+
+operation = 'write'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 0        |                                        |
+         8 -> 15         : 0        |                                        |
+        16 -> 31         : 0        |                                        |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 0        |                                        |
+       128 -> 255        : 75       |****************************************|
+       256 -> 511        : 11       |*****                                   |
+       512 -> 1023       : 0        |                                        |
+      1024 -> 2047       : 0        |                                        |
+      2048 -> 4095       : 0        |                                        |
+      4096 -> 8191       : 0        |                                        |
+      8192 -> 16383      : 0        |                                        |
+     16384 -> 32767      : 0        |                                        |
+     32768 -> 65535      : 0        |                                        |
+     65536 -> 131071     : 13       |******                                  |
+    131072 -> 262143     : 1        |                                        |
+
+operation = 'open'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 2        |****************************************|
+
+This output shows a bimodal distribution for read latency, with a faster
+mode of around 5 thousand reads that took between 4 and 15 microseconds, and a
+slower mode of 1256 reads that took between 512 and 1023 microseconds. It's
+likely that the faster mode was a hit from the in-memory file system cache,
+and the slower mode is a read from a storage device (disk).
+
+The write latency is also bimodal, with a faster mode between 128 and 511 us,
+and the slower mode between 65 and 131 ms.
+
+This "latency" is measured from when the operation was issued from the VFS
+interface to the file system (via the ZFS POSIX layer), to when it completed.
+This spans everything: block device I/O (disk I/O), file system CPU cycles,
+file system locks, run queue latency, etc. This is a better measure of the
+latency suffered by applications reading from the file system than measuring
+this down at the block device interface.
+
+Note that this only traces the common file system operations previously
+listed: other file system operations (eg, inode operations including
+getattr()) are not traced.
+
+
+An optional interval and a count can be provided, as well as -m to show the
+distributions in milliseconds. For example:
+
+# ./zfsdist 1 5
+Tracing ZFS operation latency... Hit Ctrl-C to end.
+
+06:55:41:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 3976     |****************************************|
+         8 -> 15         : 1181     |***********                             |
+        16 -> 31         : 18       |                                        |
+        32 -> 63         : 4        |                                        |
+        64 -> 127        : 17       |                                        |
+       128 -> 255        : 16       |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 1275     |************                            |
+      1024 -> 2047       : 36       |                                        |
+      2048 -> 4095       : 3        |                                        |
+      4096 -> 8191       : 0        |                                        |
+      8192 -> 16383      : 1        |                                        |
+     16384 -> 32767      : 1        |                                        |
+
+06:55:42:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 12751    |****************************************|
+         8 -> 15         : 1190     |***                                     |
+        16 -> 31         : 38       |                                        |
+        32 -> 63         : 7        |                                        |
+        64 -> 127        : 85       |                                        |
+       128 -> 255        : 47       |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 1010     |***                                     |
+      1024 -> 2047       : 49       |                                        |
+      2048 -> 4095       : 12       |                                        |
+
+06:55:43:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 80925    |****************************************|
+         8 -> 15         : 1645     |                                        |
+        16 -> 31         : 251      |                                        |
+        32 -> 63         : 24       |                                        |
+        64 -> 127        : 16       |                                        |
+       128 -> 255        : 12       |                                        |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 80       |                                        |
+      1024 -> 2047       : 1        |                                        |
+
+06:55:44:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 81207    |****************************************|
+         8 -> 15         : 2075     |*                                       |
+        16 -> 31         : 2005     |                                        |
+        32 -> 63         : 177      |                                        |
+        64 -> 127        : 3        |                                        |
+
+06:55:45:
+
+operation = 'read'
+     usecs               : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 0        |                                        |
+         4 -> 7          : 74364    |****************************************|
+         8 -> 15         : 865      |                                        |
+        16 -> 31         : 4960     |**                                      |
+        32 -> 63         : 625      |                                        |
+        64 -> 127        : 2        |                                        |
+
+This workload was randomly reading from a file that became cached. The slower
+mode can be seen to disappear by the final summaries.
+
+
+USAGE message:
+
+# ./zfsdist -h
+usage: zfsdist [-h] [-T] [-m] [-p PID] [interval] [count]
+
+Summarize ZFS operation latency
+
+positional arguments:
+  interval            output interval, in seconds
+  count               number of outputs
+
+optional arguments:
+  -h, --help          show this help message and exit
+  -T, --notimestamp   don't include timestamp on interval output
+  -m, --milliseconds  output in milliseconds
+  -p PID, --pid PID   trace this PID only
+
+examples:
+    ./zfsdist            # show operation latency as a histogram
+    ./zfsdist -p 181     # trace PID 181 only
+    ./zfsdist 1 10       # print 1 second summaries, 10 times
+    ./zfsdist -m 5       # 5s summaries, milliseconds
--- a/tools/zfsslower.py
+++ b/tools/zfsslower.py
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# zfsslower  Trace slow ZFS operations.
+#            For Linux, uses BCC, eBPF.
+#
+# USAGE: zfsslower [-h] [-j] [-p PID] [min_ms]
+#
+# This script traces common ZFS file operations: reads, writes, opens, and
+# syncs. It measures the time spent in these operations, and prints details
+# for each that exceeded a threshold.
+#
+# WARNING: This adds low-overhead instrumentation to these ZFS operations,
+# including reads and writes from the file system cache. Such reads and writes
+# can be very frequent (depending on the workload; eg, 1M/sec), at which
+# point the overhead of this tool (even if it prints no "slower" events) can
+# begin to become significant.
+#
+# This works by using kernel dynamic tracing of the ZPL interface, and will
+# need updates to match any changes to this interface.
+#
+# By default, a minimum millisecond threshold of 10 is used.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 14-Feb-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF
+import argparse
+from time import strftime
+import ctypes as ct
+
+# arguments
+examples = """examples:
+    ./zfsslower             # trace operations slower than 10 ms (default)
+    ./zfsslower 1           # trace operations slower than 1 ms
+    ./zfsslower -j 1        # ... 1 ms, parsable output (csv)
+    ./zfsslower 0           # trace all operations (warning: verbose)
+    ./zfsslower -p 185      # trace PID 185 only
+"""
+parser = argparse.ArgumentParser(
+    description="Trace common ZFS file operations slower than a threshold",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-j", "--csv", action="store_true",
+    help="just print fields: comma-separated values")
+parser.add_argument("-p", "--pid",
+    help="trace this PID only")
+parser.add_argument("min_ms", nargs="?", default='10',
+    help="minimum I/O duration to trace, in ms (default 10)")
+args = parser.parse_args()
+min_ms = int(args.min_ms)
+pid = args.pid
+csv = args.csv
+debug = 0
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/dcache.h>
+
+// XXX: switch these to char's when supported
+#define TRACE_READ      0
+#define TRACE_WRITE     1
+#define TRACE_OPEN      2
+#define TRACE_FSYNC     3
+
+struct val_t {
+    u64 ts;
+    u64 offset;
+    struct file *fp;
+};
+
+struct data_t {
+    // XXX: switch some to u32's when supported
+    u64 ts_us;
+    u64 type;
+    u64 size;
+    u64 offset;
+    u64 delta_us;
+    u64 pid;
+    char task[TASK_COMM_LEN];
+    char file[DNAME_INLINE_LEN];
+};
+
+BPF_HASH(entryinfo, pid_t, struct val_t);
+BPF_PERF_OUTPUT(events);
+
+//
+// Store timestamp and size on entry
+//
+
+// zpl_read(), zpl_write():
+int trace_rw_entry(struct pt_regs *ctx, struct file *filp, char __user *buf,
+    size_t len, loff_t *ppos)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // store filep and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = filp;
+    val.offset = *ppos;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+// zpl_open():
+int trace_open_entry(struct pt_regs *ctx, struct inode *inode,
+    struct file *filp)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // store filep and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = filp;
+    val.offset = 0;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+// zpl_fsync():
+int trace_fsync_entry(struct pt_regs *ctx, struct file *filp)
+{
+    u32 pid;
+    pid = bpf_get_current_pid_tgid();
+    if (FILTER_PID)
+        return 0;
+
+    // store filp and timestamp by pid
+    struct val_t val = {};
+    val.ts = bpf_ktime_get_ns();
+    val.fp = filp;
+    val.offset = 0;
+    if (val.fp)
+        entryinfo.update(&pid, &val);
+
+    return 0;
+}
+
+//
+// Output
+//
+
+static int trace_return(struct pt_regs *ctx, int type)
+{
+    struct val_t *valp;
+    u32 pid = bpf_get_current_pid_tgid();
+
+    valp = entryinfo.lookup(&pid);
+    if (valp == 0) {
+        // missed tracing issue or filtered
+        return 0;
+    }
+
+    // calculate delta
+    u64 ts = bpf_ktime_get_ns();
+    u64 delta_us = (ts - valp->ts) / 1000;
+    entryinfo.delete(&pid);
+    if (FILTER_US)
+        return 0;
+
+    // workaround (rewriter should handle file to d_iname in one step):
+    struct dentry *de = NULL;
+    bpf_probe_read(&de, sizeof(de), &valp->fp->f_path.dentry);
+
+    // populate output struct
+    u32 size = ctx->ax;
+    struct data_t data = {.type = type, .size = size, .delta_us = delta_us,
+        .pid = pid};
+    data.ts_us = ts / 1000;
+    data.offset = valp->offset;
+    bpf_probe_read(&data.file, sizeof(data.file), de->d_iname);
+    bpf_get_current_comm(&data.task, sizeof(data.task));
+
+    events.perf_submit(ctx, &data, sizeof(data));
+
+    return 0;
+}
+
+int trace_read_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_READ);
+}
+
+int trace_write_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_WRITE);
+}
+
+int trace_open_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_OPEN);
+}
+
+int trace_fsync_return(struct pt_regs *ctx)
+{
+    return trace_return(ctx, TRACE_FSYNC);
+}
+
+"""
+if min_ms == 0:
+    bpf_text = bpf_text.replace('FILTER_US', '0')
+else:
+    bpf_text = bpf_text.replace('FILTER_US',
+        'delta_us <= %s' % str(min_ms * 1000))
+if args.pid:
+    bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
+else:
+    bpf_text = bpf_text.replace('FILTER_PID', '0')
+if debug:
+    print(bpf_text)
+
+# kernel->user event data: struct data_t
+DNAME_INLINE_LEN = 32   # linux/dcache.h
+TASK_COMM_LEN = 16      # linux/sched.h
+class Data(ct.Structure):
+    _fields_ = [
+        ("ts_us", ct.c_ulonglong),
+        ("type", ct.c_ulonglong),
+        ("size", ct.c_ulonglong),
+        ("offset", ct.c_ulonglong),
+        ("delta_us", ct.c_ulonglong),
+        ("pid", ct.c_ulonglong),
+        ("task", ct.c_char * TASK_COMM_LEN),
+        ("file", ct.c_char * DNAME_INLINE_LEN)
+    ]
+
+# process event
+def print_event(cpu, data, size):
+    event = ct.cast(data, ct.POINTER(Data)).contents
+
+    type = 'R'
+    if event.type == 1:
+        type = 'W'
+    elif event.type == 2:
+        type = 'O'
+    elif event.type == 3:
+        type = 'S'
+
+    if (csv):
+        print("%d,%s,%d,%s,%d,%d,%d,%s" % (
+            event.ts_us, event.task, event.pid, type, event.size,
+            event.offset, event.delta_us, event.file))
+        return
+    print("%-8s %-14.14s %-6s %1s %-7s %-8d %7.2f %s" % (strftime("%H:%M:%S"),
+        event.task, event.pid, type, event.size, event.offset / 1024,
+        float(event.delta_us) / 1000, event.file))
+
+# initialize BPF
+b = BPF(text=bpf_text)
+
+# common file functions
+b.attach_kprobe(event="zpl_read", fn_name="trace_rw_entry")
+b.attach_kprobe(event="zpl_write", fn_name="trace_rw_entry")
+b.attach_kprobe(event="zpl_open", fn_name="trace_open_entry")
+b.attach_kprobe(event="zpl_fsync", fn_name="trace_fsync_entry")
+b.attach_kretprobe(event="zpl_read", fn_name="trace_read_return")
+b.attach_kretprobe(event="zpl_write", fn_name="trace_write_return")
+b.attach_kretprobe(event="zpl_open", fn_name="trace_open_return")
+b.attach_kretprobe(event="zpl_fsync", fn_name="trace_fsync_return")
+
+# header
+if (csv):
+    print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
+else:
+    if min_ms == 0:
+        print("Tracing ZFS operations")
+    else:
+        print("Tracing ZFS operations slower than %d ms" % min_ms)
+    print("%-8s %-14s %-6s %1s %-7s %-8s %7s %s" % ("TIME", "COMM", "PID", "T",
+        "BYTES", "OFF_KB", "LAT(ms)", "FILENAME"))
+
+# read events
+b["events"].open_perf_buffer(print_event)
+while 1:
+    b.kprobe_poll()
--- a/tools/zfsslower_example.txt
+++ b/tools/zfsslower_example.txt
+Demonstrations of zfsslower, the Linux eBPF/bcc version.
+
+
+zfsslower shows ZFS reads, writes, opens, and fsyncs, slower than a threshold.
+It has been written to work on ZFS on Linux (http://zfsonlinux.org). For
+example:
+
+# ./zfsslower 
+Tracing ZFS operations slower than 10 ms
+TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
+06:31:28 dd             25570  W 131072  38784     303.92 data1
+06:31:34 dd             25686  W 131072  38784     388.28 data1
+06:31:35 dd             25686  W 131072  78720     519.66 data1
+06:31:35 dd             25686  W 131072  116992    405.94 data1
+06:31:35 dd             25686  W 131072  153600    433.52 data1
+06:31:36 dd             25686  W 131072  188672    314.37 data1
+06:31:36 dd             25686  W 131072  222336    372.33 data1
+06:31:36 dd             25686  W 131072  254592    309.59 data1
+06:31:37 dd             25686  W 131072  285440    304.52 data1
+06:31:37 dd             25686  W 131072  315008    236.45 data1
+06:31:37 dd             25686  W 131072  343424    193.54 data1
+06:31:38 dd             25686  W 131072  370560    286.07 data1
+06:31:38 dd             25686  W 131072  396672    251.92 data1
+[...]
+
+This shows writes to a "data1" file, each taking well over the 10 ms threshold.
+the slowest, on the 3rd line of output, reached 519.66 ms for a 128 Kbyte
+write by the "dd" command.
+
+This "latency" is measured from when the operation was issued from the VFS
+interface to the file system (via the ZFS POSIX layer), to when it completed.
+This spans everything: block device I/O (disk I/O), file system CPU cycles,
+file system locks, run queue latency, etc. This is a better measure of the
+latency suffered by applications reading from the file system than measuring
+this down at the block device interface.
+
+Note that this only traces the common file system operations previously
+listed: other file system operations (eg, inode operations including
+getattr()) are not traced.
+
+
+A threshold of 0 will trace all operations. Warning: the output will be
+verbose, as it will include all file system cache hits.
+
+# ./zfsslower 0
+Tracing ZFS operations
+TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
+06:36:07 dd             32242  O 0       0           0.01 data1
+06:36:07 dd             32242  W 131072  0           0.25 data1
+06:36:07 dd             32242  W 131072  128         0.03 data1
+06:36:07 dd             32242  W 131072  256         0.04 data1
+06:36:07 dd             32242  W 131072  384         0.04 data1
+06:36:07 dd             32242  W 131072  512         0.04 data1
+06:36:07 dd             32242  W 131072  640         0.03 data1
+06:36:07 dd             32242  W 131072  768         0.03 data1
+06:36:07 dd             32242  W 131072  896         0.04 data1
+06:36:07 dd             32242  W 131072  1024        0.28 data1
+06:36:07 dd             32242  W 131072  1152        0.04 data1
+06:36:07 dd             32242  W 131072  1280        0.03 data1
+[...]
+06:36:07 dd             32242  W 131072  13824       0.04 data1
+06:36:07 dd             32242  W 131072  13952       0.04 data1
+06:36:07 dd             32242  W 131072  14080       0.04 data1
+06:36:07 dd             32242  W 131072  14208     398.92 data1
+06:36:07 dd             32242  W 131072  14336       0.04 data1
+06:36:07 dd             32242  W 131072  14464       0.04 data1
+06:36:07 dd             32242  W 131072  15104       0.03 data1
+[...]
+
+The output now includes the open operation for this file ("O"), and then the
+writes. Most of the writes are very fast, with only an occasional outlier that
+is in the hundreds of milliseconds.
+
+Fortunately this is not a real world environment: I setup a zpool on top of a
+XFS file system for testing purposes. More debugging using other tools will
+explain these outliers: possibly XFS flushing.
+
+
+Here's a random read workload, and showing operations slower than 1 ms:
+
+# ./zfsslower 1
+Tracing ZFS operations slower than 1 ms
+TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
+06:47:30 randread.pl    15431  R 8192    97840       1.03 data1
+06:47:30 randread.pl    15431  R 8192    416744      1.12 data1
+06:47:31 randread.pl    15431  R 8192    228856      1.96 data1
+06:47:31 randread.pl    15431  R 8192    452248      1.02 data1
+06:47:31 randread.pl    15431  R 8192    315288      5.90 data1
+06:47:31 randread.pl    15431  R 8192    752696      1.20 data1
+06:47:31 randread.pl    15431  R 8192    481832      1.39 data1
+06:47:31 randread.pl    15431  R 8192    673752      1.39 data1
+06:47:31 randread.pl    15431  R 8192    691736      1.01 data1
+06:47:31 randread.pl    15431  R 8192    694776      1.78 data1
+06:47:31 randread.pl    15431  R 8192    403328      3.75 data1
+06:47:31 randread.pl    15431  R 8192    567688      1.08 data1
+06:47:31 randread.pl    15431  R 8192    694280      1.31 data1
+06:47:31 randread.pl    15431  R 8192    669280      1.06 data1
+06:47:31 randread.pl    15431  R 8192    426608      1.56 data1
+06:47:31 randread.pl    15431  R 8192    42512       1.01 data1
+06:47:31 randread.pl    15431  R 8192    22944       1.33 data1
+06:47:31 randread.pl    15431  R 8192    427432      1.48 data1
+06:47:31 randread.pl    15431  R 8192    261320      1.28 data1
+06:47:31 randread.pl    15431  R 8192    132248      1.23 data1
+06:47:31 randread.pl    15431  R 8192    96936       1.04 data1
+06:47:31 randread.pl    15431  R 8192    482800      2.63 data1
+[...]
+
+
+A -j option will print just the fields (parsable output, csv):
+
+# ./zfsslower -j 1
+ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE
+252305490911,randread.pl,17922,R,8192,163446784,1156,data1
+252305493852,randread.pl,17922,R,8192,321437696,1129,data1
+252305498839,randread.pl,17922,R,8192,475152384,1154,data1
+252305505515,randread.pl,17922,R,8192,49094656,1082,data1
+252305506774,randread.pl,17922,R,8192,470401024,1245,data1
+252305509265,randread.pl,17922,R,8192,553246720,2412,data1
+252305512365,randread.pl,17922,R,8192,20963328,1093,data1
+252305513755,randread.pl,17922,R,8192,304111616,1350,data1
+252305583330,randread.pl,17922,R,8192,166174720,1154,data1
+252305593913,randread.pl,17922,R,8192,175079424,1241,data1
+252305602833,randread.pl,17922,R,8192,305340416,3307,data1
+252305608663,randread.pl,17922,R,8192,655958016,2704,data1
+252305611212,randread.pl,17922,R,8192,40951808,1033,data1
+252305614609,randread.pl,17922,R,8192,318922752,2687,data1
+252305623800,randread.pl,17922,R,8192,246734848,2983,data1
+252305711125,randread.pl,17922,R,8192,581795840,1091,data1
+252305728694,randread.pl,17922,R,8192,710483968,1034,data1
+252305762046,randread.pl,17922,R,8192,329367552,1405,data1
+252305798215,randread.pl,17922,R,8192,44482560,1030,data1
+252305806748,randread.pl,17922,R,8192,660602880,1069,data1
+252305826360,randread.pl,17922,R,8192,616144896,2327,data1
+[...]
+
+
+USAGE message:
+
+# ./zfsslower -h
+usage: zfsslower [-h] [-j] [-p PID] [min_ms]
+
+Trace common ZFS file operations slower than a threshold
+
+positional arguments:
+  min_ms             minimum I/O duration to trace, in ms (default 10)
+
+optional arguments:
+  -h, --help         show this help message and exit
+  -j, --csv          just print fields: comma-separated values
+  -p PID, --pid PID  trace this PID only
+
+examples:
+    ./zfsslower             # trace operations slower than 10 ms (default)
+    ./zfsslower 1           # trace operations slower than 1 ms
+    ./zfsslower -j 1        # ... 1 ms, parsable output (csv)
+    ./zfsslower 0           # trace all operations (warning: verbose)
+    ./zfsslower -p 185      # trace PID 185 only