tracepoint support for argdist and trace, and new tplist tool for printing tracepoints

fd60d55c · Sasha Goldshtein · f6bf78f5 · fd60d55c · fd60d55c · fd60d55c
Commit fd60d55c authored Mar 01, 2016 by Sasha Goldshtein
10 changed files
--- a/README.md
+++ b/README.md
@@ -108,6 +108,7 @@ Examples:
 - tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
 - tools/[tcpconnlat](tools/tcpconnlat.py): Trace TCP active connection latency (connect()). [Examples](tools/tcpconnlat_example.txt).
 - tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
+- tools/[tplist](tools/tplist.py): Display kernel tracepoints and their format.
 - tools/[trace](tools/trace.py): Trace arbitrary functions, with filters. [Examples](tools/trace_example.txt)
 - tools/[vfscount](tools/vfscount.py) tools/[vfscount.c](tools/vfscount.c): Count VFS calls. [Examples](tools/vfscount_example.txt).
 - tools/[vfsstat](tools/vfsstat.py) tools/[vfsstat.c](tools/vfsstat.c): Count some VFS calls, with column output. [Examples](tools/vfsstat_example.txt).

--- a/man/man8/argdist.8
+++ b/man/man8/argdist.8
@@ -50,11 +50,11 @@ many cases, argdist will deduce the necessary header files automatically.
 .SH SPECIFIER SYNTAX
 The general specifier syntax is as follows:

-.B {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+.B {p,r,t}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 .TP
-.B {p,r}
-Probe type \- "p" for function entry, "r" for function return;
-\-H for histogram collection, \-C for frequency count.
+.B {p,r,t}
+Probe type \- "p" for function entry, "r" for function return, "t" for kernel
+tracepoint; \-H for histogram collection, \-C for frequency count.
 Indicates where to place the probe and whether the probe should collect frequency
 count information, or aggregate the collected values into a histogram. Counting 
 probes will collect the number of times every parameter value was observed,
@@ -68,12 +68,17 @@ Specify the full path to the .so or executable file where the function to probe
 resides. Alternatively, you can specify just the lib name: for example, "c"
 refers to libc. If no library name is specified, the kernel is assumed.
 .TP
+.B category
+The category of the kernel tracepoint. For example: net, sched, block.
+.TP
 .B function(signature)
 The function to probe, and its signature.
 The function name must match exactly for the probe to be placed. The signature,
 on the other hand, is only required if you plan to collect parameter values 
 based on that signature. For example, if you only want to collect the first
 parameter, you don't have to specify the rest of the parameters in the signature.
+When capturing kernel tracepoints, this should be the name of the event, e.g.
+net_dev_start_xmit. The signature for kernel tracepoints should be empty.
 .TP
 .B [type[,type...]]
 The type(s) of the expression(s) to capture.
@@ -85,6 +90,9 @@ The expression(s) to capture.
 These are the values that are assigned to the histogram or raw event collection.
 You may use the parameters directly, or valid C expressions that involve the
 parameters, such as "size % 10".
+Tracepoints may access a special structure called "tp" that is formatted according
+to the tracepoint format (which you can obtain using tplist). For example, the
+block:block_rq_complete tracepoint can access tp.nr_sector.
 Return probes can use the argument values received by the
 function when it was entered, through the $entry(paramname) special variable.
 Return probes can also access the function's return value in $retval, and the
@@ -137,6 +145,14 @@ Count fork() calls in libc across all processes, grouped by pid:
 #
 .B argdist -C 'p:c:fork():int:$PID;fork per process'
 .TP
+Print histogram of number of sectors in completing block I/O requests:
+#
+.B argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
+.TP
+Aggregate interrupts by interrupt request (IRQ):
+#
+.B argdist -C 't:irq:irq_handler_entry():int:tp.irq'
+.TP
 Print histograms of sleep() and nanosleep() parameter values:
 #
 .B argdist -H 'p:c:sleep(u32 seconds):u32:seconds' 'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'

--- a/man/man8/tplist.8
+++ b/man/man8/tplist.8
+.TH tplist 8  "2016-03-20" "USER COMMANDS"
+.SH NAME
+tplist \- Display kernel tracepoints and their format.
+.SH SYNOPSIS
+.B tplist [-v] [tracepoint]
+.SH DESCRIPTION
+tplist lists all kernel tracepoints, and can optionally print out the tracepoint
+format; namely, the variables that you can trace when the tracepoint is hit. This
+is usually used in conjunction with the argdist and/or trace tools.
+
+On a typical system, accessing the tracepoint list and format requires root.
+.SH OPTIONS
+.TP
+\-v
+Display the variables associated with the tracepoint or tracepoints.
+.TP
+[tracepoint]
+A wildcard expression that specifies which tracepoints to print. For example,
+block:* will print all block tracepoints (block:block_rq_complete, etc.).
+Regular expressions are not supported.
+.SH EXAMPLES
+.TP
+Print all kernel tracepoints:
+#
+.B tplist
+.TP
+Print all net tracepoints with their format:
+#
+.B tplist -v 'net:*'
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Sasha Goldshtein
--- a/man/man8/trace.8
+++ b/man/man8/trace.8
@@ -45,10 +45,12 @@ information. See PROBE SYNTAX below.
 The general probe syntax is as follows:

 .B [{p,r}]:[library]:function [(predicate)] ["format string"[, arguments]]
+
+.B t:category:event [(predicate)] ["format string"[, arguments]]
 .TP
-.B [{p,r}]
-Probe type \- "p" for function entry, "r" for function return. The default
-probe type is "p".
+.B {[{p,r}],t}
+Probe type \- "p" for function entry, "r" for function return, "t" for kernel
+tracepoint. The default probe type is "p".
 .TP
 .B [library]
 Library containing the probe.
@@ -58,9 +60,15 @@ refers to libc. If no library name is specified, the kernel is assumed. Also,
 you can specify an executable name (without a full path) if it is in the PATH.
 For example, "bash".
 .TP
+.B category
+The tracepoint category. For example, "sched" or "irq".
+.TP
 .B function
 The function to probe.
 .TP
+.B event
+The tracepoint event. For example, "block_rq_complete".
+.TP
 .B [(predicate)]
 The filter applied to the captured data. Only if the filter evaluates as true,
 the trace message will be printed. The filter can use any valid C expression
@@ -81,6 +89,11 @@ number of arguments as there are placeholders in the format string. The
 format specifier replacements may be any C expressions, and may refer to the
 same special keywords as in the predicate (arg1, arg2, etc.).

+In tracepoints, both the predicate and the arguments may refer to the tracepoint
+format structure, which is stored in the special "tp" variable. For example, the
+block:block_rq_complete tracepoint can print or filter by tp.nr_sector. To 
+discover the format of your tracepoint, use the tplist tool.
+
 The predicate expression and the format specifier replacements for printing
 may also use the following special keywords: $pid, $tgid to refer to the 
 current process' pid and tgid; $uid, $gid to refer to the current user's
@@ -102,6 +115,10 @@ Trace all malloc calls and print the size of the requested allocation:
 Trace returns from the readline function in bash and print the return value as a string:
 #
 .B trace 'r:bash:readline """%s"", retval' 
+.TP
+Trace the block:block_rq_complete tracepoint and print the number of sectors completed:
+#
+.B trace 't:block:block_rq_complete """%d sectors"", tp.nr_sector'
 .SH SOURCE
 This is from bcc.
 .IP

--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -325,6 +325,11 @@ class BPF(object):
        global open_kprobes
        return open_kprobes

+    @staticmethod
+    def open_uprobes():
+            global open_uprobes
+            return open_uprobes
+
    @staticmethod
    def detach_kprobe(event):
        ev_name = "p_" + event.replace("+", "_").replace(".", "_")

--- a/tools/argdist.py
+++ b/tools/argdist.py
 #!/usr/bin/env python
 #
 # argdist   Trace a function and display a distribution of its
-#              parameter values as a histogram or frequency count.
+#           parameter values as a histogram or frequency count.
 #
 # USAGE: argdist [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL]
-#                   [-n COUNT] [-v] [-T TOP]
-#                   [-C specifier [specifier ...]]
-#                   [-H specifier [specifier ...]]
-#                   [-I header [header ...]]
+#                [-n COUNT] [-v] [-T TOP]
+#                [-C specifier [specifier ...]]
+#                [-H specifier [specifier ...]]
+#                [-I header [header ...]]
 #
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
@@ -15,10 +15,129 @@
 from bcc import BPF
 from time import sleep, strftime
 import argparse
+import ctypes as ct
 import re
 import traceback
+import os
+import multiprocessing
 import sys

+class Perf(object):
+        class perf_event_attr(ct.Structure):
+                _fields_ = [
+                        ('type', ct.c_uint),
+                        ('size', ct.c_uint),
+                        ('config', ct.c_ulong),
+                        ('sample_period', ct.c_ulong),
+                        ('sample_type', ct.c_ulong),
+                        ('IGNORE1', ct.c_ulong),
+                        ('IGNORE2', ct.c_ulong),
+                        ('wakeup_events', ct.c_uint),
+                        ('IGNORE3', ct.c_uint),
+                        ('IGNORE4', ct.c_ulong),
+                        ('IGNORE5', ct.c_ulong),
+                        ('IGNORE6', ct.c_ulong),
+                        ('IGNORE7', ct.c_uint),
+                        ('IGNORE8', ct.c_int),
+                        ('IGNORE9', ct.c_ulong),
+                        ('IGNORE10', ct.c_uint),
+                        ('IGNORE11', ct.c_uint)
+                ]
+
+        NR_PERF_EVENT_OPEN = 298
+        PERF_TYPE_TRACEPOINT = 2
+        PERF_SAMPLE_RAW = 1024
+        PERF_FLAG_FD_CLOEXEC = 8
+        PERF_EVENT_IOC_SET_FILTER = 1074275334
+        PERF_EVENT_IOC_ENABLE = 9216
+
+        libc = ct.CDLL('libc.so.6', use_errno=True)
+        syscall = libc.syscall          # not declaring vararg types
+        ioctl = libc.ioctl              # not declaring vararg types
+
+        @staticmethod
+        def _open_for_cpu(cpu, attr):
+                pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
+                                   -1, cpu, -1, Perf.PERF_FLAG_FD_CLOEXEC)
+                if pfd < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
+                              "common_pid == -17") < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+
+        @staticmethod
+        def perf_event_open(tpoint_id):
+                attr = Perf.perf_event_attr()
+                attr.config = tpoint_id
+                attr.type = Perf.PERF_TYPE_TRACEPOINT
+                attr.sample_type = Perf.PERF_SAMPLE_RAW
+                attr.sample_period = 1
+                attr.wakeup_events = 1
+                for cpu in range(0, multiprocessing.cpu_count()):
+                        Perf._open_for_cpu(cpu, attr)
+
+class Tracepoint(object):
+        tracepoints_enabled = 0
+        trace_root = "/sys/kernel/debug/tracing"
+        event_root = os.path.join(trace_root, "events")
+
+        @staticmethod
+        def generate_decl():
+                if Tracepoint.tracepoints_enabled == 0:
+                        return ""
+                return "\nBPF_HASH(__trace_di, u64, u64);\n"
+
+        @staticmethod
+        def generate_entry_probe():
+                if Tracepoint.tracepoints_enabled == 0:
+                        return ""
+                return """
+int __trace_entry_update(struct pt_regs *ctx)
+{
+        u64 tid = bpf_get_current_pid_tgid();
+        u64 val = ctx->di;
+        __trace_di.update(&tid, &val);
+        return 0;
+}
+"""
+
+        @staticmethod
+        def enable_tracepoint(category, event):
+                tp_id = Tracepoint.get_tpoint_id(category, event)
+                if tp_id == -1:
+                        raise ValueError("no such tracepoint found: %s:%s" %
+                                         (category, event))
+                Perf.perf_event_open(tp_id)
+                Tracepoint.tracepoints_enabled += 1
+
+        @staticmethod
+        def get_tpoint_id(category, event):
+                evt_dir = os.path.join(Tracepoint.event_root, category, event)
+                try:
+                        return int(
+                          open(os.path.join(evt_dir, "id")).read().strip())
+                except:
+                        return -1
+
+        @staticmethod
+        def get_tpoint_format(category, event):
+                evt_dir = os.path.join(Tracepoint.event_root, category, event)
+                try:
+                        return open(os.path.join(evt_dir, "format")).readlines()
+                except:
+                        return ""
+
+        @staticmethod
+        def attach(bpf):
+                if Tracepoint.tracepoints_enabled > 0:
+                        bpf.attach_kprobe(event="tracing_generic_entry_update",
+                                          fn_name="__trace_entry_update")
+
 class Specifier(object):
        probe_text = """
 DATA_DECL
@@ -36,10 +155,11 @@ int PROBENAME(struct pt_regs *ctx SIGNATURE)
        next_probe_index = 0
        aliases = { "$PID": "bpf_get_current_pid_tgid()" }
        auto_includes = {
-                "linux/time.h"    : ["time"],
-                "linux/fs.h"      : ["fs", "file"],
-                "linux/blkdev.h"  : ["bio", "request"],
-                "linux/slab.h"    : ["alloc"]
+                "linux/time.h"      : ["time"],
+                "linux/fs.h"        : ["fs", "file"],
+                "linux/blkdev.h"    : ["bio", "request"],
+                "linux/slab.h"      : ["alloc"],
+                "linux/netdevice.h" : ["sk_buff", "net_device"]
        }

        @staticmethod
@@ -189,8 +309,8 @@ u64 __time = bpf_ktime_get_ns();
                                   "function signature must be specified")
                if len(parts) > 6:
                        self._bail("extraneous ':'-separated parts detected")
-                if parts[0] not in ["r", "p"]:
-                        self._bail("probe type must be either 'p' or 'r', " +
+                if parts[0] not in ["r", "p", "t"]:
+                        self._bail("probe type must be 'p', 'r', or 't', " +
                                   "but got '%s'" % parts[0])
                if re.match(r"\w+\(.*\)", parts[2]) is None:
                        self._bail(("function signature '%s' has an invalid " +
@@ -216,11 +336,19 @@ u64 __time = bpf_ktime_get_ns();

                parts = spec_and_label[0].strip().split(':')
                self.type = type    # hist or freq
-                self.is_ret_probe = parts[0] == "r"
-                self.library = parts[1]
-                self.is_user = len(self.library) > 0
+                self.probe_type = parts[0]
                fparts = parts[2].split('(')
                self.function = fparts[0].strip()
+                if self.probe_type == "t":
+                        self.library = ""       # kernel
+                        self.tp_category = parts[1]
+                        self.tp_event = self.function
+                        Tracepoint.enable_tracepoint(
+                                        self.tp_category, self.tp_event)
+                        self.function = "perf_trace_" + self.function
+                else:
+                        self.library = parts[1]
+                self.is_user = len(self.library) > 0
                self.signature = fparts[1].strip()[:-1]
                self._parse_signature()

@@ -235,12 +363,12 @@ u64 __time = bpf_ktime_get_ns();
                        if self.type == "hist" and len(self.expr_types) > 1:
                                self._bail("histograms can only have 1 expr")
                else:
-                        if not self.is_ret_probe and self.type == "hist":
+                        if not self.probe_type == "r" and self.type == "hist":
                                self._bail("histograms must have expr")
                        self.expr_types = \
-                                ["u64" if not self.is_ret_probe else "int"]
+                          ["u64" if not self.probe_type == "r" else "int"]
                        self.exprs = \
-                                ["1" if not self.is_ret_probe else "$retval"]
+                          ["1" if not self.probe_type == "r" else "$retval"]
                self.filter = "" if len(parts) != 6 else parts[5]
                self._substitute_exprs()

@@ -249,7 +377,7 @@ u64 __time = bpf_ktime_get_ns();
                def check(expr):
                        keywords = ["$entry", "$latency"]
                        return any(map(lambda kw: kw in expr, keywords))
-                self.entry_probe_required = self.is_ret_probe and \
+                self.entry_probe_required = self.probe_type == "r" and \
                        (any(map(check, self.exprs)) or check(self.filter))

                self.pid = pid
@@ -278,11 +406,11 @@ u64 __time = bpf_ktime_get_ns();

        def _generate_field_assignment(self, i):
                if self._is_string(self.expr_types[i]):
-                        return "bpf_probe_read(" + \
+                        return "        bpf_probe_read(" + \
                               "&__key.v%d.s, sizeof(__key.v%d.s), %s);\n" % \
                                (i, i, self.exprs[i])
                else:
-                        return "__key.v%d = %s;\n" % (i, self.exprs[i])
+                        return "        __key.v%d = %s;\n" % (i, self.exprs[i])

        def _generate_hash_decl(self):
                if self.type == "hist":
@@ -324,28 +452,71 @@ u64 __time = bpf_ktime_get_ns();
                else:
                        return ""

-        def generate_text(self):
-                # We don't like tools writing tools (Brendan Gregg), but this
-                # is an exception because we're letting the user fully
-                # customize the values we probe. As a rule of thumb though,
-                # try to build a custom tool for a specific purpose.
+        def _generate_tpoint_entry_struct_fields(self):
+                format_lines = Tracepoint.get_tpoint_format(self.tp_category,
+                                                            self.tp_event)
+                text = ""
+                for line in format_lines:
+                        match = re.search(r'field:([^;]*);.*size:(\d+);', line)
+                        if match is None:
+                                continue
+                        parts = match.group(1).split()
+                        field_name = parts[-1:][0]
+                        field_type = " ".join(parts[:-1])
+                        field_size = int(match.group(2))
+                        if "__data_loc" in field_type:
+                                continue
+                        if field_name.startswith("common_"):
+                                continue
+                        text += "        %s %s;\n" % (field_type, field_name)
+                return text

+        def _generate_tpoint_entry_struct(self):
+                text = """
+struct %s {
+        u64 __do_not_use__;
+%s
+};
+                """
+                self.tp_entry_struct_name = self.probe_func_name + \
+                                            "_trace_entry"
+                fields = self._generate_tpoint_entry_struct_fields()
+                return text % (self.tp_entry_struct_name, fields)
+
+        def _generate_tpoint_entry_prefix(self):
+                text = """
+        u64 tid = bpf_get_current_pid_tgid();
+        u64 *di = __trace_di.lookup(&tid);
+        if (di == 0) { return 0; }
+        struct %s tp = {};
+        bpf_probe_read(&tp, sizeof(tp), (void *)*di);
+                """ % self.tp_entry_struct_name
+                return text
+
+        def generate_text(self):
                program = ""

                # If any entry arguments are probed in a ret probe, we need
                # to generate an entry probe to collect them
                prefix = ""
                if self.entry_probe_required:
-                        program = self._generate_entry_probe()
-                        prefix = self._generate_retprobe_prefix()
+                        program += self._generate_entry_probe()
+                        prefix += self._generate_retprobe_prefix()
                        # Replace $entry(paramname) with a reference to the
                        # value we collected when entering the function:
                        self._replace_entry_exprs()

+                # If this is a tracepoint probe, generate a local variable
+                # that enables access to the tracepoint structure and also
+                # the structure definition itself
+                if self.probe_type == "t":
+                        program += self._generate_tpoint_entry_struct()
+                        prefix += self._generate_tpoint_entry_prefix()
+
                program += self.probe_text.replace("PROBENAME",
                                                   self.probe_func_name)
                signature = "" if len(self.signature) == 0 \
-                                  or self.is_ret_probe \
+                                  or self.probe_type == "r" \
                               else ", " + self.signature
                program = program.replace("SIGNATURE", signature)
                program = program.replace("PID_FILTER",
@@ -364,8 +535,10 @@ u64 __time = bpf_ktime_get_ns();

        def attach(self, bpf):
                self.bpf = bpf
+                uprobes_start = len(BPF.open_uprobes())
+                kprobes_start = len(BPF.open_kprobes())
                if self.is_user:
-                        if self.is_ret_probe:
+                        if self.probe_type == "r":
                                bpf.attach_uretprobe(name=self.library,
                                                  sym=self.function,
                                                  fn_name=self.probe_func_name,
@@ -375,13 +548,17 @@ u64 __time = bpf_ktime_get_ns();
                                                  sym=self.function,
                                                  fn_name=self.probe_func_name,
                                                  pid=self.pid or -1)
+                        if len(BPF.open_uprobes()) != uprobes_start + 1:
+                                self._bail("error attaching probe")
                else:
-                        if self.is_ret_probe:
+                        if self.probe_type == "r" or self.probe_type == "t":
                                bpf.attach_kretprobe(event=self.function,
                                                  fn_name=self.probe_func_name)
                        else:
                                bpf.attach_kprobe(event=self.function,
                                                  fn_name=self.probe_func_name)
+                        if len(BPF.open_kprobes()) != kprobes_start + 1:
+                                self._bail("error attaching probe")
                if self.entry_probe_required:
                        self._attach_entry_probe()

@@ -406,7 +583,7 @@ u64 __time = bpf_ktime_get_ns();

        def _display_key(self, key):
                if self.is_default_expr:
-                        if not self.is_ret_probe:
+                        if not self.probe_type == "r":
                                return "total calls"
                        else:
                                return "retval = %s" % str(key.v0)
@@ -431,7 +608,7 @@ u64 __time = bpf_ktime_get_ns();
                                # Print some nice values if the user didn't
                                # specify an expression to probe
                                if self.is_default_expr:
-                                        if not self.is_ret_probe:
+                                        if not self.probe_type == "r":
                                                key_str = "total calls"
                                        else:
                                                key_str = "retval = %s" % \
@@ -448,13 +625,14 @@ u64 __time = bpf_ktime_get_ns();
 class Tool(object):
        examples = """
 Probe specifier syntax:
-        {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+        {p,r,t}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
-        p,r        -- probe at function entry or at function exit
+        p,r,t      -- probe at function entry, function exit, or kernel tracepoint
                      in exit probes: can use $retval, $entry(param), $latency
        library    -- the library that contains the function
                      (leave empty for kernel functions)
-        function   -- the function name to trace
+        category   -- the category of the kernel tracepoint (e.g. net, sched)
+        function   -- the function name to trace (or tracepoint name)
        signature  -- the function's parameters, as in the C header
        type       -- the type of the expression to collect (supports multiple)
        expr       -- the expression to collect (supports multiple)
@@ -502,6 +680,12 @@ argdist -C 'p:c:fork()#fork calls'
        Count fork() calls in libc across all processes
        Can also use funccount.py, which is easier and more flexible

+argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
+        Print histogram of number of sectors in completing block I/O requests
+
+argdist -C 't:irq:irq_handler_entry():int:tp.irq'
+        Aggregate interrupts by interrupt request (IRQ)
+
 argdist  -H \\
        'p:c:sleep(u32 seconds):u32:seconds' \\
        'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
@@ -555,7 +739,7 @@ argdist -p 2780 -z 120 \\
                                Specifier("hist", histspecifier, self.args.pid))
                if len(self.specifiers) == 0:
                        print("at least one specifier is required")
-                        exit(1)
+                        exit()

        def _generate_program(self):
                bpf_source = """
@@ -567,6 +751,8 @@ struct __string_t { char s[%d]; };
                        bpf_source += "#include <%s>\n" % include
                bpf_source += Specifier.generate_auto_includes(
                                map(lambda s: s.raw_spec, self.specifiers))
+                bpf_source += Tracepoint.generate_decl()
+                bpf_source += Tracepoint.generate_entry_probe()
                for specifier in self.specifiers:
                        bpf_source += specifier.generate_text()
                if self.args.verbose:
@@ -574,8 +760,12 @@ struct __string_t { char s[%d]; };
                self.bpf = BPF(text=bpf_source)

        def _attach(self):
+                Tracepoint.attach(self.bpf)
                for specifier in self.specifiers:
                        specifier.attach(self.bpf)
+                if self.args.verbose:
+                        print("open uprobes: %s" % BPF.open_uprobes())
+                        print("open kprobes: %s" % BPF.open_kprobes())

        def _main_loop(self):
                count_so_far = 0
@@ -601,7 +791,7 @@ struct __string_t { char s[%d]; };
                except:
                        if self.args.verbose:
                                traceback.print_exc()
-                        else:
+                        elif sys.exc_type is not SystemExit:
                                print(sys.exc_value)

 if __name__ == "__main__":

--- a/tools/argdist_example.txt
+++ b/tools/argdist_example.txt
@@ -262,6 +262,27 @@ p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
 The flags value must be expanded by hand, but it's still helpful to eliminate
 certain kinds of allocations or visually group them together.

+argdist also has basic support for kernel tracepoints. It is sometimes more
+convenient to use tracepoints because they are documented and don't vary a lot
+between kernel versions like function signatures tend to. For example, let's
+trace the net:net_dev_start_xmit tracepoint and print the interface name that
+is transmitting:
+
+# argdist -C 't:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name' -n 2
+[05:01:10]
+t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
+        COUNT      EVENT
+        4          c->name = eth0
+[05:01:11]
+t:net:net_dev_start_xmit(void *a, void *b, struct net_device *c):char*:c->name
+        COUNT      EVENT
+        6          c->name = lo
+        92         c->name = eth0
+
+Note that to determine the necessary function signature you need to look at the
+TP_PROTO declaration in the kernel headers. For example, the net_dev_start_xmit
+tracepoint is defined in the include/trace/events/net.h header file.
+
 Here's a final example that finds how many write() system calls are performed
 by each process on the system:

@@ -311,13 +332,13 @@ optional arguments:
                        additional header files to include in the BPF program

 Probe specifier syntax:
-        {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
+        {p,r,t}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
-        p,r        -- probe at function entry or at function exit
+        p,r,t      -- probe at function entry, function exit, or kernel tracepoint
                      in exit probes: can use $retval, $entry(param), $latency
        library    -- the library that contains the function
                      (leave empty for kernel functions)
-        function   -- the function name to trace
+        category   -- the category of the kernel tracepoint (e.g. net, sched)
        signature  -- the function's parameters, as in the C header
        type       -- the type of the expression to collect (supports multiple)
        expr       -- the expression to collect (supports multiple)
@@ -365,7 +386,13 @@ argdist -C 'p:c:fork()#fork calls'
        Count fork() calls in libc across all processes
        Can also use funccount.py, which is easier and more flexible 

-argdist  -H \
+argdist -H 't:block:block_rq_complete():u32:tp.nr_sector'
+        Print histogram of number of sectors in completing block I/O requests
+
+argdist -C 't:irq:irq_handler_entry():int:tp.irq'
+        Aggregate interrupts by interrupt request (IRQ)
+
+argdist -H \
        'p:c:sleep(u32 seconds):u32:seconds' \
        'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
        Print histograms of sleep() and nanosleep() parameter values

--- a/tools/tplist.py
+++ b/tools/tplist.py
+#!/usr/bin/env python
+#
+# tplist    Display kernel tracepoints and their formats.
+#
+# USAGE:    tplist [-v] [tracepoint]
+#
+# Licensed under the Apache License, Version 2.0 (the "License")
+# Copyright (C) 2016 Sasha Goldshtein.
+
+import argparse
+import fnmatch
+import re
+import os
+
+trace_root = "/sys/kernel/debug/tracing"
+event_root = os.path.join(trace_root, "events")
+
+parser = argparse.ArgumentParser(description=
+                "Display kernel tracepoints and their formats.",
+                formatter_class=argparse.RawDescriptionHelpFormatter)
+parser.add_argument("-v", dest="variables", action="store_true", help=
+                "Print the format (available variables) for each tracepoint")
+parser.add_argument(dest="tracepoint", nargs="?",
+                help="The tracepoint name to print (wildcards allowed)")
+args = parser.parse_args()
+
+def print_tpoint_format(category, event):
+        fmt = open(os.path.join(event_root, category, event, "format")
+                  ).readlines()
+        for line in fmt:
+                match = re.search(r'field:([^;]*);', line)
+                if match is None:
+                        continue
+                parts = match.group(1).split()
+                field_name = parts[-1:][0]
+                field_type = " ".join(parts[:-1])
+                if "__data_loc" in field_type:
+                        continue
+                if field_name.startswith("common_"):
+                        continue
+                print("    %s %s;" % (field_type, field_name))
+
+def print_tpoint(category, event):
+        tpoint = "%s:%s" % (category, event)
+        if not args.tracepoint or fnmatch.fnmatch(tpoint, args.tracepoint):
+                print(tpoint)
+                if args.variables:
+                        print_tpoint_format(category, event)
+
+def print_all():
+        for category in os.listdir(event_root):
+                cat_dir = os.path.join(event_root, category)
+                if not os.path.isdir(cat_dir):
+                        continue
+                for event in os.listdir(cat_dir):
+                        evt_dir = os.path.join(cat_dir, event)
+                        if os.path.isdir(evt_dir):
+                                print_tpoint(category, event)
+
+if __name__ == "__main__":
+        print_all()
--- a/tools/trace.py
+++ b/tools/trace.py
@@ -5,6 +5,7 @@
 #
 # USAGE: trace [-h] [-p PID] [-v] [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-o]
 #              probe [probe ...]
+#
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.

@@ -13,6 +14,7 @@ from time import sleep, strftime
 import argparse
 import re
 import ctypes as ct
+import multiprocessing
 import os
 import traceback
 import sys
@@ -42,6 +44,122 @@ class Time(object):
                        raise OSError(errno_, os.strerror(errno_))
                return t.tv_sec * 1e9 + t.tv_nsec

+class Perf(object):
+        class perf_event_attr(ct.Structure):
+                _fields_ = [
+                        ('type', ct.c_uint),
+                        ('size', ct.c_uint),
+                        ('config', ct.c_ulong),
+                        ('sample_period', ct.c_ulong),
+                        ('sample_type', ct.c_ulong),
+                        ('IGNORE1', ct.c_ulong),
+                        ('IGNORE2', ct.c_ulong),
+                        ('wakeup_events', ct.c_uint),
+                        ('IGNORE3', ct.c_uint),
+                        ('IGNORE4', ct.c_ulong),
+                        ('IGNORE5', ct.c_ulong),
+                        ('IGNORE6', ct.c_ulong),
+                        ('IGNORE7', ct.c_uint),
+                        ('IGNORE8', ct.c_int),
+                        ('IGNORE9', ct.c_ulong),
+                        ('IGNORE10', ct.c_uint),
+                        ('IGNORE11', ct.c_uint)
+                ]
+
+        NR_PERF_EVENT_OPEN = 298
+        PERF_TYPE_TRACEPOINT = 2
+        PERF_SAMPLE_RAW = 1024
+        PERF_FLAG_FD_CLOEXEC = 8
+        PERF_EVENT_IOC_SET_FILTER = 1074275334
+        PERF_EVENT_IOC_ENABLE = 9216
+
+        libc = ct.CDLL('libc.so.6', use_errno=True)
+        syscall = libc.syscall          # not declaring vararg types
+        ioctl = libc.ioctl              # not declaring vararg types
+
+        @staticmethod
+        def _open_for_cpu(cpu, attr):
+                pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
+                                   -1, cpu, -1, Perf.PERF_FLAG_FD_CLOEXEC)
+                if pfd < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
+                              "common_pid == -17") < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+
+        @staticmethod
+        def perf_event_open(tpoint_id):
+                attr = Perf.perf_event_attr()
+                attr.config = tpoint_id
+                attr.type = Perf.PERF_TYPE_TRACEPOINT
+                attr.sample_type = Perf.PERF_SAMPLE_RAW
+                attr.sample_period = 1
+                attr.wakeup_events = 1
+                for cpu in range(0, multiprocessing.cpu_count()):
+                        Perf._open_for_cpu(cpu, attr)
+
+class Tracepoint(object):
+        tracepoints_enabled = 0
+        trace_root = "/sys/kernel/debug/tracing"
+        event_root = os.path.join(trace_root, "events")
+
+        @staticmethod
+        def generate_decl():
+                if Tracepoint.tracepoints_enabled == 0:
+                        return ""
+                return "\nBPF_HASH(__trace_di, u64, u64);\n"
+
+        @staticmethod
+        def generate_entry_probe():
+                if Tracepoint.tracepoints_enabled == 0:
+                        return ""
+                return """
+int __trace_entry_update(struct pt_regs *ctx)
+{
+        u64 tid = bpf_get_current_pid_tgid();
+        u64 val = ctx->di;
+        __trace_di.update(&tid, &val);
+        return 0;
+}
+"""
+
+        @staticmethod
+        def enable_tracepoint(category, event):
+                tp_id = Tracepoint.get_tpoint_id(category, event)
+                if tp_id == -1:
+                        raise ValueError("no such tracepoint found: %s:%s" %
+                                         (category, event))
+                Perf.perf_event_open(tp_id)
+                Tracepoint.tracepoints_enabled += 1
+
+        @staticmethod
+        def get_tpoint_id(category, event):
+                evt_dir = os.path.join(Tracepoint.event_root, category, event)
+                try:
+                        return int(
+                          open(os.path.join(evt_dir, "id")).read().strip())
+                except:
+                        return -1
+
+        @staticmethod
+        def get_tpoint_format(category, event):
+                evt_dir = os.path.join(Tracepoint.event_root, category, event)
+                try:
+                        return open(os.path.join(evt_dir, "format")).readlines()
+                except:
+                        return ""
+
+        @staticmethod
+        def attach(bpf):
+                if Tracepoint.tracepoints_enabled > 0:
+                        bpf.attach_kprobe(event="tracing_generic_entry_update",
+                                          fn_name="__trace_entry_update")
+
 class Probe(object):
        probe_count = 0
        max_events = None
@@ -72,6 +190,47 @@ class Probe(object):
        def is_default_action(self):
                return self.python_format == ""

+        def _generate_tpoint_entry_struct_fields(self):
+                format_lines = Tracepoint.get_tpoint_format(self.tp_category,
+                                                            self.tp_event)
+                text = ""
+                for line in format_lines:
+                        match = re.search(r'field:([^;]*);.*size:(\d+);', line)
+                        if match is None:
+                                continue
+                        parts = match.group(1).split()
+                        field_name = parts[-1:][0]
+                        field_type = " ".join(parts[:-1])
+                        field_size = int(match.group(2))
+                        if "__data_loc" in field_type:
+                                continue
+                        if field_name.startswith("common_"):
+                                continue
+                        text += "        %s %s;\n" % (field_type, field_name)
+                return text
+
+        def _generate_tpoint_entry_struct(self):
+                text = """
+struct %s {
+        u64 __do_not_use__;
+%s
+};
+                """
+                self.tp_entry_struct_name = self.probe_name + \
+                                            "_trace_entry"
+                fields = self._generate_tpoint_entry_struct_fields()
+                return text % (self.tp_entry_struct_name, fields)
+
+        def _generate_tpoint_entry_prefix(self):
+                text = """
+        u64 tid = bpf_get_current_pid_tgid();
+        u64 *di = __trace_di.lookup(&tid);
+        if (di == 0) { return 0; }
+        struct %s tp = {};
+        bpf_probe_read(&tp, sizeof(tp), (void *)*di);
+                """ % self.tp_entry_struct_name
+                return text
+
        def _bail(self, error):
                raise ValueError("error parsing probe '%s': %s" %
                                 (self.raw_probe, error))
@@ -123,13 +282,21 @@ class Probe(object):
                        parts = ["p", parts[0], parts[1]]
                if len(parts[0]) == 0:
                        self.probe_type = "p"
-                elif parts[0] in ["p", "r"]:
+                elif parts[0] in ["p", "r", "t"]:
                        self.probe_type = parts[0]
                else:
-                        self._bail("expected '', 'p', or 'r', got '%s'" %
+                        self._bail("expected '', 'p', 't', or 'r', got '%s'" %
                                   parts[0])
-                self.library = parts[1]
-                self.function = parts[2]
+                if self.probe_type == "t":
+                        self.tp_category = parts[1]
+                        self.tp_event = parts[2]
+                        Tracepoint.enable_tracepoint(self.tp_category,
+                                                     self.tp_event)
+                        self.library = ""       # kernel
+                        self.function = "perf_trace_%s" % self.tp_event
+                else:
+                        self.library = parts[1]
+                        self.function = parts[2]

        def _parse_filter(self, filt):
                self.filter = self._replace_args(filt)
@@ -149,12 +316,17 @@ class Probe(object):
                if len(action) == 0:
                        return

-                parts = action.split(',')
-                self.raw_format = parts[0]
+                action = action.strip()
+                match = re.search(r'(\".*\"),?(.*)', action)
+                if match is None:
+                        self._bail("expected format string in \"s")
+
+                self.raw_format = match.group(1)
                self._parse_types(self.raw_format)
-                for part in parts[1:]:
+                for part in match.group(2).split(','):
                        part = self._replace_args(part)
-                        self.values.append(part)
+                        if len(part) > 0:
+                                self.values.append(part)

        aliases = {
                "retval": "ctx->ax",
@@ -283,9 +455,15 @@ BPF_PERF_OUTPUT(%s);
                for i, expr in enumerate(self.values):
                        data_fields += self._generate_field_assign(i)

+                prefix = ""
+                if self.probe_type == "t":
+                        data_decl += self._generate_tpoint_entry_struct()
+                        prefix = self._generate_tpoint_entry_prefix()
+
                text = """
 int %s(struct pt_regs *ctx)
 {
+        %s
        %s
        if (!(%s)) return 0;

@@ -298,7 +476,7 @@ int %s(struct pt_regs *ctx)
        return 0;
 }
 """
-                text = text % (self.probe_name, pid_filter,
+                text = text % (self.probe_name, pid_filter, prefix,
                               self.filter, self.struct_name,
                               data_fields, self.events_name)

@@ -308,6 +486,31 @@ int %s(struct pt_regs *ctx)
        def _time_off_str(cls, timestamp_ns):
                return "%.6f" % (1e-9 * (timestamp_ns - cls.first_ts))

+        auto_includes = {
+                "linux/time.h"      : ["time"],
+                "linux/fs.h"        : ["fs", "file"],
+                "linux/blkdev.h"    : ["bio", "request"],
+                "linux/slab.h"      : ["alloc"],
+                "linux/netdevice.h" : ["sk_buff"]
+        }
+
+        @classmethod
+        def generate_auto_includes(cls, probes):
+                headers = ""
+                for header, keywords in cls.auto_includes.items():
+                        for keyword in keywords:
+                                for probe in probes:
+                                        if keyword in probe:
+                                                headers += "#include <%s>\n" \
+                                                           % header
+                return headers
+
+        def _display_function(self):
+                if self.probe_type != 't':
+                        return self.function
+                else:
+                        return self.function.replace("perf_trace_", "")
+
        def print_event(self, cpu, data, size):
                # Cast as the generated structure type and display
                # according to the format string in the probe.
@@ -318,7 +521,8 @@ int %s(struct pt_regs *ctx)
                time = strftime("%H:%M:%S") if Probe.use_localtime else \
                       Probe._time_off_str(event.timestamp_ns)
                print("%-8s %-6d %-12s %-16s %s" % \
-                    (time[:8], event.pid, event.comm[:12], self.function, msg))
+                    (time[:8], event.pid, event.comm[:12],
+                     self._display_function(), msg))

                Probe.event_count += 1
                if Probe.max_events is not None and \
@@ -337,7 +541,7 @@ int %s(struct pt_regs *ctx)
                if self.probe_type == "r":
                        bpf.attach_kretprobe(event=self.function,
                                             fn_name=self.probe_name)
-                elif self.probe_type == "p":
+                elif self.probe_type == "p" or self.probe_type == "t":
                        bpf.attach_kprobe(event=self.function,
                                          fn_name=self.probe_name)

@@ -384,6 +588,8 @@ trace 'r::__kmalloc (retval == 0) "kmalloc failed!"
        Trace returns from __kmalloc which returned a null pointer
 trace 'r:c:malloc (retval) "allocated = %p", retval
        Trace returns from malloc and print non-NULL allocated buffers
+trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+        Trace the block_rq_complete kernel tracepoint and print # of tx sectors
 """

        def __init__(self):
@@ -420,6 +626,10 @@ trace 'r:c:malloc (retval) "allocated = %p", retval
 #include <linux/sched.h>        /* For TASK_COMM_LEN */

 """
+                self.program += Probe.generate_auto_includes(
+                        map(lambda p: p.raw_probe, self.probes))
+                self.program += Tracepoint.generate_decl()
+                self.program += Tracepoint.generate_entry_probe()
                for probe in self.probes:
                        self.program += probe.generate_program(
                                self.args.pid or -1, self.args.include_self)
@@ -429,6 +639,7 @@ trace 'r:c:malloc (retval) "allocated = %p", retval

        def _attach_probes(self):
                self.bpf = BPF(text=self.program)
+                Tracepoint.attach(self.bpf)
                for probe in self.probes:
                        if self.args.verbose:
                                print(probe)
@@ -455,7 +666,7 @@ trace 'r:c:malloc (retval) "allocated = %p", retval
                except:
                        if self.args.verbose:
                                traceback.print_exc()
-                        else:
+                        elif sys.exc_type is not SystemExit:
                                print(sys.exc_value)

 if __name__ == "__main__":

--- a/tools/trace_example.txt
+++ b/tools/trace_example.txt
@@ -80,6 +80,31 @@ Note that the retval variable must be cast to int before comparing to zero.
 The reason is that the default type for argN and retval is an unsigned 64-bit
 integer, which can never be smaller than 0.

+trace has also some basic support for kernel tracepoints. For example, let's
+trace the block:block_rq_complete tracepoint and print out the number of sectors
+transferred:
+
+# trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+TIME     PID    COMM         FUNC             -
+01:23:51 0      swapper/0    block_rq_complete sectors=8
+01:23:55 10017  kworker/u64: block_rq_complete sectors=1
+01:23:55 0      swapper/0    block_rq_complete sectors=8
+^C
+
+To discover the tracepoint structure format (which you can refer to as the "tp"
+variable), use the tplist tool. For example:
+
+# tplist -v block:block_rq_complete
+block:block_rq_complete
+    dev_t dev;
+    sector_t sector;
+    unsigned int nr_sector;
+    int errors;
+    char rwbs[8];
+
+This output tells you that you can use "tp.dev", "tp.sector", etc. in your
+predicate and trace arguments.
+
 As a final example, let's trace open syscalls for a specific process. By 
 default, tracing is system-wide, but the -p switch overrides this:

@@ -144,4 +169,6 @@ trace 'r::__kmalloc (retval == 0) "kmalloc failed!"
        Trace returns from __kmalloc which returned a null pointer
 trace 'r:c:malloc (retval) "allocated = %p", retval
        Trace returns from malloc and print non-NULL allocated buffers
+trace 't:block:block_rq_complete "sectors=%d", tp.nr_sector'
+        Trace the block_rq_complete kernel tracepoint and print # of tx sectors