Merge pull request #377 from goldshtn/argdist-tuples

Enhancements to argdist

Merge pull request #377 from goldshtn/argdist-tuples
Enhancements to argdist
61f5535f · 4ast · e01b6544 · 7df65da7 · 61f5535f · 61f5535f
Commit 61f5535f authored Feb 14, 2016 by 4ast
Hide whitespace changes
Inline Side-by-side

Showing with 584 additions and 169 deletions

man/man8/argdist.8 man/man8/argdist.8 +50 -21

tools/argdist.py tools/argdist.py +396 -103

tools/argdist_examples.txt tools/argdist_examples.txt +138 -45

No files found.
--- a/man/man8/argdist.8
+++ b/man/man8/argdist.8
@@ -2,14 +2,14 @@
 .SH NAME
 argdist \- Trace a function and display a histogram or frequency count of its parameter values. Uses Linux eBPF/bcc.
 .SH SYNOPSIS
-.B argdist [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL] [-n COUNT] [-H specifier [specifier ...]] [-C specifier [specifier ...]]
+.B argdist [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL] [-n COUNT] [-v] [-T TOP] [-H specifier [specifier ...]] [-C specifier [specifier ...]] [-I header [header ...]]
 .SH DESCRIPTION
 argdist attaches to function entry and exit points, collects specified parameter
 values, and stores them in a histogram or a frequency collection that counts
 the number of times a parameter value occurred. It can also filter parameter
 values and instrument multiple entry points at once.
-This currently only works on x86_64. Check for future versions.
+Since this uses BPF, only the root user can use this tool.
 .SH REQUIREMENTS
 CONFIG_BPF and bcc.
 .SH OPTIONS
@@ -24,20 +24,33 @@ Trace only functions in the process PID.
 When collecting string arguments (of type char*), collect up to STRING_SIZE 
 characters. Longer strings will be truncated.
 .TP
-i INTERVAL
+\-i INTERVAL
 Print the collected data every INTERVAL seconds. The default is 1 second.
 .TP
-n NUMBER
+\-n NUMBER
 Print the collected data COUNT times and then exit.
 .TP
-H SPECIFIER, -C SPECIFIER
+\-v
+Display the generated BPF program, for debugging purposes.
+.TP
+\-T TOP
+When collecting frequency counts, display only the top TOP entries.
+.TP
+\-H SPECIFIER, \-C SPECIFIER
 One or more probe specifications that instruct argdist which functions to
 probe, which parameters to collect, how to aggregate them, and whether to perform
 any filtering. See SPECIFIER SYNTAX below.
+.TP
+\-I HEADER
+One or more header files that should be included in the BPF program. This 
+enables the use of structure definitions, enumerations, and constants that
+are available in these headers. You should provide the same path you would
+include in the BPF program, e.g. 'linux/blkdev.h' or 'linux/time.h'. Note: in
+many cases, argdist will deduce the necessary header files automatically. 
 .SH SPECIFIER SYNTAX
 The general specifier syntax is as follows:
-.B {p,r}:[library]:function(signature)[:type:expr[:filter]][;label]
+.B {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 .TP
 .B {p,r}
 Probe type \- "p" for function entry, "r" for function return;
@@ -47,8 +60,7 @@ count information, or aggregate the collected values into a histogram. Counting
 probes will collect the number of times every parameter value was observed,
 whereas histogram probes will collect the parameter values into a histogram.
 Only integral types can be used with histogram probes; there is no such limitation
-for counting probes. Return probes can only use the pseudo-variable $retval, which
+for counting probes.
-is an alias for the function's return value.
 .TP
 .B [library]
 Library containing the probe.
@@ -63,21 +75,30 @@ on the other hand, is only required if you plan to collect parameter values
 based on that signature. For example, if you only want to collect the first
 parameter, you don't have to specify the rest of the parameters in the signature.
 .TP
-.B [type]
+.B [type[,type...]]
-The type of the expression to capture.
+The type(s) of the expression(s) to capture.
 This is the type of the keys in the histogram or raw event collection that are
 collected by the probes.
 .TP
-.B [expr]
+.B [expr[,expr...]]
-The expression to capture.
+The expression(s) to capture.
 These are the values that are assigned to the histogram or raw event collection.
 You may use the parameters directly, or valid C expressions that involve the
 parameters, such as "size % 10".
+Return probes can use the argument values received by the
+function when it was entered, through the $entry(paramname) special variable.
+Return probes can also access the function's return value in $retval, and the
+function's execution time in nanoseconds in $latency. Note that adding the
+$latency or $entry(paramname) variables to the expression will introduce an
+additional probe at the function's entry to collect this data, and therefore
+introduce additional overhead.
 .TP
 .B [filter]
 The filter applied to the captured data.
 Only parameter values that pass the filter will be collected. This is any valid
 C expression that refers to the parameter values, such as "fd == 1 && length > 16".
+The $entry, $retval, and $latency variables can be used here as well, in return
+probes.
 .TP
 .B [label]
 The label that will be displayed when printing the probed values. By default,
@@ -86,39 +107,47 @@ this is the probe specifier.
 .TP
 Print a histogram of allocation sizes passed to kmalloc:
 #
-.B argdist.py -H 'p::__kmalloc(u64 size):u64:size'
+.B argdist -H 'p::__kmalloc(u64 size):u64:size'
 .TP
 Print a count of how many times process 1005 called malloc with an allocation size of 16 bytes:
 #
-.B argdist.py -p 1005 -C 'p:c:malloc(size_t size):size_t:size:size==16'
+.B argdist -p 1005 -C 'p:c:malloc(size_t size):size_t:size:size==16'
 .TP
 Snoop on all strings returned by gets():
 #
-.B argdist.py -C 'r:c:gets():char*:$retval'
+.B argdist -C 'r:c:gets():char*:$retval'
+.TP
+Print a histogram of read sizes that were longer than 1ms:
+#
+.B argdist -H 'r::__vfs_read(void *file, void *buf, size_t count):size_t:$entry(count):$latency > 1000000'
 .TP
 Print frequency counts of how many times writes were issued to a particular file descriptor number, in process 1005:
 #
-.B argdist.py -p 1005 -C 'p:c:write(int fd):int:fd'
+.B argdist -p 1005 -C 'p:c:write(int fd):int:fd'
 .TP
 Print a histogram of error codes returned by read() in process 1005:
 #
-.B argdist.py -p 1005 -H 'r:c:read()'
+.B argdist -p 1005 -H 'r:c:read()'
 .TP
 Print a histogram of buffer sizes passed to write() across all processes, where the file descriptor was 1 (STDOUT):
 #
-.B argdist.py -H 'p:c:write(int fd, const void *buf, size_t count):size_t:count:fd==1'
+.B argdist -H 'p:c:write(int fd, const void *buf, size_t count):size_t:count:fd==1'
 .TP
 Count fork() calls in libc across all processes, grouped by pid:
 #
-.B argdist.py -C 'p:c:fork():int:$PID;fork per process'
+.B argdist -C 'p:c:fork():int:$PID;fork per process'
 .TP
 Print histograms of sleep() and nanosleep() parameter values:
 #
-.B argdist.py -H 'p:c:sleep(u32 seconds):u32:seconds' -H 'p:c:nanosleep(struct timespec { time_t tv_sec; long tv_nsec; } *req):long:req->tv_nsec'
+.B argdist -H 'p:c:sleep(u32 seconds):u32:seconds' 'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
 .TP
 Spy on writes to STDOUT performed by process 2780, up to a string size of 120 characters:
 #
-.B argdist.py -p 2780 -z 120 -C 'p:c:write(int fd, char* buf, size_t len):char*:buf:fd==1'
+.B argdist -p 2780 -z 120 -C 'p:c:write(int fd, char* buf, size_t len):char*:buf:fd==1'
+.TP
+Group files being read from and the read sizes from __vfs_read:
+#
+.B argdist -C 'p::__vfs_read(struct file *file, void *buf, size_t count):char*,size_t:file->f_path.dentry->d_iname,count:file->f_path.dentry->d_iname[0]!=0'
 .SH SOURCE
 This is from bcc.
 .IP

--- a/tools/argdist.py
+++ b/tools/argdist.py
 #!/usr/bin/env python
 #
-# argdist.py   Trace a function and display a distribution of its
+# argdist   Trace a function and display a distribution of its
 #              parameter values as a histogram or frequency count.
 #
-# USAGE: argdist.py [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL]
+# USAGE: argdist [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL]
-#                   [-n COUNT] [-C specifier [specifier ...]]
+#                   [-n COUNT] [-v] [-T TOP]
+#                   [-C specifier [specifier ...]]
 #                   [-H specifier [specifier ...]]
+#                   [-I header [header ...]]
 #
 # Licensed under the Apache License, Version 2.0 (the "License")
 # Copyright (C) 2016 Sasha Goldshtein.
@@ -13,22 +15,41 @@
 from bcc import BPF
 from time import sleep, strftime
 import argparse
+import re
 class Specifier(object):
-        text = """
+        probe_text = """
 DATA_DECL
 int PROBENAME(struct pt_regs *ctx SIGNATURE)
 {
+        PREFIX
        PID_FILTER
-        KEY_EXPR
        if (!(FILTER)) return 0;
+        KEY_EXPR
        COLLECT
        return 0;
 }
 """
        next_probe_index = 0
        aliases = { "$PID": "bpf_get_current_pid_tgid()" }
+        auto_includes = {
+                "linux/time.h"    : ["time"],
+                "linux/fs.h"      : ["fs", "file"],
+                "linux/blkdev.h"  : ["bio", "request"],
+                "linux/slab.h"    : ["alloc"]
+        }
+        @staticmethod
+        def generate_auto_includes(specifiers):
+                headers = ""
+                for header, keywords in Specifier.auto_includes.items():
+                        for keyword in keywords:
+                                for specifier in specifiers:                            
+                                        if keyword in specifier:
+                                                headers += "#include <%s>\n" \
+                                                           % header
+                return headers
        def _substitute_aliases(self, expr):
                if expr is None:
@@ -37,45 +58,198 @@ int PROBENAME(struct pt_regs *ctx SIGNATURE)
                        expr = expr.replace(alias, subst)
                return expr
+        def _parse_signature(self):
+                params = map(str.strip, self.signature.split(','))
+                self.param_types = {}
+                for param in params:
+                        # If the type is a pointer, the * can be next to the
+                        # param name. Other complex types like arrays are not
+                        # supported right now.
+                        index = param.rfind('*')
+                        index = index if index != -1 else param.rfind(' ')
+                        param_type = param[0:index+1].strip()
+                        param_name = param[index+1:].strip()
+                        self.param_types[param_name] = param_type
+        entry_probe_text = """
+int PROBENAME(struct pt_regs *ctx SIGNATURE)
+{
+        u32 pid = bpf_get_current_pid_tgid();
+        PID_FILTER
+        COLLECT
+        return 0;
+}
+"""
+        def _generate_entry(self):
+                self.entry_probe_func = self.probe_func_name + "_entry"
+                text = self.entry_probe_text
+                text = text.replace("PROBENAME", self.entry_probe_func)
+                text = text.replace("SIGNATURE",
+                     "" if len(self.signature) == 0 else ", " + self.signature)
+                pid_filter = "" if self.is_user or self.pid is None \
+                                else "if (pid != %d) { return 0; }" % self.pid
+                text = text.replace("PID_FILTER", pid_filter)
+                collect = ""
+                for pname in self.args_to_probe:
+                        param_hash = self.hashname_prefix + pname
+                        if pname == "__latency":
+                                collect += """
+u64 __time = bpf_ktime_get_ns();
+%s.update(&pid, &__time);
+"""                             % param_hash
+                        else:
+                                collect += "%s.update(&pid, &%s);\n" % \
+                                           (param_hash, pname)
+                text = text.replace("COLLECT", collect)
+                return text
+        def _generate_entry_probe(self):
+                # Any $entry(name) expressions result in saving that argument
+                # when entering the function.
+                self.args_to_probe = set()
+                regex = r"\$entry\((\w+)\)"
+                for expr in self.exprs:
+                        for arg in re.finditer(regex, expr):
+                                self.args_to_probe.add(arg.group(1))
+                for arg in re.finditer(regex, self.filter):
+                        self.args_to_probe.add(arg.group(1))
+                if any(map(lambda expr: "$latency" in expr, self.exprs)) or \
+                   "$latency" in self.filter:
+                        self.args_to_probe.add("__latency")
+                        self.param_types["__latency"] = "u64"    # nanoseconds
+                for pname in self.args_to_probe:
+                        if pname not in self.param_types:
+                                raise ValueError("$entry(%s): no such param" \
+                                                % arg)
+                self.hashname_prefix = "%s_param_" % self.probe_hash_name
+                text = ""
+                for pname in self.args_to_probe:
+                        # Each argument is stored in a separate hash that is
+                        # keyed by pid.
+                        text += "BPF_HASH(%s, u32, %s);\n" % \
+                             (self.hashname_prefix + pname,
+                              self.param_types[pname])
+                text += self._generate_entry()
+                return text
+        def _generate_retprobe_prefix(self):
+                # After we're done here, there are __%s_val variables for each
+                # argument we needed to probe using $entry(name), and they all
+                # have values (which isn't necessarily the case if we missed
+                # the method entry probe).
+                text = "u32 __pid = bpf_get_current_pid_tgid();\n"
+                self.param_val_names = {}
+                for pname in self.args_to_probe:
+                        val_name = "__%s_val" % pname
+                        text += "%s *%s = %s.lookup(&__pid);\n" % \
+                                (self.param_types[pname], val_name,
+                                 self.hashname_prefix + pname)
+                        text += "if (%s == 0) { return 0 ; }\n" % val_name
+                        self.param_val_names[pname] = val_name
+                return text
+        def _replace_entry_exprs(self):
+                for pname, vname in self.param_val_names.items():
+                        if pname == "__latency":
+                                entry_expr = "$latency"
+                                val_expr = "(bpf_ktime_get_ns() - *%s)" % vname
+                        else:
+                                entry_expr = "$entry(%s)" % pname
+                                val_expr = "(*%s)" % vname
+                        for i in range(0, len(self.exprs)):
+                                self.exprs[i] = self.exprs[i].replace(
+                                                entry_expr, val_expr)
+                        self.filter = self.filter.replace(entry_expr,
+                                                          val_expr)
+        def _attach_entry_probe(self):
+                if self.is_user:
+                        self.bpf.attach_uprobe(name=self.library,
+                                               sym=self.function,
+                                               fn_name=self.entry_probe_func,
+                                               pid=self.pid or -1)
+                else:
+                        self.bpf.attach_kprobe(event=self.function,
+                                               fn_name=self.entry_probe_func)
+        def _bail(self, error):
+                raise ValueError("error parsing probe '%s': %s" %
+                                 (self.raw_spec, error))
+        def _validate_specifier(self):
+                # Everything after '#' is the probe label, ignore it
+                spec = self.raw_spec.split('#')[0]
+                parts = spec.strip().split(':')
+                if len(parts) < 3:
+                        self._bail("at least the probe type, library, and " +
+                                   "function signature must be specified")
+                if len(parts) > 6:
+                        self._bail("extraneous ':'-separated parts detected")
+                if parts[0] not in ["r", "p"]:
+                        self._bail("probe type must be either 'p' or 'r', " +
+                                   "but got '%s'" % parts[0])
+                if re.match(r"\w+\(.*\)", parts[2]) is None:
+                        self._bail(("function signature '%s' has an invalid " +
+                                    "format") % parts[2])
+        def _parse_expr_types(self, expr_types):
+                if len(expr_types) == 0:
+                        self._bail("no expr types specified")
+                self.expr_types = expr_types.split(',')
+        def _parse_exprs(self, exprs):
+                if len(exprs) == 0:
+                        self._bail("no exprs specified")
+                self.exprs = exprs.split(',')
        def __init__(self, type, specifier, pid):
-                self.raw_spec = specifier 
+                self.raw_spec = specifier
-                spec_and_label = specifier.split(';')
+                self._validate_specifier()
+                spec_and_label = specifier.split('#')
                self.label = spec_and_label[1] \
                             if len(spec_and_label) == 2 else None
                parts = spec_and_label[0].strip().split(':')
-                if len(parts) < 3 or len(parts) > 6:
-                        raise ValueError("invalid specifier format")
                self.type = type    # hist or freq
                self.is_ret_probe = parts[0] == "r"
-                if self.type != "hist" and self.type != "freq":
-                        raise ValueError("unrecognized probe type")
-                if parts[0] not in ["r", "p"]:
-                        raise ValueError("unrecognized probe type")
                self.library = parts[1]
                self.is_user = len(self.library) > 0
                fparts = parts[2].split('(')
-                if len(fparts) != 2:
+                self.function = fparts[0].strip()
-                        raise ValueError("invalid specifier format")
+                self.signature = fparts[1].strip()[:-1]
-                self.function = fparts[0]
+                self._parse_signature()
-                self.signature = fparts[1][:-1]
+                # If the user didn't specify an expression to probe, we probe
+                # the retval in a ret probe, or simply the value "1" otherwise.
                self.is_default_expr = len(parts) < 5
                if not self.is_default_expr:
-                        self.expr_type = parts[3]
+                        self._parse_expr_types(parts[3])
-                        self.expr = parts[4]
+                        self._parse_exprs(parts[4])
+                        if len(self.exprs) != len(self.expr_types):
+                                self._bail("mismatched # of exprs and types")
+                        if self.type == "hist" and len(self.expr_types) > 1:
+                                self._bail("histograms can only have 1 expr")
                else:
                        if not self.is_ret_probe and self.type == "hist":
-                                raise ValueError("dist probes must have expr")
+                                self._bail("histograms must have expr")
-                        self.expr_type = \
+                        self.expr_types = \
-                                "u64" if not self.is_ret_probe else "int"
+                                ["u64" if not self.is_ret_probe else "int"]
-                        self.expr = "1" if not self.is_ret_probe else "$retval"
+                        self.exprs = \
-                self.expr = self.expr.replace("$retval",
+                                ["1" if not self.is_ret_probe else "$retval"]
-                                              "(%s)ctx->ax" % self.expr_type)
+                self.filter = "" if len(parts) != 6 else parts[5]
-                self.filter = None if len(parts) != 6 else parts[5]
+                self._substitute_exprs()
-                if self.filter is not None:
-                        self.filter = self.filter.replace("$retval",
+                # Do we need to attach an entry probe so that we can collect an 
-                                "(%s)ctx->ax" % self.expr_type)
+                # argument that is required for an exit (return) probe?
-                self.expr = self._substitute_aliases(self.expr)
+                def check(expr):
-                self.filter = self._substitute_aliases(self.filter)
+                        keywords = ["$entry", "$latency"]
+                        return any(map(lambda kw: kw in expr, keywords))
+                self.entry_probe_required = self.is_ret_probe and \
+                        (any(map(check, self.exprs)) or check(self.filter))
                self.pid = pid
                self.probe_func_name = "%s_probe%d" % \
                        (self.function, Specifier.next_probe_index)
@@ -83,51 +257,107 @@ int PROBENAME(struct pt_regs *ctx SIGNATURE)
                        (self.function, Specifier.next_probe_index)
                Specifier.next_probe_index += 1
-        def _is_string_probe(self):
+        def _substitute_exprs(self):
-                return self.expr_type == "char*" or self.expr_type == "char *"
+                def repl(expr):
+                        expr = self._substitute_aliases(expr)
+                        return expr.replace("$retval", "ctx->ax")
+                for i in range(0, len(self.exprs)):
+                        self.exprs[i] = repl(self.exprs[i])
+                self.filter = repl(self.filter)
-        def generate_text(self, string_size):
+        def _is_string(self, expr_type):
-                program = self.text.replace("PROBENAME", self.probe_func_name)
+                return expr_type == "char*" or expr_type == "char *"
-                signature = "" if len(self.signature) == 0 \
-                               else "," + self.signature
+        def _generate_hash_field(self, i):
-                program = program.replace("SIGNATURE", signature)
+                if self._is_string(self.expr_types[i]):
+                        return "struct __string_t v%d;\n" % i
+                else:
+                        return "%s v%d;\n" % (self.expr_types[i], i)
+        def _generate_field_assignment(self, i):
+                if self._is_string(self.expr_types[i]):
+                        return "bpf_probe_read(" + \
+                               "&__key.v%d.s, sizeof(__key.v%d.s), %s);\n" % \
+                                (i, i, self.exprs[i])
+                else:
+                        return "__key.v%d = %s;\n" % (i, self.exprs[i])
+        def _generate_hash_decl(self):
+                if self.type == "hist":
+                        return "BPF_HISTOGRAM(%s, %s);" % \
+                               (self.probe_hash_name, self.expr_types[0])
+                else:
+                        text = "struct %s_key_t {\n" % self.probe_hash_name
+                        for i in range(0, len(self.expr_types)):
+                                text += self._generate_hash_field(i)
+                        text += "};\n"
+                        text += "BPF_HASH(%s, struct %s_key_t, u64);\n" % \
+                                (self.probe_hash_name, self.probe_hash_name)
+                        return text
+        def _generate_key_assignment(self):
+                if self.type == "hist":
+                        return "%s __key = %s;\n" % \
+                                (self.expr_types[0], self.exprs[0])
+                else:
+                        text = "struct %s_key_t __key = {};\n" % \
+                                self.probe_hash_name
+                        for i in range(0, len(self.exprs)):
+                                text += self._generate_field_assignment(i) 
+                        return text
+        def _generate_hash_update(self):
+                if self.type == "hist":
+                        return "%s.increment(bpf_log2l(__key));" % \
+                                self.probe_hash_name 
+                else:
+                        return "%s.increment(__key);" % self.probe_hash_name
+        def _generate_pid_filter(self):
+                # Kernel probes need to explicitly filter pid, because the
+                # attach interface doesn't support pid filtering
                if self.pid is not None and not self.is_user:
-                        # kernel probes need to explicitly filter pid
+                        return "u32 pid = bpf_get_current_pid_tgid();\n" + \
-                        program = program.replace("PID_FILTER",
+                               "if (pid != %d) { return 0; }" % self.pid
-                                "u32 pid = bpf_get_current_pid_tgid();\n" + \
-                                "if (pid != %d) { return 0; }" % self.pid)
                else:
-                        program = program.replace("PID_FILTER", "")
+                        return ""
-                if self._is_string_probe():
-                        decl = """
+        def generate_text(self):
-struct %s_key_t { char key[%d]; };
+                # We don't like tools writing tools (Brendan Gregg), but this
-BPF_HASH(%s, struct %s_key_t, u64);
+                # is an exception because we're letting the user fully
-""" \
+                # customize the values we probe. As a rule of thumb though,
-                        % (self.function, string_size,
+                # try to build a custom tool for a specific purpose.
-                           self.probe_hash_name, self.function)
-                        collect = "%s.increment(__key);" % self.probe_hash_name
+                program = ""
-                        key_expr = """
-struct %s_key_t __key = {0};
+                # If any entry arguments are probed in a ret probe, we need
-bpf_probe_read(&__key.key, sizeof(__key.key), %s);
+                # to generate an entry probe to collect them
-""" \
+                prefix = ""
-                        % (self.function, self.expr)
+                if self.entry_probe_required:
-                elif self.type == "freq":
+                        program = self._generate_entry_probe()
-                        decl = "BPF_HASH(%s, %s, u64);" % \
+                        prefix = self._generate_retprobe_prefix()                                         
-                                (self.probe_hash_name, self.expr_type)
+                        # Replace $entry(paramname) with a reference to the
-                        collect = "%s.increment(__key);" % self.probe_hash_name
+                        # value we collected when entering the function:
-                        key_expr = "%s __key = %s;" % \
+                        self._replace_entry_exprs()
-                                   (self.expr_type, self.expr)
-                elif self.type == "hist":
+                program += self.probe_text.replace("PROBENAME",
-                        decl = "BPF_HISTOGRAM(%s, %s);" % \
+                                                   self.probe_func_name)
-                                (self.probe_hash_name, self.expr_type)
+                signature = "" if len(self.signature) == 0 \
-                        collect = "%s.increment(bpf_log2l(__key));" % \
+                                  or self.is_ret_probe \
-                                  self.probe_hash_name 
+                               else ", " + self.signature
-                        key_expr = "%s __key = %s;" % \
+                program = program.replace("SIGNATURE", signature)
-                                   (self.expr_type, self.expr)
+                program = program.replace("PID_FILTER",
+                                          self._generate_pid_filter())
+                decl = self._generate_hash_decl()
+                key_expr = self._generate_key_assignment()
+                collect = self._generate_hash_update()
                program = program.replace("DATA_DECL", decl)
                program = program.replace("KEY_EXPR", key_expr) 
-                program = program.replace("FILTER", self.filter or "1") 
+                program = program.replace("FILTER",
+                        "1" if len(self.filter) == 0 else self.filter) 
                program = program.replace("COLLECT", collect)
+                program = program.replace("PREFIX", prefix)
                return program
        def attach(self, bpf):
@@ -150,81 +380,131 @@ bpf_probe_read(&__key.key, sizeof(__key.key), %s);
                        else:
                                bpf.attach_kprobe(event=self.function,
                                                  fn_name=self.probe_func_name)
+                if self.entry_probe_required:
+                        self._attach_entry_probe()
+        def _v2s(self, v):
+                # Most fields can be converted with plain str(), but strings
+                # are wrapped in a __string_t which has an .s field
+                if "__string_t" in type(v).__name__:
+                        return str(v.s)
+                return str(v)
+        def _display_expr(self, i):
+                # Replace ugly latency calculation with $latency
+                expr = self.exprs[i].replace(
+                        "(bpf_ktime_get_ns() - *____latency_val)", "$latency")
+                # Replace alias values back with the alias name
+                for alias, subst in Specifier.aliases.items():
+                        expr = expr.replace(subst, alias) 
+                # Replace retval expression with $retval
+                expr = expr.replace("ctx->ax", "$retval")
+                # Replace ugly (*__param_val) expressions with param name
+                return re.sub(r"\(\*__(\w+)_val\)", r"\1", expr)
-        def display(self):
+        def _display_key(self, key):
-                print(self.label or self.raw_spec)
+                if self.is_default_expr:
+                        if not self.is_ret_probe:
+                                return "total calls"
+                        else:
+                                return "retval = %s" % str(key.v0)
+                else:
+                        # The key object has v0, ..., vk fields containing
+                        # the values of the expressions from self.exprs
+                        def str_i(i):
+                                key_i = self._v2s(getattr(key, "v%d" % i))
+                                return "%s = %s" % \
+                                        (self._display_expr(i), key_i)
+                        return ", ".join(map(str_i, range(0, len(self.exprs))))
+        def display(self, top):
                data = self.bpf.get_table(self.probe_hash_name)
                if self.type == "freq":
+                        print(self.label or self.raw_spec)
                        print("\t%-10s %s" % ("COUNT", "EVENT"))
-                        for key, value in sorted(data.items(),
+                        data = sorted(data.items(), key=lambda kv: kv[1].value)
-                                                 key=lambda kv: kv[1].value):
+                        if top is not None:
-                                key_val = key.key if self._is_string_probe() \
+                                data = data[-top:]
-                                                  else str(key.value)
+                        for key, value in data: 
+                                # Print some nice values if the user didn't
+                                # specify an expression to probe
                                if self.is_default_expr:
                                        if not self.is_ret_probe:
                                                key_str = "total calls"
                                        else:
                                                key_str = "retval = %s" % \
-                                                          key_val
+                                                          self._v2s(key.v0)
                                else:
-                                        key_str = "%s = %s" % \
+                                        key_str = self._display_key(key) 
-                                                  (self.expr, key_val)
                                print("\t%-10s %s" % \
                                      (str(value.value), key_str))
                elif self.type == "hist":
-                        label = self.expr if not self.is_default_expr \
+                        label = self.label or (self._display_expr(0)
-                                          else "retval"
+                                if not self.is_default_expr  else "retval")
                        data.print_log2_hist(val_type=label)
 examples = """
 Probe specifier syntax:
-        {p,r}:[library]:function(signature)[:type:expr[:filter]][;label]
+        {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
        p,r        -- probe at function entry or at function exit
-                      in exit probes, only $retval is accessible
+                      in exit probes: can use $retval, $entry(param), $latency
        library    -- the library that contains the function
                      (leave empty for kernel functions)
        function   -- the function name to trace
        signature  -- the function's parameters, as in the C header
-        type       -- the type of the expression to collect
+        type       -- the type of the expression to collect (supports multiple)
-        expr       -- the expression to collect
+        expr       -- the expression to collect (supports multiple)
        filter     -- the filter that is applied to collected values
        label      -- the label for this probe in the resulting output
 EXAMPLES:
-argdist.py -H 'p::__kmalloc(u64 size):u64:size'
+argdist -H 'p::__kmalloc(u64 size):u64:size'
        Print a histogram of allocation sizes passed to kmalloc
-argdist.py -p 1005 -C 'p:c:malloc(size_t size):size_t:size:size==16'
+argdist -p 1005 -C 'p:c:malloc(size_t size):size_t:size:size==16'
        Print a frequency count of how many times process 1005 called malloc
        with an allocation size of 16 bytes
-argdist.py -C 'r:c:gets():char*:$retval;snooped strings'
+argdist -C 'r:c:gets():char*:(char*)$retval#snooped strings'
        Snoop on all strings returned by gets()
-argdist.py -p 1005 -C 'p:c:write(int fd):int:fd'
+argdist -H 'r::__kmalloc(size_t size):u64:$latency/$entry(size)#ns per byte'
+        Print a histogram of nanoseconds per byte from kmalloc allocations
+argdist -C 'p::__kmalloc(size_t size, gfp_t flags):size_t:size:flags&GFP_ATOMIC'
+        Print frequency count of kmalloc allocation sizes that have GFP_ATOMIC
+argdist -p 1005 -C 'p:c:write(int fd):int:fd' -T 5
        Print frequency counts of how many times writes were issued to a
-        particular file descriptor number, in process 1005
+        particular file descriptor number, in process 1005, but only show
+        the top 5 busiest fds
+argdist -p 1005 -H 'r:c:read()'
+        Print a histogram of results (sizes) returned by read() in process 1005
-argdist.py -p 1005 -H 'r:c:read()'
+argdist -C 'r::__vfs_read():u32:$PID:$latency > 100000'
-        Print a histogram of error codes returned by read() in process 1005
+        Print frequency of reads by process where the latency was >0.1ms
-argdist.py -H \\
+argdist -H 'r::__vfs_read(void *file, void *buf, size_t count):size_t:$entry(count):$latency > 1000000' 
+        Print a histogram of read sizes that were longer than 1ms
+argdist -H \\
        'p:c:write(int fd, const void *buf, size_t count):size_t:count:fd==1'
        Print a histogram of buffer sizes passed to write() across all
        processes, where the file descriptor was 1 (STDOUT)
-argdist.py -C 'p:c:fork();fork calls'
+argdist -C 'p:c:fork()#fork calls'
        Count fork() calls in libc across all processes
        Can also use funccount.py, which is easier and more flexible 
-argdist.py \\
+argdist  -H \\
-        -H 'p:c:sleep(u32 seconds):u32:seconds' \\
+        'p:c:sleep(u32 seconds):u32:seconds' \\
-        -H 'p:c:nanosleep(struct timespec { time_t tv_sec; long tv_nsec; } *req):long:req->tv_nsec'
+        'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
        Print histograms of sleep() and nanosleep() parameter values
-argdist.py -p 2780 -z 120 \\
+argdist -p 2780 -z 120 \\
        -C 'p:c:write(int fd, char* buf, size_t len):char*:buf:fd==1'
        Spy on writes to STDOUT performed by process 2780, up to a string size
        of 120 characters 
@@ -242,12 +522,18 @@ parser.add_argument("-i", "--interval", default=1, type=int,
        help="output interval, in seconds")
 parser.add_argument("-n", "--number", type=int, dest="count",
        help="number of outputs")
+parser.add_argument("-v", "--verbose", action="store_true",
+        help="print resulting BPF program code before executing")
+parser.add_argument("-T", "--top", type=int,
+        help="number of top results to show (not applicable to histograms)")
 parser.add_argument("-H", "--histogram", nargs="*", dest="histspecifier",
+        metavar="specifier",
        help="probe specifier to capture histogram of (see examples below)")
 parser.add_argument("-C", "--count", nargs="*", dest="countspecifier",
+        metavar="specifier",
        help="probe specifier to capture count of (see examples below)")
-parser.add_argument("-v", "--verbose", action="store_true",
+parser.add_argument("-I", "--include", nargs="*", metavar="header",
-        help="print resulting BPF program code before executing")
+        help="additional header files to include in the BPF program")
 args = parser.parse_args()
 specifiers = []
@@ -259,9 +545,16 @@ if len(specifiers) == 0:
        print("at least one specifier is required")
        exit(1)
-bpf_source = "#include <uapi/linux/ptrace.h>\n"
+bpf_source = """
+struct __string_t { char s[%d]; };
+#include <uapi/linux/ptrace.h>
+""" % args.string_size
+for include in (args.include or []):
+        bpf_source += "#include <%s>\n" % include
+bpf_source += Specifier.generate_auto_includes(map(lambda s: s.raw_spec, specifiers))
 for specifier in specifiers:
-        bpf_source += specifier.generate_text(args.string_size)
+        bpf_source += specifier.generate_text()
 if args.verbose:
        print(bpf_source)
@@ -279,7 +572,7 @@ while True:
                exit()
        print("[%s]" % strftime("%H:%M:%S"))
        for specifier in specifiers:
-                specifier.display()
+                specifier.display(args.top)
        count_so_far += 1
        if args.count is not None and count_so_far >= args.count:
                exit()
--- a/tools/argdist_examples.txt
+++ b/tools/argdist_examples.txt
@@ -10,7 +10,7 @@ various functions.
 For example, suppose you want to find what allocation sizes are common in
 your application:
-# ./argdist.py -p 2420 -C 'p:c:malloc(size_t size):size_t:size'
+# ./argdist -p 2420 -C 'p:c:malloc(size_t size):size_t:size'
 [01:42:29]
 p:c:malloc(size_t size):size_t:size
        COUNT      EVENT
@@ -43,7 +43,7 @@ probed and its value was 16, repeatedly.
 Now, suppose you wanted a histogram of buffer sizes passed to the write()
 function across the system:
-# ./argdist.py -H 'p:c:write(int fd, void *buf, size_t len):size_t:len'
+# ./argdist -H 'p:c:write(int fd, void *buf, size_t len):size_t:len'
 [01:45:22]
 p:c:write(int fd, void *buf, size_t len):size_t:len
     len                 : count     distribution
@@ -81,7 +81,7 @@ bytes, medium writes of 32-63 bytes, and larger writes of 64-127 bytes.
 But these are writes across the board -- what if you wanted to focus on writes
 to STDOUT?
-# ./argdist.py -H 'p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1'
+# ./argdist -H 'p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1'
 [01:47:17]
 p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1
     len                 : count     distribution
@@ -120,7 +120,7 @@ You can also use argdist to trace kernel functions. For example, suppose you
 wanted a histogram of kernel allocation (kmalloc) sizes across the system,
 printed twice with 3 second intervals:
-# ./argdist.py -i 3 -n 2 -H 'p::__kmalloc(size_t size):size_t:size'
+# ./argdist -i 3 -n 2 -H 'p::__kmalloc(size_t size):size_t:size'
 [01:50:00]
 p::__kmalloc(size_t size):size_t:size
     size                : count     distribution
@@ -143,7 +143,7 @@ p::__kmalloc(size_t size):size_t:size
 Occasionally, numeric information isn't enough and you want to capture strings.
 What are the strings printed by puts() across the system?
-# ./argdist.py -i 10 -n 1 -C 'p:c:puts(char *str):char*:str'
+# ./argdist -i 10 -n 1 -C 'p:c:puts(char *str):char*:str'
 [01:53:54]
 p:c:puts(char *str):char*:str
        COUNT      EVENT
@@ -156,17 +156,17 @@ What about reads? You could trace gets() across the system and print the
 strings input by the user (note how "r" is used instead of "p" to attach a
 probe to the function's return):
-# ./argdist.py -i 10 -n 1 -C 'r:c:gets():char*:$retval:$retval!=0'
+# ./argdist -i 10 -n 1 -C 'r:c:gets():char*:(char*)$retval:$retval!=0'
 [02:12:23]
 r:c:gets():char*:$retval:$retval!=0
        COUNT      EVENT
-        1          (char*)ctx->ax = hi there
+        1          (char*)$retval = hi there
-        3          (char*)ctx->ax = sasha
+        3          (char*)$retval = sasha
-        8          (char*)ctx->ax = hello
+        8          (char*)$retval = hello
 Similarly, we could get a histogram of the error codes returned by read():
-# ./argdist.py -i 10 -c 1 -H 'r:c:read()'
+# ./argdist -i 10 -c 1 -H 'r:c:read()'
 [02:15:36]
 r:c:read()
     retval              : count     distribution
@@ -183,32 +183,109 @@ r:c:read()
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 2        |**                                      |
+In return probes, you can also trace the latency of the function (unless it is
+recursive) and the parameters it had on entry. For example, we can identify
+which processes are performing slow synchronous filesystem reads -- say,
+longer than 0.1ms (100,000ns):
+# ./argdist -C 'r::__vfs_read():u32:$PID:$latency > 100000'
+[01:08:48]
+r::__vfs_read():u32:$PID:$latency > 100000
+        COUNT      EVENT
+        1          $PID = 10457
+        21         $PID = 2780
+[01:08:49]
+r::__vfs_read():u32:$PID:$latency > 100000
+        COUNT      EVENT
+        1          $PID = 10457
+        21         $PID = 2780
+^C
+It looks like process 2780 performed 21 slow reads.
+Occasionally, entry parameter values are also interesting. For example, you
+might be curious how long it takes malloc() to allocate memory -- nanoseconds
+per byte allocated. Let's go:
+# ./argdist -H 'r:c:malloc(size_t size):u64:$latency/$entry(size);ns per byte' -n 1 -i 10
+[01:11:13]
+     ns per byte         : count     distribution
+         0 -> 1          : 0        |                                        |
+         2 -> 3          : 4        |*****************                       |
+         4 -> 7          : 3        |*************                           |
+         8 -> 15         : 2        |********                                |
+        16 -> 31         : 1        |****                                    |
+        32 -> 63         : 0        |                                        |
+        64 -> 127        : 7        |*******************************         |
+       128 -> 255        : 1        |****                                    |
+       256 -> 511        : 0        |                                        |
+       512 -> 1023       : 1        |****                                    |
+      1024 -> 2047       : 1        |****                                    |
+      2048 -> 4095       : 9        |****************************************|
+      4096 -> 8191       : 1        |****                                    | 
+It looks like a tri-modal distribution. Some allocations are extremely cheap,
+and take 2-15 nanoseconds per byte. Other allocations are slower, and take
+64-127 nanoseconds per byte. And some allocations are slower still, and take
+multiple microseconds per byte.
+You could also group results by more than one field. For example, __kmalloc
+takes an additional flags parameter that describes how to allocate memory:
+# ./argdist -C 'p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size'
+[03:42:29]
+p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
+        COUNT      EVENT
+        1          flags = 16, size = 152
+        2          flags = 131280, size = 8
+        7          flags = 131280, size = 16
+[03:42:30]
+p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
+        COUNT      EVENT
+        1          flags = 16, size = 152
+        6          flags = 131280, size = 8
+        19         flags = 131280, size = 16
+[03:42:31]
+p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
+        COUNT      EVENT
+        2          flags = 16, size = 152
+        10         flags = 131280, size = 8
+        31         flags = 131280, size = 16
+[03:42:32]
+p::__kmalloc(size_t size, gfp_t flags):gfp_t,size_t:flags,size
+        COUNT      EVENT
+        2          flags = 16, size = 152
+        14         flags = 131280, size = 8
+        43         flags = 131280, size = 16
+^C
+The flags value must be expanded by hand, but it's still helpful to eliminate
+certain kinds of allocations or visually group them together.
 Here's a final example that finds how many write() system calls are performed
 by each process on the system:
-# argdist.py -C 'p:c:write():int:$PID;write per process' -n 2
+# argdist -C 'p:c:write():int:$PID;write per process' -n 2
 [06:47:18]
 write by process
        COUNT      EVENT
-        3          bpf_get_current_pid_tgid() = 8889
+        3          $PID = 8889
-        7          bpf_get_current_pid_tgid() = 7615
+        7          $PID = 7615
-        7          bpf_get_current_pid_tgid() = 2480
+        7          $PID = 2480
 [06:47:19]
 write by process
        COUNT      EVENT
-        9          bpf_get_current_pid_tgid() = 8889
+        9          $PID = 8889
-        23         bpf_get_current_pid_tgid() = 7615
+        23         $PID = 7615
-        23         bpf_get_current_pid_tgid() = 2480
+        23         $PID = 2480
-As you see, the $PID alias is expanded to the BPF function bpf_get_current_pid_tgid(),
-which returns the current process' pid.
 USAGE message:
-usage: argdist.py [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL] [-n COUNT]
+# argdist -h
-                  [-H [HISTSPECIFIER [HISTSPECIFIER ...]]]
+usage: argdist [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL] [-n COUNT] [-v]
-                  [-C [COUNTSPECIFIER [COUNTSPECIFIER ...]]] [-v]
+                  [-T TOP] [-H [specifier [specifier ...]]]
+                  [-C [specifier [specifier ...]]] [-I [header [header ...]]]
 Trace a function and display a summary of its parameter values.
@@ -221,63 +298,79 @@ optional arguments:
                        output interval, in seconds
  -n COUNT, --number COUNT
                        number of outputs
-  -H [HISTSPECIFIER [HISTSPECIFIER ...]], --histogram [HISTSPECIFIER [HISTSPECIFIER ...]]
+  -v, --verbose         print resulting BPF program code before executing
+  -T TOP, --top TOP     number of top results to show (not applicable to
+                        histograms)
+  -H [specifier [specifier ...]], --histogram [specifier [specifier ...]]
                        probe specifier to capture histogram of (see examples
                        below)
-  -C [COUNTSPECIFIER [COUNTSPECIFIER ...]], --count [COUNTSPECIFIER [COUNTSPECIFIER ...]]
+  -C [specifier [specifier ...]], --count [specifier [specifier ...]]
                        probe specifier to capture count of (see examples
                        below)
-  -v, --verbose         print resulting BPF program code before executing
+  -I [header [header ...]], --include [header [header ...]]
+                        additional header files to include in the BPF program
 Probe specifier syntax:
-        {p,r}:[library]:function(signature)[:type:expr[:filter]][;label]
+        {p,r}:[library]:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
 Where:
        p,r        -- probe at function entry or at function exit
-                      in exit probes, only $retval is accessible
+                      in exit probes: can use $retval, $entry(param), $latency
        library    -- the library that contains the function
                      (leave empty for kernel functions)
        function   -- the function name to trace
        signature  -- the function's parameters, as in the C header
-        type       -- the type of the expression to collect
+        type       -- the type of the expression to collect (supports multiple)
-        expr       -- the expression to collect
+        expr       -- the expression to collect (supports multiple)
        filter     -- the filter that is applied to collected values
        label      -- the label for this probe in the resulting output
 EXAMPLES:
-argdist.py -H 'p::__kmalloc(u64 size):u64:size'
+argdist -H 'p::__kmalloc(u64 size):u64:size'
        Print a histogram of allocation sizes passed to kmalloc
-argdist.py -p 1005 -C 'p:c:malloc(size_t size):size_t:size:size==16'
+argdist -p 1005 -C 'p:c:malloc(size_t size):size_t:size:size==16'
        Print a frequency count of how many times process 1005 called malloc
        with an allocation size of 16 bytes
-argdist.py -C 'r:c:gets():char*:$retval;snooped strings'
+argdist -C 'r:c:gets():char*:$retval#snooped strings'
        Snoop on all strings returned by gets()
-argdist.py -p 1005 -C 'p:c:write(int fd):int:fd'
+argdist -H 'r::__kmalloc(size_t size):u64:$latency/$entry(size)#ns per byte'
+        Print a histogram of nanoseconds per byte from kmalloc allocations
+argdist -C 'p::__kmalloc(size_t size, gfp_t flags):size_t:size:flags&GFP_ATOMIC'
+        Print frequency count of kmalloc allocation sizes that have GFP_ATOMIC
+argdist -p 1005 -C 'p:c:write(int fd):int:fd' -T 5
        Print frequency counts of how many times writes were issued to a
-        particular file descriptor number, in process 1005
+        particular file descriptor number, in process 1005, but only show
+        the top 5 busiest fds
-argdist.py -p 1005 -H 'r:c:read()'
+argdist -p 1005 -H 'r:c:read()'
        Print a histogram of error codes returned by read() in process 1005
-argdist.py -H \
+argdist -C 'r::__vfs_read():u32:$PID:$latency > 100000'
+        Print frequency of reads by process where the latency was >0.1ms
+argdist -H 'r::__vfs_read(void *file, void *buf, size_t count):size_t:$entry(count):$latency > 1000000' 
+        Print a histogram of read sizes that were longer than 1ms
+argdist -H \
        'p:c:write(int fd, const void *buf, size_t count):size_t:count:fd==1'
        Print a histogram of buffer sizes passed to write() across all
        processes, where the file descriptor was 1 (STDOUT)
-argdist.py -C 'p:c:fork();fork calls'
+argdist -C 'p:c:fork()#fork calls'
        Count fork() calls in libc across all processes
-        Can also use funccount.py, which is easier and more flexible
+        Can also use funccount.py, which is easier and more flexible 
-argdist.py \
+argdist  -H \
-        -H 'p:c:sleep(u32 seconds):u32:seconds' \
+        'p:c:sleep(u32 seconds):u32:seconds' \
-        -H 'p:c:nanosleep(struct timespec { time_t tv_sec; long tv_nsec; } *req):long:req->tv_nsec'
+        'p:c:nanosleep(struct timespec *req):long:req->tv_nsec'
        Print histograms of sleep() and nanosleep() parameter values
-argdist.py -p 2780 -z 120 \
+argdist -p 2780 -z 120 \
        -C 'p:c:write(int fd, char* buf, size_t len):char*:buf:fd==1'
        Spy on writes to STDOUT performed by process 2780, up to a string size
-        of 120 characters
+        of 120 characters