Merge pull request #971 from goldshtn/syscount

syscount: Summarize syscall counts and latencies

Merge pull request #971 from goldshtn/syscount
syscount: Summarize syscall counts and latencies
8ca91fca · Brenden Blanco · GitHub · 30aece23 · 8e583cca · 8ca91fca
Commit 8ca91fca authored Feb 16, 2017 by Brenden Blanco Committed by GitHub Feb 16, 2017
Showing with 728 additions and 0 deletions

README.md README.md +1 -0

man/man8/syscount.8 man/man8/syscount.8 +100 -0

tools/syscount.py tools/syscount.py +500 -0

tools/syscount_example.txt tools/syscount_example.txt +127 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -122,6 +122,7 @@ Examples:
 - tools/[stacksnoop](tools/stacksnoop.py): Trace a kernel function and print all kernel stack traces. [Examples](tools/stacksnoop_example.txt).
 - tools/[statsnoop](tools/statsnoop.py): Trace stat() syscalls. [Examples](tools/statsnoop_example.txt).
 - tools/[syncsnoop](tools/syncsnoop.py): Trace sync() syscall. [Examples](tools/syncsnoop_example.txt).
+- tools/[syscount](tools/syscount.py): Summarize syscall counts and latencies. [Examples](tools/syscount_example.txt).
 - tools/[tcpaccept](tools/tcpaccept.py): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
 - tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
 - tools/[tcpconnlat](tools/tcpconnlat.py): Trace TCP active connection latency (connect()). [Examples](tools/tcpconnlat_example.txt).

--- a/man/man8/syscount.8
+++ b/man/man8/syscount.8
+.TH syscount 8  "2017-02-15" "USER COMMANDS"
+.SH NAME
+syscount \- Summarize syscall counts and latencies.
+.SH SYNOPSIS
+.B syscount [-h] [-p PID] [-i INTERVAL] [-T TOP] [-x] [-L] [-m] [-P] [-l]
+.SH DESCRIPTION
+This tool traces syscall entry and exit tracepoints and summarizes either the
+number of syscalls of each type, or the number of syscalls per process. It can
+also collect latency (invocation time) for each syscall or each process.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc. Linux 4.7+ is required to attach a BPF program to the
+raw_syscalls:sys_{enter,exit} tracepoints, used by this tool.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-p PID
+Trace only this process.
+.TP
+\-i INTERVAL
+Print the summary at the specified interval (in seconds).
+.TP
+\-T TOP
+Print only this many entries. Default: 10.
+.TP
+\-x
+Trace only failed syscalls (i.e., the return value from the syscall was < 0).
+.TP
+\-m
+Display times in milliseconds. Default: microseconds.
+.TP
+\-P
+Summarize by process and not by syscall.
+.TP
+\-l
+List the syscalls recognized by the tool (hard-coded list). Syscalls beyond this
+list will still be displayed, as "[unknown: nnn]" where nnn is the syscall
+number.
+.SH EXAMPLES
+.TP
+Summarize all syscalls by syscall:
+#
+.B syscount
+.TP
+Summarize all syscalls by process:
+#
+.B syscount \-P
+.TP
+Summarize only failed syscalls:
+#
+.B syscount \-x
+.TP
+Trace PID 181 only:
+#
+.B syscount \-p 181
+.TP
+Summarize syscalls counts and latencies:
+#
+.B syscount \-L
+.SH FIELDS
+.TP
+PID
+Process ID
+.TP
+COMM
+Process name
+.TP
+SYSCALL
+Syscall name, or "[unknown: nnn]" for syscalls that aren't recognized
+.TP
+COUNT
+The number of events
+.TP
+TIME
+The total elapsed time (in us or ms)
+.SH OVERHEAD
+For most applications, the overhead should be manageable if they perform 1000's
+or even 10,000's of syscalls per second. For higher rates, the overhead may
+become considerable. For example, tracing a loop of 4 million calls to geteuid(),
+slows it down by 1.85x when tracing only syscall counts, and slows it down by
+more than 5x when tracing syscall counts and latencies. However, this represents
+a rate of >3.5 million syscalls per second, which should not be typical.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Sasha Goldshtein
+.SH SEE ALSO
+funccount(8), ucalls(8), argdist(8), trace(8), funclatency(8)
--- a/tools/syscount.py
+++ b/tools/syscount.py
--- a/tools/syscount_example.txt
+++ b/tools/syscount_example.txt
+Demonstrations of syscount, the Linux/eBPF version.
+
+
+syscount summarizes syscall counts across the system or a specific process,
+with optional latency information. It is very useful for general workload
+characterization, for example:
+
+# syscount
+Tracing syscalls, printing top 10... Ctrl+C to quit.
+[09:39:04]
+SYSCALL             COUNT
+write               10739
+read                10584
+wait4                1460
+nanosleep            1457
+select                795
+rt_sigprocmask        689
+clock_gettime         653
+rt_sigaction          128
+futex                  86
+ioctl                  83
+^C
+
+These are the top 10 entries; you can get more by using the -T switch. Here,
+the output indicates that the write and read syscalls were very common, followed
+immediately by wait4, nanosleep, and so on. By default, syscount counts across
+the entire system, but we can point it to a specific process of interest:
+
+# syscount -p $(pidof dd)
+Tracing syscalls, printing top 10... Ctrl+C to quit.
+[09:40:21]
+SYSCALL             COUNT
+read              7878397
+write             7878397
+^C
+
+Indeed, dd's workload is a bit easier to characterize. Occasionally, the count
+of syscalls is not enough, and you'd also want an aggregate latency:
+
+# syscount -L
+Tracing syscalls, printing top 10... Ctrl+C to quit.
+[09:41:32]
+SYSCALL                   COUNT        TIME (us)
+select                       16      3415860.022
+nanosleep                   291        12038.707
+ftruncate                     1          122.939
+write                         4           63.389
+stat                          1           23.431
+fstat                         1            5.088
+[unknown: 321]               32            4.965
+timerfd_settime               1            4.830
+ioctl                         3            4.802
+kill                          1            4.342
+^C
+
+The select and nanosleep calls are responsible for a lot of time, but remember
+these are blocking calls. This output was taken from a mostly idle system. Note
+the "unknown" entry -- syscall 321 is the bpf() syscall, which is not in the
+table used by this tool (borrowed from strace sources).
+
+Another direction would be to understand which processes are making a lot of
+syscalls, thus responsible for a lot of activity. This is what the -P switch
+does:
+
+# syscount -P
+Tracing syscalls, printing top 10... Ctrl+C to quit.
+[09:58:13]
+PID    COMM               COUNT
+13820  vim                  548
+30216  sshd                 149
+29633  bash                  72
+25188  screen                70
+25776  mysqld                30
+31285  python                10
+529    systemd-udevd          9
+1      systemd                8
+494    systemd-journal        5
+^C
+
+This is again from a mostly idle system over an interval of a few seconds.
+
+Sometimes, you'd only care about failed syscalls -- these are the ones that
+might be worth investigating with follow-up tools like opensnoop, execsnoop,
+or trace. Use the -x switch for this; the following example also demonstrates
+the -i switch, for printing at predefined intervals:
+
+# syscount -x -i 5
+Tracing failed syscalls, printing top 10... Ctrl+C to quit.
+[09:44:16]
+SYSCALL             COUNT
+futex                  13
+getxattr               10
+stat                    8
+open                    6
+wait4                   3
+access                  2
+[unknown: 321]          1
+
+[09:44:21]
+SYSCALL             COUNT
+futex                  12
+getxattr               10
+[unknown: 321]          2
+wait4                   1
+access                  1
+pause                   1
+^C
+
+USAGE:
+# syscount -h
+usage: syscount.py [-h] [-p PID] [-i INTERVAL] [-T TOP] [-x] [-L] [-m] [-P]
+                   [-l]
+
+Summarize syscall counts and latencies.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -p PID, --pid PID     trace only this pid
+  -i INTERVAL, --interval INTERVAL
+                        print summary at this interval (seconds)
+  -T TOP, --top TOP     print only the top syscalls by count or latency
+  -x, --failures        trace only failed syscalls (return < 0)
+  -L, --latency         collect syscall latency
+  -m, --milliseconds    display latency in milliseconds (default:
+                        microseconds)
+  -P, --process         count by process and not by syscall
+  -l, --list            print list of recognized syscalls and exit