add runqlat tool

3ff0a80a · Brendan Gregg · Alastair Robertson · 5f95444d · 3ff0a80a · 3ff0a80a
Commit 3ff0a80a authored Oct 09, 2018 by Brendan Gregg Committed by Alastair Robertson Oct 09, 2018
5 changed files
--- a/README.md
+++ b/README.md
@@ -151,6 +151,7 @@ bpftrace contains various tools, which also serve as examples of programming in
 - tools/[opensnoop.bt](tools/loads.bt): Trace open() syscalls showing filenames. [Examples](tools/opensnoop_example.txt).
 - tools/[oomkill.bt](tools/oomkill.bt): Trace OOM killer. [Examples](tools/oomkill_example.txt).
 - tools/[pidpersec.bt](tools/pidpersec.bt): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
+- tools/[runqlat.bt](tools/runqlat.bt): CPU scheduler run queue latency as a histogram. [Examples](tools/runqlat_example.txt).
 - tools/[runqlen.bt](tools/runqlen.bt): CPU scheduler run queue length as a histogram. [Examples](tools/runqlen_example.txt).
 - tools/[statsnoop.bt](tools/statsnoop.bt): Trace stat() syscalls for general debugging. [Examples](tools/statsnoop_example.txt).
 - tools/[syncsnoop.bt](tools/syncsnoop.bt): Trace sync() variety of syscalls. [Examples](tools/syncsnoop_example.txt).

--- a/man/man8/runqlat.8
+++ b/man/man8/runqlat.8
+.TH runqlat 8  "2018-09-17" "USER COMMANDS"
+.SH NAME
+runqlat.bt \- CPU scheduler run queue latency as a histogram. Uses bpftrace/eBPF.
+.SH SYNOPSIS
+.B runqlat.bt
+.SH DESCRIPTION
+This traces time spent waiting in the CPU scheduler for a turn on-CPU. This
+metric is often called run queue latency, or scheduler latency. This tool shows
+this latency as a power-of-2 histogram in nanoseconds, allowing multimodal
+distributions to be studied, as well as latency outliers.
+
+This tool uses the sched tracepoints.
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bpftrace.
+.SH EXAMPLES
+.TP
+Trace CPU run queue latency system wide, printing a histogram on Ctrl-C:
+#
+.B runqlat
+.SH FIELDS
+.TP
+1st, 2nd
+This is a range of latency, in microseconds (shown in "[...)" set notation).
+.TP
+3rd
+A column showing the count of scheduler veents in this range.
+.TP
+4th
+This is an ASCII histogram representing the count colimn.
+.SH OVERHEAD
+This traces scheduler functions, which can become very frequent. While eBPF
+has very low overhead, and this tool uses in-kernel maps for efficiency, the
+frequency of scheduler events for some workloads may be high enough that the
+overhead of this tool becomes significant. Measure in a lab environment
+to quantify the overhead before use.
+.SH SOURCE
+This is from bpftrace.
+.IP
+https://github.com/iovisor/bpftrace
+.PP
+Also look in the bpftrace distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+
+This is a bpftrace version of the bcc tool of the same name. The bcc tool
+may provide more options and customizations.
+.IP
+https://github.com/iovisor/bcc
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+runqlen(8), mpstat(1), pidstat(1), uptime(1)
--- a/man/man8/runqlen.8
+++ b/man/man8/runqlen.8
@@ -52,4 +52,4 @@ Unstable - in development.
 .SH AUTHOR
 Brendan Gregg
 .SH SEE ALSO
-mpstat(1), pidstat(1), uptime(1)
+runqlat(8), mpstat(1), pidstat(1), uptime(1)
--- a/tools/runqlat.bt
+++ b/tools/runqlat.bt
+/*
+ * runqlat.bt	CPU scheduler run queue latency as a histogram.
+ *		For Linux, uses bpftrace, eBPF.
+ *
+ * This is a bpftrace version of the bcc tool of the same name.
+ *
+ * Copyright 2018 Netflix, Inc.
+ * Licensed under the Apache License, Version 2.0 (the "License")
+ *
+ * 17-Sep-2018	Brendan Gregg	Created this.
+ */
+
+#include <linux/sched.h>
+
+BEGIN
+{
+	printf("Tracing CPU scheduler... Hit Ctrl-C to end.\n");
+}
+
+tracepoint:sched:sched_wakeup
+{
+	@qtime[args->pid] = nsecs;
+}
+
+tracepoint:sched:sched_wakeup_new
+{
+	@qtime[args->pid] = nsecs;
+}
+
+tracepoint:sched:sched_switch
+{
+	$TASK_RUNNING = 0;	// from linux/sched.h, workaround for #153
+	if (args->prev_state == $TASK_RUNNING) {
+		@qtime[args->prev_pid] = nsecs;
+	};
+
+	$ns = @qtime[args->next_pid];
+	if ($ns) {
+		@usecs = hist((nsecs - $ns) / 1000);
+	};
+	delete(@qtime[args->next_pid]);
+}
+
+END
+{
+	clear(@qtime);
+}
--- a/tools/runqlat_example.txt
+++ b/tools/runqlat_example.txt
+Demonstrations of runqlat, the Linux BPF/bpftrace version.
+
+
+This traces time spent waiting in the CPU scheduler for a turn on-CPU. This
+metric is often called run queue latency, or scheduler latency. This tool shows
+this latency as a power-of-2 histogram in nanoseconds. For example:
+
+# bpftrace runqlat.bt
+Attaching 5 probes...
+Tracing CPU scheduler... Hit Ctrl-C to end.
+^C
+
+
+
+@usecs:
+[0, 1]                12 |@@                                                  |
+[2, 4)                16 |@@@                                                 |
+[4, 8)                43 |@@@@@@@@@@                                          |
+[8, 16)              134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
+[16, 32)             220 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[32, 64)             117 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
+[64, 128)             84 |@@@@@@@@@@@@@@@@@@@                                 |
+[128, 256)            10 |@@                                                  |
+[256, 512)             2 |                                                    |
+[512, 1K)              5 |@                                                   |
+[1K, 2K)               5 |@                                                   |
+[2K, 4K)               5 |@                                                   |
+[4K, 8K)               4 |                                                    |
+[8K, 16K)              1 |                                                    |
+[16K, 32K)             2 |                                                    |
+[32K, 64K)             0 |                                                    |
+[64K, 128K)            1 |                                                    |
+[128K, 256K)           0 |                                                    |
+[256K, 512K)           0 |                                                    |
+[512K, 1M)             1 |                                                    |
+
+This is an idle system where most of the time we are waiting for less than
+128 microseconds, shown by the mode above. As an example of reading the output,
+the above histogram shows 220 scheduling events with a run queue latency of
+between 16 and 32 microseconds.
+
+The output also shows an outlier taking between 0.5 and 1 seconds: ??? XXX
+likely work was scheduled behind another higher priority task, and had to wait
+briefly. The kernel decides whether it is worth migrating such work to an
+idle CPU, or leaving it wait its turn on its current CPU run queue where
+the CPU caches should be hotter.
+
+
+I'll now add a single-threaded CPU bound workload to this system, and bind
+it on one CPU:
+
+# bpftrace runqlat.bt
+Attaching 5 probes...
+Tracing CPU scheduler... Hit Ctrl-C to end.
+^C
+
+
+
+@usecs:
+[0, 1]                 6 |@@@                                                 |
+[2, 4)                26 |@@@@@@@@@@@@@                                       |
+[4, 8)                97 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[8, 16)               72 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              |
+[16, 32)              17 |@@@@@@@@@                                           |
+[32, 64)              19 |@@@@@@@@@@                                          |
+[64, 128)             20 |@@@@@@@@@@                                          |
+[128, 256)             3 |@                                                   |
+[256, 512)             0 |                                                    |
+[512, 1K)              0 |                                                    |
+[1K, 2K)               1 |                                                    |
+[2K, 4K)               1 |                                                    |
+[4K, 8K)               4 |@@                                                  |
+[8K, 16K)              3 |@                                                   |
+[16K, 32K)             0 |                                                    |
+[32K, 64K)             0 |                                                    |
+[64K, 128K)            0 |                                                    |
+[128K, 256K)           1 |                                                    |
+[256K, 512K)           0 |                                                    |
+[512K, 1M)             0 |                                                    |
+[1M, 2M)               1 |                                                    |
+
+That didn't make much difference.
+
+
+Now I'll add a second single-threaded CPU workload, and bind it to the same
+CPU, causing contention:
+
+# bpftrace runqlat.bt
+Attaching 5 probes...
+Tracing CPU scheduler... Hit Ctrl-C to end.
+^C
+
+
+
+@usecs:
+[0, 1]                 9 |@@@                                                 |
+[2, 4)                28 |@@@@@@@@@@@@                                        |
+[4, 8)                95 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           |
+[8, 16)              120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[16, 32)              22 |@@@@@@@@@                                           |
+[32, 64)              10 |@@@@                                                |
+[64, 128)              7 |@@@                                                 |
+[128, 256)             3 |@                                                   |
+[256, 512)             1 |                                                    |
+[512, 1K)              0 |                                                    |
+[1K, 2K)               0 |                                                    |
+[2K, 4K)               2 |                                                    |
+[4K, 8K)               4 |@                                                   |
+[8K, 16K)            107 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
+[16K, 32K)             0 |                                                    |
+[32K, 64K)             0 |                                                    |
+[64K, 128K)            0 |                                                    |
+[128K, 256K)           0 |                                                    |
+[256K, 512K)           1 |                                                    |
+
+There's now a second mode between 8 and 16 milliseconds, as each thread must
+wait its turn on the one CPU.
+
+
+Now I'l run 10 CPU-bound throuds on one CPU:
+
+# bpftrace runqlat.bt
+Attaching 5 probes...
+Tracing CPU scheduler... Hit Ctrl-C to end.
+^C
+
+
+
+@usecs:
+[0, 1]                12 |@                                                   |
+[2, 4)                38 |@@@@                                                |
+[4, 8)                63 |@@@@@@                                              |
+[8, 16)              106 |@@@@@@@@@@@                                         |
+[16, 32)              28 |@@@                                                 |
+[32, 64)              13 |@                                                   |
+[64, 128)             15 |@                                                   |
+[128, 256)             2 |                                                    |
+[256, 512)             2 |                                                    |
+[512, 1K)              1 |                                                    |
+[1K, 2K)               1 |                                                    |
+[2K, 4K)               2 |                                                    |
+[4K, 8K)               4 |                                                    |
+[8K, 16K)              3 |                                                    |
+[16K, 32K)             0 |                                                    |
+[32K, 64K)           478 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[64K, 128K)            1 |                                                    |
+[128K, 256K)           0 |                                                    |
+[256K, 512K)           0 |                                                    |
+[512K, 1M)             0 |                                                    |
+[1M, 2M)               1 |                                                    |
+
+This shows that most of the time threads need to wait their turn, with the
+largest mode between 32 and 64 milliseconds.
+
+
+There is another version of this tool in bcc: https://github.com/iovisor/bcc
+The bcc version provides options to customize the output.