Commit 3ff0a80a authored by Brendan Gregg's avatar Brendan Gregg Committed by Alastair Robertson

add runqlat tool

parent 5f95444d
......@@ -151,6 +151,7 @@ bpftrace contains various tools, which also serve as examples of programming in
- tools/[opensnoop.bt](tools/loads.bt): Trace open() syscalls showing filenames. [Examples](tools/opensnoop_example.txt).
- tools/[oomkill.bt](tools/oomkill.bt): Trace OOM killer. [Examples](tools/oomkill_example.txt).
- tools/[pidpersec.bt](tools/pidpersec.bt): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[runqlat.bt](tools/runqlat.bt): CPU scheduler run queue latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[runqlen.bt](tools/runqlen.bt): CPU scheduler run queue length as a histogram. [Examples](tools/runqlen_example.txt).
- tools/[statsnoop.bt](tools/statsnoop.bt): Trace stat() syscalls for general debugging. [Examples](tools/statsnoop_example.txt).
- tools/[syncsnoop.bt](tools/syncsnoop.bt): Trace sync() variety of syscalls. [Examples](tools/syncsnoop_example.txt).
......
.TH runqlat 8 "2018-09-17" "USER COMMANDS"
.SH NAME
runqlat.bt \- CPU scheduler run queue latency as a histogram. Uses bpftrace/eBPF.
.SH SYNOPSIS
.B runqlat.bt
.SH DESCRIPTION
This traces time spent waiting in the CPU scheduler for a turn on-CPU. This
metric is often called run queue latency, or scheduler latency. This tool shows
this latency as a power-of-2 histogram in nanoseconds, allowing multimodal
distributions to be studied, as well as latency outliers.
This tool uses the sched tracepoints.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bpftrace.
.SH EXAMPLES
.TP
Trace CPU run queue latency system wide, printing a histogram on Ctrl-C:
#
.B runqlat
.SH FIELDS
.TP
1st, 2nd
This is a range of latency, in microseconds (shown in "[...)" set notation).
.TP
3rd
A column showing the count of scheduler veents in this range.
.TP
4th
This is an ASCII histogram representing the count colimn.
.SH OVERHEAD
This traces scheduler functions, which can become very frequent. While eBPF
has very low overhead, and this tool uses in-kernel maps for efficiency, the
frequency of scheduler events for some workloads may be high enough that the
overhead of this tool becomes significant. Measure in a lab environment
to quantify the overhead before use.
.SH SOURCE
This is from bpftrace.
.IP
https://github.com/iovisor/bpftrace
.PP
Also look in the bpftrace distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
This is a bpftrace version of the bcc tool of the same name. The bcc tool
may provide more options and customizations.
.IP
https://github.com/iovisor/bcc
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
runqlen(8), mpstat(1), pidstat(1), uptime(1)
......@@ -52,4 +52,4 @@ Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
mpstat(1), pidstat(1), uptime(1)
runqlat(8), mpstat(1), pidstat(1), uptime(1)
/*
* runqlat.bt CPU scheduler run queue latency as a histogram.
* For Linux, uses bpftrace, eBPF.
*
* This is a bpftrace version of the bcc tool of the same name.
*
* Copyright 2018 Netflix, Inc.
* Licensed under the Apache License, Version 2.0 (the "License")
*
* 17-Sep-2018 Brendan Gregg Created this.
*/
#include <linux/sched.h>
BEGIN
{
printf("Tracing CPU scheduler... Hit Ctrl-C to end.\n");
}
tracepoint:sched:sched_wakeup
{
@qtime[args->pid] = nsecs;
}
tracepoint:sched:sched_wakeup_new
{
@qtime[args->pid] = nsecs;
}
tracepoint:sched:sched_switch
{
$TASK_RUNNING = 0; // from linux/sched.h, workaround for #153
if (args->prev_state == $TASK_RUNNING) {
@qtime[args->prev_pid] = nsecs;
};
$ns = @qtime[args->next_pid];
if ($ns) {
@usecs = hist((nsecs - $ns) / 1000);
};
delete(@qtime[args->next_pid]);
}
END
{
clear(@qtime);
}
Demonstrations of runqlat, the Linux BPF/bpftrace version.
This traces time spent waiting in the CPU scheduler for a turn on-CPU. This
metric is often called run queue latency, or scheduler latency. This tool shows
this latency as a power-of-2 histogram in nanoseconds. For example:
# bpftrace runqlat.bt
Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.
^C
@usecs:
[0, 1] 12 |@@ |
[2, 4) 16 |@@@ |
[4, 8) 43 |@@@@@@@@@@ |
[8, 16) 134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16, 32) 220 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64) 117 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 84 |@@@@@@@@@@@@@@@@@@@ |
[128, 256) 10 |@@ |
[256, 512) 2 | |
[512, 1K) 5 |@ |
[1K, 2K) 5 |@ |
[2K, 4K) 5 |@ |
[4K, 8K) 4 | |
[8K, 16K) 1 | |
[16K, 32K) 2 | |
[32K, 64K) 0 | |
[64K, 128K) 1 | |
[128K, 256K) 0 | |
[256K, 512K) 0 | |
[512K, 1M) 1 | |
This is an idle system where most of the time we are waiting for less than
128 microseconds, shown by the mode above. As an example of reading the output,
the above histogram shows 220 scheduling events with a run queue latency of
between 16 and 32 microseconds.
The output also shows an outlier taking between 0.5 and 1 seconds: ??? XXX
likely work was scheduled behind another higher priority task, and had to wait
briefly. The kernel decides whether it is worth migrating such work to an
idle CPU, or leaving it wait its turn on its current CPU run queue where
the CPU caches should be hotter.
I'll now add a single-threaded CPU bound workload to this system, and bind
it on one CPU:
# bpftrace runqlat.bt
Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.
^C
@usecs:
[0, 1] 6 |@@@ |
[2, 4) 26 |@@@@@@@@@@@@@ |
[4, 8) 97 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 72 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16, 32) 17 |@@@@@@@@@ |
[32, 64) 19 |@@@@@@@@@@ |
[64, 128) 20 |@@@@@@@@@@ |
[128, 256) 3 |@ |
[256, 512) 0 | |
[512, 1K) 0 | |
[1K, 2K) 1 | |
[2K, 4K) 1 | |
[4K, 8K) 4 |@@ |
[8K, 16K) 3 |@ |
[16K, 32K) 0 | |
[32K, 64K) 0 | |
[64K, 128K) 0 | |
[128K, 256K) 1 | |
[256K, 512K) 0 | |
[512K, 1M) 0 | |
[1M, 2M) 1 | |
That didn't make much difference.
Now I'll add a second single-threaded CPU workload, and bind it to the same
CPU, causing contention:
# bpftrace runqlat.bt
Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.
^C
@usecs:
[0, 1] 9 |@@@ |
[2, 4) 28 |@@@@@@@@@@@@ |
[4, 8) 95 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 22 |@@@@@@@@@ |
[32, 64) 10 |@@@@ |
[64, 128) 7 |@@@ |
[128, 256) 3 |@ |
[256, 512) 1 | |
[512, 1K) 0 | |
[1K, 2K) 0 | |
[2K, 4K) 2 | |
[4K, 8K) 4 |@ |
[8K, 16K) 107 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 0 | |
[32K, 64K) 0 | |
[64K, 128K) 0 | |
[128K, 256K) 0 | |
[256K, 512K) 1 | |
There's now a second mode between 8 and 16 milliseconds, as each thread must
wait its turn on the one CPU.
Now I'l run 10 CPU-bound throuds on one CPU:
# bpftrace runqlat.bt
Attaching 5 probes...
Tracing CPU scheduler... Hit Ctrl-C to end.
^C
@usecs:
[0, 1] 12 |@ |
[2, 4) 38 |@@@@ |
[4, 8) 63 |@@@@@@ |
[8, 16) 106 |@@@@@@@@@@@ |
[16, 32) 28 |@@@ |
[32, 64) 13 |@ |
[64, 128) 15 |@ |
[128, 256) 2 | |
[256, 512) 2 | |
[512, 1K) 1 | |
[1K, 2K) 1 | |
[2K, 4K) 2 | |
[4K, 8K) 4 | |
[8K, 16K) 3 | |
[16K, 32K) 0 | |
[32K, 64K) 478 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64K, 128K) 1 | |
[128K, 256K) 0 | |
[256K, 512K) 0 | |
[512K, 1M) 0 | |
[1M, 2M) 1 | |
This shows that most of the time threads need to wait their turn, with the
largest mode between 32 and 64 milliseconds.
There is another version of this tool in bcc: https://github.com/iovisor/bcc
The bcc version provides options to customize the output.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment