Commit 5c48a3fa authored by Ivan Babrou's avatar Ivan Babrou Committed by yonghong-song

Add runqslower tool (#1728)

* Add runqslower tool

* Remove mentions of obsolete enqueue_task_* in tools/runq*

* Use u32 for pid field in runqslower
parent ad2d0d9f
......@@ -130,6 +130,7 @@ pair of .c and .py files, and some are directories of files.
- tools/[reset-trace](tools/reset-trace.sh): Reset the state of tracing. Maintenance tool only. [Examples](tools/reset-trace_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[runqlen](tools/runqlen.py): Run queue length as a histogram. [Examples](tools/runqlen_example.txt).
- tools/[runqslower](tools/runqslower.py): Trace long process scheduling delays. [Examples](tools/runqslower_example.txt).
- tools/[slabratetop](tools/slabratetop.py): Kernel SLAB/SLUB memory cache allocation rate top. [Examples](tools/slabratetop_example.txt).
- tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).
......
......@@ -13,7 +13,8 @@ wait its turn.
This tool measures two types of run queue latency:
1. The time from a task being enqueued on a run queue to its context switch
and execution. This traces enqueue_task_*() -> finish_task_switch(),
and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
finish_task_switch() with either raw tracepoints (if supported) or kprobes
and instruments the run queue latency after a voluntary context switch.
2. The time from when a task was involuntary context switched and still
......@@ -109,4 +110,4 @@ Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
runqlen(8), pidstat(1)
runqlen(8), runqslower(8), pidstat(1)
......@@ -83,4 +83,4 @@ Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
runqlat(8), pidstat(1)
runqlat(8), runqslower(8), pidstat(1)
.TH runqslower 8 "2016-02-07" "USER COMMANDS"
.SH NAME
runqlat \- Trace long process scheduling delays.
.SH SYNOPSIS
.B runqslower [\-p PID] [min_us]
.SH DESCRIPTION
This measures the time a task spends waiting on a run queue (or equivalent
scheduler data structure) for a turn on-CPU, and shows occurrences of time
exceeding passed threshold. This time should be small, but a task may need
to wait its turn due to CPU load. The higher the CPU load, the longer a task
will generally need to wait its turn.
This tool measures two types of run queue latency:
1. The time from a task being enqueued on a run queue to its context switch
and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
finish_task_switch() with either raw tracepoints (if supported) or kprobes
and instruments the run queue latency after a voluntary context switch.
2. The time from when a task was involuntary context switched and still
in the runnable state, to when it next executed. This is instrumented
from finish_task_switch() alone.
The overhead of this tool may become significant for some workloads:
see the OVERHEAD section.
This works by tracing various kernel scheduler functions using dynamic tracing,
and will need updating to match any changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-p PID
Only show this PID (filtered in kernel for efficiency).
.TP
min_us
Minimum scheduling delay in microseconds to output.
.SH EXAMPLES
.TP
Show scheduling delays longer than 10ms:
#
.B runqslower
.TP
Show scheduling delays longer than 1ms for process with PID 123:
#
.B runqslower -p 123 1000
.SH FIELDS
.TP
TIME
Time of when scheduling event occurred.
.TP
COMM
Process name.
.TP
PID
Process ID.
.TP
LAT(us)
Scheduling latency from time when task was ready to run to the time it was
assigned to a CPU to run.
.SH OVERHEAD
This traces scheduler functions, which can become very frequent. While eBPF
has very low overhead, and this tool uses in-kernel maps for efficiency, the
frequency of scheduler events for some workloads may be high enough that the
overhead of this tool becomes significant. Measure in a lab environment
to quantify the overhead before use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Ivan Babrou
.SH SEE ALSO
runqlen(8), runqlat(8), pidstat(1)
......@@ -12,7 +12,8 @@
#
# This measures two types of run queue latency:
# 1. The time from a task being enqueued on a run queue to its context switch
# and execution. This traces enqueue_task_*() -> finish_task_switch(),
# and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
# finish_task_switch() with either raw tracepoints (if supported) or kprobes
# and instruments the run queue latency after a voluntary context switch.
# 2. The time from when a task was involuntary context switched and still
# in the runnable state, to when it next executed. This is instrumented
......
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# runqslower Trace long process scheduling delays.
# For Linux, uses BCC, eBPF.
#
# This script traces high scheduling delays between tasks being
# ready to run and them running on CPU after that.
#
# USAGE: runqslower [-p PID] [min_us]
#
# REQUIRES: Linux 4.9+ (BPF_PROG_TYPE_PERF_EVENT support).
#
# This measures the time a task spends waiting on a run queue for a turn
# on-CPU, and shows this time as a individual events. This time should be small,
# but a task may need to wait its turn due to CPU load.
#
# This measures two types of run queue latency:
# 1. The time from a task being enqueued on a run queue to its context switch
# and execution. This traces ttwu_do_wakeup(), wake_up_new_task() ->
# finish_task_switch() with either raw tracepoints (if supported) or kprobes
# and instruments the run queue latency after a voluntary context switch.
# 2. The time from when a task was involuntary context switched and still
# in the runnable state, to when it next executed. This is instrumented
# from finish_task_switch() alone.
#
# Copyright 2016 Cloudflare, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 02-May-2018 Ivan Babrou Created this.
from __future__ import print_function
from bcc import BPF
import argparse
from time import strftime
import ctypes as ct
# arguments
examples = """examples:
./runqslower # trace run queue latency higher than 10000 us (default)
./runqslower 1000 # trace run queue latency higher than 1000 us
./runqslower -p 123 # trace pid 123 only
"""
parser = argparse.ArgumentParser(
description="Trace high run queue latency",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-p", "--pid", type=int, metavar="PID", dest="pid",
help="trace this PID only")
parser.add_argument("min_us", nargs="?", default='10000',
help="minimum run queue latecy to trace, in ms (default 10000)")
parser.add_argument("--ebpf", action="store_true",
help=argparse.SUPPRESS)
args = parser.parse_args()
min_us = int(args.min_us)
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/nsproxy.h>
#include <linux/pid_namespace.h>
BPF_HASH(start, u32);
struct rq;
struct data_t {
u32 pid;
char task[TASK_COMM_LEN];
u64 delta_us;
};
BPF_PERF_OUTPUT(events);
// record enqueue timestamp
static int trace_enqueue(u32 tgid, u32 pid)
{
if (FILTER_PID)
return 0;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
"""
bpf_text_kprobe = """
int trace_wake_up_new_task(struct pt_regs *ctx, struct task_struct *p)
{
return trace_enqueue(p->tgid, p->pid);
}
int trace_ttwu_do_wakeup(struct pt_regs *ctx, struct rq *rq, struct task_struct *p,
int wake_flags)
{
return trace_enqueue(p->tgid, p->pid);
}
// calculate latency
int trace_run(struct pt_regs *ctx, struct task_struct *prev)
{
u32 pid, tgid;
// ivcsw: treat like an enqueue event and store timestamp
if (prev->state == TASK_RUNNING) {
tgid = prev->tgid;
pid = prev->pid;
if (!(FILTER_PID)) {
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
}
}
tgid = bpf_get_current_pid_tgid() >> 32;
pid = bpf_get_current_pid_tgid();
u64 *tsp, delta_us;
// fetch timestamp and calculate delta
tsp = start.lookup(&pid);
if (tsp == 0) {
return 0; // missed enqueue
}
delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
if (FILTER_US)
return 0;
struct data_t data = {};
data.pid = pid;
data.delta_us = delta_us;
bpf_get_current_comm(&data.task, sizeof(data.task));
// output
events.perf_submit(ctx, &data, sizeof(data));
start.delete(&pid);
return 0;
}
"""
bpf_text_raw_tp = """
RAW_TRACEPOINT_PROBE(sched_wakeup)
{
// TP_PROTO(struct task_struct *p)
struct task_struct *p = (struct task_struct *)ctx->args[0];
u32 tgid, pid;
bpf_probe_read(&tgid, sizeof(tgid), &p->tgid);
bpf_probe_read(&pid, sizeof(pid), &p->pid);
return trace_enqueue(tgid, pid);
}
RAW_TRACEPOINT_PROBE(sched_wakeup_new)
{
// TP_PROTO(struct task_struct *p)
struct task_struct *p = (struct task_struct *)ctx->args[0];
u32 tgid, pid;
bpf_probe_read(&tgid, sizeof(tgid), &p->tgid);
bpf_probe_read(&pid, sizeof(pid), &p->pid);
return trace_enqueue(tgid, pid);
}
RAW_TRACEPOINT_PROBE(sched_switch)
{
// TP_PROTO(bool preempt, struct task_struct *prev, struct task_struct *next)
struct task_struct *prev = (struct task_struct *)ctx->args[1];
struct task_struct *next= (struct task_struct *)ctx->args[2];
u32 pid, tgid;
long state;
// ivcsw: treat like an enqueue event and store timestamp
bpf_probe_read(&state, sizeof(long), &prev->state);
if (state == TASK_RUNNING) {
bpf_probe_read(&tgid, sizeof(prev->tgid), &prev->tgid);
bpf_probe_read(&pid, sizeof(prev->pid), &prev->pid);
if (!(FILTER_PID)) {
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
}
}
bpf_probe_read(&tgid, sizeof(next->tgid), &next->tgid);
bpf_probe_read(&pid, sizeof(next->pid), &next->pid);
u64 *tsp, delta_us;
// fetch timestamp and calculate delta
tsp = start.lookup(&pid);
if (tsp == 0) {
return 0; // missed enqueue
}
delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
if (FILTER_US)
return 0;
struct data_t data = {};
data.pid = pid;
data.delta_us = delta_us;
bpf_get_current_comm(&data.task, sizeof(data.task));
// output
events.perf_submit(ctx, &data, sizeof(data));
start.delete(&pid);
return 0;
}
"""
is_support_raw_tp = BPF.support_raw_tracepoint()
if is_support_raw_tp:
bpf_text += bpf_text_raw_tp
else:
bpf_text += bpf_text_kprobe
# code substitutions
if min_us == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
else:
bpf_text = bpf_text.replace('FILTER_US', 'delta_us <= %s' % str(min_us))
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID', 'pid != %s' % pid)
else:
bpf_text = bpf_text.replace('FILTER_PID', '0')
if debug or args.ebpf:
print(bpf_text)
if args.ebpf:
exit()
# kernel->user event data: struct data_t
DNAME_INLINE_LEN = 32 # linux/dcache.h
TASK_COMM_LEN = 16 # linux/sched.h
class Data(ct.Structure):
_fields_ = [
("pid", ct.c_uint),
("task", ct.c_char * TASK_COMM_LEN),
("delta_us", ct.c_ulonglong),
]
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
print("%-8s %-16s %-6s %14s" % (strftime("%H:%M:%S"), event.task, event.pid, event.delta_us))
# load BPF program
b = BPF(text=bpf_text)
if not is_support_raw_tp:
b.attach_kprobe(event="ttwu_do_wakeup", fn_name="trace_ttwu_do_wakeup")
b.attach_kprobe(event="wake_up_new_task", fn_name="trace_wake_up_new_task")
b.attach_kprobe(event="finish_task_switch", fn_name="trace_run")
print("Tracing run queue latency higher than %d us" % min_us)
print("%-8s %-16s %-6s %14s" % ("TIME", "COMM", "PID", "LAT(us)"))
# read events
b["events"].open_perf_buffer(print_event, page_cnt=64)
while 1:
b.perf_buffer_poll()
Demonstrations of runqslower, the Linux eBPF/bcc version.
runqslower shows high latency scheduling times between tasks being
ready to run and them running on CPU after that. For example:
# runqslower
Tracing run queue latency higher than 10000 us
TIME COMM PID LAT(us)
04:16:32 cc1 12924 12739
04:16:32 sh 13640 12118
04:16:32 make 13639 12730
04:16:32 bash 13655 12047
04:16:32 bash 13657 12744
04:16:32 bash 13656 12880
04:16:32 sh 13660 10846
04:16:32 gcc 13663 12681
04:16:32 make 13668 10814
04:16:32 make 13670 12988
04:16:32 gcc 13677 11770
04:16:32 gcc 13678 23519
04:16:32 as 12999 20541
[...]
This shows various processes waiting for available CPU during a Linux kernel
build. By default the output contains delays for more than 10ms.
These daelays can be analyzed in depth with "perf sched" tool, see:
* http://www.brendangregg.com/blog/2017-03-16/perf-sched.html
USAGE message:
# ./runqslower -h
usage: runqslower.py [-h] [-p PID] [min_us]
Trace high run queue latency
positional arguments:
min_us minimum run queue latecy to trace, in ms (default 10000)
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
examples:
./runqslower # trace run queue latency higher than 10000 us (default)
./runqslower 1000 # trace run queue latency higher than 1000 us
./runqslower -p 123 # trace pid 123 only
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment