Commit 0abd93e5 authored by 4ast's avatar 4ast Committed by GitHub

Merge pull request #870 from brendangregg/master

add cpuunclaimed
parents dd5799bc 06223652
......@@ -86,6 +86,7 @@ Examples:
- tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
- tools/[cachetop](tools/cachetop.py): Trace page cache hit/miss ratio by processes. [Examples](tools/cachetop_example.txt).
- tools/[cpudist](tools/cpudist.py): Summarize on- and off-CPU time per task as a histogram. [Examples](tools/cpudist_example.txt)
- tools/[cpuunclaimed](tools/cpuunclaimed.py): Sample CPU run queues and calculate unclaimed idle CPU. [Examples](tools/cpuunclaimed_example.txt)
- tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
- tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
- tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).
......
.TH cpuunclaimed 8 "2016-12-21" "USER COMMANDS"
.SH NAME
cpuunclaimed \- Sample CPU run queues and calculate unclaimed idle CPU. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B cpuunclaimed
[\-T] [\-j] [\-J] [interval [count]]
.SH DESCRIPTION
This tool samples the length of the run queues and determine when there are idle
CPUs, yet queued threads waiting their turn. It reports the amount of idle
(yet unclaimed by waiting threads) CPU as a system-wide percentage.
This situation can happen for a number of reasons:
.IP -
An application has been bound to some, but not all, CPUs, and has runnable
threads that cannot migrate to other CPUs due to this configuration.
.IP -
CPU affinity: an optimization that leaves threads on CPUs where the CPU
caches are warm, even if this means short periods of waiting while other
CPUs are idle. The wait period is tunale (see sysctl, kernel.sched*).
.IP -
Scheduler bugs.
.P
An unclaimed idle of < 1% is likely to be CPU affinity, and not usually a
cause for concern. By leaving the CPU idle, overall throughput of the system
may be improved. This tool is best for identifying larger issues, > 2%, due
to the coarseness of its 99 Hertz samples.
This is an experimental tool that currently works by use of sampling to
keep overheads low. Tool assumptions:
.IP -
CPU samples consistently fire around the same offset. There will sometimes
be a lag as a sample is delayed by higher-priority interrupts, but it is
assumed the subsequent samples will catch up to the expected offsets (as
is seen in practice). You can use -J to inspect sample offsets. Some
systems can power down CPUs when idle, and when they wake up again they
may begin firing at a skewed offset: this tool will detect the skew, print
an error, and exit.
.IP -
All CPUs are online (see ncpu).
.P
If this identifies unclaimed CPU, you can double check it by dumping raw
samples (-j), as well as using other tracing tools to instrument scheduler
events (although this latter approach has much higher overhead).
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH EXAMPLES
.TP
Sample and calculate unclaimed idle CPUs, output every 1 second (default:
#
.B cpuunclaimed
.TP
Print 5 second summaries, 10 times:
#
.B cpuunclaimed 5 10
.TP
Print 1 second summaries with timestamps:
#
.B cpuunclaimed \-T 1
.TP
Raw dump of all samples (verbose), as comma-separated values:
#
.B cpuunclaimed \-j
.SH FIELDS
.TP
%CPU
CPU utilization as a system-wide percentage.
.TP
unclaimed idle
Percentage of CPU resources that were idle when work was queued on other CPUs,
as a system-wide percentage.
.TP
TIME
Time (HH:MM:SS)
.TP
TIMESTAMP_ns
Timestamp, nanoseconds.
.TP
CPU#
CPU ID.
.TP
OFFSET_ns_CPU#
Time offset that a sample fired within a sample group for this CPU.
.SH OVERHEAD
The overhead is expected to be low/negligible as this tool uses sampling at
99 Hertz (on all CPUs), which has a fixed and low cost, rather than sampling
every scheduler event as many other approaches use (which can involve
instrumenting millions of events per second). Sampled CPUs, run queue lengths,
and timestamps are written to ring buffers that are periodically read by
user space for reporting. Measure overhead in a test environment.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
runqlen(8)
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# cpuunclaimed Sample CPU run queues and calculate unclaimed idle CPU.
# For Linux, uses BCC, eBPF.
#
# This samples the length of the run queues and determine when there are idle
# CPUs, yet queued threads waiting their turn. Report the amount of idle
# (yet unclaimed by waiting threads) CPU as a system-wide percentage.
#
# This situation can happen for a number of reasons:
#
# - An application has been bound to some, but not all, CPUs, and has runnable
# threads that cannot migrate to other CPUs due to this configuration.
# - CPU affinity: an optimization that leaves threads on CPUs where the CPU
# caches are warm, even if this means short periods of waiting while other
# CPUs are idle. The wait period is tunale (see sysctl, kernel.sched*).
# - Scheduler bugs.
#
# An unclaimed idle of < 1% is likely to be CPU affinity, and not usually a
# cause for concern. By leaving the CPU idle, overall throughput of the system
# may be improved. This tool is best for identifying larger issues, > 2%, due
# to the coarseness of its 99 Hertz samples.
#
# This is an experimental tool that currently works by use of sampling to
# keep overheads low. Tool assumptions:
#
# - CPU samples consistently fire around the same offset. There will sometimes
# be a lag as a sample is delayed by higher-priority interrupts, but it is
# assumed the subsequent samples will catch up to the expected offsets (as
# is seen in practice). You can use -J to inspect sample offsets. Some
# systems can power down CPUs when idle, and when they wake up again they
# may begin firing at a skewed offset: this tool will detect the skew, print
# an error, and exit.
# - All CPUs are online (see ncpu).
#
# If this identifies unclaimed CPU, you can double check it by dumping raw
# samples (-j), as well as using other tracing tools to instrument scheduler
# events (although this latter approach has much higher overhead).
#
# This tool passes all sampled events to user space for post processing.
# I originally wrote this to do the calculations entirerly in kernel context,
# and only pass a summary. That involves a number of challenges, and the
# overhead savings may not outweigh the caveats. You can see my WIP here:
# https://gist.github.com/brendangregg/731cf2ce54bf1f9a19d4ccd397625ad9
#
# USAGE: cpuunclaimed [-h] [-j] [-J] [-T] [interval] [count]
#
# If you see "Lost 1881 samples" warnings, try increasing wakeup_hz.
#
# REQUIRES: Linux 4.9+ (BPF_PROG_TYPE_PERF_EVENT support). Under tools/old is
# a version of this tool that may work on Linux 4.6 - 4.8.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 20-Dec-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF, PerfType, PerfSWConfig
from time import sleep, strftime
from ctypes import c_int
import argparse
import multiprocessing
from os import getpid, system
import ctypes as ct
# arguments
examples = """examples:
./cpuunclaimed # sample and calculate unclaimed idle CPUs,
# output every 1 second (default)
./cpuunclaimed 5 10 # print 5 second summaries, 10 times
./cpuunclaimed -T 1 # 1s summaries and timestamps
./cpuunclaimed -j # raw dump of all samples (verbose), CSV
"""
parser = argparse.ArgumentParser(
description="Sample CPU run queues and calculate unclaimed idle CPU",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-j", "--csv", action="store_true",
help="print sample summaries (verbose) as comma-separated values")
parser.add_argument("-J", "--fullcsv", action="store_true",
help="print sample summaries with extra fields: CPU sample offsets")
parser.add_argument("-T", "--timestamp", action="store_true",
help="include timestamp on output")
parser.add_argument("interval", nargs="?", default=-1,
help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
frequency = 99
dobind = 1
wakeup_hz = 10 # frequency to read buffers
wakeup_s = float(1) / wakeup_hz
ncpu = multiprocessing.cpu_count() # assume all are online
debug = 0
# process arguments
if args.fullcsv:
args.csv = True
if args.csv:
interval = 0.2
if args.interval != -1 and (args.fullcsv or args.csv):
print("ERROR: cannot use interval with either -j or -J. Exiting.")
exit()
if args.interval == -1:
args.interval = "1"
interval = float(args.interval)
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <uapi/linux/bpf_perf_event.h>
#include <linux/sched.h>
struct data_t {
u64 ts;
u64 cpu;
u64 len;
};
BPF_PERF_OUTPUT(events);
// Declare enough of cfs_rq to find nr_running, since we can't #import the
// header. This will need maintenance. It is from kernel/sched/sched.h:
struct cfs_rq_partial {
struct load_weight load;
unsigned int nr_running, h_nr_running;
};
int do_perf_event(struct bpf_perf_event_data *ctx)
{
int cpu = bpf_get_smp_processor_id();
u64 now = bpf_ktime_get_ns();
/*
* Fetch the run queue length from task->se.cfs_rq->nr_running. This is an
* unstable interface and may need maintenance. Perhaps a future version
* of BPF will support task_rq(p) or something similar as a more reliable
* interface.
*/
unsigned int len = 0;
struct task_struct *task = NULL;
struct cfs_rq_partial *my_q = NULL;
task = (struct task_struct *)bpf_get_current_task();
bpf_probe_read(&my_q, sizeof(my_q), &task->se.cfs_rq);
bpf_probe_read(&len, sizeof(len), &my_q->nr_running);
struct data_t data = {.ts = now, .cpu = cpu, .len = len};
events.perf_submit(ctx, &data, sizeof(data));
return 0;
}
"""
# code substitutions
if debug:
print(bpf_text)
# initialize BPF & perf_events
b = BPF(text=bpf_text)
# TODO: check for HW counters first and use if more accurate
b.attach_perf_event(ev_type=PerfType.SOFTWARE,
ev_config=PerfSWConfig.TASK_CLOCK, fn_name="do_perf_event",
sample_period=0, sample_freq=frequency)
if args.csv:
if args.timestamp:
print("TIME", end=",")
print("TIMESTAMP_ns", end=",")
print(",".join("CPU" + str(c) for c in xrange(ncpu)), end="")
if args.fullcsv:
print(",", end="")
print(",".join("OFFSET_ns_CPU" + str(c) for c in xrange(ncpu)), end="")
print()
else:
print(("Sampling run queues... Output every %s seconds. " +
"Hit Ctrl-C to end.") % args.interval)
class Data(ct.Structure):
_fields_ = [
("ts", ct.c_ulonglong),
("cpu", ct.c_ulonglong),
("len", ct.c_ulonglong)
]
samples = {}
group = {}
last = 0
# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
samples[event.ts] = {}
samples[event.ts]['cpu'] = event.cpu
samples[event.ts]['len'] = event.len
exiting = 0 if args.interval else 1
slept = float(0)
# Choose the elapsed time from one sample group to the next that identifies a
# new sample group (a group being a set of samples from all CPUs). The
# earliest timestamp is compared in each group. This trigger is also used
# for sanity testing, if a group's samples exceed half this value.
trigger = int(0.8 * (1000000000 / frequency))
# read events
b["events"].open_perf_buffer(print_event)
while 1:
# allow some buffering by calling sleep(), to reduce the context switch
# rate and lower overhead.
try:
if not exiting:
sleep(wakeup_s)
except KeyboardInterrupt:
exiting = 1
b.kprobe_poll()
slept += wakeup_s
if slept < 0.999 * interval: # floating point workaround
continue
slept = 0
positive = 0 # number of samples where an idle CPU could have run work
running = 0
idle = 0
if debug >= 2:
print("DEBUG: begin samples loop, count %d" % len(samples))
for e in sorted(samples):
if debug >= 2:
print("DEBUG: ts %d cpu %d len %d delta %d trig %d" % (e,
samples[e]['cpu'], samples[e]['len'], e - last,
e - last > trigger))
# look for time jumps to identify a new sample group
if e - last > trigger:
# first first group timestamp, and sanity test
g_time = 0
g_max = 0
for ge in sorted(group):
if g_time == 0:
g_time = ge
g_max = ge
# process previous sample group
if args.csv:
lens = [0] * ncpu
offs = [0] * ncpu
for ge in sorted(group):
lens[samples[ge]['cpu']] = samples[ge]['len']
if args.fullcsv:
offs[samples[ge]['cpu']] = ge - g_time
if g_time > 0: # else first sample
if args.timestamp:
print("%-8s" % strftime("%H:%M:%S"), end=",")
print("%d" % g_time, end=",")
print(",".join(str(lens[c]) for c in xrange(ncpu)), end="")
if args.fullcsv:
print(",", end="")
print(",".join(str(offs[c]) for c in xrange(ncpu)))
else:
print()
else:
# calculate stats
g_running = 0
g_queued = 0
for ge in group:
if samples[ge]['len'] > 0:
g_running += 1
if samples[ge]['len'] > 1:
g_queued += samples[ge]['len'] - 1
g_idle = ncpu - g_running
# calculate the number of threads that could have run as the
# minimum of idle and queued
if g_idle > 0 and g_queued > 0:
if g_queued > g_idle:
i = g_idle
else:
i = g_queued
positive += i
running += g_running
idle += g_idle
# now sanity test, after -J output
g_range = g_max - g_time
if g_range > trigger / 2:
# if a sample group exceeds half the interval, we can no
# longer draw conclusions about some CPUs idle while others
# have queued work. Error and exit. This can happen when
# CPUs power down, then start again on different offsets.
# TODO: Since this is a sampling tool, an error margin should
# be anticipated, so an improvement may be to bump a counter
# instead of exiting, and only exit if this counter shows
# a skewed sample rate of over, say, 1%. Such an approach
# would allow a small rate of outliers (sampling error),
# and, we could tighten the trigger to be, say, trigger / 5.
# In the case of a power down, if it's detectable, perhaps
# the tool could reinitialize the timers (although exiting
# is simple and works).
print(("ERROR: CPU samples arrived at skewed offsets " +
"(CPUs may have powered down when idle), " +
"spanning %d ns (expected < %d ns). Debug with -J, " +
"and see the man page. As output may begin to be " +
"unreliable, exiting.") % (g_range, trigger / 2))
exit()
# these are done, remove
for ge in sorted(group):
del samples[ge]
# begin next group
group = {}
last = e
# stash this timestamp in a sample group dict
group[e] = 1
if not args.csv:
total = running + idle
unclaimed = util = 0
if debug:
print("DEBUG: hit %d running %d idle %d total %d buffered %d" % (
positive, running, idle, total, len(samples)))
if args.timestamp:
print("%-8s " % strftime("%H:%M:%S"), end="")
# output
if total:
unclaimed = float(positive) / total
util = float(running) / total
print("%%CPU %6.2f%%, unclaimed idle %0.2f%%" % (100 * util,
100 * unclaimed))
countdown -= 1
if exiting or countdown == 0:
exit()
Demonstrations of cpuunclaimed, the Linux eBPF/bcc version.
This tool samples the length of the CPU run queues and determine when there are
idle CPUs, yet queued threads waiting their turn. It reports the amount of idle
(yet unclaimed by waiting threads) CPU as a system-wide percentage. For
example:
# ./cpuunclaimed.py
Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
%CPU 83.00%, unclaimed idle 0.12%
%CPU 87.25%, unclaimed idle 0.38%
%CPU 85.00%, unclaimed idle 0.25%
%CPU 85.00%, unclaimed idle 0.25%
%CPU 80.88%, unclaimed idle 0.00%
%CPU 82.25%, unclaimed idle 0.00%
%CPU 83.50%, unclaimed idle 0.12%
%CPU 81.50%, unclaimed idle 0.00%
%CPU 81.38%, unclaimed idle 0.00%
[...]
This shows a system running at over 80% CPU utilization, and with less than
0.5% unclaimed idle CPUs.
Unclaimed idle CPUs can happen for a number of reasons:
- An application has been bound to some, but not all, CPUs, and has runnable
threads that cannot migrate to other CPUs due to this configuration.
- CPU affinity: an optimization that leaves threads on CPUs where the CPU
caches are warm, even if this means short periods of waiting while other
CPUs are idle. The wait period is tunale (see sysctl, kernel.sched*).
- Scheduler bugs.
An unclaimed idle of < 1% is likely to be CPU affinity, and not usually a
cause for concern. By leaving the CPU idle, overall throughput of the system
may be improved. This tool is best for identifying larger issues, > 2%, due
to the coarseness of its 99 Hertz samples.
This is an 8 CPU system, with an 8 CPU-bound threaded application running that
has been bound to one CPU (via taskset):
# ./cpuunclaimed.py
Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
%CPU 12.63%, unclaimed idle 86.36%
%CPU 12.50%, unclaimed idle 87.50%
%CPU 12.63%, unclaimed idle 87.37%
%CPU 12.75%, unclaimed idle 87.25%
%CPU 12.50%, unclaimed idle 87.50%
%CPU 12.63%, unclaimed idle 87.37%
%CPU 12.50%, unclaimed idle 87.50%
%CPU 12.50%, unclaimed idle 87.50%
[...]
It shows that 7 of the 8 CPUs (87.5%) are idle at the same time there are
queued threads waiting to run on CPU. This is an artifical situation caused
by binding threads to the same CPU, to demonstrate how the tool works.
This is an 8 CPU system running a Linux kernel build with "make -j8", and -T
to print timestamps:
# ./cpuunclaimed.py -T
Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
22:25:55 %CPU 98.88%, unclaimed idle 0.12%
22:25:56 %CPU 99.75%, unclaimed idle 0.25%
22:25:57 %CPU 99.50%, unclaimed idle 0.50%
22:25:58 %CPU 99.25%, unclaimed idle 0.75%
22:25:59 %CPU 99.75%, unclaimed idle 0.25%
22:26:00 %CPU 99.50%, unclaimed idle 0.50%
22:26:01 %CPU 99.25%, unclaimed idle 0.75%
22:26:02 %CPU 99.25%, unclaimed idle 0.75%
22:26:03 %CPU 99.01%, unclaimed idle 0.87%
22:26:04 %CPU 99.88%, unclaimed idle 0.12%
22:26:05 %CPU 99.38%, unclaimed idle 0.62%
There's now a consistent, yet small, amount of unclaimed idle CPU. This is
expected to be deliberate: CPU affinity, as mentioned earlier.
The -j option will print raw samples: around one hundred lines of output
every second. For the same system with a Linux kernel build of "make -j8":
# ./cpuunclaimed.py -j
TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,CPU4,CPU5,CPU6,CPU7
514606928954752,1,1,1,1,1,1,1,1
514606939054312,1,1,1,1,1,1,1,2
514606949156518,1,1,1,1,1,1,1,1
514606959256596,2,2,1,1,1,1,1,1
514606969357989,1,1,1,1,1,2,1,1
514606979459700,1,2,1,1,1,2,1,1
514606989560481,1,1,1,1,1,1,1,1
514606999661396,1,1,1,1,1,1,2,1
514607009795601,1,1,1,1,1,1,1,2
514607019862711,1,1,1,1,1,1,1,1
514607029963734,1,1,1,1,1,1,1,1
514607040062372,1,1,1,1,1,1,1,1
514607050197735,1,1,1,2,1,1,1,1
514607060266464,1,1,1,1,1,1,1,2
514607070368025,1,1,1,1,1,2,1,1
514607080468375,1,1,1,1,1,1,1,2
514607090570292,3,2,1,1,1,1,1,1
514607100670725,1,1,1,1,1,2,1,1
514607110771946,1,2,1,1,1,1,1,1
514607120873489,1,1,1,1,2,1,2,1
514607130973857,2,1,1,1,3,1,1,1
514607141080056,0,1,1,1,1,2,1,3
514607151176312,1,1,1,2,1,1,1,1
514607161277753,1,1,1,1,1,1,2,1
514607171379095,1,1,1,1,1,1,1,1
514607181479262,1,1,1,1,1,1,1,1
514607191580794,3,1,1,1,1,1,1,1
514607201680952,1,1,1,1,1,1,2,1
514607211783683,1,1,1,1,1,1,1,1
514607221883274,1,1,1,1,1,1,0,1
514607231984244,1,1,1,1,1,1,1,1
514607242085698,1,1,1,1,1,1,1,1
514607252216898,1,2,1,1,1,1,1,1
514607262289420,1,1,1,1,1,2,1,1
514607272389922,1,1,1,1,1,1,1,1
514607282489413,1,1,1,1,1,1,1,1
514607292589950,1,3,1,1,1,1,1,1
514607302693367,1,1,1,1,2,1,1,1
514607312793792,1,1,1,1,1,1,1,1
514607322895249,1,1,1,3,1,1,3,1
514607332994278,1,0,1,1,1,2,1,2
514607343095836,1,1,1,1,1,2,1,1
514607353196533,1,1,1,1,2,1,1,1
514607363297749,1,1,1,1,1,1,1,2
514607373399011,1,1,1,1,1,1,1,2
514607383499730,1,1,1,1,1,1,1,2
514607393601510,1,1,1,1,1,1,1,2
514607403704117,2,1,1,1,1,1,1,2
514607413802700,1,1,1,1,2,1,0,1
514607423904559,1,1,1,1,1,1,1,1
[...]
The output is verbose: printing out a timestamp, and then the length of each
CPU's run queue. The second last line, of timestamp 514607413802700, is an
example of what this tool detects: CPU 4 has a run queue length of 4, which
means one thread running and one thread queued, while CPU 6 has a run queue
length of 0: idle. The very next sample shows all CPUs busy.
The -J option prints raw samples with time offsets showing when the samples
were collected on each CPU. It's mostly useful for debugging the tool itself.
For example, during a Linux kernel build:
# ./cpuunclaimed.py -J
TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,CPU4,CPU5,CPU6,CPU7,OFFSET_ns_CPU0,OFFSET_ns_CPU1,OFFSET_ns_CPU2,OFFSET_ns_CPU3,OFFSET_ns_CPU4,OFFSET_ns_CPU5,OFFSET_ns_CPU6,OFFSET_ns_CPU7
514722625198188,1,1,1,1,1,1,1,2,0,28321,51655,73396,89654,111172,132803,159792
514722635299034,1,1,1,1,1,2,1,1,0,28809,51999,74183,89552,110011,131995,153519
514722645400274,1,1,1,1,1,1,1,2,0,28024,51333,73652,88964,110075,131973,153568
514722655501816,1,2,1,1,1,1,1,1,0,28893,51671,75233,89496,109430,131945,153694
514722665602594,1,1,2,1,1,2,1,1,0,28623,50988,73866,89383,109186,131786,154555
514722675703498,1,1,1,1,1,1,1,1,0,27379,51031,73175,89625,110380,131482,104811
514722685804942,1,1,1,1,1,2,1,1,0,27213,50501,72183,88797,108780,130659,152153
514722695906294,1,1,1,1,1,1,1,1,0,27036,51182,73420,87861,109585,130364,155089
514722706005778,1,1,1,1,1,1,1,1,0,28492,51851,74138,89744,110208,132462,154060
514722716060705,1,1,1,1,1,1,1,1,0,154499,152528,155232,155046,154502,178746,200001
514722726209615,1,1,1,1,1,1,1,1,0,28170,49580,72605,87741,108144,130723,152138
514722736309475,1,2,1,1,1,1,1,1,0,27421,51386,73061,89358,109457,131273,153005
514722746410845,1,2,1,1,1,2,1,1,0,27788,50840,72720,88920,109111,131143,152979
514722756511363,1,1,1,1,1,1,2,1,0,28280,50977,73559,89848,109659,131579,152693
514722766613044,1,1,1,1,1,1,1,1,0,28046,50812,72754,89160,110108,130735,152948
514722776712932,1,1,1,2,1,1,1,1,0,28586,51177,73974,89588,109947,132376,154162
514722786815477,1,1,1,1,1,1,1,1,0,27973,71104,72539,88302,108896,130414,152236
514722796914955,1,1,1,1,1,1,1,1,0,29054,52354,74214,89592,110615,132586,153925
514722807044060,1,1,1,1,1,1,1,1,1587130,0,24079,46633,61787,82325,104706,125278
514722817117432,2,1,2,1,1,1,1,1,0,27628,51038,75138,89724,109340,132426,155348
514722827218254,1,1,1,1,1,1,2,1,0,29111,51868,74347,88904,109911,132764,153851
514722837340158,1,1,1,1,1,1,1,1,0,7366,30760,53528,68622,89317,111095,132319
514722847421305,1,1,1,1,1,1,1,1,0,28257,51105,73841,89037,110820,131605,153368
514722857521112,1,1,1,1,1,1,1,1,0,28544,51441,73857,89530,110497,131915,153513
514722867626129,0,2,1,1,1,1,1,1,0,24621,47917,70568,85391,106670,128081,150329
514722877727183,2,1,1,1,1,1,1,1,0,24869,47630,71547,84761,106048,128444,149285
514722887824589,1,1,1,1,1,1,2,1,0,28793,51212,73863,89584,109773,132348,153194
514722897925481,1,1,1,1,1,1,2,1,0,29278,51163,73961,89774,109592,132029,153715
514722908026097,1,1,1,1,1,1,1,1,0,30630,35595,36210,188001,190815,190072,190732
514722918127439,1,1,1,1,1,1,1,1,0,28544,51885,73948,89987,109763,132632,154083
514722928227399,1,1,1,1,1,1,1,1,0,31882,51574,74769,89939,110578,132951,154356
514722938329471,1,1,1,1,1,1,1,1,0,28498,51304,74101,89670,110278,132653,153176
514722948430589,1,1,1,1,1,1,1,1,0,27868,50925,73477,89676,109583,132360,153014
514722958531802,1,1,1,1,1,1,1,1,0,28505,50886,73729,89919,109618,131988,152896
514722968632181,1,1,1,1,1,1,1,1,0,28492,51749,73977,90334,109816,132897,152890
514722978733584,1,1,1,1,1,1,1,1,0,28847,50957,74121,90014,110019,132377,152978
514722988834321,1,1,1,1,1,1,1,1,0,28601,51437,74021,89968,110252,132233,153623
514722998937170,1,1,2,1,1,1,1,1,0,27007,50044,73259,87725,108663,132194,152459
514723009036821,1,2,1,2,1,1,1,1,0,28226,50937,73983,89110,110476,131740,153663
514723019137577,1,1,1,1,1,1,1,1,0,30261,52357,75657,87803,61823,131850,153585
514723029238745,1,1,1,1,1,1,1,1,0,28030,50752,74452,89240,110791,132187,153327
514723039339069,1,1,1,1,1,1,1,1,0,29791,52636,75996,90475,110414,132232,154714
514723049439822,1,1,1,1,2,1,1,1,0,29133,56662,74153,89520,110683,132740,154708
514723059541617,1,1,1,1,1,1,1,1,0,27932,51480,74644,89656,109176,131499,153732
514723069642500,1,1,2,1,1,1,2,1,0,27678,51509,73984,90136,110124,131554,153459
514723079743525,2,1,1,1,1,1,1,1,0,28029,51424,74394,90056,110087,132383,152963
514723089844091,2,1,1,2,1,1,1,1,0,28944,51692,74440,90339,110402,132722,154083
514723099945957,1,1,2,1,1,1,1,1,0,28425,51267,73164,89322,115048,114630,115187
514723110047020,1,1,2,0,1,1,1,2,0,28192,50811,76814,89835,109370,131265,153511
514723120216662,1,1,2,1,1,2,1,1,29,34,0,4514,19268,40293,62674,84009
[...]
This is a Xen guest system, and it shows that CPU 0 usually completes first (an
offset of 0), followed by CPU 1 around 28000 nanoseconds later, and so on.
The spread of offsets is triggered by the bcc Python library that initializes
the timers, which steps through the CPUs in sequence, with a small delay
between them merely from executing its own loop code.
Here's more output during a Linux kernel build:
# ./cpuunclaimed.py -J
TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,CPU4,CPU5,CPU6,CPU7,OFFSET_ns_CPU0,OFFSET_ns_CPU1,OFFSET_ns_CPU2,OFFSET_ns_CPU3,OFFSET_ns_CPU4,OFFSET_ns_CPU5,OFFSET_ns_CPU6,OFFSET_ns_CPU7
514722625198188,1,1,1,1,1,1,1,2,0,28321,51655,73396,89654,111172,132803,159792
515700745758947,2,1,1,1,1,1,1,1,0,19835,34891,49397,59364,71988,87571,102769
515700755860451,2,1,1,1,1,1,1,2,0,19946,34323,49855,59844,72741,87925,102891
515700765960560,1,1,1,1,1,1,1,1,0,20805,35339,50436,59677,73557,88661,104796
515700776061744,1,1,1,1,1,1,1,1,1626,77,0,190,153452,154665,178218,154116
515700786162017,1,1,1,1,1,1,1,1,0,20497,35361,51552,59787,74451,147789,104545
515700796262811,1,1,1,1,1,1,1,2,0,20910,35657,50805,60175,73953,88492,103527
515700806364951,1,1,1,1,1,1,1,1,0,20140,35023,50074,59726,72757,88040,102421
515700816465253,1,1,1,1,1,1,2,1,0,20952,34899,50262,60048,72890,88067,103545
515700826566573,1,1,1,1,1,1,1,1,0,20898,35490,50609,59805,74060,88550,103354
515700836667480,1,1,1,1,1,1,2,1,0,20548,34760,50959,59490,73059,87820,103006
515700846768182,1,1,1,1,1,1,2,1,0,20571,35113,50777,59962,74139,88543,103192
515700856869468,1,1,2,1,1,2,2,1,0,20932,35382,50510,60106,73739,91818,103684
515700866971905,1,1,1,2,1,1,1,1,0,19780,33018,49075,58375,71949,86537,102136
515700877073459,2,1,1,1,1,1,1,1,0,20065,73966,48989,58832,71408,85714,101067
515700887172772,1,1,1,1,1,1,1,1,0,20909,34608,51493,59890,73564,88668,103454
515700897273292,1,2,1,1,1,1,1,1,0,20353,35292,50114,59773,73948,88615,103383
515700907374341,1,1,2,1,1,1,1,1,0,20816,35206,50915,60062,73878,88857,103794
515700917475331,1,1,6,1,1,2,1,1,0,20752,34931,50280,59764,73781,88329,103234
515700927576958,1,1,1,1,1,1,1,1,0,19929,34703,50181,59364,73004,88053,103127
515700937677298,1,1,2,2,1,1,1,1,0,21178,34724,50740,61193,73452,89030,103390
515700947778409,2,1,1,1,1,1,1,1,0,21059,35604,50853,60098,73919,88675,103506
515700957879196,2,1,1,1,1,1,1,1,0,21326,35939,51492,60083,74249,89474,103761
[...]
Notice the tighter range of offsets? I began executing cpuunclaimed when the
system was idle, and it initialized the CPU timers more quickly, and then I
began the Linux kernel build.
Here's some different output, this time from a physical system with 4 CPUs,
also doing a kernel build,
# ./cpuunclaimed.py -J
TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,OFFSET_ns_CPU0,OFFSET_ns_CPU1,OFFSET_ns_CPU2,OFFSET_ns_CPU3
4429382557480,1,1,1,1,0,6011,10895,16018
4429392655042,2,1,1,1,0,8217,13661,19378
4429402757604,1,1,1,1,0,6879,12433,18000
4429412857809,1,1,1,1,0,8303,13190,18719
4429422960709,2,1,1,1,0,6095,11234,17079
4429433060391,1,1,1,2,0,6747,12480,18070
4429443161699,1,1,1,1,0,6560,12264,17945
4429453262002,1,2,1,1,0,6992,12644,18341
4429463363706,1,2,1,1,0,6211,12071,17853
4429473465571,1,1,1,1,0,5766,11495,17638
4429483566920,1,1,1,1,0,5223,11736,16358
4429493666279,1,1,1,1,0,6964,12653,18410
4429503769113,1,1,1,1,0,5161,11399,16612
4429513870744,1,1,1,1,0,5943,10583,15768
4429523969826,1,1,1,1,0,6533,12336,18189
4429534070311,1,1,1,1,0,6834,12816,18488
4429544170456,1,1,1,1,0,7284,13401,19129
4429554274467,1,2,1,1,0,5941,11160,16594
4429564372365,1,2,1,1,0,7514,13618,19190
4429574474406,1,2,1,1,0,6687,12650,18248
4429584574220,1,2,1,1,0,7912,13705,19136
[...]
If the offset range becomes too great, we can no longer conclude about when
some CPUs were idle and others had queued work. The tool will detect this,
and print an error message and exit.
Some systems can power down CPUs when idle, and when they wake up again the
timed samples may resume from different offsets. If this happens, this tool
can no longer draw conclusions about when some CPUs were idle and others
had queued work, so it prints an error, and exits. Eg:
# ./cpuunclaimed.py 1
Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
%CPU 0.25%, unclaimed idle 0.00%
%CPU 0.75%, unclaimed idle 0.00%
%CPU 0.25%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.12%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.25%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.12%, unclaimed idle 0.00%
%CPU 0.13%, unclaimed idle 0.00%
%CPU 0.12%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
%CPU 0.00%, unclaimed idle 0.00%
ERROR: CPU samples arrived at skewed offsets (CPUs may have powered down when idle), spanning 4328176 ns (expected < 4040404 ns). Debug with -J, and see the man page. As output may begin to be unreliable, exiting.
It's expected that this will only really occur on idle systems.
USAGE:
# ./cpuunclaimed.py -h
usage: cpuunclaimed.py [-h] [-j] [-J] [-T] [interval] [count]
Sample CPU run queues and calculate unclaimed idle CPU
positional arguments:
interval output interval, in seconds
count number of outputs
optional arguments:
-h, --help show this help message and exit
-j, --csv print sample summaries (verbose) as comma-separated values
-J, --fullcsv print sample summaries with extra fields: CPU sample
offsets
-T, --timestamp include timestamp on output
examples:
./cpuunclaimed # sample and calculate unclaimed idle CPUs,
# output every 1 second (default)
./cpuunclaimed 5 10 # print 5 second summaries, 10 times
./cpuunclaimed -T 1 # 1s summaries and timestamps
./cpuunclaimed -j # raw dump of all samples (verbose), CSV
......@@ -84,8 +84,8 @@ int do_perf_event()
bpf_probe_read(&my_q, sizeof(my_q), &task->se.cfs_rq);
bpf_probe_read(&len, sizeof(len), &my_q->nr_running);
// Decrement idle thread by dropping the run queue by one. We could do
// this other ways if needed, like matching on task->pid.
// Calculate run queue length by subtracting the currently running task,
// if present. len 0 == idle, len 1 == one running task.
if (len > 0)
len--;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment