add cpuunclaimed

3f3acd80 · Brendan Gregg · dd5799bc · 3f3acd80 · 3f3acd80 · 3f3acd80
Commit 3f3acd80 authored Dec 21, 2016 by Brendan Gregg
Showing with 772 additions and 0 deletions

README.md README.md +1 -0

man/man8/cpuunclaimed.8 man/man8/cpuunclaimed.8 +106 -0

tools/cpuunclaimed.py tools/cpuunclaimed.py +340 -0

tools/cpuunclaimed_example.txt tools/cpuunclaimed_example.txt +325 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -86,6 +86,7 @@ Examples:
 - tools/[cachestat](tools/cachestat.py): Trace page cache hit/miss ratio. [Examples](tools/cachestat_example.txt).
 - tools/[cachetop](tools/cachetop.py): Trace page cache hit/miss ratio by processes. [Examples](tools/cachetop_example.txt).
 - tools/[cpudist](tools/cpudist.py): Summarize on- and off-CPU time per task as a histogram. [Examples](tools/cpudist_example.txt)
+- tools/[cpuunclaimed](tools/cpuunclaimed.py): Sample CPU run queues and calculate unclaimed idle CPU. [Examples](tools/cpuunclaimed_example.txt)
 - tools/[dcsnoop](tools/dcsnoop.py): Trace directory entry cache (dcache) lookups. [Examples](tools/dcsnoop_example.txt).
 - tools/[dcstat](tools/dcstat.py): Directory entry cache (dcache) stats. [Examples](tools/dcstat_example.txt).
 - tools/[execsnoop](tools/execsnoop.py): Trace new processes via exec() syscalls. [Examples](tools/execsnoop_example.txt).

--- a/man/man8/cpuunclaimed.8
+++ b/man/man8/cpuunclaimed.8
+.TH cpuunclaimed 8  "2016-12-21" "USER COMMANDS"
+.SH NAME
+cpuunclaimed \- Sample CPU run queues and calculate unclaimed idle CPU. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B cpuunclaimed
+[\-T] [\-j] [\-J] [interval [count]]
+.SH DESCRIPTION
+This tool samples the length of the run queues and determine when there are idle
+CPUs, yet queued threads waiting their turn. It reports the amount of idle
+(yet unclaimed by waiting threads) CPU as a system-wide percentage.
+
+This situation can happen for a number of reasons:
+.IP -
+An application has been bound to some, but not all, CPUs, and has runnable
+threads that cannot migrate to other CPUs due to this configuration.
+.IP -
+CPU affinity: an optimization that leaves threads on CPUs where the CPU
+caches are warm, even if this means short periods of waiting while other
+CPUs are idle. The wait period is tunale (see sysctl, kernel.sched*).
+.IP -
+Scheduler bugs.
+.P
+An unclaimed idle of < 1% is likely to be CPU affinity, and not usually a
+cause for concern. By leaving the CPU idle, overall throughput of the system
+may be improved. This tool is best for identifying larger issues, > 2%, due
+to the coarseness of its 99 Hertz samples.
+
+This is an experimental tool that currently works by use of sampling to
+keep overheads low. Tool assumptions:
+.IP -
+CPU samples consistently fire around the same offset. There will sometimes
+be a lag as a sample is delayed by higher-priority interrupts, but it is
+assumed the subsequent samples will catch up to the expected offsets (as
+is seen in practice). You can use -J to inspect sample offsets. Some
+systems can power down CPUs when idle, and when they wake up again they
+may begin firing at a skewed offset: this tool will detect the skew, print
+an error, and exit.
+.IP -
+All CPUs are online (see ncpu).
+.P
+If this identifies unclaimed CPU, you can double check it by dumping raw
+samples (-j), as well as using other tracing tools to instrument scheduler
+events (although this latter approach has much higher overhead).
+
+Since this uses BPF, only the root user can use this tool.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+.SH EXAMPLES
+.TP
+Sample and calculate unclaimed idle CPUs, output every 1 second (default:
+#
+.B cpuunclaimed
+.TP
+Print 5 second summaries, 10 times:
+#
+.B cpuunclaimed 5 10
+.TP
+Print 1 second summaries with timestamps:
+#
+.B cpuunclaimed \-T 1
+.TP
+Raw dump of all samples (verbose), as comma-separated values:
+#
+.B cpuunclaimed \-j
+.SH FIELDS
+.TP
+%CPU
+CPU utilization as a system-wide percentage.
+.TP
+unclaimed idle
+Percentage of CPU resources that were idle when work was queued on other CPUs,
+as a system-wide percentage.
+.TP
+TIME
+Time (HH:MM:SS)
+.TP
+TIMESTAMP_ns
+Timestamp, nanoseconds.
+.TP
+CPU#
+CPU ID.
+.TP
+OFFSET_ns_CPU#
+Time offset that a sample fired within a sample group for this CPU.
+.SH OVERHEAD
+The overhead is expected to be low/negligible as this tool uses sampling at
+99 Hertz (on all CPUs), which has a fixed and low cost, rather than sampling
+every scheduler event as many other approaches use (which can involve
+instrumenting millions of events per second). Sampled CPUs, run queue lengths,
+and timestamps are written to ring buffers that are periodically read by
+user space for reporting. Measure overhead in a test environment.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+runqlen(8)
--- a/tools/cpuunclaimed.py
+++ b/tools/cpuunclaimed.py
+#!/usr/bin/python
+# @lint-avoid-python-3-compatibility-imports
+#
+# cpuunclaimed   Sample CPU run queues and calculate unclaimed idle CPU.
+#                For Linux, uses BCC, eBPF.
+#
+# This samples the length of the run queues and determine when there are idle
+# CPUs, yet queued threads waiting their turn. Report the amount of idle
+# (yet unclaimed by waiting threads) CPU as a system-wide percentage.
+#
+# This situation can happen for a number of reasons:
+#
+# - An application has been bound to some, but not all, CPUs, and has runnable
+#   threads that cannot migrate to other CPUs due to this configuration.
+# - CPU affinity: an optimization that leaves threads on CPUs where the CPU
+#   caches are warm, even if this means short periods of waiting while other
+#   CPUs are idle. The wait period is tunale (see sysctl, kernel.sched*).
+# - Scheduler bugs.
+#
+# An unclaimed idle of < 1% is likely to be CPU affinity, and not usually a
+# cause for concern. By leaving the CPU idle, overall throughput of the system
+# may be improved. This tool is best for identifying larger issues, > 2%, due
+# to the coarseness of its 99 Hertz samples.
+#
+# This is an experimental tool that currently works by use of sampling to
+# keep overheads low. Tool assumptions:
+#
+# - CPU samples consistently fire around the same offset. There will sometimes
+#   be a lag as a sample is delayed by higher-priority interrupts, but it is
+#   assumed the subsequent samples will catch up to the expected offsets (as
+#   is seen in practice). You can use -J to inspect sample offsets. Some
+#   systems can power down CPUs when idle, and when they wake up again they
+#   may begin firing at a skewed offset: this tool will detect the skew, print
+#   an error, and exit.
+# - All CPUs are online (see ncpu).
+#
+# If this identifies unclaimed CPU, you can double check it by dumping raw
+# samples (-j), as well as using other tracing tools to instrument scheduler
+# events (although this latter approach has much higher overhead).
+#
+# This tool passes all sampled events to user space for post processing.
+# I originally wrote this to do the calculations entirerly in kernel context,
+# and only pass a summary. That involves a number of challenges, and the
+# overhead savings may not outweigh the caveats. You can see my WIP here:
+# https://gist.github.com/brendangregg/731cf2ce54bf1f9a19d4ccd397625ad9
+#
+# USAGE: cpuunclaimed [-h] [-j] [-J] [-T] [interval] [count]
+#
+# If you see "Lost 1881 samples" warnings, try increasing wakeup_hz.
+#
+# REQUIRES: Linux 4.9+ (BPF_PROG_TYPE_PERF_EVENT support). Under tools/old is
+# a version of this tool that may work on Linux 4.6 - 4.8.
+#
+# Copyright 2016 Netflix, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License")
+#
+# 20-Dec-2016   Brendan Gregg   Created this.
+
+from __future__ import print_function
+from bcc import BPF, PerfType, PerfSWConfig
+from time import sleep, strftime
+from ctypes import c_int
+import argparse
+import multiprocessing
+from os import getpid, system
+import ctypes as ct
+
+# arguments
+examples = """examples:
+    ./cpuunclaimed            # sample and calculate unclaimed idle CPUs,
+                              # output every 1 second (default)
+    ./cpuunclaimed 5 10       # print 5 second summaries, 10 times
+    ./cpuunclaimed -T 1       # 1s summaries and timestamps
+    ./cpuunclaimed -j         # raw dump of all samples (verbose), CSV
+"""
+parser = argparse.ArgumentParser(
+    description="Sample CPU run queues and calculate unclaimed idle CPU",
+    formatter_class=argparse.RawDescriptionHelpFormatter,
+    epilog=examples)
+parser.add_argument("-j", "--csv", action="store_true",
+    help="print sample summaries (verbose) as comma-separated values")
+parser.add_argument("-J", "--fullcsv", action="store_true",
+    help="print sample summaries with extra fields: CPU sample offsets")
+parser.add_argument("-T", "--timestamp", action="store_true",
+    help="include timestamp on output")
+parser.add_argument("interval", nargs="?", default=-1,
+    help="output interval, in seconds")
+parser.add_argument("count", nargs="?", default=99999999,
+    help="number of outputs")
+args = parser.parse_args()
+countdown = int(args.count)
+frequency = 99
+dobind = 1
+wakeup_hz = 10                      # frequency to read buffers
+wakeup_s = float(1) / wakeup_hz
+ncpu = multiprocessing.cpu_count()  # assume all are online
+debug = 0
+
+# process arguments
+if args.fullcsv:
+    args.csv = True
+if args.csv:
+    interval = 0.2
+if args.interval != -1 and (args.fullcsv or args.csv):
+    print("ERROR: cannot use interval with either -j or -J. Exiting.")
+    exit()
+if args.interval == -1:
+    args.interval = "1"
+interval = float(args.interval)
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <uapi/linux/bpf_perf_event.h>
+#include <linux/sched.h>
+
+struct data_t {
+    u64 ts;
+    u64 cpu;
+    u64 len;
+};
+
+BPF_PERF_OUTPUT(events);
+
+// Declare enough of cfs_rq to find nr_running, since we can't #import the
+// header. This will need maintenance. It is from kernel/sched/sched.h:
+struct cfs_rq_partial {
+    struct load_weight load;
+    unsigned int nr_running, h_nr_running;
+};
+
+int do_perf_event(struct bpf_perf_event_data *ctx)
+{
+    int cpu = bpf_get_smp_processor_id();
+    u64 now = bpf_ktime_get_ns();
+
+    /*
+     * Fetch the run queue length from task->se.cfs_rq->nr_running. This is an
+     * unstable interface and may need maintenance. Perhaps a future version
+     * of BPF will support task_rq(p) or something similar as a more reliable
+     * interface.
+     */
+    unsigned int len = 0;
+    struct task_struct *task = NULL;
+    struct cfs_rq_partial *my_q = NULL;
+    task = (struct task_struct *)bpf_get_current_task();
+    bpf_probe_read(&my_q, sizeof(my_q), &task->se.cfs_rq);
+    bpf_probe_read(&len, sizeof(len), &my_q->nr_running);
+
+    struct data_t data = {.ts = now, .cpu = cpu, .len = len};
+    events.perf_submit(ctx, &data, sizeof(data));
+
+    return 0;
+}
+"""
+
+# code substitutions
+if debug:
+    print(bpf_text)
+
+# initialize BPF & perf_events
+b = BPF(text=bpf_text)
+# TODO: check for HW counters first and use if more accurate
+b.attach_perf_event(ev_type=PerfType.SOFTWARE,
+    ev_config=PerfSWConfig.TASK_CLOCK, fn_name="do_perf_event",
+    sample_period=0, sample_freq=frequency)
+
+if args.csv:
+    if args.timestamp:
+        print("TIME", end=",")
+    print("TIMESTAMP_ns", end=",")
+    print(",".join("CPU" + str(c) for c in xrange(ncpu)), end="")
+    if args.fullcsv:
+        print(",", end="")
+        print(",".join("OFFSET_ns_CPU" + str(c) for c in xrange(ncpu)), end="")
+    print()
+else:
+    print(("Sampling run queues... Output every %s seconds. " +
+          "Hit Ctrl-C to end.") % args.interval)
+class Data(ct.Structure):
+    _fields_ = [
+        ("ts", ct.c_ulonglong),
+        ("cpu", ct.c_ulonglong),
+        ("len", ct.c_ulonglong)
+    ]
+
+samples = {}
+group = {}
+last = 0
+
+# process event
+def print_event(cpu, data, size):
+    event = ct.cast(data, ct.POINTER(Data)).contents
+    samples[event.ts] = {}
+    samples[event.ts]['cpu'] = event.cpu
+    samples[event.ts]['len'] = event.len
+
+exiting = 0 if args.interval else 1
+slept = float(0)
+
+# Choose the elapsed time from one sample group to the next that identifies a
+# new sample group (a group being a set of samples from all CPUs). The
+# earliest timestamp is compared in each group. This trigger is also used
+# for sanity testing, if a group's samples exceed half this value.
+trigger = int(0.8 * (1000000000 / frequency))
+
+# read events
+b["events"].open_perf_buffer(print_event)
+while 1:
+    # allow some buffering by calling sleep(), to reduce the context switch
+    # rate and lower overhead.
+    try:
+        if not exiting:
+            sleep(wakeup_s)
+    except KeyboardInterrupt:
+        exiting = 1
+    b.kprobe_poll()
+    slept += wakeup_s
+
+    if slept < 0.999 * interval:   # floating point workaround
+        continue
+    slept = 0
+
+    positive = 0  # number of samples where an idle CPU could have run work
+    running = 0
+    idle = 0
+    if debug >= 2:
+        print("DEBUG: begin samples loop, count %d" % len(samples))
+    for e in sorted(samples):
+        if debug >= 2:
+            print("DEBUG: ts %d cpu %d len %d delta %d trig %d" % (e,
+                  samples[e]['cpu'], samples[e]['len'], e - last,
+                  e - last > trigger))
+
+        # look for time jumps to identify a new sample group
+        if e - last > trigger:
+
+            # first first group timestamp, and sanity test
+            g_time = 0
+            g_max = 0
+            for ge in sorted(group):
+                if g_time == 0:
+                    g_time = ge
+                g_max = ge
+
+            # process previous sample group
+            if args.csv:
+                lens = [0] * ncpu
+                offs = [0] * ncpu
+                for ge in sorted(group):
+                    lens[samples[ge]['cpu']] = samples[ge]['len']
+                    if args.fullcsv:
+                        offs[samples[ge]['cpu']] = ge - g_time
+                if g_time > 0:      # else first sample
+                    if args.timestamp:
+                        print("%-8s" % strftime("%H:%M:%S"), end=",")
+                    print("%d" % g_time, end=",")
+                    print(",".join(str(lens[c]) for c in xrange(ncpu)), end="")
+                    if args.fullcsv:
+                        print(",", end="")
+                        print(",".join(str(offs[c]) for c in xrange(ncpu)))
+                    else:
+                        print()
+            else:
+                # calculate stats
+                g_running = 0
+                g_queued = 0
+                for ge in group:
+                    if samples[ge]['len'] > 0:
+                        g_running += 1
+                    if samples[ge]['len'] > 1:
+                        g_queued += samples[ge]['len'] - 1
+                g_idle = ncpu - g_running
+
+                # calculate the number of threads that could have run as the
+                # minimum of idle and queued
+                if g_idle > 0 and g_queued > 0:
+                    if g_queued > g_idle:
+                        i = g_idle
+                    else:
+                        i = g_queued
+                    positive += i
+                running += g_running
+                idle += g_idle
+
+            # now sanity test, after -J output
+            g_range = g_max - g_time
+            if g_range > trigger / 2:
+                # if a sample group exceeds half the interval, we can no
+                # longer draw conclusions about some CPUs idle while others
+                # have queued work. Error and exit. This can happen when
+                # CPUs power down, then start again on different offsets.
+                # TODO: Since this is a sampling tool, an error margin should
+                # be anticipated, so an improvement may be to bump a counter
+                # instead of exiting, and only exit if this counter shows
+                # a skewed sample rate of over, say, 1%. Such an approach
+                # would allow a small rate of outliers (sampling error),
+                # and, we could tighten the trigger to be, say, trigger / 5.
+                # In the case of a power down, if it's detectable, perhaps
+                # the tool could reinitialize the timers (although exiting
+                # is simple and works).
+                print(("ERROR: CPU samples arrived at skewed offsets " +
+                      "(CPUs may have powered down when idle), " +
+                      "spanning %d ns (expected < %d ns). Debug with -J, " +
+                      "and see the man page. As output may begin to be " +
+                      "unreliable, exiting.") % (g_range, trigger / 2))
+                exit()
+
+            # these are done, remove
+            for ge in sorted(group):
+                del samples[ge]
+
+            # begin next group
+            group = {}
+            last = e
+
+        # stash this timestamp in a sample group dict
+        group[e] = 1
+
+    if not args.csv:
+        total = running + idle
+        unclaimed = util = 0
+
+        if debug:
+            print("DEBUG: hit %d running %d idle %d total %d buffered %d" % (
+                  positive, running, idle, total, len(samples)))
+
+        if args.timestamp:
+            print("%-8s " % strftime("%H:%M:%S"), end="")
+
+        # output
+        if total:
+            unclaimed = float(positive) / total
+            util = float(running) / total
+        print("%%CPU %6.2f%%, unclaimed idle %0.2f%%" % (100 * util,
+              100 * unclaimed))
+
+    countdown -= 1
+    if exiting or countdown == 0:
+        exit()
--- a/tools/cpuunclaimed_example.txt
+++ b/tools/cpuunclaimed_example.txt
+Demonstrations of cpuunclaimed, the Linux eBPF/bcc version.
+
+
+This tool samples the length of the CPU run queues and determine when there are
+idle CPUs, yet queued threads waiting their turn. It reports the amount of idle
+(yet unclaimed by waiting threads) CPU as a system-wide percentage. For
+example:
+
+# ./cpuunclaimed.py
+Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
+%CPU  83.00%, unclaimed idle 0.12%
+%CPU  87.25%, unclaimed idle 0.38%
+%CPU  85.00%, unclaimed idle 0.25%
+%CPU  85.00%, unclaimed idle 0.25%
+%CPU  80.88%, unclaimed idle 0.00%
+%CPU  82.25%, unclaimed idle 0.00%
+%CPU  83.50%, unclaimed idle 0.12%
+%CPU  81.50%, unclaimed idle 0.00%
+%CPU  81.38%, unclaimed idle 0.00%
+[...]
+
+This shows a system running at over 80% CPU utilization, and with less than
+0.5% unclaimed idle CPUs.
+
+Unclaimed idle CPUs can happen for a number of reasons:
+
+- An application has been bound to some, but not all, CPUs, and has runnable
+  threads that cannot migrate to other CPUs due to this configuration.
+- CPU affinity: an optimization that leaves threads on CPUs where the CPU
+  caches are warm, even if this means short periods of waiting while other
+  CPUs are idle. The wait period is tunale (see sysctl, kernel.sched*).
+- Scheduler bugs.
+
+An unclaimed idle of < 1% is likely to be CPU affinity, and not usually a
+cause for concern. By leaving the CPU idle, overall throughput of the system
+may be improved. This tool is best for identifying larger issues, > 2%, due
+to the coarseness of its 99 Hertz samples.
+
+
+This is an 8 CPU system, with an 8 CPU-bound threaded application running that
+has been bound to one CPU (via taskset):
+
+# ./cpuunclaimed.py 
+Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
+%CPU  12.63%, unclaimed idle 86.36%
+%CPU  12.50%, unclaimed idle 87.50%
+%CPU  12.63%, unclaimed idle 87.37%
+%CPU  12.75%, unclaimed idle 87.25%
+%CPU  12.50%, unclaimed idle 87.50%
+%CPU  12.63%, unclaimed idle 87.37%
+%CPU  12.50%, unclaimed idle 87.50%
+%CPU  12.50%, unclaimed idle 87.50%
+[...]
+
+It shows that 7 of the 8 CPUs (87.5%) are idle at the same time there are
+queued threads waiting to run on CPU. This is an artifical situation caused
+by binding threads to the same CPU, to demonstrate how the tool works.
+
+
+This is an 8 CPU system running a Linux kernel build with "make -j8", and -T
+to print timestamps:
+
+# ./cpuunclaimed.py -T
+Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
+22:25:55 %CPU  98.88%, unclaimed idle 0.12%
+22:25:56 %CPU  99.75%, unclaimed idle 0.25%
+22:25:57 %CPU  99.50%, unclaimed idle 0.50%
+22:25:58 %CPU  99.25%, unclaimed idle 0.75%
+22:25:59 %CPU  99.75%, unclaimed idle 0.25%
+22:26:00 %CPU  99.50%, unclaimed idle 0.50%
+22:26:01 %CPU  99.25%, unclaimed idle 0.75%
+22:26:02 %CPU  99.25%, unclaimed idle 0.75%
+22:26:03 %CPU  99.01%, unclaimed idle 0.87%
+22:26:04 %CPU  99.88%, unclaimed idle 0.12%
+22:26:05 %CPU  99.38%, unclaimed idle 0.62%
+
+There's now a consistent, yet small, amount of unclaimed idle CPU. This is
+expected to be deliberate: CPU affinity, as mentioned earlier.
+
+
+The -j option will print raw samples: around one hundred lines of output
+every second. For the same system with a Linux kernel build of "make -j8":
+
+# ./cpuunclaimed.py -j
+TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,CPU4,CPU5,CPU6,CPU7
+514606928954752,1,1,1,1,1,1,1,1
+514606939054312,1,1,1,1,1,1,1,2
+514606949156518,1,1,1,1,1,1,1,1
+514606959256596,2,2,1,1,1,1,1,1
+514606969357989,1,1,1,1,1,2,1,1
+514606979459700,1,2,1,1,1,2,1,1
+514606989560481,1,1,1,1,1,1,1,1
+514606999661396,1,1,1,1,1,1,2,1
+514607009795601,1,1,1,1,1,1,1,2
+514607019862711,1,1,1,1,1,1,1,1
+514607029963734,1,1,1,1,1,1,1,1
+514607040062372,1,1,1,1,1,1,1,1
+514607050197735,1,1,1,2,1,1,1,1
+514607060266464,1,1,1,1,1,1,1,2
+514607070368025,1,1,1,1,1,2,1,1
+514607080468375,1,1,1,1,1,1,1,2
+514607090570292,3,2,1,1,1,1,1,1
+514607100670725,1,1,1,1,1,2,1,1
+514607110771946,1,2,1,1,1,1,1,1
+514607120873489,1,1,1,1,2,1,2,1
+514607130973857,2,1,1,1,3,1,1,1
+514607141080056,0,1,1,1,1,2,1,3
+514607151176312,1,1,1,2,1,1,1,1
+514607161277753,1,1,1,1,1,1,2,1
+514607171379095,1,1,1,1,1,1,1,1
+514607181479262,1,1,1,1,1,1,1,1
+514607191580794,3,1,1,1,1,1,1,1
+514607201680952,1,1,1,1,1,1,2,1
+514607211783683,1,1,1,1,1,1,1,1
+514607221883274,1,1,1,1,1,1,0,1
+514607231984244,1,1,1,1,1,1,1,1
+514607242085698,1,1,1,1,1,1,1,1
+514607252216898,1,2,1,1,1,1,1,1
+514607262289420,1,1,1,1,1,2,1,1
+514607272389922,1,1,1,1,1,1,1,1
+514607282489413,1,1,1,1,1,1,1,1
+514607292589950,1,3,1,1,1,1,1,1
+514607302693367,1,1,1,1,2,1,1,1
+514607312793792,1,1,1,1,1,1,1,1
+514607322895249,1,1,1,3,1,1,3,1
+514607332994278,1,0,1,1,1,2,1,2
+514607343095836,1,1,1,1,1,2,1,1
+514607353196533,1,1,1,1,2,1,1,1
+514607363297749,1,1,1,1,1,1,1,2
+514607373399011,1,1,1,1,1,1,1,2
+514607383499730,1,1,1,1,1,1,1,2
+514607393601510,1,1,1,1,1,1,1,2
+514607403704117,2,1,1,1,1,1,1,2
+514607413802700,1,1,1,1,2,1,0,1
+514607423904559,1,1,1,1,1,1,1,1
+[...]
+
+The output is verbose: printing out a timestamp, and then the length of each
+CPU's run queue. The second last line, of timestamp 514607413802700, is an
+example of what this tool detects: CPU 4 has a run queue length of 4, which
+means one thread running and one thread queued, while CPU 6 has a run queue
+length of 0: idle. The very next sample shows all CPUs busy.
+
+
+The -J option prints raw samples with time offsets showing when the samples
+were collected on each CPU. It's mostly useful for debugging the tool itself.
+For example, during a Linux kernel build:
+
+# ./cpuunclaimed.py -J
+TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,CPU4,CPU5,CPU6,CPU7,OFFSET_ns_CPU0,OFFSET_ns_CPU1,OFFSET_ns_CPU2,OFFSET_ns_CPU3,OFFSET_ns_CPU4,OFFSET_ns_CPU5,OFFSET_ns_CPU6,OFFSET_ns_CPU7
+514722625198188,1,1,1,1,1,1,1,2,0,28321,51655,73396,89654,111172,132803,159792
+514722635299034,1,1,1,1,1,2,1,1,0,28809,51999,74183,89552,110011,131995,153519
+514722645400274,1,1,1,1,1,1,1,2,0,28024,51333,73652,88964,110075,131973,153568
+514722655501816,1,2,1,1,1,1,1,1,0,28893,51671,75233,89496,109430,131945,153694
+514722665602594,1,1,2,1,1,2,1,1,0,28623,50988,73866,89383,109186,131786,154555
+514722675703498,1,1,1,1,1,1,1,1,0,27379,51031,73175,89625,110380,131482,104811
+514722685804942,1,1,1,1,1,2,1,1,0,27213,50501,72183,88797,108780,130659,152153
+514722695906294,1,1,1,1,1,1,1,1,0,27036,51182,73420,87861,109585,130364,155089
+514722706005778,1,1,1,1,1,1,1,1,0,28492,51851,74138,89744,110208,132462,154060
+514722716060705,1,1,1,1,1,1,1,1,0,154499,152528,155232,155046,154502,178746,200001
+514722726209615,1,1,1,1,1,1,1,1,0,28170,49580,72605,87741,108144,130723,152138
+514722736309475,1,2,1,1,1,1,1,1,0,27421,51386,73061,89358,109457,131273,153005
+514722746410845,1,2,1,1,1,2,1,1,0,27788,50840,72720,88920,109111,131143,152979
+514722756511363,1,1,1,1,1,1,2,1,0,28280,50977,73559,89848,109659,131579,152693
+514722766613044,1,1,1,1,1,1,1,1,0,28046,50812,72754,89160,110108,130735,152948
+514722776712932,1,1,1,2,1,1,1,1,0,28586,51177,73974,89588,109947,132376,154162
+514722786815477,1,1,1,1,1,1,1,1,0,27973,71104,72539,88302,108896,130414,152236
+514722796914955,1,1,1,1,1,1,1,1,0,29054,52354,74214,89592,110615,132586,153925
+514722807044060,1,1,1,1,1,1,1,1,1587130,0,24079,46633,61787,82325,104706,125278
+514722817117432,2,1,2,1,1,1,1,1,0,27628,51038,75138,89724,109340,132426,155348
+514722827218254,1,1,1,1,1,1,2,1,0,29111,51868,74347,88904,109911,132764,153851
+514722837340158,1,1,1,1,1,1,1,1,0,7366,30760,53528,68622,89317,111095,132319
+514722847421305,1,1,1,1,1,1,1,1,0,28257,51105,73841,89037,110820,131605,153368
+514722857521112,1,1,1,1,1,1,1,1,0,28544,51441,73857,89530,110497,131915,153513
+514722867626129,0,2,1,1,1,1,1,1,0,24621,47917,70568,85391,106670,128081,150329
+514722877727183,2,1,1,1,1,1,1,1,0,24869,47630,71547,84761,106048,128444,149285
+514722887824589,1,1,1,1,1,1,2,1,0,28793,51212,73863,89584,109773,132348,153194
+514722897925481,1,1,1,1,1,1,2,1,0,29278,51163,73961,89774,109592,132029,153715
+514722908026097,1,1,1,1,1,1,1,1,0,30630,35595,36210,188001,190815,190072,190732
+514722918127439,1,1,1,1,1,1,1,1,0,28544,51885,73948,89987,109763,132632,154083
+514722928227399,1,1,1,1,1,1,1,1,0,31882,51574,74769,89939,110578,132951,154356
+514722938329471,1,1,1,1,1,1,1,1,0,28498,51304,74101,89670,110278,132653,153176
+514722948430589,1,1,1,1,1,1,1,1,0,27868,50925,73477,89676,109583,132360,153014
+514722958531802,1,1,1,1,1,1,1,1,0,28505,50886,73729,89919,109618,131988,152896
+514722968632181,1,1,1,1,1,1,1,1,0,28492,51749,73977,90334,109816,132897,152890
+514722978733584,1,1,1,1,1,1,1,1,0,28847,50957,74121,90014,110019,132377,152978
+514722988834321,1,1,1,1,1,1,1,1,0,28601,51437,74021,89968,110252,132233,153623
+514722998937170,1,1,2,1,1,1,1,1,0,27007,50044,73259,87725,108663,132194,152459
+514723009036821,1,2,1,2,1,1,1,1,0,28226,50937,73983,89110,110476,131740,153663
+514723019137577,1,1,1,1,1,1,1,1,0,30261,52357,75657,87803,61823,131850,153585
+514723029238745,1,1,1,1,1,1,1,1,0,28030,50752,74452,89240,110791,132187,153327
+514723039339069,1,1,1,1,1,1,1,1,0,29791,52636,75996,90475,110414,132232,154714
+514723049439822,1,1,1,1,2,1,1,1,0,29133,56662,74153,89520,110683,132740,154708
+514723059541617,1,1,1,1,1,1,1,1,0,27932,51480,74644,89656,109176,131499,153732
+514723069642500,1,1,2,1,1,1,2,1,0,27678,51509,73984,90136,110124,131554,153459
+514723079743525,2,1,1,1,1,1,1,1,0,28029,51424,74394,90056,110087,132383,152963
+514723089844091,2,1,1,2,1,1,1,1,0,28944,51692,74440,90339,110402,132722,154083
+514723099945957,1,1,2,1,1,1,1,1,0,28425,51267,73164,89322,115048,114630,115187
+514723110047020,1,1,2,0,1,1,1,2,0,28192,50811,76814,89835,109370,131265,153511
+514723120216662,1,1,2,1,1,2,1,1,29,34,0,4514,19268,40293,62674,84009
+[...]
+
+This is a Xen guest system, and it shows that CPU 0 usually completes first (an
+offset of 0), followed by CPU 1 around 28000 nanoseconds later, and so on.
+The spread of offsets is triggered by the bcc Python library that initializes
+the timers, which steps through the CPUs in sequence, with a small delay
+between them merely from executing its own loop code.
+
+Here's more output during a Linux kernel build:
+
+# ./cpuunclaimed.py -J
+TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,CPU4,CPU5,CPU6,CPU7,OFFSET_ns_CPU0,OFFSET_ns_CPU1,OFFSET_ns_CPU2,OFFSET_ns_CPU3,OFFSET_ns_CPU4,OFFSET_ns_CPU5,OFFSET_ns_CPU6,OFFSET_ns_CPU7
+514722625198188,1,1,1,1,1,1,1,2,0,28321,51655,73396,89654,111172,132803,159792
+515700745758947,2,1,1,1,1,1,1,1,0,19835,34891,49397,59364,71988,87571,102769
+515700755860451,2,1,1,1,1,1,1,2,0,19946,34323,49855,59844,72741,87925,102891
+515700765960560,1,1,1,1,1,1,1,1,0,20805,35339,50436,59677,73557,88661,104796
+515700776061744,1,1,1,1,1,1,1,1,1626,77,0,190,153452,154665,178218,154116
+515700786162017,1,1,1,1,1,1,1,1,0,20497,35361,51552,59787,74451,147789,104545
+515700796262811,1,1,1,1,1,1,1,2,0,20910,35657,50805,60175,73953,88492,103527
+515700806364951,1,1,1,1,1,1,1,1,0,20140,35023,50074,59726,72757,88040,102421
+515700816465253,1,1,1,1,1,1,2,1,0,20952,34899,50262,60048,72890,88067,103545
+515700826566573,1,1,1,1,1,1,1,1,0,20898,35490,50609,59805,74060,88550,103354
+515700836667480,1,1,1,1,1,1,2,1,0,20548,34760,50959,59490,73059,87820,103006
+515700846768182,1,1,1,1,1,1,2,1,0,20571,35113,50777,59962,74139,88543,103192
+515700856869468,1,1,2,1,1,2,2,1,0,20932,35382,50510,60106,73739,91818,103684
+515700866971905,1,1,1,2,1,1,1,1,0,19780,33018,49075,58375,71949,86537,102136
+515700877073459,2,1,1,1,1,1,1,1,0,20065,73966,48989,58832,71408,85714,101067
+515700887172772,1,1,1,1,1,1,1,1,0,20909,34608,51493,59890,73564,88668,103454
+515700897273292,1,2,1,1,1,1,1,1,0,20353,35292,50114,59773,73948,88615,103383
+515700907374341,1,1,2,1,1,1,1,1,0,20816,35206,50915,60062,73878,88857,103794
+515700917475331,1,1,6,1,1,2,1,1,0,20752,34931,50280,59764,73781,88329,103234
+515700927576958,1,1,1,1,1,1,1,1,0,19929,34703,50181,59364,73004,88053,103127
+515700937677298,1,1,2,2,1,1,1,1,0,21178,34724,50740,61193,73452,89030,103390
+515700947778409,2,1,1,1,1,1,1,1,0,21059,35604,50853,60098,73919,88675,103506
+515700957879196,2,1,1,1,1,1,1,1,0,21326,35939,51492,60083,74249,89474,103761
+[...]
+
+Notice the tighter range of offsets? I began executing cpuunclaimed when the
+system was idle, and it initialized the CPU timers more quickly, and then I
+began the Linux kernel build.
+
+Here's some different output, this time from a physical system with 4 CPUs,
+also doing a kernel build,
+
+# ./cpuunclaimed.py -J
+TIMESTAMP_ns,CPU0,CPU1,CPU2,CPU3,OFFSET_ns_CPU0,OFFSET_ns_CPU1,OFFSET_ns_CPU2,OFFSET_ns_CPU3
+4429382557480,1,1,1,1,0,6011,10895,16018
+4429392655042,2,1,1,1,0,8217,13661,19378
+4429402757604,1,1,1,1,0,6879,12433,18000
+4429412857809,1,1,1,1,0,8303,13190,18719
+4429422960709,2,1,1,1,0,6095,11234,17079
+4429433060391,1,1,1,2,0,6747,12480,18070
+4429443161699,1,1,1,1,0,6560,12264,17945
+4429453262002,1,2,1,1,0,6992,12644,18341
+4429463363706,1,2,1,1,0,6211,12071,17853
+4429473465571,1,1,1,1,0,5766,11495,17638
+4429483566920,1,1,1,1,0,5223,11736,16358
+4429493666279,1,1,1,1,0,6964,12653,18410
+4429503769113,1,1,1,1,0,5161,11399,16612
+4429513870744,1,1,1,1,0,5943,10583,15768
+4429523969826,1,1,1,1,0,6533,12336,18189
+4429534070311,1,1,1,1,0,6834,12816,18488
+4429544170456,1,1,1,1,0,7284,13401,19129
+4429554274467,1,2,1,1,0,5941,11160,16594
+4429564372365,1,2,1,1,0,7514,13618,19190
+4429574474406,1,2,1,1,0,6687,12650,18248
+4429584574220,1,2,1,1,0,7912,13705,19136
+[...]
+
+If the offset range becomes too great, we can no longer conclude about when
+some CPUs were idle and others had queued work. The tool will detect this,
+and print an error message and exit.
+
+
+Some systems can power down CPUs when idle, and when they wake up again the
+timed samples may resume from different offsets. If this happens, this tool
+can no longer draw conclusions about when some CPUs were idle and others
+had queued work, so it prints an error, and exits. Eg:
+
+# ./cpuunclaimed.py 1
+Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
+%CPU   0.25%, unclaimed idle 0.00%
+%CPU   0.75%, unclaimed idle 0.00%
+%CPU   0.25%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.12%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.25%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.12%, unclaimed idle 0.00%
+%CPU   0.13%, unclaimed idle 0.00%
+%CPU   0.12%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+%CPU   0.00%, unclaimed idle 0.00%
+ERROR: CPU samples arrived at skewed offsets (CPUs may have powered down when idle), spanning 4328176 ns (expected < 4040404 ns). Debug with -J, and see the man page. As output may begin to be unreliable, exiting.
+
+It's expected that this will only really occur on idle systems.
+
+USAGE:
+
+# ./cpuunclaimed.py -h
+usage: cpuunclaimed.py [-h] [-j] [-J] [-T] [interval] [count]
+
+Sample CPU run queues and calculate unclaimed idle CPU
+
+positional arguments:
+  interval         output interval, in seconds
+  count            number of outputs
+
+optional arguments:
+  -h, --help       show this help message and exit
+  -j, --csv        print sample summaries (verbose) as comma-separated values
+  -J, --fullcsv    print sample summaries with extra fields: CPU sample
+                   offsets
+  -T, --timestamp  include timestamp on output
+
+examples:
+    ./cpuunclaimed            # sample and calculate unclaimed idle CPUs,
+                              # output every 1 second (default)
+    ./cpuunclaimed 5 10       # print 5 second summaries, 10 times
+    ./cpuunclaimed -T 1       # 1s summaries and timestamps
+    ./cpuunclaimed -j         # raw dump of all samples (verbose), CSV