Add profile: a CPU profiler (#620)

* Add profile: a CPU profiler * move Perf to common class

Add profile: a CPU profiler (#620)
* Add profile: a CPU profiler * move Perf to common class
f4bf2751 · Brendan Gregg · 4ast · 2947ee31 · f4bf2751 · f4bf2751
Commit f4bf2751 authored Jul 21, 2016 by Brendan Gregg Committed by 4ast Jul 21, 2016
7 changed files
--- a/README.md
+++ b/README.md
@@ -100,6 +100,7 @@ Examples:
 - tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
 - tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
 - tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
+- tools/[profile](tools/profile.py): Profile CPU usage by sampling stack traces at a timed interval. [Examples](tools/profile_example.txt).
 - tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
 - tools/[softirqs](tools/softirqs.py):  Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
 - tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).

--- a/man/man8/profile.8
+++ b/man/man8/profile.8
+.TH profile 8  "2016-07-17" "USER COMMANDS"
+.SH NAME
+profile \- Profile CPU usage by sampling stack traces. Uses Linux eBPF/bcc.
+.SH SYNOPSIS
+.B profile [\-adfh] [\-p PID] [\-U | \-k] [\-F FREQUENCY]
+.B [\-\-stack\-storage\-size COUNT] [\-S FRAMES] [duration]
+.SH DESCRIPTION
+This is a CPU profiler. It works by taking samples of stack traces at timed
+intervals. It will help you understand and quantify CPU usage: which code is
+executing, and by how much, including both user-level and kernel code.
+
+By default this samples at 49 Hertz (samples per second), across all CPUs.
+This frequency can be tuned using a command line option. The reason for 49, and
+not 50, is to avoid lock-step sampling.
+
+This is also an efficient profiler, as stack traces are frequency counted in
+kernel context, rather than passing each stack to user space for frequency
+counting there. Only the unique stacks and counts are passed to user space
+at the end of the profile, greatly reducing the kernel<->user transfer.
+
+Note: if another perf-based sampling session is active, the output may become
+polluted with their events. On older kernels, the ouptut may also become
+polluted with tracing sessions (when the kprobe is used instead of the
+tracepoint). This may be filtered in a future version if it becomes a problem.
+.SH REQUIREMENTS
+CONFIG_BPF and bcc.
+
+This also requires Linux 4.6+ (BPF_MAP_TYPE_STACK_TRACE support), and the
+perf:perf_hrtimer tracepoint (currently a kernel patch). If the latter is
+unavailable, this will try to use kprobes as a fallback (of perf_misc_flags()),
+which may work or
+may not, depending on your kernel build. If the kprobe doesn't work, this tool
+will either error on instrumentation, or, instrument successfully but
+generate no output.
+.SH OPTIONS
+.TP
+\-h
+Print usage message.
+.TP
+\-p PID
+Trace this process ID only (filtered in-kernel). Without this, all CPUs are
+profiled.
+.TP
+\-F frequency
+Frequency to sample stacks (default 49).
+.TP
+\-f
+Print output in folded stack format.
+.TP
+\-d
+Include an output delimiter between kernel and user stacks (either "--", or,
+in folded mode, "-").
+.TP
+\-U
+Show stacks from user space only (no kernel space stacks).
+.TP
+\-K
+Show stacks from kernel space only (no user space stacks).
+.TP
+\-\-stack-storage-size COUNT
+The maximum number of unique stack traces that the kernel will count (default
+2048). If the sampled count exceeds this, a warning will be printed.
+.TP
+\-S FRAMES
+A fixed number of kernel frames to skip. By default, extra registers are
+recorded so that the interrupt framework stack can be identified and excluded
+from the output. If this isn't working on your architecture, or, if you'd
+like to improve performance a tiny amount, then you can specify a fixed count
+to skip. Note for debugging that the IP address is printed as the first frame,
+followed by the captured stack.
+.TP
+duration
+Duration to trace, in seconds.
+.SH EXAMPLES
+.TP
+Profile (sample) stack traces system-wide at 49 Hertz (samples per second) until Ctrl-C:
+#
+.B profile
+.TP
+Profile for 5 seconds only:
+#
+.B profile 5
+.TP
+Profile at 99 Hertz for 5 seconds only:
+#
+.B profile -F 99 5
+.TP
+Profile PID 181 only:
+#
+.B profile -p 181
+.TP
+Profile for 5 seconds and output in folded stack format (suitable as input for flame graphs), including a delimiter between kernel and user stacks:
+#
+.B profile -df 5
+.TP
+Profile kernel stacks only:
+#
+.B profile -K
+.SH DEBUGGING
+See "[unknown]" frames with bogus addresses? This can happen for different
+reasons. Your best approach is to get Linux perf to work first, and then to
+try this tool. Eg, "perf record \-F 49 \-a \-g \-\- sleep 1; perf script", and
+to check for unknown frames there.
+
+The most common reason for "[unknown]" frames is that the target software has
+not been compiled
+with frame pointers, and so we can't use that simple method for walking the
+stack. The fix in that case is to use software that does have frame pointers,
+eg, gcc -fno-omit-frame-pointer, or Java's -XX:+PreserveFramePointer.
+
+Another reason for "[unknown]" frames is JIT compilers, which don't use a
+traditional symbol table. The fix in that case is to populate a
+/tmp/perf-PID.map file with the symbols, which this tool should read. How you
+do this depends on the runtime (Java, Node.js).
+
+If you seem to have unrelated samples in the output, check for other
+sampling or tracing tools that may be running. The current version of this
+tool can include their events if profiling happened concurrently. Those
+samples may be filtered in a future version.
+.SH OVERHEAD
+This is an efficient profiler, as stack traces are frequency counted in
+kernel context, and only the unique stacks and their counts are passed to
+user space. Contrast this with the current "perf record -F 99 -a" method
+of profiling, which writes each sample to user space (via a ring buffer),
+and then to the file system (perf.data), which must be post-processed.
+
+This uses perf_event_open to setup a timer which is instrumented by BPF,
+and for efficiency it does not initialize the perf ring buffer, so the
+redundant perf samples are not collected.
+
+It's expected that the overhead while sampling at 49 Hertz (the default),
+across all CPUs, should be negligible. If you increase the sample rate, the
+overhead might begin to be measurable.
+.SH SOURCE
+This is from bcc.
+.IP
+https://github.com/iovisor/bcc
+.PP
+Also look in the bcc distribution for a companion _examples.txt file containing
+example usage, output, and commentary for this tool.
+.SH OS
+Linux
+.SH STABILITY
+Unstable - in development.
+.SH AUTHOR
+Brendan Gregg
+.SH SEE ALSO
+offcputime(8)
--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -27,7 +27,8 @@ basestring = (unicode if sys.version_info[0] < 3 else str)
 from .libbcc import lib, _CB_TYPE, bcc_symbol
 from .procstat import ProcStat, ProcUtils
 from .table import Table
-from .tracepoint import Perf, Tracepoint
+from .tracepoint import Tracepoint
+from .perf import Perf
 from .usyms import ProcessSymbols

 _kprobe_limit = 1000

--- a/src/python/bcc/perf.py
+++ b/src/python/bcc/perf.py
+# Copyright 2016 Sasha Goldshtein
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ctypes as ct
+import multiprocessing
+import os
+
+class Perf(object):
+        class perf_event_attr(ct.Structure):
+                _fields_ = [
+                        ('type', ct.c_uint),
+                        ('size', ct.c_uint),
+                        ('config', ct.c_ulong),
+                        ('sample_period', ct.c_ulong),
+                        ('sample_type', ct.c_ulong),
+                        ('read_format', ct.c_ulong),
+                        ('flags', ct.c_ulong),
+                        ('wakeup_events', ct.c_uint),
+                        ('IGNORE3', ct.c_uint),
+                        ('IGNORE4', ct.c_ulong),
+                        ('IGNORE5', ct.c_ulong),
+                        ('IGNORE6', ct.c_ulong),
+                        ('IGNORE7', ct.c_uint),
+                        ('IGNORE8', ct.c_int),
+                        ('IGNORE9', ct.c_ulong),
+                        ('IGNORE10', ct.c_uint),
+                        ('IGNORE11', ct.c_uint)
+                ]
+
+        # x86 specific, from arch/x86/include/generated/uapi/asm/unistd_64.h
+        NR_PERF_EVENT_OPEN = 298
+
+        #
+        # Selected constants from include/uapi/linux/perf_event.h.
+        # Values copied during Linux 4.7 series.
+        #
+
+        # perf_type_id
+        PERF_TYPE_HARDWARE = 0
+        PERF_TYPE_SOFTWARE = 1
+        PERF_TYPE_TRACEPOINT = 2
+
+        # perf_event_sample_format
+        PERF_SAMPLE_RAW = 1024      # it's a u32; could also try zero args
+
+        # perf_event_attr
+        PERF_ATTR_FLAG_FREQ = 1024
+
+        # perf_event.h
+        PERF_FLAG_FD_CLOEXEC = 8
+        PERF_EVENT_IOC_SET_FILTER = 1074275334
+        PERF_EVENT_IOC_ENABLE = 9216
+
+        # fetch syscall routines
+        libc = ct.CDLL('libc.so.6', use_errno=True)
+        syscall = libc.syscall          # not declaring vararg types
+        ioctl = libc.ioctl              # not declaring vararg types
+
+        @staticmethod
+        def _open_for_cpu(cpu, attr):
+                pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
+                                   attr.pid, cpu, -1,
+                                   Perf.PERF_FLAG_FD_CLOEXEC)
+                if pfd < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+
+                if attr.type == Perf.PERF_TYPE_TRACEPOINT:
+                    if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
+                                  "common_pid == -17") < 0:
+                            errno_ = ct.get_errno()
+                            raise OSError(errno_, os.strerror(errno_))
+
+                # we don't setup the perf ring buffers, as we won't read them
+
+                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
+                        errno_ = ct.get_errno()
+                        raise OSError(errno_, os.strerror(errno_))
+
+        @staticmethod
+        def perf_event_open(tpoint_id, pid=-1, ptype=PERF_TYPE_TRACEPOINT,
+                            freq=0):
+                attr = Perf.perf_event_attr()
+                attr.config = tpoint_id
+                attr.pid = pid
+                attr.type = ptype
+                attr.sample_type = Perf.PERF_SAMPLE_RAW
+                if freq > 0:
+                    # setup sampling
+                    attr.flags = Perf.PERF_ATTR_FLAG_FREQ   # no mmap or comm
+                    attr.sample_period = freq
+                else:
+                    attr.sample_period = 1
+                attr.wakeup_events = 9999999                # don't wake up
+
+                for cpu in range(0, multiprocessing.cpu_count()):
+                        Perf._open_for_cpu(cpu, attr)
--- a/src/python/bcc/tracepoint.py
+++ b/src/python/bcc/tracepoint.py
@@ -17,65 +17,6 @@ import multiprocessing
 import os
 import re

-class Perf(object):
-        class perf_event_attr(ct.Structure):
-                _fields_ = [
-                        ('type', ct.c_uint),
-                        ('size', ct.c_uint),
-                        ('config', ct.c_ulong),
-                        ('sample_period', ct.c_ulong),
-                        ('sample_type', ct.c_ulong),
-                        ('IGNORE1', ct.c_ulong),
-                        ('IGNORE2', ct.c_ulong),
-                        ('wakeup_events', ct.c_uint),
-                        ('IGNORE3', ct.c_uint),
-                        ('IGNORE4', ct.c_ulong),
-                        ('IGNORE5', ct.c_ulong),
-                        ('IGNORE6', ct.c_ulong),
-                        ('IGNORE7', ct.c_uint),
-                        ('IGNORE8', ct.c_int),
-                        ('IGNORE9', ct.c_ulong),
-                        ('IGNORE10', ct.c_uint),
-                        ('IGNORE11', ct.c_uint)
-                ]
-
-        NR_PERF_EVENT_OPEN = 298
-        PERF_TYPE_TRACEPOINT = 2
-        PERF_SAMPLE_RAW = 1024
-        PERF_FLAG_FD_CLOEXEC = 8
-        PERF_EVENT_IOC_SET_FILTER = 1074275334
-        PERF_EVENT_IOC_ENABLE = 9216
-
-        libc = ct.CDLL('libc.so.6', use_errno=True)
-        syscall = libc.syscall          # not declaring vararg types
-        ioctl = libc.ioctl              # not declaring vararg types
-
-        @staticmethod
-        def _open_for_cpu(cpu, attr):
-                pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
-                                   -1, cpu, -1, Perf.PERF_FLAG_FD_CLOEXEC)
-                if pfd < 0:
-                        errno_ = ct.get_errno()
-                        raise OSError(errno_, os.strerror(errno_))
-                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
-                              "common_pid == -17") < 0:
-                        errno_ = ct.get_errno()
-                        raise OSError(errno_, os.strerror(errno_))
-                if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
-                        errno_ = ct.get_errno()
-                        raise OSError(errno_, os.strerror(errno_))
-
-        @staticmethod
-        def perf_event_open(tpoint_id):
-                attr = Perf.perf_event_attr()
-                attr.config = tpoint_id
-                attr.type = Perf.PERF_TYPE_TRACEPOINT
-                attr.sample_type = Perf.PERF_SAMPLE_RAW
-                attr.sample_period = 1
-                attr.wakeup_events = 1
-                for cpu in range(0, multiprocessing.cpu_count()):
-                        Perf._open_for_cpu(cpu, attr)
-
 class Tracepoint(object):
        enabled_tracepoints = []
        trace_root = "/sys/kernel/debug/tracing"
@@ -172,7 +113,7 @@ struct %s {
                if tp_id == -1:
                        raise ValueError("no such tracepoint found: %s:%s" %
                                         (category, event))
-                Perf.perf_event_open(tp_id)
+                Perf.perf_event_open(tp_id, ptype=Perf.PERF_TYPE_TRACEPOINT)
                new_tp = Tracepoint(category, event, tp_id)
                cls.enabled_tracepoints.append(new_tp)
                return new_tp
@@ -199,4 +140,3 @@ struct %s {
                if cls._any_tracepoints_enabled():
                        bpf.attach_kprobe(event="tracing_generic_entry_update",
                                          fn_name="__trace_entry_update")
-
--- a/tools/profile.py
+++ b/tools/profile.py
--- a/tools/profile_example.txt
+++ b/tools/profile_example.txt