Commit 715f7e6e authored by Brendan Gregg's avatar Brendan Gregg Committed by 4ast

Update profile.py to use new perf support (#776)

* profile.py to use new perf support

* Minor adjustments to llcstat docs
parent be294db8
...@@ -100,6 +100,7 @@ Examples: ...@@ -100,6 +100,7 @@ Examples:
- tools/[hardirqs](tools/hardirqs.py): Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt). - tools/[hardirqs](tools/hardirqs.py): Measure hard IRQ (hard interrupt) event time. [Examples](tools/hardirqs_example.txt).
- tools/[killsnoop](tools/killsnoop.py): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt). - tools/[killsnoop](tools/killsnoop.py): Trace signals issued by the kill() syscall. [Examples](tools/killsnoop_example.txt).
- tools/[slabratetop](tools/slabratetop.py): Kernel SLAB/SLUB memory cache allocation rate top. [Examples](tools/slabratetop_example.txt). - tools/[slabratetop](tools/slabratetop.py): Kernel SLAB/SLUB memory cache allocation rate top. [Examples](tools/slabratetop_example.txt).
- tools/[llcstat](tools/llcstat.py): Summarize CPU cache references and misses by process. [Examples](tools/llcstat_example.txt).
- tools/[mdflush](tools/mdflush.py): Trace md flush events. [Examples](tools/mdflush_example.txt). - tools/[mdflush](tools/mdflush.py): Trace md flush events. [Examples](tools/mdflush_example.txt).
- tools/[mysqld_qslower](tools/mysqld_qslower.py): Trace MySQL server queries slower than a threshold. [Examples](tools/mysqld_qslower_example.txt). - tools/[mysqld_qslower](tools/mysqld_qslower.py): Trace MySQL server queries slower than a threshold. [Examples](tools/mysqld_qslower_example.txt).
- tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_example.txt). - tools/[memleak](tools/memleak.py): Display outstanding memory allocations to find memory leaks. [Examples](tools/memleak_example.txt).
......
.TH llcstat 8 "2015-08-18" "USER COMMANDS" .TH llcstat 8 "2015-08-18" "USER COMMANDS"
.SH NAME .SH NAME
llcstat \- Trace cache references and cache misses. Uses Linux eBPF/bcc. llcstat \- Summarize CPU cache references and misses by process. Uses Linux eBPF/bcc.
.SH SYNOPSIS .SH SYNOPSIS
.B llcstat [\-h] [\-c SAMPLE_PERIOD] [duration] .B llcstat [\-h] [\-c SAMPLE_PERIOD] [duration]
.SH DESCRIPTION .SH DESCRIPTION
llcstat traces cache references and cache misses system-side, and summarizes llcstat instruments CPU cache references and cache misses system-side, and
them by PID and CPU. These events have different meanings on different summarizes them by PID and CPU. These events have different meanings on
architecture. For x86-64, they mean misses and references to LLC. different architecture. For x86-64, they mean misses and references to LLC.
This can be useful to locate and debug performance issues This can be useful to locate and debug performance issues
caused by cache hit rate. caused by cache hit rate.
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
profile \- Profile CPU usage by sampling stack traces. Uses Linux eBPF/bcc. profile \- Profile CPU usage by sampling stack traces. Uses Linux eBPF/bcc.
.SH SYNOPSIS .SH SYNOPSIS
.B profile [\-adfh] [\-p PID] [\-U | \-k] [\-F FREQUENCY] .B profile [\-adfh] [\-p PID] [\-U | \-k] [\-F FREQUENCY]
.B [\-\-stack\-storage\-size COUNT] [\-S FRAMES] [duration] .B [\-\-stack\-storage\-size COUNT] [duration]
.SH DESCRIPTION .SH DESCRIPTION
This is a CPU profiler. It works by taking samples of stack traces at timed This is a CPU profiler. It works by taking samples of stack traces at timed
intervals. It will help you understand and quantify CPU usage: which code is intervals. It will help you understand and quantify CPU usage: which code is
...@@ -17,17 +17,11 @@ This is also an efficient profiler, as stack traces are frequency counted in ...@@ -17,17 +17,11 @@ This is also an efficient profiler, as stack traces are frequency counted in
kernel context, rather than passing each stack to user space for frequency kernel context, rather than passing each stack to user space for frequency
counting there. Only the unique stacks and counts are passed to user space counting there. Only the unique stacks and counts are passed to user space
at the end of the profile, greatly reducing the kernel<->user transfer. at the end of the profile, greatly reducing the kernel<->user transfer.
Note: if another perf-based sampling or tracing session is active, the output
may become polluted with their events. This will be fixed for Linux 4.9.
.SH REQUIREMENTS .SH REQUIREMENTS
CONFIG_BPF and bcc. CONFIG_BPF and bcc.
This also requires Linux 4.6+ (BPF_MAP_TYPE_STACK_TRACE support), and the This also requires Linux 4.9+ (BPF_PROG_TYPE_PERF_EVENT support). See tools/old
perf_misc_flags() function symbol to exist. The latter may or may not for an older version that may work on Linux 4.6 - 4.8.
exist depending on your kernel build, and if it doesn't exist, this tool
will not work. Linux 4.9 provides a proper solution to this (this tool will
be updated).
.SH OPTIONS .SH OPTIONS
.TP .TP
\-h \-h
...@@ -57,14 +51,6 @@ Show stacks from kernel space only (no user space stacks). ...@@ -57,14 +51,6 @@ Show stacks from kernel space only (no user space stacks).
The maximum number of unique stack traces that the kernel will count (default The maximum number of unique stack traces that the kernel will count (default
2048). If the sampled count exceeds this, a warning will be printed. 2048). If the sampled count exceeds this, a warning will be printed.
.TP .TP
\-S FRAMES
A fixed number of kernel frames to skip. By default, extra registers are
recorded so that the interrupt framework stack can be identified and excluded
from the output. If this isn't working on your architecture, or, if you'd
like to improve performance a tiny amount, then you can specify a fixed count
to skip. Note for debugging that the IP address is printed as the first frame,
followed by the captured stack.
.TP
duration duration
Duration to trace, in seconds. Duration to trace, in seconds.
.SH EXAMPLES .SH EXAMPLES
......
Demonstrations of llcstat. Demonstrations of llcstat.
llcstat traces cache reference and cache miss events system-wide, and summarizes llcstat traces cache reference and cache miss events system-wide, and summarizes
them by PID and CPU. them by PID and CPU.
These events, defined in uapi/linux/perf_event.h, have different meanings on These events, defined in uapi/linux/perf_event.h, have different meanings on
different architecture. For x86-64, they mean misses and references to LLC. different architecture. For x86-64, they mean misses and references to LLC.
...@@ -25,6 +27,18 @@ Total References: 518920000 Total Misses: 90265000 Hit Rate: 82.61% ...@@ -25,6 +27,18 @@ Total References: 518920000 Total Misses: 90265000 Hit Rate: 82.61%
This shows each PID's cache hit rate during the 20 seconds run period. This shows each PID's cache hit rate during the 20 seconds run period.
A count of 5000 was used in this example, which means that one in every 5,000
events will trigger an in-kernel counter to be incremented. This is refactored
on the output, which is why it is always in multiples of 5,000.
We don't instrument every single event since the overhead would be prohibitive,
nor do we need to: this is a type of sampling profiler. Because of this, the
processes that trigger the 5,000'th cache reference or misses can happen to
some degree by chance. Overall it should make sense. But for low counts,
you might find a case where -- by chance -- a process has been tallied with
more misses than references, which would seem impossible.
USAGE message: USAGE message:
# ./llcstat.py --help # ./llcstat.py --help
......
This diff is collapsed.
This diff is collapsed.
...@@ -13,31 +13,22 @@ ...@@ -13,31 +13,22 @@
# and for efficiency it does not initialize the perf ring buffer, so the # and for efficiency it does not initialize the perf ring buffer, so the
# redundant perf samples are not collected. # redundant perf samples are not collected.
# #
# Kernel stacks are post-process in user-land to skip the interrupt framework # REQUIRES: Linux 4.9+ (BPF_PROG_TYPE_PERF_EVENT support). Under tools/old is
# frames. You can improve efficiency a little by specifying the exact number # a version of this tool that may work on Linux 4.6 - 4.8.
# of frames to skip with -s, provided you know what that is. If you get -s
# wrong, note that the first line is the IP, and then the (skipped) stack.
#
# Note: if another perf-based sampling session is active, the output may become
# polluted with their events. On older kernels, the ouptut may also become
# polluted with tracing sessions (when the kprobe is used instead of the
# tracepoint). If this becomes a problem, logic can be added to filter events.
#
# REQUIRES: Linux 4.6+ (BPF_MAP_TYPE_STACK_TRACE support), and the
# perf_misc_flags() function symbol to exist. The latter may or may not
# exist depending on your kernel build. Linux 4.9 provides a proper solution
# to this (this tool will be updated).
# #
# Copyright 2016 Netflix, Inc. # Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License") # Licensed under the Apache License, Version 2.0 (the "License")
# #
# THANKS: Sasha Goldshtein, Andrew Birchall, and Evgeny Vereshchagin, who wrote # THANKS: Alexei Starovoitov, who added proper BPF profiling support to Linux;
# much of the code here, borrowed from tracepoint.py and offcputime.py. # Sasha Goldshtein, Andrew Birchall, and Evgeny Vereshchagin, who wrote much
# of the code here, borrowed from tracepoint.py and offcputime.py; and
# Teng Qin, who added perf support in bcc.
# #
# 15-Jul-2016 Brendan Gregg Created this. # 15-Jul-2016 Brendan Gregg Created this.
# 20-Oct-2016 " " Switched to use the new 4.9 support.
from __future__ import print_function from __future__ import print_function
from bcc import BPF, Perf from bcc import BPF, PerfType, PerfSWConfig
from sys import stderr from sys import stderr
from time import sleep from time import sleep
import argparse import argparse
...@@ -77,7 +68,6 @@ examples = """examples: ...@@ -77,7 +68,6 @@ examples = """examples:
./profile -p 185 # only profile threads for PID 185 ./profile -p 185 # only profile threads for PID 185
./profile -U # only show user space stacks (no kernel) ./profile -U # only show user space stacks (no kernel)
./profile -K # only show kernel space stacks (no user) ./profile -K # only show kernel space stacks (no user)
./profile -S 11 # always skip 11 frames of kernel stack
""" """
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Profile CPU stack traces at a timed interval", description="Profile CPU stack traces at a timed interval",
...@@ -104,15 +94,12 @@ parser.add_argument("--stack-storage-size", default=2048, ...@@ -104,15 +94,12 @@ parser.add_argument("--stack-storage-size", default=2048,
type=positive_nonzero_int, type=positive_nonzero_int,
help="the number of unique stack traces that can be stored and " help="the number of unique stack traces that can be stored and "
"displayed (default 2048)") "displayed (default 2048)")
parser.add_argument("-S", "--kernel-skip", type=positive_int, default=0,
help="skip this many kernel frames (default 3)")
parser.add_argument("duration", nargs="?", default=99999999, parser.add_argument("duration", nargs="?", default=99999999,
type=positive_nonzero_int, type=positive_nonzero_int,
help="duration of trace, in seconds") help="duration of trace, in seconds")
# option logic # option logic
args = parser.parse_args() args = parser.parse_args()
skip = args.kernel_skip
pid = int(args.pid) if args.pid is not None else -1 pid = int(args.pid) if args.pid is not None else -1
duration = int(args.duration) duration = int(args.duration)
debug = 0 debug = 0
...@@ -127,6 +114,7 @@ need_delimiter = args.delimited and not (args.kernel_stacks_only or ...@@ -127,6 +114,7 @@ need_delimiter = args.delimited and not (args.kernel_stacks_only or
# define BPF program # define BPF program
bpf_text = """ bpf_text = """
#include <uapi/linux/ptrace.h> #include <uapi/linux/ptrace.h>
#include <uapi/linux/bpf_perf_event.h>
#include <linux/sched.h> #include <linux/sched.h>
struct key_t { struct key_t {
...@@ -143,7 +131,7 @@ BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE) ...@@ -143,7 +131,7 @@ BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)
// This code gets a bit complex. Probably not suitable for casual hacking. // This code gets a bit complex. Probably not suitable for casual hacking.
PERF_TRACE_EVENT { int do_perf_event(struct bpf_perf_event_data *ctx) {
u32 pid = bpf_get_current_pid_tgid(); u32 pid = bpf_get_current_pid_tgid();
if (!(THREAD_FILTER)) if (!(THREAD_FILTER))
return 0; return 0;
...@@ -160,7 +148,7 @@ PERF_TRACE_EVENT { ...@@ -160,7 +148,7 @@ PERF_TRACE_EVENT {
if (key.kernel_stack_id >= 0) { if (key.kernel_stack_id >= 0) {
// populate extras to fix the kernel stack // populate extras to fix the kernel stack
struct pt_regs regs = {}; struct pt_regs regs = {};
bpf_probe_read(&regs, sizeof(regs), (void *)REGS_LOCATION); bpf_probe_read(&regs, sizeof(regs), (void *)&ctx->regs);
u64 ip = PT_REGS_IP(&regs); u64 ip = PT_REGS_IP(&regs);
// if ip isn't sane, leave key ips as zero for later checking // if ip isn't sane, leave key ips as zero for later checking
...@@ -170,17 +158,6 @@ PERF_TRACE_EVENT { ...@@ -170,17 +158,6 @@ PERF_TRACE_EVENT {
if (ip > PAGE_OFFSET) { if (ip > PAGE_OFFSET) {
#endif #endif
key.kernel_ip = ip; key.kernel_ip = ip;
if (DO_KERNEL_RIP) {
/*
* User didn't specify a skip value (-s), so we will figure
* out how many interrupt framework frames to skip by recording
* the kernel rip, then later scanning for it on the stack.
* This is likely x86_64 specific; can use -s as a workaround
* until this supports your architecture.
*/
bpf_probe_read(&key.kernel_ret_ip, sizeof(key.kernel_ret_ip),
(void *)(regs.bp + 8));
}
} }
} }
...@@ -206,10 +183,11 @@ bpf_text = bpf_text.replace('THREAD_FILTER', thread_filter) ...@@ -206,10 +183,11 @@ bpf_text = bpf_text.replace('THREAD_FILTER', thread_filter)
bpf_text = bpf_text.replace('STACK_STORAGE_SIZE', str(args.stack_storage_size)) bpf_text = bpf_text.replace('STACK_STORAGE_SIZE', str(args.stack_storage_size))
# handle stack args # handle stack args
kernel_stack_get = "stack_traces.get_stackid(args, " \ kernel_stack_get = \
"%d | BPF_F_REUSE_STACKID)" % skip "stack_traces.get_stackid(&ctx->regs, 0 | BPF_F_REUSE_STACKID)"
user_stack_get = \ user_stack_get = \
"stack_traces.get_stackid(args, BPF_F_REUSE_STACKID | BPF_F_USER_STACK)" "stack_traces.get_stackid(&ctx->regs, 0 | BPF_F_REUSE_STACKID | " \
"BPF_F_USER_STACK)"
stack_context = "" stack_context = ""
if args.user_stacks_only: if args.user_stacks_only:
stack_context = "user" stack_context = "user"
...@@ -221,12 +199,6 @@ else: ...@@ -221,12 +199,6 @@ else:
stack_context = "user + kernel" stack_context = "user + kernel"
bpf_text = bpf_text.replace('USER_STACK_GET', user_stack_get) bpf_text = bpf_text.replace('USER_STACK_GET', user_stack_get)
bpf_text = bpf_text.replace('KERNEL_STACK_GET', kernel_stack_get) bpf_text = bpf_text.replace('KERNEL_STACK_GET', kernel_stack_get)
if skip:
# don't record the rip, as we won't use it
bpf_text = bpf_text.replace('DO_KERNEL_RIP', '0')
else:
# rip is used to skip interrupt infrastructure frames
bpf_text = bpf_text.replace('DO_KERNEL_RIP', '1')
# header # header
if not args.folded: if not args.folded:
...@@ -237,41 +209,19 @@ if not args.folded: ...@@ -237,41 +209,19 @@ if not args.folded:
else: else:
print("... Hit Ctrl-C to end.") print("... Hit Ctrl-C to end.")
# kprobe perf_misc_flags()
bpf_text = bpf_text.replace('PERF_TRACE_EVENT',
'int kprobe__perf_misc_flags(struct pt_regs *args)')
bpf_text = bpf_text.replace('REGS_LOCATION', 'PT_REGS_PARM1(args)')
if debug: if debug:
print(bpf_text) print(bpf_text)
# initialize BPF # initialize BPF & perf_events
try: b = BPF(text=bpf_text)
b = BPF(text=bpf_text) b.attach_perf_event(ev_type=PerfType.SOFTWARE,
except: ev_config=PerfSWConfig.CPU_CLOCK, fn_name="do_perf_event",
print("BPF initialization failed. perf_misc_flags() may be inlined in " + sample_period=0, sample_freq=args.frequency)
"your kernel build.\nThis tool will be updated in the future to " +
"support Linux 4.9, which has reliable profiling support. Exiting.")
exit()
# signal handler # signal handler
def signal_ignore(signal, frame): def signal_ignore(signal, frame):
print() print()
#
# Setup perf_events
#
# use perf_events to sample
try:
Perf.perf_event_open(0, pid=-1, ptype=Perf.PERF_TYPE_SOFTWARE,
freq=args.frequency)
except:
print("ERROR: initializing perf_events for sampling.\n"
"To debug this, try running the following command:\n"
" perf record -F 49 -e cpu-clock %s -- sleep 1\n"
"If that also doesn't work, fix it first." % perf_filter, file=stderr)
exit(0)
# #
# Output Report # Output Report
# #
...@@ -317,19 +267,9 @@ for k, v in sorted(counts.items(), key=lambda counts: counts[1].value): ...@@ -317,19 +267,9 @@ for k, v in sorted(counts.items(), key=lambda counts: counts[1].value):
# fix kernel stack # fix kernel stack
kernel_stack = [] kernel_stack = []
if k.kernel_stack_id >= 0: if k.kernel_stack_id >= 0:
if skip:
# fixed skip
for addr in kernel_tmp:
kernel_stack.append(addr)
kernel_stack = kernel_stack[skip:]
else:
# skip the interrupt framework stack by searching for our RIP
skipping = 1
for addr in kernel_tmp: for addr in kernel_tmp:
if k.kernel_ret_ip == addr:
skipping = 0
if not skipping:
kernel_stack.append(addr) kernel_stack.append(addr)
# the later IP checking
if k.kernel_ip: if k.kernel_ip:
kernel_stack.insert(0, k.kernel_ip) kernel_stack.insert(0, k.kernel_ip)
......
...@@ -702,53 +702,11 @@ WARNING: 8 stack traces could not be displayed. Consider increasing --stack-stor ...@@ -702,53 +702,11 @@ WARNING: 8 stack traces could not be displayed. Consider increasing --stack-stor
Run ./profile -h to see the default. Run ./profile -h to see the default.
There is a -S option to skip kernel frames. You probably don't need to mess
with this. Here's why it exists: consider the following kernel stack trace,
and IP:
ffffffff81174e78 perf_swevent_hrtimer
ffffffff810e6984 __hrtimer_run_queues
ffffffff810e70f8 hrtimer_interrupt
ffffffff81022c69 xen_timer_interrupt
ffffffff810d2942 handle_irq_event_percpu
ffffffff810d62da handle_percpu_irq
ffffffff810d1f52 generic_handle_irq
ffffffff814a5137 evtchn_2l_handle_events
ffffffff814a2853 __xen_evtchn_do_upcall
ffffffff814a4740 xen_evtchn_do_upcall
ffffffff817cd50c xen_hvm_callback_vector
ffffffff8103663e default_idle
ffffffff81036dbf arch_cpu_idle
ffffffff810bb8ea default_idle_call
ffffffff810bbb97 cpu_startup_entry
ffffffff8104df85 start_secondary
IP: ffffffff8105eb66 native_safe_halt
This is the idle thread. The first function is native_safe_halt(), and its
parent is default_idle(). But what you see there is really what we are
profiling. All that stuff above default_idle()? Interrupt framework stack.
So we have to exclude those interrupt frames. I do this by fetching the ret IP
from the kernel stack, and then scanning for it in user-level: in this case
it would be default_idle(). Ok.
If this doesn't work on your architecture (and your kernel stacks are a
single line, the IP), then you might consider setting a fixed skip count,
which avoids this ret IP logic. For the above stack, I'd set "-S 11", and
it would slice off those 11 interrupt frames nicely. It also does this in
kernel context for efficiency.
So how do you figure out what number to use? 11? 14? 5? Well.. Try "-S 1",
and then see how much higher you need to set it. Remember on the real
profile output that the IP line is printed on top of the sliced stack.
USAGE message: USAGE message:
# ./profile -h # ./profile -h
usage: profile [-h] [-p PID] [-U | -K] [-F FREQUENCY] [-d] [-a] [-f] usage: profile [-h] [-p PID] [-U | -K] [-F FREQUENCY] [-d] [-a] [-f]
[--stack-storage-size STACK_STORAGE_SIZE] [-S KERNEL_SKIP] [--stack-storage-size STACK_STORAGE_SIZE]
[duration] [duration]
Profile CPU stack traces at a timed interval Profile CPU stack traces at a timed interval
...@@ -774,8 +732,6 @@ optional arguments: ...@@ -774,8 +732,6 @@ optional arguments:
--stack-storage-size STACK_STORAGE_SIZE --stack-storage-size STACK_STORAGE_SIZE
the number of unique stack traces that can be stored the number of unique stack traces that can be stored
and displayed (default 2048) and displayed (default 2048)
-S KERNEL_SKIP, --kernel-skip KERNEL_SKIP
skip this many kernel frames (default 3)
examples: examples:
./profile # profile stack traces at 49 Hertz until Ctrl-C ./profile # profile stack traces at 49 Hertz until Ctrl-C
...@@ -785,4 +741,3 @@ examples: ...@@ -785,4 +741,3 @@ examples:
./profile -p 185 # only profile threads for PID 185 ./profile -p 185 # only profile threads for PID 185
./profile -U # only show user space stacks (no kernel) ./profile -U # only show user space stacks (no kernel)
./profile -K # only show kernel space stacks (no user) ./profile -K # only show kernel space stacks (no user)
./profile -S 11 # always skip 11 frames of kernel stack
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment