Commit 3286c063 authored by Brendan Gregg's avatar Brendan Gregg Committed by GitHub

Merge branch 'master' into strcmp

parents c8f752f7 797c3ec1
......@@ -109,6 +109,7 @@ Examples:
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[profile](tools/profile.py): Profile CPU usage by sampling stack traces at a timed interval. [Examples](tools/profile_example.txt).
- tools/[reset-trace](tools/reset-trace.sh): Reset the state of tracing. Maintenance tool only. [Examples](tools/reset-trace_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).
......@@ -120,6 +121,7 @@ Examples:
- tools/[tcpaccept](tools/tcpaccept.py): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
- tools/[tcpconnect](tools/tcpconnect.py): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
- tools/[tcpconnlat](tools/tcpconnlat.py): Trace TCP active connection latency (connect()). [Examples](tools/tcpconnlat_example.txt).
- tools/[tcplife](tools/tcplife.py): Trace TCP sessions and summarize lifespan. [Examples](tools/tcplife_example.txt).
- tools/[tcpretrans](tools/tcpretrans.py): Trace TCP retransmits and TLPs. [Examples](tools/tcpretrans_example.txt).
- tools/[tcptop](tools/tcptop.py): Summarize TCP send/recv throughput by host. Top for TCP. [Examples](tools/tcptop_example.txt).
- tools/[tplist](tools/tplist.py): Display kernel tracepoints or USDT probes and their formats. [Examples](tools/tplist_example.txt).
......
.TH reset-trace 8 "2016-10-18" "USER COMMANDS"
.SH NAME
reset-trace \- reset the state of tracing.
.SH SYNOPSIS
.B reset-trace [\-F] [\-h] [\-q] [\-v]
.SH DESCRIPTION
You will probably never need this tool. If you kill \-9 a bcc tool (plus other
signals, like SIGTERM), or if a bcc tool crashes, then kernel tracing can be
left in a semi-enabled state. It's not as bad as it sounds: there may just be
overhead for writing to ring buffers that are never read. This tool can be
used to clean up the tracing state, and reset and disable active tracing.
Make sure no other tracing sessions are active. This tool might stop them from
functioning (perhaps ungracefully).
This specifically clears the state in at least the following files in
/sys/kernel/debug/tracing: kprobe_events, uprobe_events, trace_pipe.
Other tracing facilities (ftrace) are checked, and if not in an expected state,
a note is printed. All tracing files can be reset with \-F for force, but this
will interfere with any other running tracing sessions (eg, ftrace).
.SH REQUIREMENTS
/sys/kernel/debug mounted as debugfs
.SH OPTIONS
.TP
\-F
Force. Will reset all tracing facilities, including those not used by bcc
(ftrace). You shouldn't need to use this.
.TP
\-h
USAGE message.
.TP
\-q
Quiet. No output while working.
.TP
\-v
Verbose: print what it is doing.
.SH EXAMPLES
.TP
Reset the state of tracing:
#
.B reset-trace
.TP
Verbose:
#
.B reset-trace \-v
.TP
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.TH tcplife 8 "2016-10-19" "USER COMMANDS"
.SH NAME
tcplife \- Trace TCP sessions and summarize lifespan. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B tcplife [\-h] [\-T] [\-t] [\-w] [\-s] [\-p PID] [\-D PORTS] [\-L PORTS]
.SH DESCRIPTION
This tool traces TCP sessions that open and close while tracing, and prints
a line of output to summarize each one. This includes the IP addresses, ports,
duration, and throughput for the session. This is useful for workload
characterisation and flow accounting: identifying what connections are
happening, with the bytes transferred.
This tool works by using kernel dynamic tracing, and will need to be updated
if the kernel implementation changes. Only TCP state changes are traced, so
it is expected that the overhead of this tool is much lower than typical
send/receive tracing.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-s
Comma separated values output (parseable).
.TP
\-t
Include a timestamp column (seconds).
.TP
\-T
Include a time column (HH:MM:SS).
.TP
\-w
Wide column output (fits IPv6 addresses).
.TP
\-p PID
Trace this process ID only (filtered in-kernel).
.TP
\-L PORTS
Comma-separated list of local ports to trace (filtered in-kernel).
.TP
\-D PORTS
Comma-separated list of destination ports to trace (filtered in-kernel).
.SH EXAMPLES
.TP
Trace all TCP sessions, and summarize lifespan and throughput:
#
.B tcplife
.TP
Include a timestamp column, and wide column output:
#
.B tcplife \-tw
.TP
Trace PID 181 only:
#
.B tcplife \-p 181
.TP
Trace connections to local ports 80 and 81 only:
#
.B tcplife \-L 80,81
.TP
Trace connections to remote port 80 only:
#
.B tcplife \-D 80
.SH FIELDS
.TP
TIME
Time of the call, in HH:MM:SS format.
.TP
TIME(s)
Time of the call, in seconds.
.TP
PID
Process ID
.TP
COMM
Process name
.TP
IP
IP address family (4 or 6)
.TP
LADDR
Local IP address.
.TP
DADDR
Remote IP address.
.TP
LPORT
Local port.
.TP
DPORT
Destination port.
.TP
TX_KB
Total transmitted Kbytes.
.TP
RX_KB
Total received Kbytes.
.TP
MS
Lifespan of the session, in milliseconds.
.SH OVERHEAD
This traces the kernel TCP set state function, which should be called much
less often than send/receive tracing, and therefore have lower overhead. The
overhead of the tool is relative to the rate of new TCP sessions: if this is
high, over 10,000 per second, then there may be noticable overhead just to
print out 10k lines of formatted output per second.
You can find out the rate of new TCP sessions using "sar \-n TCP 1", and
adding the active/s and passive/s columns.
As always, test and understand this tools overhead for your types of
workloads before production use.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
tcpaccept(8), tcpconnect(8), tcptop(8)
......@@ -27,6 +27,12 @@ import traceback
debug = False
def verify_limit(num):
probe_limit = 1000
if num > probe_limit:
raise Exception("maximum of %d probes allowed, attempted %d" %
(probe_limit, num))
class Probe(object):
def __init__(self, pattern, use_regex=False, pid=None):
"""Init a new probe.
......@@ -117,7 +123,9 @@ class Probe(object):
self.usdt = None
text = ""
if self.type == "p" and not self.library:
for function in BPF.get_kprobe_functions(self.pattern):
functions = BPF.get_kprobe_functions(self.pattern)
verify_limit(len(functions))
for function in functions:
text += self._add_function(template, function)
elif self.type == "p" and self.library:
# uprobes are tricky because the same function may have multiple
......@@ -127,25 +135,33 @@ class Probe(object):
# map to an address that we've already seen. Also ignore functions
# that may repeat multiple times with different addresses.
addresses, functions = (set(), set())
for function, address in BPF.get_user_functions_and_addresses(
self.library, self.pattern):
functions_and_addresses = BPF.get_user_functions_and_addresses(
self.library, self.pattern)
verify_limit(len(functions_and_addresses))
for function, address in functions_and_addresses:
if address in addresses or function in functions:
continue
addresses.add(address)
functions.add(function)
text += self._add_function(template, function)
elif self.type == "t":
for tracepoint in BPF.get_tracepoints(self.pattern):
tracepoints = BPF.get_tracepoints(self.pattern)
verify_limit(len(tracepoints))
for tracepoint in tracepoints:
text += self._add_function(template, tracepoint)
elif self.type == "u":
self.usdt = USDT(path=self.library, pid=self.pid)
matches = []
for probe in self.usdt.enumerate_probes():
if not self.pid and (probe.bin_path != self.library):
continue
if re.match(self.pattern, probe.name):
new_func = "trace_count_%d" % self.matched
text += self._add_function(template, probe.name)
self.usdt.enable_probe(probe.name, new_func)
matches.append(probe.name)
verify_limit(len(matches))
for match in matches:
new_func = "trace_count_%d" % self.matched
text += self._add_function(template, match)
self.usdt.enable_probe(match, new_func)
if debug:
print(self.usdt.get_text())
return text
......
#!/bin/bash
#
# reset-trace - reset state of tracing, disabling all tracing.
# Written for Linux.
#
# If a bcc tool crashed and you suspect tracing is partially enabled, you
# can use this tool to reset the state of tracing, disabling anything still
# enabled. Only use this tool in the case of error, and, consider filing a
# bcc ticket so we can fix the error.
#
# bcc-used tracing facilities are reset. Other tracing facilities (ftrace) are
# checked, and if not in an expected state, a note is printed. All tracing
# files can be reset with -F for force, but this will interfere with any other
# running tracing sessions (eg, ftrace).
#
# USAGE: ./reset-trace [-Fhqv]
#
# REQUIREMENTS: debugfs mounted on /sys/kernel/debug
#
# COPYRIGHT: Copyright (c) 2016 Brendan Gregg.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 20-Jul-2014 Brendan Gregg Created this.
# 18-Oct-2016 " " Updated for bcc use.
tracing=/sys/kernel/debug/tracing
opt_force=0; opt_verbose=0; opt_quiet=0
function usage {
cat <<-END >&2
USAGE: reset-trace [-Fhqv]
-F # force: reset all tracing files
-v # verbose: print details while working
-h # this usage message
-q # quiet: no output
eg,
reset-trace # disable semi-enabled tracing
END
exit
}
function die {
echo >&2 "$@"
exit 1
}
function vecho {
(( ! opt_verbose )) && return
echo "$@"
}
function writefile {
file=$1
write=$2
if [[ ! -w $file ]]; then
echo >&2 "WARNING: file $file not writable/exists. Skipping."
return
fi
vecho "Checking $PWD/$file"
contents=$(grep -v '^#' $file)
if [[ "$contents" != "$expected" ]]; then
(( ! opt_quiet )) && echo "Needed to reset $PWD/$file"
vecho "$file, before (line enumerated):"
(( opt_verbose )) && cat -nv $file
cmd="echo $write > $file"
if ! eval "$cmd"; then
echo >&2 "WARNING: command failed \"$cmd\"." \
"bcc still running? Continuing."
fi
vecho "$file, after (line enumerated):"
(( opt_verbose )) && cat -nv $file
vecho
fi
}
# only write when force is used
function checkfile {
file=$1
write=$2
expected=$3
if [[ ! -e $file ]]; then
echo >&2 "WARNING: file $file doesn't exist. Skipping."
return
fi
if (( opt_force )); then
writefile $file $write
return
fi
(( opt_quiet )) && return
vecho "Checking $PWD/$file"
contents=$(grep -v '^#' $file)
if [[ "$contents" != "$expected" ]]; then
echo "Noticed unrelated tracing file $PWD/$file isn't set as" \
"expected. Not reseting (-F to force, -v for verbose)."
vecho "Contents of $file is (line enumerated):"
(( opt_verbose )) && cat -nv $file
vecho "Expected \"$expected\"."
fi
}
### process options
while getopts Fhqv opt
do
case $opt in
F) opt_force=1 ;;
q) opt_quiet=1 ;;
v) opt_verbose=1 ;;
h|?) usage ;;
esac
done
shift $(( $OPTIND - 1 ))
### reset tracing state
vecho "Reseting tracing state..."
vecho
cd $tracing || die "ERROR: accessing tracing. Root user? /sys/kernel/debug?"
# files bcc uses
writefile kprobe_events "" ""
writefile uprobe_events "" ""
writefile trace "" "" # clears trace_pipe
# non-bcc files
checkfile current_tracer nop nop
checkfile set_ftrace_filter "" ""
checkfile set_graph_function "" ""
checkfile set_ftrace_pid "" "no pid"
checkfile events/enable 0 0
checkfile tracing_thresh 0 0
checkfile tracing_on 1 1
vecho
vecho "Done."
Demonstrations of reset-trace, for Linux bcc/BPF.
You will probably never need this tool. If you kill -9 a bcc tool (plus other
signals, like SIGTERM), or if a bcc tool crashes, then kernel tracing can be
left in a semi-enabled state. It's not as bad as it sounds: there may just be
overhead for writing to ring buffers that are never read. This tool can be
used to clean up the tracing state, and reset and disable active tracing.
WARNING: Make sure no other tracing sessions are active, as it will likely
stop them from functioning (perhaps ungracefully).
This specifically clears the state in at least the following files in
/sys/kernel/debug/tracing: kprobe_events, uprobe_events, trace_pipe.
Other tracing facilities (ftrace) are checked, and if not in an expected state,
a note is printed. All tracing files can be reset with -F for force, but this
will interfere with any other running tracing sessions (eg, ftrace).
Here's an example:
# ./reset-trace.sh
#
That's it.
You can use -v to see what it does:
# ./reset-trace.sh -v
Reseting tracing state...
Checking /sys/kernel/debug/tracing/kprobe_events
Checking /sys/kernel/debug/tracing/uprobe_events
Checking /sys/kernel/debug/tracing/trace
Checking /sys/kernel/debug/tracing/current_tracer
Checking /sys/kernel/debug/tracing/set_ftrace_filter
Checking /sys/kernel/debug/tracing/set_graph_function
Checking /sys/kernel/debug/tracing/set_ftrace_pid
Checking /sys/kernel/debug/tracing/events/enable
Checking /sys/kernel/debug/tracing/tracing_thresh
Checking /sys/kernel/debug/tracing/tracing_on
Done.
In this example, no resetting was necessary.
Here's an example of actually needing it:
# ./funccount 'bash:r*'
Tracing 317 functions for "bash:r*"... Hit Ctrl-C to end.
^C
FUNC COUNT
rl_free_undo_list 1
rl_deprep_terminal 1
readline_internal_teardown 1
rl_on_new_line 1
rl_crlf 1
rl_clear_signals 1
rl_prep_terminal 1
rl_reset_line_state 1
rl_initialize 1
rl_newline 1
readline_internal_setup 1
rl_set_screen_size 1
readline 1
rl_set_signals 1
rl_expand_prompt 1
replace_history_data 1
rl_set_prompt 1
rl_add_undo 1
rl_insert_text 2
rl_insert 2
rl_redisplay 3
rl_read_key 3
rl_getc 3
readline_internal_char 3
restore_parser_state 6
reap_dead_jobs 6
reset_parser 6
restore_input_line_state 6
realloc 7
read_octal 10
read_tty_modified 13
run_exit_trap 13
redirection_expand 13
restore_pipestatus_array 18
reader_loop 20
run_return_trap 21
remember_args 25
reset_signal_handlers 30
remove_quoted_escapes 60
run_unwind_frame 102
reset_terminating_signals 125
restore_original_signals 139
reset_internal_getopt 405
run_debug_trap 719
read_command 940
remove_quoted_nulls 1830
run_pending_traps 3207
^C
^C
^C
I've traced 317 functions using funccount, and when I hit Ctrl-C, funccount is
not exiting (it can normally take many seconds, but this really looks stuck):
# pidstat 1
Linux 4.9.0-rc1-virtual (bgregg-xenial-bpf-i-xxx) 10/18/2016 _x86_64_ (8 CPU)
10:00:33 PM UID PID %usr %system %guest %CPU CPU Command
10:00:34 PM 60004 3277 0.00 0.98 0.00 0.98 0 redis-server
10:00:34 PM 0 27980 87.25 10.78 0.00 98.04 3 funccount.py
10:00:34 PM 0 29965 0.00 0.98 0.00 0.98 6 pidstat
10:00:34 PM UID PID %usr %system %guest %CPU CPU Command
10:00:35 PM 65534 3276 0.00 1.00 0.00 1.00 2 multilog
10:00:35 PM 0 27980 77.00 23.00 0.00 100.00 3 funccount.py
10:00:35 PM 0 29965 0.00 1.00 0.00 1.00 6 pidstat
10:00:35 PM 60004 29990 0.00 1.00 0.00 1.00 6 catalina.sh
funccount looks a lot like it's in an infinite loop (I can use a stack-sampling
profiler to confirm). This is a known bug (#665) and may be fixed by the time
you read this. But right now it's a good example of needing reset-trace.
I'll send a SIGTERM, before resorting to a SIGKILL:
# kill 27980
Terminated
Ok, so the process is now gone, but it did leave tracing in a semi-enabled
state. Using reset-trace:
# ./reset-trace.sh -v
Reseting tracing state...
Checking /sys/kernel/debug/tracing/kprobe_events
Checking /sys/kernel/debug/tracing/uprobe_events
Needed to reset /sys/kernel/debug/tracing/uprobe_events
uprobe_events, before (line enumerated):
1 p:uprobes/p__bin_bash_0xa2540 /bin/bash:0x00000000000a2540
2 p:uprobes/p__bin_bash_0x21220 /bin/bash:0x0000000000021220
3 p:uprobes/p__bin_bash_0x78530 /bin/bash:0x0000000000078530
4 p:uprobes/p__bin_bash_0xa3840 /bin/bash:0x00000000000a3840
5 p:uprobes/p__bin_bash_0x9c550 /bin/bash:0x000000000009c550
6 p:uprobes/p__bin_bash_0x5e360 /bin/bash:0x000000000005e360
7 p:uprobes/p__bin_bash_0xb2630 /bin/bash:0x00000000000b2630
8 p:uprobes/p__bin_bash_0xb1e70 /bin/bash:0x00000000000b1e70
9 p:uprobes/p__bin_bash_0xb2540 /bin/bash:0x00000000000b2540
10 p:uprobes/p__bin_bash_0xb16e0 /bin/bash:0x00000000000b16e0
[...]
312 p:uprobes/p__bin_bash_0xa80b0 /bin/bash:0x00000000000a80b0
313 p:uprobes/p__bin_bash_0x9e280 /bin/bash:0x000000000009e280
314 p:uprobes/p__bin_bash_0x9e100 /bin/bash:0x000000000009e100
315 p:uprobes/p__bin_bash_0xb2bd0 /bin/bash:0x00000000000b2bd0
316 p:uprobes/p__bin_bash_0x9d9c0 /bin/bash:0x000000000009d9c0
317 p:uprobes/p__bin_bash_0x4a930 /bin/bash:0x000000000004a930
uprobe_events, after (line enumerated):
Checking /sys/kernel/debug/tracing/trace
Checking /sys/kernel/debug/tracing/current_tracer
Checking /sys/kernel/debug/tracing/set_ftrace_filter
Checking /sys/kernel/debug/tracing/set_graph_function
Checking /sys/kernel/debug/tracing/set_ftrace_pid
Checking /sys/kernel/debug/tracing/events/enable
Checking /sys/kernel/debug/tracing/tracing_thresh
Checking /sys/kernel/debug/tracing/tracing_on
Done.
Now looks clean. I did truncate the output here: there were a few hundred lines
from uprobe_events.
Here's the same situation, but without the verbose option:
# ./reset-trace.sh
Needed to reset /sys/kernel/debug/tracing/uprobe_events
#
And again with quiet:
# ./reset-trace.sh -q
#
Here is an example of reset-trace detecting an unrelated tracing session:
# ./reset-trace.sh
Noticed unrelated tracing file /sys/kernel/debug/tracing/set_ftrace_filter isn't set as expected. Not reseting (-F to force, -v for verbose).
And verbose:
# ./reset-trace.sh -v
Reseting tracing state...
Checking /sys/kernel/debug/tracing/kprobe_events
Checking /sys/kernel/debug/tracing/uprobe_events
Checking /sys/kernel/debug/tracing/trace
Checking /sys/kernel/debug/tracing/current_tracer
Checking /sys/kernel/debug/tracing/set_ftrace_filter
Noticed unrelated tracing file /sys/kernel/debug/tracing/set_ftrace_filter isn't set as expected. Not reseting (-F to force, -v for verbose).
Contents of set_ftrace_filter is (line enumerated):
1 tcp_send_mss
2 tcp_sendpage
3 tcp_sendmsg
4 tcp_send_dupack
5 tcp_send_challenge_ack.isra.53
6 tcp_send_rcvq
7 tcp_send_ack
8 tcp_send_loss_probe
9 tcp_send_fin
10 tcp_send_active_reset
11 tcp_send_synack
12 tcp_send_delayed_ack
13 tcp_send_window_probe
14 tcp_send_probe0
Expected "".
Checking /sys/kernel/debug/tracing/set_graph_function
Checking /sys/kernel/debug/tracing/set_ftrace_pid
Checking /sys/kernel/debug/tracing/events/enable
Checking /sys/kernel/debug/tracing/tracing_thresh
Checking /sys/kernel/debug/tracing/tracing_on
Done.
So this file is not currently used by bcc, but it may be useful to know that
it's not in the default state -- something is either using it or has left it
enabled. These files can be reset with -F, but that may break other tools that
are currently using them.
Use -h to print the USAGE message:
# ./reset-trace.sh -h
USAGE: reset-trace [-Fhqv]
-F # force: reset all tracing files
-v # verbose: print details while working
-h # this usage message
-q # quiet: no output
eg,
reset-trace # disable semi-enabled tracing
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# tcplife Trace the lifespan of TCP sessions and summarize.
# For Linux, uses BCC, BPF. Embedded C.
#
# USAGE: tcplife [-h] [-C] [-S] [-p PID] [interval [count]]
#
# This uses dynamic tracing of kernel functions, and will need to be updated
# to match kernel changes.
#
# While throughput counters are emitted, they are fetched in a low-overhead
# manner: reading members of the tcp_info struct on TCP close. ie, we do not
# trace send/receive.
#
# Copyright 2016 Netflix, Inc.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# IDEA: Julia Evans
#
# 18-Oct-2016 Brendan Gregg Created this.
from __future__ import print_function
from bcc import BPF
import argparse
from socket import inet_ntop, ntohs, AF_INET, AF_INET6
from struct import pack
import ctypes as ct
from time import strftime
# arguments
examples = """examples:
./tcplife # trace all TCP connect()s
./tcplife -t # include time column (HH:MM:SS)
./tcplife -w # wider colums (fit IPv6)
./tcplife -stT # csv output, with times & timestamps
./tcplife -p 181 # only trace PID 181
./tcplife -L 80 # only trace local port 80
./tcplife -L 80,81 # only trace local ports 80 and 81
./tcplife -D 80 # only trace remote port 80
"""
parser = argparse.ArgumentParser(
description="Trace the lifespan of TCP sessions and summarize",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("-T", "--time", action="store_true",
help="include time column on output (HH:MM:SS)")
parser.add_argument("-t", "--timestamp", action="store_true",
help="include timestamp on output (seconds)")
parser.add_argument("-w", "--wide", action="store_true",
help="wide column output (fits IPv6 addresses)")
parser.add_argument("-s", "--csv", action="store_true",
help="comma seperated values output")
parser.add_argument("-p", "--pid",
help="trace this PID only")
parser.add_argument("-L", "--localport",
help="comma-separated list of local ports to trace.")
parser.add_argument("-D", "--remoteport",
help="comma-separated list of remote ports to trace.")
args = parser.parse_args()
debug = 0
# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#define KBUILD_MODNAME "foo"
#include <linux/tcp.h>
#include <net/sock.h>
#include <bcc/proto.h>
BPF_HASH(birth, struct sock *, u64);
// separate data structs for ipv4 and ipv6
struct ipv4_data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
u64 pid;
u64 saddr;
u64 daddr;
u64 ports;
u64 rx_b;
u64 tx_b;
u64 span_us;
char task[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(ipv4_events);
struct ipv6_data_t {
u64 ts_us;
u64 pid;
unsigned __int128 saddr;
unsigned __int128 daddr;
u64 ports;
u64 rx_b;
u64 tx_b;
u64 span_us;
char task[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(ipv6_events);
struct id_t {
u32 pid;
char task[TASK_COMM_LEN];
};
BPF_HASH(whoami, struct sock *, struct id_t);
int kprobe__tcp_set_state(struct pt_regs *ctx, struct sock *sk, int state)
{
u32 pid = bpf_get_current_pid_tgid() >> 32;
// lport is either used in a filter here, or later
u16 lport = sk->__sk_common.skc_num;
FILTER_LPORT
// dport is either used in a filter here, or later
u16 dport = sk->__sk_common.skc_dport;
FILTER_DPORT
/*
* This tool includes PID and comm context. It's best effort, and may
* be wrong in some situations. It currently works like this:
* - active connections: cache PID & comm on TCP_SYN_SENT
* - passive connections: read PID & comm on TCP_LAST_ACK
*/
// record PID & comm on SYN_SENT
if (state == TCP_SYN_SENT) {
FILTER_PID
struct id_t me = {.pid = pid};
bpf_get_current_comm(&me.task, sizeof(me.task));
whoami.update(&sk, &me);
}
// capture birth time
if (state < TCP_FIN_WAIT1) {
// matching just ESTABLISHED may be sufficient, provided no code-path
// sets ESTABLISHED without a tcp_set_state() call. Until we know
// that for sure, match all early states to increase chances a
// timestamp is set.
u64 ts = bpf_ktime_get_ns();
birth.update(&sk, &ts);
return 0;
}
// fetch possible cached data
struct id_t *mep;
mep = whoami.lookup(&sk);
// passive connection closing; wait until TCP_LAST_ACK
if (mep == 0 && state != TCP_LAST_ACK)
return 0;
if (state == TCP_LAST_ACK) {
FILTER_PID
}
// calculate lifespan
u64 *tsp, delta_us;
tsp = birth.lookup(&sk);
if (tsp == 0) {
return 0; // missed create
}
delta_us = (bpf_ktime_get_ns() - *tsp) / 1000;
// get throughput stats. see tcp_get_info().
u64 rx_b = 0, tx_b = 0, sport = 0;
struct tcp_sock *tp = (struct tcp_sock *)sk;
rx_b = tp->bytes_received;
tx_b = tp->bytes_acked;
u16 family = sk->__sk_common.skc_family;
if (family == AF_INET) {
struct ipv4_data_t data4 = {.span_us = delta_us,
.rx_b = rx_b, .tx_b = tx_b};
data4.ts_us = bpf_ktime_get_ns() / 1000;
data4.saddr = sk->__sk_common.skc_rcv_saddr;
data4.daddr = sk->__sk_common.skc_daddr;
// a workaround until data4 compiles with separate lport/dport
data4.ports = ntohs(dport) + ((0ULL + lport) << 32);
if (mep == 0) {
data4.pid = pid;
bpf_get_current_comm(&data4.task, sizeof(data4.task));
bpf_trace_printk("state %d 0pid %d\\n", state, pid);
} else {
data4.pid = mep->pid;
bpf_trace_printk("state %d mpid %d\\n", state, mep->pid);
bpf_probe_read(&data4.task, sizeof(data4.task), (void *)mep->task);
}
ipv4_events.perf_submit(ctx, &data4, sizeof(data4));
} else /* 6 */ {
struct ipv6_data_t data6 = {.span_us = delta_us,
.rx_b = rx_b, .tx_b = tx_b};
data6.ts_us = bpf_ktime_get_ns() / 1000;
bpf_probe_read(&data6.saddr, sizeof(data6.saddr),
sk->__sk_common.skc_v6_rcv_saddr.in6_u.u6_addr32);
bpf_probe_read(&data6.daddr, sizeof(data6.daddr),
sk->__sk_common.skc_v6_daddr.in6_u.u6_addr32);
// a workaround until data6 compiles with separate lport/dport
data6.ports = ntohs(dport) + ((0ULL + lport) << 32);
if (mep == 0) {
data6.pid = pid;
bpf_get_current_comm(&data6.task, sizeof(data6.task));
} else {
data6.pid = mep->pid;
bpf_probe_read(&data6.task, sizeof(data6.task), (void *)mep->task);
}
ipv6_events.perf_submit(ctx, &data6, sizeof(data6));
}
birth.delete(&sk);
if (mep != 0)
whoami.delete(&sk);
return 0;
}
"""
# code substitutions
if args.pid:
bpf_text = bpf_text.replace('FILTER_PID',
'if (pid != %s) { return 0; }' % args.pid)
if args.remoteport:
dports = [int(dport) for dport in args.remoteport.split(',')]
dports_if = ' && '.join(['dport != %d' % ntohs(dport) for dport in dports])
bpf_text = bpf_text.replace('FILTER_DPORT',
'if (%s) { birth.delete(&sk); return 0; }' % dports_if)
if args.localport:
lports = [int(lport) for lport in args.localport.split(',')]
lports_if = ' && '.join(['lport != %d' % lport for lport in lports])
bpf_text = bpf_text.replace('FILTER_LPORT',
'if (%s) { birth.delete(&sk); return 0; }' % lports_if)
bpf_text = bpf_text.replace('FILTER_PID', '')
bpf_text = bpf_text.replace('FILTER_DPORT', '')
bpf_text = bpf_text.replace('FILTER_LPORT', '')
if debug:
print(bpf_text)
# event data
TASK_COMM_LEN = 16 # linux/sched.h
class Data_ipv4(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("pid", ct.c_ulonglong),
("saddr", ct.c_ulonglong),
("daddr", ct.c_ulonglong),
("ports", ct.c_ulonglong),
("rx_b", ct.c_ulonglong),
("tx_b", ct.c_ulonglong),
("span_us", ct.c_ulonglong),
("task", ct.c_char * TASK_COMM_LEN)
]
class Data_ipv6(ct.Structure):
_fields_ = [
("ts_us", ct.c_ulonglong),
("pid", ct.c_ulonglong),
("saddr", (ct.c_ulonglong * 2)),
("daddr", (ct.c_ulonglong * 2)),
("ports", ct.c_ulonglong),
("rx_b", ct.c_ulonglong),
("tx_b", ct.c_ulonglong),
("span_us", ct.c_ulonglong),
("task", ct.c_char * TASK_COMM_LEN)
]
#
# Setup output formats
#
# Don't change the default output (next 2 lines): this fits in 80 chars. I
# know it doesn't have NS or UIDs etc. I know. If you really, really, really
# need to add columns, columns that solve real actual problems, I'd start by
# adding an extended mode (-x) to included those columns.
#
header_string = "%-5s %-10.10s %s%-15s %-5s %-15s %-5s %5s %5s %s"
format_string = "%-5d %-10.10s %s%-15s %-5d %-15s %-5d %5d %5d %.2f"
if args.wide:
header_string = "%-5s %-16.16s %-2s %-26s %-5s %-26s %-5s %6s %6s %s"
format_string = "%-5d %-16.16s %-2s %-26s %-5s %-26s %-5d %6d %6d %.2f"
if args.csv:
header_string = "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s"
format_string = "%d,%s,%s,%s,%s,%s,%d,%d,%d,%.2f"
# process event
def print_ipv4_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data_ipv4)).contents
global start_ts
if args.time:
if args.csv:
print("%s," % strftime("%H:%M:%S"), end="")
else:
print("%-8s " % strftime("%H:%M:%S"), end="")
if args.timestamp:
if start_ts == 0:
start_ts = event.ts_us
delta_s = (float(event.ts_us) - start_ts) / 1000000
if args.csv:
print("%.6f," % delta_s, end="")
else:
print("%-9.6f " % delta_s, end="")
print(format_string % (event.pid, event.task,
"4" if args.wide or args.csv else "",
inet_ntop(AF_INET, pack("I", event.saddr)), event.ports >> 32,
inet_ntop(AF_INET, pack("I", event.daddr)), event.ports & 0xffffffff,
event.tx_b / 1024, event.rx_b / 1024, float(event.span_us) / 1000))
def print_ipv6_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data_ipv6)).contents
global start_ts
if args.time:
if args.csv:
print("%s," % strftime("%H:%M:%S"), end="")
else:
print("%-8s " % strftime("%H:%M:%S"), end="")
if args.timestamp:
if start_ts == 0:
start_ts = event.ts_us
delta_s = (float(event.ts_us) - start_ts) / 1000000
if args.csv:
print("%.6f," % delta_s, end="")
else:
print("%-9.6f " % delta_s, end="")
print(format_string % (event.pid, event.task,
"6" if args.wide or args.csv else "",
inet_ntop(AF_INET6, event.saddr), event.ports >> 32,
inet_ntop(AF_INET6, event.daddr), event.ports & 0xffffffff,
event.tx_b / 1024, event.rx_b / 1024, float(event.span_us) / 1000))
# initialize BPF
b = BPF(text=bpf_text)
# header
if args.time:
if args.csv:
print("%s," % ("TIME"), end="")
else:
print("%-8s " % ("TIME"), end="")
if args.timestamp:
if args.csv:
print("%s," % ("TIME(s)"), end="")
else:
print("%-9s " % ("TIME(s)"), end="")
print(header_string % ("PID", "COMM",
"IP" if args.wide or args.csv else "", "LADDR",
"LPORT", "RADDR", "RPORT", "TX_KB", "RX_KB", "MS"))
start_ts = 0
# read events
b["ipv4_events"].open_perf_buffer(print_ipv4_event)
b["ipv6_events"].open_perf_buffer(print_ipv6_event)
while 1:
b.kprobe_poll()
Demonstrations of tcplife, the Linux BPF/bcc version.
tcplife summarizes TCP sessions that open and close while tracing. For example:
# ./tcplife
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
22597 recordProg 127.0.0.1 46644 127.0.0.1 28527 0 0 0.23
3277 redis-serv 127.0.0.1 28527 127.0.0.1 46644 0 0 0.28
22598 curl 100.66.3.172 61620 52.205.89.26 80 0 1 91.79
22604 curl 100.66.3.172 44400 52.204.43.121 80 0 1 121.38
22624 recordProg 127.0.0.1 46648 127.0.0.1 28527 0 0 0.22
3277 redis-serv 127.0.0.1 28527 127.0.0.1 46648 0 0 0.27
22647 recordProg 127.0.0.1 46650 127.0.0.1 28527 0 0 0.21
3277 redis-serv 127.0.0.1 28527 127.0.0.1 46650 0 0 0.26
[...]
This caught a program, "recordProg" making a few short-lived TCP connections
to "redis-serv", lasting about 0.25 milliseconds each connection. A couple of
"curl" sessions were also traced, connecting to port 80, and lasting 91 and 121
milliseconds.
This tool is useful for workload characterisation and flow accounting:
identifying what connections are happening, with the bytes transferred.
Process names are truncated to 10 characters. By using the wide option, -w,
the column width becomes 16 characters. The IP address columns are also wider
to fit IPv6 addresses:
# ./tcplife -w
PID COMM IP LADDR LPORT RADDR RPORT TX_KB RX_KB MS
26315 recordProgramSt 4 127.0.0.1 44188 127.0.0.1 28527 0 0 0.21
3277 redis-server 4 127.0.0.1 28527 127.0.0.1 44188 0 0 0.26
26320 ssh 6 fe80::8a3:9dff:fed5:6b19 22440 fe80::8a3:9dff:fed5:6b19 22 1 1 457.52
26321 sshd 6 fe80::8a3:9dff:fed5:6b19 22 fe80::8a3:9dff:fed5:6b19 22440 1 1 458.69
26341 recordProgramSt 4 127.0.0.1 44192 127.0.0.1 28527 0 0 0.27
3277 redis-server 4 127.0.0.1 28527 127.0.0.1 44192 0 0 0.32
In this example, I uploaded a 10 Mbyte file to the server, and then downloaded
it again, using scp:
# ./tcplife
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
7715 recordProg 127.0.0.1 50894 127.0.0.1 28527 0 0 0.25
3277 redis-serv 127.0.0.1 28527 127.0.0.1 50894 0 0 0.30
7619 sshd 100.66.3.172 22 100.127.64.230 63033 5 10255 3066.79
7770 recordProg 127.0.0.1 50896 127.0.0.1 28527 0 0 0.20
3277 redis-serv 127.0.0.1 28527 127.0.0.1 50896 0 0 0.24
7793 recordProg 127.0.0.1 50898 127.0.0.1 28527 0 0 0.23
3277 redis-serv 127.0.0.1 28527 127.0.0.1 50898 0 0 0.27
7847 recordProg 127.0.0.1 50900 127.0.0.1 28527 0 0 0.24
3277 redis-serv 127.0.0.1 28527 127.0.0.1 50900 0 0 0.29
7870 recordProg 127.0.0.1 50902 127.0.0.1 28527 0 0 0.29
3277 redis-serv 127.0.0.1 28527 127.0.0.1 50902 0 0 0.30
7798 sshd 100.66.3.172 22 100.127.64.230 64925 10265 6 2176.15
[...]
You can see the 10 Mbytes received by sshd, and then later transmitted. Looks
like receive was slower (3.07 seconds) than transmit (2.18 seconds).
Timestamps can be added with -t:
# ./tcplife -t
TIME(s) PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
0.000000 5973 recordProg 127.0.0.1 47986 127.0.0.1 28527 0 0 0.25
0.000059 3277 redis-serv 127.0.0.1 28527 127.0.0.1 47986 0 0 0.29
1.022454 5996 recordProg 127.0.0.1 47988 127.0.0.1 28527 0 0 0.23
1.022513 3277 redis-serv 127.0.0.1 28527 127.0.0.1 47988 0 0 0.27
2.044868 6019 recordProg 127.0.0.1 47990 127.0.0.1 28527 0 0 0.24
2.044924 3277 redis-serv 127.0.0.1 28527 127.0.0.1 47990 0 0 0.28
3.069136 6042 recordProg 127.0.0.1 47992 127.0.0.1 28527 0 0 0.22
3.069204 3277 redis-serv 127.0.0.1 28527 127.0.0.1 47992 0 0 0.28
This shows that the recordProg process was connecting once per second.
There's also a -T for HH:MM:SS formatted times.
There's a comma separated values mode, -s. Here it is with both -t and -T
timestamps:
# ./tcplife -stT
TIME,TIME(s),PID,COMM,IP,LADDR,LPORT,RADDR,RPORT,TX_KB,RX_KB,MS
23:39:38,0.000000,7335,recordProgramSt,4,127.0.0.1,48098,127.0.0.1,28527,0,0,0.26
23:39:38,0.000064,3277,redis-server,4,127.0.0.1,28527,127.0.0.1,48098,0,0,0.32
23:39:39,1.025078,7358,recordProgramSt,4,127.0.0.1,48100,127.0.0.1,28527,0,0,0.25
23:39:39,1.025141,3277,redis-server,4,127.0.0.1,28527,127.0.0.1,48100,0,0,0.30
23:39:41,2.040949,7381,recordProgramSt,4,127.0.0.1,48102,127.0.0.1,28527,0,0,0.24
23:39:41,2.041011,3277,redis-server,4,127.0.0.1,28527,127.0.0.1,48102,0,0,0.29
23:39:42,3.067848,7404,recordProgramSt,4,127.0.0.1,48104,127.0.0.1,28527,0,0,0.30
23:39:42,3.067914,3277,redis-server,4,127.0.0.1,28527,127.0.0.1,48104,0,0,0.35
[...]
There are options for filtering on local and remote ports. Here is filtering
on local ports 22 and 80:
# ./tcplife.py -L 22,80
PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
8301 sshd 100.66.3.172 22 100.127.64.230 58671 3 3 1448.52
[...]
USAGE:
# ./tcplife.py -h
usage: tcplife.py [-h] [-T] [-t] [-w] [-s] [-p PID] [-L LOCALPORT]
[-D REMOTEPORT]
Trace the lifespan of TCP sessions and summarize
optional arguments:
-h, --help show this help message and exit
-T, --time include time column on output (HH:MM:SS)
-t, --timestamp include timestamp on output (seconds)
-w, --wide wide column output (fits IPv6 addresses)
-s, --csv comma seperated values output
-p PID, --pid PID trace this PID only
-L LOCALPORT, --localport LOCALPORT
comma-separated list of local ports to trace.
-D REMOTEPORT, --remoteport REMOTEPORT
comma-separated list of remote ports to trace.
examples:
./tcplife # trace all TCP connect()s
./tcplife -t # include time column (HH:MM:SS)
./tcplife -w # wider colums (fit IPv6)
./tcplife -stT # csv output, with times & timestamps
./tcplife -p 181 # only trace PID 181
./tcplife -L 80 # only trace local port 80
./tcplife -L 80,81 # only trace local ports 80 and 81
./tcplife -D 80 # only trace remote port 80
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment