Commit c5b677d3 authored by Brendan Gregg's avatar Brendan Gregg Committed by GitHub

Merge pull request #267 from dalehamel/tcp-tools

bpftrace adaptations of several iovisor/bcc tcp*.py tools
parents e1960e9a 25b776ba
......@@ -159,6 +159,10 @@ bpftrace contains various tools, which also serve as examples of programming in
- tools/[statsnoop.bt](tools/statsnoop.bt): Trace stat() syscalls for general debugging. [Examples](tools/statsnoop_example.txt).
- tools/[syncsnoop.bt](tools/syncsnoop.bt): Trace sync() variety of syscalls. [Examples](tools/syncsnoop_example.txt).
- tools/[syscount.bt](tools/syscount.bt): Count system calls. [Examples](tools/syscount_example.txt).
- tools/[tcpaccept](tools/tcpaccept.bt): Trace TCP passive connections (accept()). [Examples](tools/tcpaccept_example.txt).
- tools/[tcpconnect](tools/tcpconnect.bt): Trace TCP active connections (connect()). [Examples](tools/tcpconnect_example.txt).
- tools/[tcpdrop](tools/tcpdrop.bt): Trace kernel-based TCP packet drops with details. [Examples](tools/tcpdrop_example.txt).
- tools/[tcpretrans](tools/tcpretrans.bt): Trace TCP retransmits. [Examples](tools/tcpretrans_example.txt).
- tools/[vfscount.bt](tools/vfscount.bt): Count VFS calls. [Examples](tools/vfscount_example.txt).
- tools/[vfsstat.bt](tools/vfsstat.bt): Count some VFS calls, with per-second summaries. [Examples](tools/vfsstat_example.txt).
- tools/[writeback.bt](tools/writeback.bt): Trace file system writeback events with details. [Examples](tools/writeback_example.txt).
......
.TH tcpaccept 8 "2018-10-24" "USER COMMANDS"
.SH NAME
tcpaccept.bt \- Trace TCP passive connections (accept()). Uses bpftrace/eBPF
.SH SYNOPSIS
.B tcpaccept.bt
.SH DESCRIPTION
This tool traces passive TCP connections (eg, via an accept() syscall;
connect() are active connections). This can be useful for general
troubleshooting to see what new connections the local server is accepting.
This uses dynamic tracing of the kernel inet_csk_accept() socket function (from
tcp_prot.accept), and will need to be modified to match kernel changes.
This tool only traces successful TCP accept()s. Connection attempts to closed
ports will not be shown (those can be traced via other functions).
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bpftrace.
.SH EXAMPLES
.TP
Trace all passive TCP connections (accept()s):
#
.B tcpaccept.bt
.TP
.SH FIELDS
.TP
TIME(s)
Time of the call, in HH:MM:SS format.
.TP
PID
Process ID
.TP
COMM
Process name
.TP
RADDR
Remote IP address.
.TP
RPORT
Remote port.
.TP
LADDR
Local IP address.
.TP
LPORT
Local port
.TP
BL
Current accept backlog vs maximum backlog
.SH OVERHEAD
This traces the kernel inet_csk_accept function and prints output for each event.
The rate of this depends on your server application. If it is a web or proxy server
accepting many tens of thousands of connections per second, then the overhead
of this tool may be measurable (although, still a lot better than tracing
every packet). If it is less than a thousand a second, then the overhead is
expected to be negligible. Test and understand this overhead before use.
.SH SOURCE
This is from bpftrace
.IP
https://github.com/iovisor/bpftrace
.PP
Also look in the bpftrace distribution for a companion _examples.txt file
containing example usage, output, and commentary for this tool.
This is a bpftrace version of the bcc tool of the same name. The bcc tool
may provide more options and customizations.
.IP
https://github.com/iovisor/bcc
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg, adapted for bpftrace by Dale Hamel
.SH SEE ALSO
tcpconnect(8), funccount(8), tcpdump(8)
.TH tcpconnect 8 "2018-11-24" "USER COMMANDS"
.SH NAME
tcpconnect.bt \- Trace TCP active connections (connect()). Uses Linux bpftrace/eBPF
.SH SYNOPSIS
.B tcpconnect.bt
.SH DESCRIPTION
This tool traces active TCP connections (eg, via a connect() syscall;
accept() are passive connections). This can be useful for general
troubleshooting to see what connections are initiated by the local server.
All connection attempts are traced, even if they ultimately fail.
This works by tracing the kernel tcp_v4_connect() and tcp_v6_connect() functions
using dynamic tracing, and will need updating to match any changes to these
functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bpftrace.
.SH EXAMPLES
.TP
Trace all active TCP connections:
#
.B tcpconnect.bt
.TP
.SH FIELDS
.TP
TIME(s)
Time of the call, in HH:MM:SS format.
.TP
PID
Process ID
.TP
COMM
Process name
.TP
SADDR
Source IP address.
.TP
SPORT
Source port.
.TP
DADDR
Destination IP address.
.TP
DPORT
Destination port
.SH OVERHEAD
This traces the kernel tcp_v[46]_connect functions and prints output for each
event. As the rate of this is generally expected to be low (< 1000/s), the
overhead is also expected to be negligible. If you have an application that
is calling a high rate of connects()s, such as a proxy server, then test and
understand this overhead before use.
.SH SOURCE
This is from bpftrace
.IP
https://github.com/iovisor/bpftrace
.PP
Also look in the bpftrace distribution for a companion _examples.txt file
containing example usage, output, and commentary for this tool.
This is a bpftrace version of the bcc tool of the same name. The bcc tool
may provide more options and customizations.
.IP
https://github.com/iovisor/bcc
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg, adapted for bpftrace by Dale Hamel
.SH SEE ALSO
tcpaccept(8), funccount(8), tcpdump(8)
.TH tcpdrop 8 "2018-11-24" "USER COMMANDS"
.SH NAME
tcpdrop.bt \- Trace kernel-based TCP packet drops with details. Uses Linux bpftrace/eBPF
.SH SYNOPSIS
.B tcpdrop.bt
.SH DESCRIPTION
This tool traces TCP packets or segments that were dropped by the kernel, and
shows details from the IP and TCP headers, the socket state, and the
kernel stack trace. This is useful for debugging cases of high kernel drops,
which can cause timer-based retransmits and performance issues.
This tool works using dynamic tracing of the tcp_drop() kernel function,
which requires a recent kernel version.
This tool is limited to ipv4, and cannot parse tcpflags as bpftrace currently cannot parse socket buffers in the way that bcc can.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bpftrace.
.SH EXAMPLES
.TP
Trace all tcp drops:
#
.B tcpdrop.bt
.TP
.SH FIELDS
.TP
TIME
Time of the call, in HH:MM:SS format.
.TP
PID
Process ID that was on-CPU during the drop. This may be unrelated, as drops
can occur on the receive interrupt and be unrelated to the PID that was
interrupted.
.TP
COMM
Process name
.TP
SADDR
Source IP address.
.TP
SPORT
Source TCP port.
.TP
DADDR
Destination IP address.
.TP
DPORT
Destionation TCP port.
.TP
STATE
TCP session state ("ESTABLISHED", etc).
.SH OVERHEAD
This traces the kernel tcp_drop() function, which should be low frequency,
and therefore the overhead of this tool should be negligible.
As always, test and understand this tools overhead for your types of
workloads before production use.
.SH SOURCE
This is from bpftrace
.IP
https://github.com/iovisor/bpftrace
.PP
Also look in the bpftrace distribution for a companion _examples.txt file
containing example usage, output, and commentary for this tool.
This is a bpftrace version of the bcc tool of the same name. The bcc tool
may provide more options and customizations.
.IP
https://github.com/iovisor/bcc
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg, adapted for bpftrace by Dale Hamel
.SH SEE ALSO
tcplife(8), tcpaccept(8), tcpconnect(8), tcptop(8)
.TH tcpretrans 8 "2018-11-24" "USER COMMANDS"
.SH NAME
tcpretrans.bt \- Trace or count TCP retransmits. Uses Linux bpftrace/eBPF
.SH SYNOPSIS
.B tcpretrans.bt
.SH DESCRIPTION
This traces TCP retransmits, showing address, port, and TCP state information,
and sometimes the PID (although usually not, since retransmits are usually
sent by the kernel on timeouts). To keep overhead very low, only
the TCP retransmit functions are traced. This does not trace every packet
(like tcpdump(8) or a packet sniffer). Optionally, it can count retransmits
over a user signalled interval to spot potentially dropping network paths the
flows are traversing.
This uses dynamic tracing of the kernel tcp_retransmit_skb() and
tcp_send_loss_probe() functions, and will need to be updated to
match kernel changes to these functions.
Since this uses BPF, only the root user can use this tool.
.SH REQUIREMENTS
CONFIG_BPF and bcc.
CONFIG_BPF and bpftrace.
.SH EXAMPLES
.TP
Trace TCP retransmits:
#
.B tcpretrans.bt
.TP
.SH FIELDS
.TP
TIME
Time of the call, in HH:MM:SS format.
.TP
PID
Process ID that was on-CPU. This is less useful than it might sound, as it
may usually be 0, for the kernel, for timer-based retransmits.
.TP
LADDR
Local IP address.
.TP
LPORT
Local port.
.TP
RADDR
Remote IP address.
.TP
RPORT
Remote port.
.TP
STATE
TCP session state.
.SH OVERHEAD
Should be negligible: TCP retransmit events should be low (<1000/s), and the
low overhead this tool adds to each event should make the cost negligible.
.SH SOURCE
This is from bpftrace
.IP
https://github.com/iovisor/bpftrace
.PP
Also look in the bpftrace distribution for a companion _examples.txt file
containing example usage, output, and commentary for this tool.
This is a bpftrace version of the bcc tool of the same name. The bcc tool
may provide more options and customizations.
.IP
https://github.com/iovisor/bcc
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg, adapted for bpftrace by Dale Hamel
.SH SEE ALSO
tcpconnect(8), tcpaccept(8)
#!/usr/bin/env bpftrace
/*
* tcpaccept.bt Trace TCP accept()s
* For Linux, uses bpftrace and eBPF.
*
* USAGE: tcpaccept.bt
*
* This is a bpftrace version of the bcc tool of the same name.
*
* This uses dynamic tracing of the kernel inet_csk_accept() socket function
* (from tcp_prot.accept), and will need to be modified to match kernel changes.
* Copyright (c) 2018 Dale Hamel.
* Licensed under the Apache License, Version 2.0 (the "License")
* 23-Nov-2018 Dale Hamel created this.
*/
#include <net/sock.h>
BEGIN
{
printf("Tracing tcp accepts. Hit Ctrl-C to end.\n");
printf("%-8s %-6s %-14s ", "TIME", "PID", "COMM");
printf("%-14s %-5s %-14s %-5s %s\n", "RADDR", "RPORT", "LADDR", "LPORT", "BL");
}
kretprobe:inet_csk_accept
{
$sk = ((sock *) retval);
$inet_family = $sk->__sk_common.skc_family;
$af_inet = 2;
if ($inet_family == $af_inet) {
$daddr = $sk->__sk_common.skc_daddr;
$saddr = $sk->__sk_common.skc_rcv_saddr;
$lport = $sk->__sk_common.skc_num;
$dport = $sk->__sk_common.skc_dport;
$qlen = $sk->sk_ack_backlog;
$qmax = $sk->sk_max_ack_backlog;
// Destination port is big endian, it must be flipped
$dport = ($dport >> 8) | (($dport << 8) & 0x00FF00);
time("%H:%M:%S ");
printf("%-6d %-14s ", pid, comm);
printf("%-14s %-5d %-14s %-5d ", ntop($af_inet, $daddr), $dport, ntop($af_inet, $saddr), $lport);
printf("%d/%d\n", $qlen, $qmax);
}
}
Demonstrations of tcpaccept, the Linux bpftrace/eBPF version.
This tool traces the kernel function accepting TCP socket connections (eg, a
passive connection via accept(); not connect()). Some example output (IP
addresses changed to protect the innocent):
# ./tcpaccept
Tracing tcp accepts. Hit Ctrl-C to end.
TIME PID COMM RADDR RPORT LADDR LPORT BL
00:34:19 3949061 nginx 10.228.22.228 44226 10.229.20.169 8080 0/128
00:34:19 3951399 ruby 127.0.0.1 52422 127.0.0.1 8000 0/128
00:34:19 3949062 nginx 10.228.23.128 35408 10.229.20.169 8080 0/128
This output shows three connections, an IPv4 connections to PID 1463622, a "redis-server"
process listening on port 6379, and one IPv6 connection to a "thread.rb" process
listening on port 8000. The remote address and port are also printed, and the accept queue
current size as well as maximum size are shown.
The overhead of this tool should be negligible, since it is only tracing the
kernel function performing accept. It is not tracing every packet and then
filtering.
This tool only traces successful TCP accept()s. Connection attempts to closed
ports will not be shown (those can be traced via other functions).
There is another version of this tool in bcc: https://github.com/iovisor/bcc
USAGE message:
# ./tcpaccept.bt
#!/usr/bin/env bpftrace
/*
* tcpconnect.bt Trace TCP connect()s.
* For Linux, uses bpftrace and eBPF.
*
* USAGE: tcpconnect.bt
*
* This is a bpftrace version of the bcc tool of the same name.
* It is limited to ipv4 addresses.
*
* All connection attempts are traced, even if they ultimately fail.
*
* This uses dynamic tracing of kernel functions, and will need to be updated
* to match kernel changes.
*
* Copyright (c) 2018 Dale Hamel.
* Licensed under the Apache License, Version 2.0 (the "License")
*
* 23-Nov-2018 Dale Hamel created this.
*/
#include <net/sock.h>
BEGIN
{
printf("Tracing tcp connections. Hit Ctrl-C to end.\n");
printf("%-8s %-8s %-16s ", "TIME", "PID", "COMM");
printf("%-14s %-6s %-14s %-6s\n", "SADDR", "SPORT", "DADDR", "DPORT");
}
kprobe:tcp_connect
{
$sk = ((sock *) arg0);
$inet_family = $sk->__sk_common.skc_family;
$af_inet = 2;
if ($inet_family == $af_inet) {
$daddr = $sk->__sk_common.skc_daddr;
$saddr = $sk->__sk_common.skc_rcv_saddr;
$lport = $sk->__sk_common.skc_num;
$dport = $sk->__sk_common.skc_dport;
// Destination port is big endian, it must be flipped
$dport = ($dport >> 8) | (($dport << 8) & 0x00FF00);
time("%H:%M:%S ");
printf("%-8d %-16s ", pid, comm);
printf("%-14s %-6d %-14s %-6d\n", ntop($af_inet, $daddr), $dport, ntop($af_inet, $saddr), $lport);
}
}
Demonstrations of tcpconnect, the Linux bpftrace/eBPF version.
This tool traces the kernel function performing active TCP connections
(eg, via a connect() syscall; accept() are passive connections). Some example
output (IP addresses changed to protect the innocent):
# ./tcpconnect
TIME PID COMM SADDR SPORT DADDR DPORT
00:36:45 1798396 agent 127.0.0.1 5001 10.229.20.82 56114
00:36:45 1798396 curl 127.0.0.1 10255 10.229.20.82 56606
00:36:45 3949059 nginx 127.0.0.1 8000 127.0.0.1 37780
This output shows three connections, one from a "agent" process, one from
"curl", and one from "redis-cli". The output details shows the IP version, source
address, source socket port, destination address, and destination port. This traces attempted
connections: these may have failed.
The overhead of this tool should be negligible, since it is only tracing the
kernel functions performing connect. It is not tracing every packet and then
filtering.
USAGE message:
# ./tcpconnect.bt
#!/usr/bin/env bpftrace
/*
* tcpdrop.bt Trace TCP kernel-dropped packets/segments.
* For Linux, uses bpftrace and eBPF.
*
* USAGE: tcpdrop.bt
*
* This is a bpftrace version of the bcc tool of the same name.
* It is limited to ipv4 addresses, and cannot show tcp flags.
*
* This provides information such as packet details, socket state, and kernel
* stack trace for packets/segments that were dropped via tcp_drop().
* Copyright (c) 2018 Dale Hamel.
* Licensed under the Apache License, Version 2.0 (the "License")
* 23-Nov-2018 Dale Hamel created this.
*/
#include <net/sock.h>
BEGIN
{
printf("Tracing tcp drops. Hit Ctrl-C to end.\n");
printf("%-8s %-8s %-16s %-21s %-21s %-8s\n", "TIME", "PID", "COMM", "SADDR:SPORT", "DADDR:DPORT", "STATE")
}
kprobe:tcp_drop
{
$sk = ((sock *) arg0);
$inet_family = $sk->__sk_common.skc_family;
$af_inet = 2;
if ($inet_family == $af_inet) {
$daddr = $sk->__sk_common.skc_daddr;
$saddr = $sk->__sk_common.skc_rcv_saddr;
$lport = $sk->__sk_common.skc_num;
$dport = $sk->__sk_common.skc_dport;
// Destination port is big endian, it must be flipped
$dport = ($dport >> 8) | (($dport << 8) & 0x00FF00);
$state = $sk->__sk_common.skc_state;
// See https://github.com/torvalds/linux/blob/master/include/net/tcp_states.h
$statestr = "";
$statestr = $state == 1 ? "ESTABLISHED" : $statestr;
$statestr = $state == 2 ? "SYN_SENT" : $statestr;
$statestr = $state == 3 ? "SYN_RECV" : $statestr;
$statestr = $state == 4 ? "FIN_WAIT1" : $statestr;
$statestr = $state == 5 ? "FIN_WAIT2" : $statestr;
$statestr = $state == 6 ? "TIME_WAIT" : $statestr;
$statestr = $state == 7 ? "CLOSE" : $statestr;
$statestr = $state == 8 ? "CLOSE_WAIT" : $statestr;
$statestr = $state == 9 ? "LAST_ACK" : $statestr;
$statestr = $state == 10 ? "LISTEN" : $statestr;
$statestr = $state == 11 ? "CLOSING" : $statestr;
$statestr = $state == 12 ? "NEW_SYN_RECV" : $statestr;
time("%H:%M:%S ");
printf("%-8d %-16s ", pid, comm);
printf("%14s:%-6d %14s:%-6d %-10s\n", ntop($af_inet, $daddr), $dport, ntop($af_inet, $saddr), $lport, $statestr);
printf("%s\n", stack);
}
}
Demonstrations of tcpdrop, the Linux bpftrace/eBPF version.
tcpdrop prints details of TCP packets or segments that were dropped by the
kernel, including the kernel stack trace that led to the drop:
# ./tcpdrop.bt
TIME PID COMM SADDR:SPORT DADDR:DPORT STATE
00:39:21 0 swapper/2 10.231.244.31:3306 10.229.20.82:50552 ESTABLISHE
tcp_drop+0x1
tcp_v4_do_rcv+0x135
tcp_v4_rcv+0x9c7
ip_local_deliver_finish+0x62
ip_local_deliver+0x6f
ip_rcv_finish+0x129
ip_rcv+0x28f
__netif_receive_skb_core+0x432
__netif_receive_skb+0x18
netif_receive_skb_internal+0x37
napi_gro_receive+0xc5
ena_clean_rx_irq+0x3c3
ena_io_poll+0x33f
net_rx_action+0x140
__softirqentry_text_start+0xdf
irq_exit+0xb6
do_IRQ+0x82
ret_from_intr+0x0
native_safe_halt+0x6
default_idle+0x20
arch_cpu_idle+0x15
default_idle_call+0x23
do_idle+0x17f
cpu_startup_entry+0x73
rest_init+0xae
start_kernel+0x4dc
x86_64_start_reservations+0x24
x86_64_start_kernel+0x74
secondary_startup_64+0xa5
[...]
The last column shows the state of the TCP session.
This tool is useful for debugging high rates of drops, which can cause the
remote end to do timer-based retransmits, hurting performance.
USAGE:
# ./tcpdrop.bt
#!/usr/bin/env bpftrace
/*
* tcpretrans.bt Trace or count TCP retransmits
* For Linux, uses bpftrace and eBPF.
*
* USAGE: tcpretrans.bt
*
* This is a bpftrace version of the bcc tool of the same name.
* It is limited to ipv4 addresses, and doesn't support tracking TLPs.
*
* This uses dynamic tracing of kernel functions, and will need to be updated
* to match kernel changes.
*
* Copyright (c) 2018 Dale Hamel.
* Licensed under the Apache License, Version 2.0 (the "License")
*
* 23-Nov-2018 Dale Hamel created this.
*/
#include <net/sock.h>
BEGIN
{
printf("Tracing tcp retransmits. Hit Ctrl-C to end.\n");
printf("%-8s %-8s %20s %21s %6s\n", "TIME", "PID", "LADDR:LPORT", "RADDR:RPORT", "STATE" )
}
kprobe:tcp_retransmit_skb
{
$sk = ((sock *) arg0);
$inet_family = $sk->__sk_common.skc_family;
$af_inet = 2;
if ($inet_family == $af_inet) {
$daddr = $sk->__sk_common.skc_daddr;
$saddr = $sk->__sk_common.skc_rcv_saddr;
$lport = $sk->__sk_common.skc_num;
$dport = $sk->__sk_common.skc_dport;
// Destination port is big endian, it must be flipped
$dport = ($dport >> 8) | (($dport << 8) & 0x00FF00);
$state = $sk->__sk_common.skc_state;
// See https://github.com/torvalds/linux/blob/master/include/net/tcp_states.h
$statestr = "";
$statestr = $state == 1 ? "ESTABLISHED" : $statestr;
$statestr = $state == 2 ? "SYN_SENT" : $statestr;
$statestr = $state == 3 ? "SYN_RECV" : $statestr;
$statestr = $state == 4 ? "FIN_WAIT1" : $statestr;
$statestr = $state == 5 ? "FIN_WAIT2" : $statestr;
$statestr = $state == 6 ? "TIME_WAIT" : $statestr;
$statestr = $state == 7 ? "CLOSE" : $statestr;
$statestr = $state == 8 ? "CLOSE_WAIT" : $statestr;
$statestr = $state == 9 ? "LAST_ACK" : $statestr;
$statestr = $state == 10 ? "LISTEN" : $statestr;
$statestr = $state == 11 ? "CLOSING" : $statestr;
$statestr = $state == 12 ? "NEW_SYN_RECV" : $statestr;
time("%H:%M:%S ");
printf("%-8d %14s:%-6d %14s:%-6d %6s\n", pid, ntop($af_inet, $saddr), $lport, ntop($af_inet, $daddr), $dport, $statestr);
}
}
Demonstrations of tcpretrans, the Linux bpftrace/eBPF version.
This tool traces the kernel TCP retransmit function to show details of these
retransmits. For example:
# ./tcpretrans.bt
TIME PID LADDR:LPORT RADDR:RPORT STATE
00:43:54 0 10.229.20.82:46654 153.2.224.76:443 SYN_SENT
00:43:55 0 10.232.0.49:57678 10.229.20.99:24231 SYN_SENT
00:43:57 100 10.229.20.175:54224 10.201.76.122:443 ESTABLISHED
[...]
This output shows three TCP retransmits, the first two were for an IPv4
connection from 10.153.223.157 port 22 to 69.53.245.40 port 34619. The TCP
state was "ESTABLISHED" at the time of the retransmit. The on-CPU PID at the
time of the retransmit is printed, in this case 0 (the kernel, which will
be the case most of the time).
Retransmits are usually a sign of poor network health, and this tool is
useful for their investigation. Unlike using tcpdump, this tool has very
low overhead, as it only traces the retransmit function. It also prints
additional kernel details: the state of the TCP session at the time of the
retransmit.
USAGE message:
# ./tcpretrans.bt
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment