- 25 Jun, 2020 18 commits
-
-
Tobias Klauser authored
Define attach_type_name in common.c instead of main.h so it is only defined once. This leads to a slight decrease in the binary size of bpftool. Before: text data bss dec hex filename 399024 11168 1573160 1983352 1e4378 bpftool After: text data bss dec hex filename 398256 10880 1573160 1982296 1e3f58 bpftool Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200624143154.13145-1-tklauser@distanz.ch
-
Tobias Klauser authored
Define prog_type_name in prog.c instead of main.h so it is only defined once. This leads to a slight decrease in the binary size of bpftool. Before: text data bss dec hex filename 401032 11936 1573160 1986128 1e4e50 bpftool After: text data bss dec hex filename 399024 11168 1573160 1983352 1e4378 bpftool Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200624143124.12914-1-tklauser@distanz.ch
-
Alexei Starovoitov authored
Yonghong Song says: ==================== bpf iterator implments traversal of kernel data structures and these data structures are passed to a bpf program for processing. This gives great flexibility for users to examine kernel data structure without using e.g. /proc/net which has limited and fixed format. Commit 138d0be3 ("net: bpf: Add netlink and ipv6_route bpf_iter targets") implemented bpf iterators for netlink and ipv6_route. This patch set intends to implement bpf iterators for tcp and udp. Currently, /proc/net/tcp is used to print tcp4 stats and /proc/net/tcp6 is used to print tcp6 stats. /proc/net/udp[6] have similar usage model. In contrast, only one tcp iterator is implemented and it is bpf program resposibility to filter based on socket family. The same is for udp. This will avoid another unnecessary traversal pass if users want to check both tcp4 and tcp6. Several helpers are also implemented in this patch bpf_skc_to_{tcp, tcp6, tcp_timewait, tcp_request, udp6}_sock The argument for these helpers is not a fixed btf_id. For example, bpf_skc_to_tcp(struct sock_common *), or bpf_skc_to_tcp(struct sock *), or bpf_skc_to_tcp(struct inet_sock *), ... are all valid. At runtime, the helper will check whether pointer cast is legal or not. Please see Patch #5 for details. Since btf_id's for both arguments and return value are known at build time, the btf_id's are pre-computed once vmlinux btf becomes valid. Jiri's "adding d_path helper" patch set https://lore.kernel.org/bpf/20200616100512.2168860-1-jolsa@kernel.org/T/ provides a way to pre-compute btf id during vmlinux build time. This can be applied here as well. A followup patch can convert to build time btf id computation after Jiri's patch landed. Changelogs: v4 -> v5: - fix bpf_skc_to_udp6_sock helper as besides sk_protocol, sk_family, sk_type == SOCK_DGRAM is also needed to differentiate from SOCK_RAW (Eric) v3 -> v4: - fix bpf_skc_to_{tcp_timewait, tcp_request}_sock helper implementation as just checking sk->sk_state is not enough (Martin) - fix a few kernel test robot reported failures - move bpf_tracing_net.h from libbpf to selftests (Andrii) - remove __weak attribute from selftests CONFIG_HZ variables (Andrii) v2 -> v3: - change sock_cast*/SOCK_CAST* names to btf_sock* names for generality (Martin) - change gpl_license to false (Martin) - fix helper to cast to tcp timewait/request socket. (Martin) v1 -> v2: - guard init_sock_cast_types() defination properly with CONFIG_NET (Martin) - reuse the btf_ids, computed for new helper argument, for return values (Martin) - using BTF_TYPE_EMIT to express intent of btf type generation (Andrii) - abstract out common net macros into bpf_tracing_net.h (Andrii) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Added tcp{4,6} and udp{4,6} bpf programs into test_progs selftest so that they at least can load successfully. $ ./test_progs -n 3 ... #3/7 tcp4:OK #3/8 tcp6:OK #3/9 udp4:OK #3/10 udp6:OK ... #3 bpf_iter:OK Summary: 1/16 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230823.3989372-1-yhs@fb.com
-
Yonghong Song authored
On my VM, I got identical results between /proc/net/udp[6] and the udp{4,6} bpf iterator. For udp6: $ cat /sys/fs/bpf/p1 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 1405: 000080FE00000000FF7CC4D0D9EFE4FE:0222 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 193 0 19183 2 0000000029eab111 0 $ cat /proc/net/udp6 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 1405: 000080FE00000000FF7CC4D0D9EFE4FE:0222 00000000000000000000000000000000:0000 07 00000000:00000000 00:00000000 00000000 193 0 19183 2 0000000029eab111 0 For udp4: $ cat /sys/fs/bpf/p4 sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 2007: 00000000:1F90 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 72540 2 000000004ede477a 0 $ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 2007: 00000000:1F90 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 72540 2 000000004ede477a 0 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230822.3989299-1-yhs@fb.com
-
Yonghong Song authored
In my VM, I got identical result compared to /proc/net/{tcp,tcp6}. For tcp6: $ cat /proc/net/tcp6 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 00000000000000000000000000000000:0016 00000000000000000000000000000000:0000 0A 00000000:00000000 00:00000001 00000000 0 0 17955 1 000000003eb3102e 100 0 0 10 0 $ cat /sys/fs/bpf/p1 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 00000000000000000000000000000000:0016 00000000000000000000000000000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 17955 1 000000003eb3102e 100 0 0 10 0 For tcp: $ cat /proc/net/tcp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 2666 1 000000007152e43f 100 0 0 10 0 $ cat /sys/fs/bpf/p2 sl local_address remote_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 1: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 2666 1 000000007152e43f 100 0 0 10 0 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230820.3989165-1-yhs@fb.com
-
Yonghong Song authored
These newly added macros will be used in subsequent bpf iterator tcp{4,6} and udp{4,6} programs. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230819.3989050-1-yhs@fb.com
-
Yonghong Song authored
Refactor bpf_iter_ipv6_route.c and bpf_iter_netlink.c so net macros, originally from various include/linux header files, are moved to a new header file bpf_tracing_net.h. The goal is to improve reuse so networking tracing programs do not need to copy these macros every time they use them. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230817.3988962-1-yhs@fb.com
-
Yonghong Song authored
Commit b9f4c01f ("selftest/bpf: Make bpf_iter selftest compilable against old vmlinux.h") and Commit dda18a5c ("selftests/bpf: Convert bpf_iter_test_kern{3, 4}.c to define own bpf_iter_meta") redefined newly introduced types in bpf programs so the bpf program can still compile properly with old kernels although loading may fail. Since this patch set introduced new types and the same workaround is needed, so let us move the workaround to a separate header file so they do not clutter bpf programs. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230816.3988656-1-yhs@fb.com
-
Yonghong Song authored
The helper is used in tracing programs to cast a socket pointer to a udp6_sock pointer. The return value could be NULL if the casting is illegal. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200623230815.3988481-1-yhs@fb.com
-
Yonghong Song authored
The bpf iterator for udp is implemented. Both udp4 and udp6 sockets will be traversed. It is up to bpf program to filter for udp4 or udp6 only, or both families of sockets. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230813.3988404-1-yhs@fb.com
-
Yonghong Song authored
Similar to tcp_iter_state, a new field bpf_seq_afinfo is added to udp_iter_state to provide bpf udp iterator afinfo. This does not change /proc/net/{udp, udp6} behavior. But it enables bpf iterator to avoid get afinfo from PDE_DATA and iterate through all udp and udp6 sockets in one pass. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230812.3988347-1-yhs@fb.com
-
Yonghong Song authored
Three more helpers are added to cast a sock_common pointer to an tcp_sock, tcp_timewait_sock or a tcp_request_sock for tracing programs. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230811.3988277-1-yhs@fb.com
-
Yonghong Song authored
The helper is used in tracing programs to cast a socket pointer to a tcp6_sock pointer. The return value could be NULL if the casting is illegal. A new helper return type RET_PTR_TO_BTF_ID_OR_NULL is added so the verifier is able to deduce proper return types for the helper. Different from the previous BTF_ID based helpers, the bpf_skc_to_tcp6_sock() argument can be several possible btf_ids. More specifically, all possible socket data structures with sock_common appearing in the first in the memory layout. This patch only added socket types related to tcp and udp. All possible argument btf_id and return value btf_id for helper bpf_skc_to_tcp6_sock() are pre-calculcated and cached. In the future, it is even possible to precompute these btf_id's at kernel build time. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230809.3988195-1-yhs@fb.com
-
Yonghong Song authored
/proc/net/tcp{4,6} uses jiffies for various computations. Let us add bpf_jiffies64() helper to tracing program so bpf_iter and other programs can use it. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230808.3988073-1-yhs@fb.com
-
Yonghong Song authored
'X' tells kernel to print hex with upper case letters. /proc/net/tcp{4,6} seq_file show() used this, and supports it in bpf_seq_printf() helper too. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230807.3988014-1-yhs@fb.com
-
Yonghong Song authored
The bpf iterator for tcp is implemented. Both tcp4 and tcp6 sockets will be traversed. It is up to bpf program to filter for tcp4 or tcp6 only, or both families of sockets. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230805.3987959-1-yhs@fb.com
-
Yonghong Song authored
A new field bpf_seq_afinfo is added to tcp_iter_state to provide bpf tcp iterator afinfo. There are two reasons on why we did this. First, the current way to get afinfo from PDE_DATA does not work for bpf iterator as its seq_file inode does not conform to /proc/net/{tcp,tcp6} inode structures. More specifically, anonymous bpf iterator will use an anonymous inode which is shared in the system and we cannot change inode private data structure at all. Second, bpf iterator for tcp/tcp6 wants to traverse all tcp and tcp6 sockets in one pass and bpf program can control whether they want to skip one sk_family or not. Having a different afinfo with family AF_UNSPEC make it easier to understand in the code. This patch does not change /proc/net/{tcp,tcp6} behavior as the bpf_seq_afinfo will be NULL for these two proc files. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200623230804.3987829-1-yhs@fb.com
-
- 24 Jun, 2020 7 commits
-
-
Dmitry Yakunin authored
This patch adds support of SO_KEEPALIVE flag and TCP related options to bpf_setsockopt() routine. This is helpful if we want to enable or tune TCP keepalive for applications which don't do it in the userspace code. v3: - update kernel-doc in uapi (Nikita Vetoshkin <nekto0n@yandex-team.ru>) v4: - update kernel-doc in tools too (Alexei Starovoitov) - add test to selftests (Alexei Starovoitov) Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200620153052.9439-3-zeil@yandex-team.ru
-
Dmitry Yakunin authored
This is preparation for usage in bpf_setsockopt. v2: - remove redundant EXPORT_SYMBOL (Alexei Starovoitov) Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200620153052.9439-2-zeil@yandex-team.ru
-
Dmitry Yakunin authored
This is preparation for usage in bpf_setsockopt. Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200620153052.9439-1-zeil@yandex-team.ru
-
Alexei Starovoitov authored
./test_progs-no_alu32 -t get_stack_raw_tp fails due to: 52: (85) call bpf_get_stack#67 53: (bf) r8 = r0 54: (bf) r1 = r8 55: (67) r1 <<= 32 56: (c7) r1 s>>= 32 ; if (usize < 0) 57: (c5) if r1 s< 0x0 goto pc+26 R0=inv(id=0,smax_value=800) R1_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8_w=inv(id=0,smax_value=800) R9=inv800 ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0); 58: (1f) r9 -= r8 ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0); 59: (bf) r2 = r7 60: (0f) r2 += r1 regs=1 stack=0 before 52: (85) call bpf_get_stack#67 ; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0); 61: (bf) r1 = r6 62: (bf) r3 = r9 63: (b7) r4 = 0 64: (85) call bpf_get_stack#67 R0=inv(id=0,smax_value=800) R1_w=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=800,var_off=(0x0; 0x3ff),s32_max_value=1023,u32_max_value=1023) R3_w=inv(id=0,umax_value=9223372036854776608) R3 unbounded memory access, use 'var &= const' or 'if (var < const)' In the C code: usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK); if (usize < 0) return 0; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0); if (ksize < 0) return 0; We used to have problem with pointer arith in R2. Now it's a problem with two integers in R3. 'if (usize < 0)' is comparing R1 and makes it [0,800], but R8 stays [-inf,800]. Both registers represent the same 'usize' variable. Then R9 -= R8 is doing 800 - [-inf, 800] so the result of "max_len - usize" looks unbounded to the verifier while it's obvious in C code that "max_len - usize" should be [0, 800]. To workaround the problem convert ksize and usize variables from int to long. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Prevent loading/parsing vmlinux BTF twice in some cases: for CO-RE relocations and for BTF-aware hooks (tp_btf, fentry/fexit, etc). Fixes: a6ed02ca ("libbpf: Load btf_vmlinux only once per object.") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200624043805.1794620-1-andriin@fb.com
-
Colin Ian King authored
There is a spelling mistake in a pr_warn message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200623084207.149253-1-colin.king@canonical.com
-
Quentin Monnet authored
Building bpftool yields the following complaint: pids.c: In function 'emit_obj_refs_json': pids.c:175:80: warning: declaration of 'json_wtr' shadows a global declaration [-Wshadow] 175 | void emit_obj_refs_json(struct obj_refs_table *table, __u32 id, json_writer_t *json_wtr) | ~~~~~~~~~~~~~~~^~~~~~~~ In file included from pids.c:11: main.h:141:23: note: shadowed declaration is here 141 | extern json_writer_t *json_wtr; | ^~~~~~~~ Let's rename the variable. v2: - Rename the variable instead of calling the global json_wtr directly. Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200623213600.16643-1-quentin@isovalent.com
-
- 23 Jun, 2020 15 commits
-
-
Alexei Starovoitov authored
-
Tobias Klauser authored
Currently, if the clang-bpf-co-re feature is not available, the build fails with e.g. CC prog.o prog.c:1462:10: fatal error: profiler.skel.h: No such file or directory 1462 | #include "profiler.skel.h" | ^~~~~~~~~~~~~~~~~ This is due to the fact that the BPFTOOL_WITHOUT_SKELETONS macro is not defined, despite BUILD_BPF_SKELS not being set. Fix this by correctly evaluating $(BUILD_BPF_SKELS) when deciding on whether to add -DBPFTOOL_WITHOUT_SKELETONS to CFLAGS. Fixes: 05aca6da ("tools/bpftool: Generalize BPF skeleton support and generate vmlinux.h") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200623103710.10370-1-tklauser@distanz.ch
-
John Fastabend authored
Extend original variable-length tests with a case to catch a common existing pattern of testing for < 0 for errors. Note because verifier also tracks upper bounds and we know it can not be greater than MAX_LEN here we can skip upper bound check. In ALU64 enabled compilation converting from long->int return types in probe helpers results in extra instruction pattern, <<= 32, s >>= 32. The trade-off is the non-ALU64 case works. If you really care about every extra insn (XDP case?) then you probably should be using original int type. In addition adding a sext insn to bpf might help the verifier in the general case to avoid these types of tricks. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200623032224.4020118-3-andriin@fb.com
-
Andrii Nakryiko authored
Add selftest that validates variable-length data reading and concatentation with one big shared data array. This is a common pattern in production use for monitoring and tracing applications, that potentially can read a lot of data, but overall read much less. Such pattern allows to determine precisely what amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200623032224.4020118-2-andriin@fb.com
-
Andrii Nakryiko authored
Switch most of BPF helper definitions from returning int to long. These definitions are coming from comments in BPF UAPI header and are used to generate bpf_helper_defs.h (under libbpf) to be later included and used from BPF programs. In actual in-kernel implementation, all the helpers are defined as returning u64, but due to some historical reasons, most of them are actually defined as returning int in UAPI (usually, to return 0 on success, and negative value on error). This actually causes Clang to quite often generate sub-optimal code, because compiler believes that return value is 32-bit, and in a lot of cases has to be up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values, before they can be used further in BPF code. Besides just "polluting" the code, these 32-bit shifts quite often cause problems for cases in which return value matters. This is especially the case for the family of bpf_probe_read_str() functions. There are few other similar helpers (e.g., bpf_read_branch_records()), in which return value is used by BPF program logic to record variable-length data and process it. For such cases, BPF program logic carefully manages offsets within some array or map to read variable-length data. For such uses, it's crucial for BPF verifier to track possible range of register values to prove that all the accesses happen within given memory bounds. Those extraneous zero-extending bit shifts, inserted by Clang (and quite often interleaved with other code, which makes the issues even more challenging and sometimes requires employing extra per-variable compiler barriers), throws off verifier logic and makes it mark registers as having unknown variable offset. We'll study this pattern a bit later below. Another common pattern is to check return of BPF helper for non-zero state to detect error conditions and attempt alternative actions in such case. Even in this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode quite often leads to sub-optimal and unnecessary extra code. We'll look at this pattern as well. Clang's BPF target supports two modes of code generation: ALU32, in which it is capable of using lower 32-bit parts of registers, and no-ALU32, in which only full 64-bit registers are being used. ALU32 mode somewhat mitigates the above described problems, but not in all cases. This patch switches all the cases in which BPF helpers return 0 or negative error from returning int to returning long. It is shown below that such change in definition leads to equivalent or better code. No-ALU32 mode benefits more, but ALU32 mode doesn't degrade or still gets improved code generation. Another class of cases switched from int to long are bpf_probe_read_str()-like helpers, which encode successful case as non-negative values, while still returning negative value for errors. In all of such cases, correctness is preserved due to two's complement encoding of negative values and the fact that all helpers return values with 32-bit absolute value. Two's complement ensures that for negative values higher 32 bits are all ones and when truncated, leave valid negative 32-bit value with the same value. Non-negative values have upper 32 bits set to zero and similarly preserve value when high 32 bits are truncated. This means that just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't require any extra shifts). To minimize the chances of regressions, two code patterns were investigated, as mentioned above. For both patterns, BPF assembly was analyzed in ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and new 64-bit long return type. Case 1. Variable-length data reading and concatenation. This is quite ubiquitous pattern in tracing/monitoring applications, reading data like process's environment variables, file path, etc. In such case, many pieces of string-like variable-length data are read into a single big buffer, and at the end of the process, only a part of array containing actual data is sent to user-space for further processing. This case is tested in test_varlen.c selftest (in the next patch). Code flow is roughly as follows: void *payload = &sample->payload; u64 len; len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1); if (len <= MAX_SZ1) { payload += len; sample->len1 = len; } len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2); if (len <= MAX_SZ2) { payload += len; sample->len2 = len; } /* and so on */ sample->total_len = payload - &sample->payload; /* send over, e.g., perf buffer */ There could be two variations with slightly different code generated: when len is 64-bit integer and when it is 32-bit integer. Both variations were analysed. BPF assembly instructions between two successive invocations of bpf_probe_read_kernel_str() were used to check code regressions. Results are below, followed by short analysis. Left side is using helpers with int return type, the right one is after the switch to long. ALU32 + INT ALU32 + LONG =========== ============ 64-BIT (13 insns): 64-BIT (10 insns): ------------------------------------ ------------------------------------ 17: call 115 17: call 115 18: if w0 > 256 goto +9 <LBB0_4> 18: if r0 > 256 goto +6 <LBB0_4> 19: w1 = w0 19: r1 = 0 ll 20: r1 <<= 32 21: *(u64 *)(r1 + 0) = r0 21: r1 s>>= 32 22: r6 = 0 ll 22: r2 = 0 ll 24: r6 += r0 24: *(u64 *)(r2 + 0) = r1 00000000000000c8 <LBB0_4>: 25: r6 = 0 ll 25: r1 = r6 27: r6 += r1 26: w2 = 256 00000000000000e0 <LBB0_4>: 27: r3 = 0 ll 28: r1 = r6 29: call 115 29: w2 = 256 30: r3 = 0 ll 32: call 115 32-BIT (11 insns): 32-BIT (12 insns): ------------------------------------ ------------------------------------ 17: call 115 17: call 115 18: if w0 > 256 goto +7 <LBB1_4> 18: if w0 > 256 goto +8 <LBB1_4> 19: r1 = 0 ll 19: r1 = 0 ll 21: *(u32 *)(r1 + 0) = r0 21: *(u32 *)(r1 + 0) = r0 22: w1 = w0 22: r0 <<= 32 23: r6 = 0 ll 23: r0 >>= 32 25: r6 += r1 24: r6 = 0 ll 00000000000000d0 <LBB1_4>: 26: r6 += r0 26: r1 = r6 00000000000000d8 <LBB1_4>: 27: w2 = 256 27: r1 = r6 28: r3 = 0 ll 28: w2 = 256 30: call 115 29: r3 = 0 ll 31: call 115 In ALU32 mode, the variant using 64-bit length variable clearly wins and avoids unnecessary zero-extension bit shifts. In practice, this is even more important and good, because BPF code won't need to do extra checks to "prove" that payload/len are within good bounds. 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting with two bit shifts, instead of equivalent `w1 = w0` assignment. The former uses extra register. The latter might potentially lose some range information, but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256] after check at 18:, and shifting 32 bits left/right keeps that range intact. We should probably look into Clang's logic and see why it chooses bitshifts over sub-register assignments for this. NO-ALU32 + INT NO-ALU32 + LONG ============== =============== 64-BIT (14 insns): 64-BIT (10 insns): ------------------------------------ ------------------------------------ 17: call 115 17: call 115 18: r0 <<= 32 18: if r0 > 256 goto +6 <LBB0_4> 19: r1 = r0 19: r1 = 0 ll 20: r1 >>= 32 21: *(u64 *)(r1 + 0) = r0 21: if r1 > 256 goto +7 <LBB0_4> 22: r6 = 0 ll 22: r0 s>>= 32 24: r6 += r0 23: r1 = 0 ll 00000000000000c8 <LBB0_4>: 25: *(u64 *)(r1 + 0) = r0 25: r1 = r6 26: r6 = 0 ll 26: r2 = 256 28: r6 += r0 27: r3 = 0 ll 00000000000000e8 <LBB0_4>: 29: call 115 29: r1 = r6 30: r2 = 256 31: r3 = 0 ll 33: call 115 32-BIT (13 insns): 32-BIT (13 insns): ------------------------------------ ------------------------------------ 17: call 115 17: call 115 18: r1 = r0 18: r1 = r0 19: r1 <<= 32 19: r1 <<= 32 20: r1 >>= 32 20: r1 >>= 32 21: if r1 > 256 goto +6 <LBB1_4> 21: if r1 > 256 goto +6 <LBB1_4> 22: r2 = 0 ll 22: r2 = 0 ll 24: *(u32 *)(r2 + 0) = r0 24: *(u32 *)(r2 + 0) = r0 25: r6 = 0 ll 25: r6 = 0 ll 27: r6 += r1 27: r6 += r1 00000000000000e0 <LBB1_4>: 00000000000000e0 <LBB1_4>: 28: r1 = r6 28: r1 = r6 29: r2 = 256 29: r2 = 256 30: r3 = 0 ll 30: r3 = 0 ll 32: call 115 32: call 115 In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much superior code, as expected, eliminating unnecessary bit shifts. For 32-bit len, code is identical. So overall, only ALU-32 32-bit len case is more-or-less equivalent and the difference stems from internal Clang decision, rather than compiler lacking enough information about types. Case 2. Let's look at the simpler case of checking return result of BPF helper for errors. The code is very simple: long bla; if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0)) return 1; else return 0; ALU32 + CHECK (9 insns) ALU32 + CHECK (9 insns) ==================================== ==================================== 0: r1 = r10 0: r1 = r10 1: r1 += -8 1: r1 += -8 2: w2 = 8 2: w2 = 8 3: r3 = 0 3: r3 = 0 4: call 113 4: call 113 5: w1 = w0 5: r1 = r0 6: w0 = 1 6: w0 = 1 7: if w1 != 0 goto +1 <LBB2_2> 7: if r1 != 0 goto +1 <LBB2_2> 8: w0 = 0 8: w0 = 0 0000000000000048 <LBB2_2>: 0000000000000048 <LBB2_2>: 9: exit 9: exit Almost identical code, the only difference is the use of full register assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit architectures, new BPF assembly might be slightly less optimal, in theory. But one can argue that's not a big issue, given that use of full registers is still prevalent (e.g., for parameter passing). NO-ALU32 + CHECK (11 insns) NO-ALU32 + CHECK (9 insns) ==================================== ==================================== 0: r1 = r10 0: r1 = r10 1: r1 += -8 1: r1 += -8 2: r2 = 8 2: r2 = 8 3: r3 = 0 3: r3 = 0 4: call 113 4: call 113 5: r1 = r0 5: r1 = r0 6: r1 <<= 32 6: r0 = 1 7: r1 >>= 32 7: if r1 != 0 goto +1 <LBB2_2> 8: r0 = 1 8: r0 = 0 9: if r1 != 0 goto +1 <LBB2_2> 0000000000000048 <LBB2_2>: 10: r0 = 0 9: exit 0000000000000058 <LBB2_2>: 11: exit NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit shifts. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200623032224.4020118-1-andriin@fb.com
-
Alexei Starovoitov authored
Andrii Nakryiko says: ==================== This patch set implements libbpf support for a second kind of special externs, kernel symbols, in addition to existing Kconfig externs. Right now, only untyped (const void) externs are supported, which, in C language, allow only to take their address. In the future, with kernel BTF getting type info about its own global and per-cpu variables, libbpf will extend this support with BTF type info, which will allow to also directly access variable's contents and follow its internal pointers, similarly to how it's possible today in fentry/fexit programs. As a first practical use of this functionality, bpftool gained ability to show PIDs of processes that have open file descriptors for BPF map/program/link/BTF object. It relies on iter/task_file BPF iterator program to extract this information efficiently. There was a bunch of bpftool refactoring (especially Makefile) necessary to generalize bpftool's internal BPF program use. This includes generalization of BPF skeletons support, addition of a vmlinux.h generation, extracting and building minimal subset of bpftool for bootstrapping. v2->v3: - fix sec_btf_id check (Hao); v1->v2: - docs fixes (Quentin); - dual GPL/BSD license for pid_inter.bpf.c (Quentin); - NULL-init kcfg_data (Hao Luo); rfc->v1: - show pids, if supported by kernel, always (Alexei); - switched iter output to binary to support showing process names; - update man pages; - fix few minor bugs in libbpf w.r.t. extern iteration. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add statements about bpftool being able to discover process info, holding reference to BPF map, prog, link, or BTF. Show example output as well. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-10-andriin@fb.com
-
Andrii Nakryiko authored
Add bpf_iter-based way to find all the processes that hold open FDs against BPF object (map, prog, link, btf). bpftool always attempts to discover this, but will silently give up if kernel doesn't yet support bpf_iter BPF programs. Process name and PID are emitted for each process (task group). Sample output for each of 4 BPF objects: $ sudo ./bpftool prog show 2694: cgroup_device tag 8c42dee26e8cd4c2 gpl loaded_at 2020-06-16T15:34:32-0700 uid 0 xlated 648B jited 409B memlock 4096B pids systemd(1) 2907: cgroup_skb name egress tag 9ad187367cf2b9e8 gpl loaded_at 2020-06-16T18:06:54-0700 uid 0 xlated 48B jited 59B memlock 4096B map_ids 2436 btf_id 1202 pids test_progs(2238417), test_progs(22384459) $ sudo ./bpftool map show 2436: array name test_cgr.bss flags 0x400 key 4B value 8B max_entries 1 memlock 8192B btf_id 1202 pids test_progs(2238417), test_progs(22384459) 2445: array name pid_iter.rodata flags 0x480 key 4B value 4B max_entries 1 memlock 8192B btf_id 1214 frozen pids bpftool(2239612) $ sudo ./bpftool link show 61: cgroup prog 2908 cgroup_id 375301 attach_type egress pids test_progs(2238417), test_progs(22384459) 62: cgroup prog 2908 cgroup_id 375344 attach_type egress pids test_progs(2238417), test_progs(22384459) $ sudo ./bpftool btf show 1202: size 1527B prog_ids 2908,2907 map_ids 2436 pids test_progs(2238417), test_progs(22384459) 1242: size 34684B pids bpftool(2258892) Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-9-andriin@fb.com
-
Andrii Nakryiko authored
Wrap source argument of BPF_CORE_READ family of macros into parentheses to allow uses like this: BPF_CORE_READ((struct cast_struct *)src, a, b, c); Fixes: 7db3822a ("libbpf: Add BPF_CORE_READ/BPF_CORE_READ_INTO helpers") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200619231703.738941-8-andriin@fb.com
-
Andrii Nakryiko authored
Adapt Makefile to support BPF skeleton generation beyond single profiler.bpf.c case. Also add vmlinux.h generation and switch profiler.bpf.c to use it. clang-bpf-global-var feature is extended and renamed to clang-bpf-co-re to check for support of preserve_access_index attribute, which, together with BTF for global variables, is the minimum requirement for modern BPF programs. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-7-andriin@fb.com
-
Andrii Nakryiko authored
Build minimal "bootstrap mode" bpftool to enable skeleton (and, later, vmlinux.h generation), instead of building almost complete, but slightly different (w/o skeletons, etc) bpftool to bootstrap complete bpftool build. Current approach doesn't scale well (engineering-wise) when adding more BPF programs to bpftool and other complicated functionality, as it requires constant adjusting of the code to work in both bootstrapped mode and normal mode. So it's better to build only minimal bpftool version that supports only BPF skeleton code generation and BTF-to-C conversion. Thankfully, this is quite easy to accomplish due to internal modularity of bpftool commands. This will also allow to keep adding new functionality to bpftool in general, without the need to care about bootstrap mode for those new parts of bpftool. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-6-andriin@fb.com
-
Andrii Nakryiko authored
Move functions that parse map and prog by id/tag/name/etc outside of map.c/prog.c, respectively. These functions are used outside of those files and are generic enough to be in common. This also makes heavy-weight map.c and prog.c more decoupled from the rest of bpftool files and facilitates more lightweight bootstrap bpftool variant. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-5-andriin@fb.com
-
Andrii Nakryiko authored
Validate libbpf is able to handle weak and strong kernel symbol externs in BPF code correctly. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-4-andriin@fb.com
-
Andrii Nakryiko authored
Add support for another (in addition to existing Kconfig) special kind of externs in BPF code, kernel symbol externs. Such externs allow BPF code to "know" kernel symbol address and either use it for comparisons with kernel data structures (e.g., struct file's f_op pointer, to distinguish different kinds of file), or, with the help of bpf_probe_user_kernel(), to follow pointers and read data from global variables. Kernel symbol addresses are found through /proc/kallsyms, which should be present in the system. Currently, such kernel symbol variables are typeless: they have to be defined as `extern const void <symbol>` and the only operation you can do (in C code) with them is to take its address. Such extern should reside in a special section '.ksyms'. bpf_helpers.h header provides __ksym macro for this. Strong vs weak semantics stays the same as with Kconfig externs. If symbol is not found in /proc/kallsyms, this will be a failure for strong (non-weak) extern, but will be defaulted to 0 for weak externs. If the same symbol is defined multiple times in /proc/kallsyms, then it will be error if any of the associated addresses differs. In that case, address is ambiguous, so libbpf falls on the side of caution, rather than confusing user with randomly chosen address. In the future, once kernel is extended with variables BTF information, such ksym externs will be supported in a typed version, which will allow BPF program to read variable's contents directly, similarly to how it's done for fentry/fexit input arguments. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-3-andriin@fb.com
-
Andrii Nakryiko authored
Switch existing Kconfig externs to be just one of few possible kinds of more generic externs. This refactoring is in preparation for ksymbol extern support, added in the follow up patch. There are no functional changes intended. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/bpf/20200619231703.738941-2-andriin@fb.com
-