- 15 May, 2023 2 commits
-
-
Andrii Nakryiko authored
Subsequent instruction index (subseq_idx) is an index of an instruction that was verified/executed by verifier after the currently processed instruction. It is maintained during precision backtracking processing and is used to detect various subprog calling conditions. This patch fixes the bug with incorrectly resetting subseq_idx to -1 when going from child state to parent state during backtracking. If we don't maintain correct subseq_idx we can misidentify subprog calls leading to precision tracking bugs. One such case was triggered by test_global_funcs/global_func9 test where global subprog call happened to be the very last instruction in parent state, leading to subseq_idx==-1, triggering WARN_ONCE: [ 36.045754] verifier backtracking bug [ 36.045764] WARNING: CPU: 13 PID: 2073 at kernel/bpf/verifier.c:3503 __mark_chain_precision+0xcc6/0xde0 [ 36.046819] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) irqbypass(E) i2c_piix4(E) serio_raw(E) i2c_core(E) crc32c_intel) [ 36.048040] CPU: 13 PID: 2073 Comm: test_progs Tainted: G W OE 6.3.0-07976-g4d585f48-dirty #972 [ 36.048783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 36.049648] RIP: 0010:__mark_chain_precision+0xcc6/0xde0 [ 36.050038] Code: 3d 82 c6 05 bb 35 32 02 01 e8 66 21 ec ff 0f 0b b8 f2 ff ff ff e9 30 f5 ff ff 48 c7 c7 f3 61 3d 82 4c 89 0c 24 e8 4a 21 ec ff <0f> 0b 4c0 With the fix precision tracking across multiple states works correctly now: mark_precise: frame0: last_idx 45 first_idx 38 subseq_idx -1 mark_precise: frame0: regs=r8 stack= before 44: (61) r7 = *(u32 *)(r10 -4) mark_precise: frame0: regs=r8 stack= before 43: (85) call pc+41 mark_precise: frame0: regs=r8 stack= before 42: (07) r1 += -48 mark_precise: frame0: regs=r8 stack= before 41: (bf) r1 = r10 mark_precise: frame0: regs=r8 stack= before 40: (63) *(u32 *)(r10 -48) = r1 mark_precise: frame0: regs=r8 stack= before 39: (b4) w1 = 0 mark_precise: frame0: regs=r8 stack= before 38: (85) call pc+38 mark_precise: frame0: parent state regs=r8 stack=: R0_w=scalar() R1_w=map_value(off=4,ks=4,vs=8,imm=0) R6=1 R7_w=scalar() R8_r=P0 R10=fpm mark_precise: frame0: last_idx 36 first_idx 28 subseq_idx 38 mark_precise: frame0: regs=r8 stack= before 36: (18) r1 = 0xffff888104f2ed14 mark_precise: frame0: regs=r8 stack= before 35: (85) call pc+33 mark_precise: frame0: regs=r8 stack= before 33: (18) r1 = 0xffff888104f2ed10 mark_precise: frame0: regs=r8 stack= before 32: (85) call pc+36 mark_precise: frame0: regs=r8 stack= before 31: (07) r1 += -4 mark_precise: frame0: regs=r8 stack= before 30: (bf) r1 = r10 mark_precise: frame0: regs=r8 stack= before 29: (63) *(u32 *)(r10 -4) = r7 mark_precise: frame0: regs=r8 stack= before 28: (4c) w7 |= w0 mark_precise: frame0: parent state regs=r8 stack=: R0_rw=scalar() R6=1 R7_rw=scalar() R8_rw=P0 R10=fp0 fp-48_r=mmmmmmmm mark_precise: frame0: last_idx 27 first_idx 16 subseq_idx 28 mark_precise: frame0: regs=r8 stack= before 27: (85) call pc+31 mark_precise: frame0: regs=r8 stack= before 26: (b7) r1 = 0 mark_precise: frame0: regs=r8 stack= before 25: (b7) r8 = 0 Note how subseq_idx starts out as -1, then is preserved as 38 and then 28 as we go up the parent state chain. Reported-by: Alexei Starovoitov <ast@kernel.org> Fixes: fde2a388 ("bpf: support precision propagation in the presence of subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230515180710.1535018-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
For kfuncs like bpf_obj_drop and bpf_refcount_acquire - which take user-defined types as input - the verifier needs to track the specific type passed in when checking a particular kfunc call. This requires tracking (btf, btf_id) tuple. In commit 7c50b1cb ("bpf: Add bpf_refcount_acquire kfunc") I added an anonymous union with inner structs named after the specific kfuncs tracking this information, with the goal of making it more obvious which kfunc this data was being tracked / expected to be tracked on behalf of. In a recent series adding a new user of this tuple, Alexei mentioned that he didn't like this union usage as it doesn't really help with readability or bug-proofing ([0]). In an offline convo we agreed to have the tuple be fields (arg_btf, arg_btf_id), with comments in bpf_kfunc_call_arg_meta definition enumerating the uses of the fields by kfunc-specific handling logic. Such a pattern is used by struct bpf_reg_state without trouble. Accordingly, this patch removes the anonymous union in favor of arg_btf and arg_btf_id fields and comment enumerating their current uses. The patch also removes struct btf_and_id, which was only being used by the removed union's inner structs. This is a mechanical change, existing linked_list and rbtree tests will validate that correct (btf, btf_id) are being passed. [0]: https://lore.kernel.org/bpf/20230505021707.vlyiwy57vwxglbka@dhcp-172-26-102-232.dhcp.thefacebook.comSigned-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230510213047.1633612-1-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 14 May, 2023 1 commit
-
-
Martin KaFai Lau authored
Stanislav Fomichev says: ==================== optval larger than PAGE_SIZE leads to EFAULT if the BPF program isn't careful enough. This is often overlooked and might break completely unrelated socket options. Instead of EFAULT, let's ignore BPF program buffer changes. See the first patch for more info. In addition, clearly document this corner case and reset optlen in our selftests (in case somebody copy-pastes from them). v6: - no changes; resending due to screwing up v5 series with the unrelated patch v5: - goto in the selftest (Martin) - set IP_TOS to zero to avoid endianness complications (Martin) v4: - ignore retval as well when optlen > PAGE_SIZE (Martin) v3: - don't hard-code PAGE_SIZE (Martin) - reset orig_optlen in getsockopt when kernel part succeeds (Martin) ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
- 13 May, 2023 4 commits
-
-
Stanislav Fomichev authored
And add examples for how to correctly handle large optlens. This is less relevant now when we don't EFAULT anymore, but that's still the correct thing to do. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Stanislav Fomichev authored
Even though it's not relevant in selftests, the people might still copy-paste from them. So let's take care of optlen > 4096 cases explicitly. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-4-sdf@google.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Stanislav Fomichev authored
Instead of assuming EFAULT, let's assume the BPF program's output is ignored. Remove "getsockopt: deny arbitrary ctx->retval" because it was actually testing optlen. We have separate set of tests for retval. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-3-sdf@google.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Stanislav Fomichev authored
With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT. The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls). Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong. Fixes: 0d01da6a ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
- 12 May, 2023 3 commits
-
-
Andrii Nakryiko authored
It seems like __builtin_offset() doesn't preserve CO-RE field relocations properly. So if offsetof() macro is defined through __builtin_offset(), CO-RE-enabled BPF code using container_of() will be subtly and silently broken. To avoid this problem, redefine offsetof() and container_of() in the form that works with CO-RE relocations more reliably. Fixes: 5fbc2208 ("tools/libpf: Add offsetof/container_of macro in bpf_helpers.h") Reported-by: Lennart Poettering <lennart@poettering.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230509065502.2306180-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
KCSAN reported a data-race when accessing node->ref. Although node->ref does not have to be accurate, take this chance to use a more common READ_ONCE() and WRITE_ONCE() pattern instead of data_race(). There is an existing bpf_lru_node_is_ref() and bpf_lru_node_set_ref(). This patch also adds bpf_lru_node_clear_ref() to do the WRITE_ONCE(node->ref, 0) also. ================================================================== BUG: KCSAN: data-race in __bpf_lru_list_rotate / __htab_lru_percpu_map_update_elem write to 0xffff888137038deb of 1 bytes by task 11240 on cpu 1: __bpf_lru_node_move kernel/bpf/bpf_lru_list.c:113 [inline] __bpf_lru_list_rotate_active kernel/bpf/bpf_lru_list.c:149 [inline] __bpf_lru_list_rotate+0x1bf/0x750 kernel/bpf/bpf_lru_list.c:240 bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:329 [inline] bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline] bpf_lru_pop_free+0x638/0xe20 kernel/bpf/bpf_lru_list.c:499 prealloc_lru_pop kernel/bpf/hashtab.c:290 [inline] __htab_lru_percpu_map_update_elem+0xe7/0x820 kernel/bpf/hashtab.c:1316 bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313 bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687 bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534 __sys_bpf+0x338/0x810 __do_sys_bpf kernel/bpf/syscall.c:5096 [inline] __se_sys_bpf kernel/bpf/syscall.c:5094 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888137038deb of 1 bytes by task 11241 on cpu 0: bpf_lru_node_set_ref kernel/bpf/bpf_lru_list.h:70 [inline] __htab_lru_percpu_map_update_elem+0x2f1/0x820 kernel/bpf/hashtab.c:1332 bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313 bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200 generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687 bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534 __sys_bpf+0x338/0x810 __do_sys_bpf kernel/bpf/syscall.c:5096 [inline] __se_sys_bpf kernel/bpf/syscall.c:5094 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x01 -> 0x00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 11241 Comm: syz-executor.3 Not tainted 6.3.0-rc7-syzkaller-00136-g6a66fdd2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 ================================================================== Reported-by: syzbot+ebe648a84e8784763f82@syzkaller.appspotmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230511043748.1384166-1-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alan Maguire authored
v1.25 of pahole supports filtering out functions with multiple inconsistent function prototypes or optimized-out parameters from the BTF representation. These present problems because there is no additional info in BTF saying which inconsistent prototype matches which function instance to help guide attachment, and functions with optimized-out parameters can lead to incorrect assumptions about register contents. So for now, filter out such functions while adding BTF representations for functions that have "."-suffixes (foo.isra.0) but not optimized-out parameters. This patch assumes that below linked changes land in pahole for v1.25. Issues with pahole filtering being too aggressive in removing functions appear to be resolved now, but CI and further testing will confirm. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230510130241.1696561-1-alan.maguire@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 06 May, 2023 9 commits
-
-
Alexei Starovoitov authored
Daniel Rosenberg says: ==================== These patches relax a few verifier requirements around dynptrs. Patches 1-3 are unchanged from v2, apart from rebasing Patch 4 is the same as in v1, see https://lore.kernel.org/bpf/CA+PiJmST4WUH061KaxJ4kRL=fqy3X6+Wgb2E2rrLT5OYjUzxfQ@mail.gmail.com/ Patch 5 adds a test for the change in Patch 4 ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Rosenberg authored
This ensures that buffers retrieved from dynptr_data are allowed to be passed in to helpers that take mem, like bpf_strncmp Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20230506013134.2492210-6-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Rosenberg authored
This allows using memory retrieved from dynptrs with helper functions that accept ARG_PTR_TO_MEM. For instance, results from bpf_dynptr_data can be passed along to bpf_strncmp. Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20230506013134.2492210-5-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Rosenberg authored
This ensures we still reject invalid memory accesses in buffers that are marked optional. Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20230506013134.2492210-4-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Rosenberg authored
bpf_dynptr_slice(_rw) no longer requires a buffer for verification. If the buffer is needed, but not present, the function will return NULL. Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20230506013134.2492210-3-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Daniel Rosenberg authored
bpf_dynptr_slice(_rw) uses a user provided buffer if it can not provide a pointer to a block of contiguous memory. This buffer is unused in the case of local dynptrs, and may be unused in other cases as well. There is no need to require the buffer, as the kfunc can just return NULL if it was needed and not provided. This adds another kfunc annotation, __opt, which combines with __sz and __szk to allow the buffer associated with the size to be NULL. If the buffer is NULL, the verifier does not check that the buffer is of sufficient size. Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20230506013134.2492210-2-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Feng zhou says: ==================== Trace sched related functions, such as enqueue_task_fair, it is necessary to specify a task instead of the current task which within a given cgroup. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Feng Zhou authored
test_progs: Tests new kfunc bpf_task_under_cgroup(). The bpf program saves the new task's pid within a given cgroup to the remote_pid, which is convenient for the user-mode program to verify the test correctness. The user-mode program creates its own mount namespace, and mounts the cgroupsv2 hierarchy in there, call the fork syscall, then check if remote_pid and local_pid are unequal. Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230506031545.35991-3-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Feng Zhou authored
Add a kfunc that's similar to the bpf_current_task_under_cgroup. The difference is that it is a designated task. When hook sched related functions, sometimes it is necessary to specify a task instead of the current task. Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230506031545.35991-2-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 05 May, 2023 14 commits
-
-
Pengcheng Yang authored
Using sizeof(nv) or strlen(nv)+1 is correct. Fixes: c890063e ("bpf: sample BPF_SOCKET_OPS_BASE_RTT program") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Link: https://lore.kernel.org/r/1683276658-2860-1-git-send-email-yangpc@wangsu.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Will Hawkins authored
Correct a few typographical errors and fix some mistakes in examples. Signed-off-by: Will Hawkins <hawkinsw@obs.cr> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230428023015.1698072-2-hawkinsw@obs.crSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Andrii Nakryiko says: ==================== As more and more real-world BPF programs become more complex and increasingly use subprograms (both static and global), scalar precision tracking and its (previously weak) support for BPF subprograms (and callbacks as a special case of that) is becoming more and more of an issue and limitation. Couple that with increasing reliance on state equivalence (BPF open-coded iterators have a hard requirement for state equivalence to converge and successfully validate loops), and it becomes pretty critical to address this limitation and make precision tracking universally supported for BPF programs of any complexity and composition. This patch set teaches BPF verifier to support SCALAR precision backpropagation across multiple frames (for subprogram calls and callback simulations) and addresses most practical situations (SCALAR stack loads/stores using registers other than r10 being the last remaining limitation, though thankfully rarely used in practice). Main logic is explained in details in patch #8. The rest are preliminary preparations, refactorings, clean ups, and fixes. See respective patches for details. Patch #8 has also veristat comparison of results for selftests, Cilium, and some of Meta production BPF programs before and after these changes. v2->v3: - drop bitcnt and ifs from bt_xxx() helpers (Alexei); v1->v2: - addressed review feedback form Alexei, adjusted commit messages, comments, added verbose(), WARN_ONCE(), etc; - re-ran all the tests and veristat on selftests, cilium, and meta-internal code: no new changes and no kernel warnings. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Now that precision propagation is supported fully in the presence of subprogs, there is no need to work around iter test. Revert original workaround. This reverts be7dbd27 ("selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests"). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-11-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add a bunch of tests validating verifier's precision backpropagation logic in the presence of subprog calls and/or callback-calling helpers/kfuncs. We validate the following conditions: - subprog_result_precise: static subprog r0 result precision handling; - global_subprog_result_precise: global subprog r0 precision shortcutting, similar to BPF helper handling; - callback_result_precise: similarly r0 marking precise for callback-calling helpers; - parent_callee_saved_reg_precise, parent_callee_saved_reg_precise_global: propagation of precision for callee-saved registers bypassing static/global subprogs; - parent_callee_saved_reg_precise_with_callback: same as above, but in the presence of callback-calling helper; - parent_stack_slot_precise, parent_stack_slot_precise_global: similar to above, but instead propagating precision of stack slot (spilled SCALAR reg); - parent_stack_slot_precise_with_callback: same as above, but in the presence of callback-calling helper; - subprog_arg_precise: propagation of precision of static subprog's input argument back to caller; - subprog_spill_into_parent_stack_slot_precise: negative test validating that verifier currently can't support backtracking of stack access with non-r10 register, we validate that we fallback to forcing precision for all SCALARs. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-10-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add support precision backtracking in the presence of subprogram frames in jump history. This means supporting a few different kinds of subprogram invocation situations, all requiring a slightly different handling in precision backtracking handling logic: - static subprogram calls; - global subprogram calls; - callback-calling helpers/kfuncs. For each of those we need to handle a few precision propagation cases: - what to do with precision of subprog returns (r0); - what to do with precision of input arguments; - for all of them callee-saved registers in caller function should be propagated ignoring subprog/callback part of jump history. N.B. Async callback-calling helpers (currently only bpf_timer_set_callback()) are transparent to all this because they set a separate async callback environment and thus callback's history is not shared with main program's history. So as far as all the changes in this commit goes, such helper is just a regular helper. Let's look at all these situation in more details. Let's start with static subprogram being called, using an exxerpt of a simple main program and its static subprog, indenting subprog's frame slightly to make everything clear. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; fr0: r1, r6 22: r0 = r1; fr0: r6; fr1: r1 23: exit fr0: r6; fr1: r0 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6 15: exit As can be seen above main function is passing 123 as single argument to an identity (`return x;`) subprog. Returned value is used to adjust map pointer offset, which forces r0 to be marked as precise. Then instruction #14 does the same for callee-saved r6, which will have to be backtracked all the way to instruction #9. For brevity, precision sets for instruction #13 and #14 are combined in the diagram above. First, for subprog calls, r0 returned from subprog (in frame 0) has to go into subprog's frame 1, and should be cleared from frame 0. So we go back into subprog's frame knowing we need to mark r0 precise. We then see that insn #22 sets r0 from r1, so now we care about marking r1 precise. When we pop up from subprog's frame back into caller at insn #11 we keep r1, as it's an argument-passing register, so we eventually find `10: r1 = 123;` and satify precision propagation chain for insn #13. This example demonstrates two sets of rules: - r0 returned after subprog call has to be moved into subprog's r0 set; - *static* subprog arguments (r1-r5) are moved back to caller precision set. Let's look at what happens with callee-saved precision propagation. Insn #14 mark r6 as precise. When we get into subprog's frame, we keep r6 in frame 0's precision set *only*. Subprog itself has its own set of independent r6-r10 registers and is not affected. When we eventually made our way out of subprog frame we keep r6 in precision set until we reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is perhaps the simplest aspect, it always stays in its original frame. That's pretty much all we have to do to support precision propagation across *static subprog* invocation. Let's look at what happens when we have global subprog invocation. frame 0 frame 1 precision set ======= ======= ============= 9: r6 = 456; 10: r1 = 123; fr0: r6 11: call pc+10; # global subprog fr0: r6 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Starting from insn #13, r0 has to be precise. We backtrack all the way to insn #11 (call pc+10) and see that subprog is global, so was already validated in isolation. As opposed to static subprog, global subprog always returns unknown scalar r0, so that satisfies precision propagation and we drop r0 from precision set. We are done for insns #13. Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`. Here we need to recognize that this is effectively both exit and entry to global subprog, which means we stay in caller's frame. So we carry on with r6 still in precision set, until we satisfy it at insn #9. The only hard part with global subprogs is just knowing when it's a global func. Lastly, callback-calling helpers and kfuncs do simulate subprog calls, so jump history will have subprog instructions in between caller program's instructions, but the rules of propagating r0 and r1-r5 differ, because we don't actually directly call callback. We actually call helper/kfunc, which at runtime will call subprog, so the only difference between normal helper/kfunc handling is that we need to make sure to skip callback simulatinog part of jump history. Let's look at an example to make this clearer. frame 0 frame 1 precision set ======= ======= ============= 8: r6 = 456; 9: r1 = 123; fr0: r6 10: r2 = &callback; fr0: r6 11: call bpf_loop; fr0: r6 22: r0 = r1; fr0: r6 fr1: 23: exit fr0: r6 fr1: 12: r1 = <map_pointer> fr0: r0, r6 13: r1 += r0; fr0: r0, r6 14: r1 += r6; fr0: r6; 15: exit Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit` we see that this isn't actually a static subprog call (it's `call bpf_loop;` helper call instead). So we clear r0 from precision set. For callee-saved register, there is no difference: it stays in frame 0's precision set, we go through insn #22 and #23, ignoring them until we get back to caller frame 0, eventually satisfying precision backtrack logic at insn #8 (`r6 = 456;`). Assuming callback needed to set r0 as precise at insn #23, we'd backtrack to insn #22, switching from r0 to r1, and then at the point when we pop back to frame 0 at insn #11, we'll clear r1-r5 from precision set, as we don't really do a subprog call directly, so there is no input argument precision propagation. That's pretty much it. With these changes, it seems like the only still unsupported situation for precision backpropagation is the case when program is accessing stack through registers other than r10. This is still left as unsupported (though rare) case for now. As for results. For selftests, few positive changes for bigger programs, cls_redirect in dynptr variant benefitting the most: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ---------------------------------------- ------------- --------- --------- ---------------- pyperf600_bpf_loop.bpf.linked1.o on_event 2060 2002 -58 (-2.82%) test_cls_redirect_dynptr.bpf.linked1.o cls_redirect 15660 2914 -12746 (-81.39%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 61620 59088 -2532 (-4.11%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 109980 86278 -23702 (-21.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 97716 85147 -12569 (-12.86%) Cilium progress don't really regress. They don't use subprogs and are mostly unaffected, but some other fixes and improvements could have changed something. This doesn't appear to be the case: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) ------------- ------------------------------ --------- --------- ------------ bpf_host.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4983 5003 +20 (+0.40%) bpf_xdp.o tail_handle_nat_fwd_ipv6 12475 12504 +29 (+0.23%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6363 6371 +8 (+0.13%) Looking at (somewhat anonymized) Meta production programs, we see mostly insignificant variation in number of instructions, with one program (syar_bind6_protect6) benefitting the most at -17%. [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0' Program Insns (A) Insns (B) Insns (DIFF) ------------------------ --------- --------- ---------------- on_request_context_event 597 585 -12 (-2.01%) read_async_py_stack 43789 43657 -132 (-0.30%) read_sync_py_stack 35041 37599 +2558 (+7.30%) rrm_usdt 946 940 -6 (-0.63%) sysarmor_inet6_bind 28863 28249 -614 (-2.13%) sysarmor_inet_bind 28845 28240 -605 (-2.10%) syar_bind4_protect4 154145 147640 -6505 (-4.22%) syar_bind6_protect6 165242 137088 -28154 (-17.04%) syar_task_exit_setgid 21289 19720 -1569 (-7.37%) syar_task_exit_setuid 21290 19721 -1569 (-7.37%) do_uprobe 19967 19413 -554 (-2.77%) tw_twfw_ingress 215877 204833 -11044 (-5.12%) tw_twfw_tc_in 215877 204833 -11044 (-5.12%) But checking duration (wall clock) differences, that is the actual time taken by verifier to validate programs, we see a sometimes dramatic improvements, all the way to about 16x improvements: [vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20 Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ---------------------------------------- ----------------- ----------------- -------------------- tw_twfw_ingress 4488374 272836 -4215538 (-93.92%) tw_twfw_tc_in 4339111 268175 -4070936 (-93.82%) tw_twfw_egress 3521816 270751 -3251065 (-92.31%) tw_twfw_tc_eg 3472878 284294 -3188584 (-91.81%) balancer_ingress 343119 291391 -51728 (-15.08%) syar_bind6_protect6 78992 64782 -14210 (-17.99%) ttls_tc_ingress 11739 8176 -3563 (-30.35%) kprobe__security_inode_link 13864 11341 -2523 (-18.20%) read_sync_py_stack 21927 19442 -2485 (-11.33%) read_async_py_stack 30444 28136 -2308 (-7.58%) syar_task_exit_setuid 10256 8440 -1816 (-17.71%) Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-9-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
When precision backtracking bails out due to some unsupported sequence of instructions (e.g., stack access through register other than r10), we need to mark all SCALAR registers as precise to be safe. Currently, though, we mark SCALARs precise only starting from the state we detected unsupported condition, which could be one of the parent states of the actual current state. This will leave some registers potentially not marked as precise, even though they should. So make sure we start marking scalars as precise from current state (env->cur_state). Further, we don't currently detect a situation when we end up with some stack slots marked as needing precision, but we ran out of available states to find the instructions that populate those stack slots. This is akin the `i >= func->allocated_stack / BPF_REG_SIZE` check and should be handled similarly by falling back to marking all SCALARs precise. Add this check when we run out of states. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-8-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Fix propagate_precision() logic to perform propagation of all necessary registers and stack slots across all active frames *in one batch step*. Doing this for each register/slot in each individual frame is wasteful, but the main problem is that backtracking of instruction in any frame except the deepest one just doesn't work. This is due to backtracking logic relying on jump history, and available jump history always starts (or ends, depending how you view it) in current frame. So, if prog A (frame #0) called subprog B (frame #1) and we need to propagate precision of, say, register R6 (callee-saved) within frame #0, we actually don't even know where jump history that corresponds to prog A even starts. We'd need to skip subprog part of jump history first to be able to do this. Luckily, with struct backtrack_state and __mark_chain_precision() handling bitmasks tracking/propagation across all active frames at the same time (added in previous patch), propagate_precision() can be both fixed and sped up by setting all the necessary bits across all frames and then performing one __mark_chain_precision() pass. This makes it unnecessary to skip subprog parts of jump history. We also improve logging along the way, to clearly specify which registers' and slots' precision markings are propagated within which frame. Each frame will have dedicated line and all registers and stack slots from that frame will be reported in format similar to precision backtrack regs/stack logging. E.g.: frame 1: propagating r1,r2,r3,fp-8,fp-16 frame 0: propagating r3,r9,fp-120 Fixes: 529409ea ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-7-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Teach __mark_chain_precision logic to maintain register/stack masks across all active frames when going from child state to parent state. Currently this should be mostly no-op, as precision backtracking usually bails out when encountering subprog entry/exit. It's not very apparent from the diff due to increased indentation, but the logic remains the same, except everything is done on specific `fr` frame index. Calls to bt_clear_reg() and bt_clear_slot() are replaced with frame-specific bt_clear_frame_reg() and bt_clear_frame_slot(), where frame index is passed explicitly, instead of using current frame number. We also adjust logging to emit affected frame number. And we also add better logging of human-readable register and stack slot masks, similar to previous patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add helper to format register and stack masks in more human-readable format. Adjust logging a bit during backtrack propagation and especially during forcing precision fallback logic to make it clearer what's going on (with log_level=2, of course), and also start reporting affected frame depth. This is in preparation for having more than one active frame later when precision propagation between subprog calls is added. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add struct backtrack_state and straightforward API around it to keep track of register and stack masks used and maintained during precision backtracking process. Having this logic separately allow to keep high-level backtracking algorithm cleaner, but also it sets us up to cleanly keep track of register and stack masks per frame, allowing (with some further logic adjustments) to perform precision backpropagation across multiple frames (i.e., subprog calls). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
When handling instructions that read register slots, mark relevant stack slots as scratched so that verifier log would contain those slots' states, in addition to currently emitted registers with stack slot offsets. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Sometimes during debugging it's important that BPF program is loaded with BPF_F_TEST_STATE_FREQ flag set to force verifier to do frequent state checkpointing. Teach veristat to do this when -t ("test state") flag is specified. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230505043317.3629845-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Kenjiro Nakayama authored
To make comments about arc and riscv arch in bpf_tracing.h accurate, this patch fixes the comment about arc and adds the comment for riscv. Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230504035443.427927-1-nakayamakenjiro@gmail.com
-
- 02 May, 2023 2 commits
-
-
Kui-Feng Lee authored
Only print the warning message if you are writing to "/proc/sys/kernel/unprivileged_bpf_disabled". The kernel may print an annoying warning when you read "/proc/sys/kernel/unprivileged_bpf_disabled" saying WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks! However, this message is only meaningful when the feature is disabled or enabled. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230502181418.308479-1-kuifeng@meta.com
-
Yonghong Song authored
In one of our internal testing, we found a case where - uapi struct bpf_tcp_sock is in vmlinux.h where vmlinux.h is not generated from the testing kernel - struct bpf_tcp_sock is not in vmlinux BTF The above combination caused bpf load failure as the following memory access struct bpf_tcp_sock *tcp_sock = ...; ... tcp_sock->snd_cwnd ... needs CORE relocation but the relocation cannot be resolved since the kernel BTF does not have corresponding type. Similar to other previous cases (nf_conn___init, tcp6_sock, mctcp_sock, etc.), add the type to vmlinux BTF with BTF_EMIT_TYPE macro. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230502180543.1832140-1-yhs@fb.com
-
- 01 May, 2023 4 commits
-
-
Andrii Nakryiko authored
Stephen Veiss says: ==================== BPF selftests have ALLOWLIST and DENYLIST files, used to control which tests are run in CI. These files are currently parsed by a shell script. [1] This patchset allows those files to be specified directly on the test_progs command line (eg, as -a @ALLOWLIST). This also fixes a bug in the existing test filter code causing unnecessary duplicate top-level test filter entries to be created. [1] https://github.com/kernel-patches/vmtest/blob/57feb460047b69f891cf4afe3cc860794a2ced17/ci/vmtest/run_selftests.sh#L21-L27 --- v2: - error handling style changes per reviewer comments - fdopen return value checking in test_parse_test_list_file v1: https://lore.kernel.org/bpf/20230425225401.1075796-1-sveiss@meta.com/ ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
-
Stephen Veiss authored
Improve test selection logic when using -a/-b/-d/-t options. The list of tests to include or exclude can now be read from a file, specified as @<filename>. The file contains one name (or wildcard pattern) per line, and comments beginning with # are ignored. These options can be passed multiple times to read more than one file. Signed-off-by: Stephen Veiss <sveiss@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230427225333.3506052-3-sveiss@meta.com
-
Stephen Veiss authored
Split the logic to insert new tests into test filter sets out from parse_test_list. Fix the subtest insertion logic to reuse an existing top-level test filter, which prevents the creation of duplicate top-level test filters each with a single subtest. Signed-off-by: Stephen Veiss <sveiss@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230427225333.3506052-2-sveiss@meta.com
-
Martin KaFai Lau authored
The btf_dump/struct_data selftest is failing with: [...] test_btf_dump_struct_data:FAIL:unexpected return value dumping fs_context unexpected unexpected return value dumping fs_context: actual -7 != expected 264 [...] The reason is in btf_dump_type_data_check_overflow(). It does not use BTF_MEMBER_BITFIELD_SIZE from the struct's member (btf_member). Instead, it is using the enum size which is 4. It had been working till the recent commit 4e04143c ("fs_context: drop the unused lsm_flags member") removed an integer member which also removed the 4 bytes padding at the end of the fs_context. Missing this 4 bytes padding exposed this bug. In particular, when btf_dump_type_data_check_overflow() reaches the member 'phase', -E2BIG is returned. The fix is to pass bit_sz to btf_dump_type_data_check_overflow(). In btf_dump_type_data_check_overflow(), it does a different size check when bit_sz is not zero. The current fs_context: [3600] ENUM 'fs_context_purpose' encoding=UNSIGNED size=4 vlen=3 'FS_CONTEXT_FOR_MOUNT' val=0 'FS_CONTEXT_FOR_SUBMOUNT' val=1 'FS_CONTEXT_FOR_RECONFIGURE' val=2 [3601] ENUM 'fs_context_phase' encoding=UNSIGNED size=4 vlen=7 'FS_CONTEXT_CREATE_PARAMS' val=0 'FS_CONTEXT_CREATING' val=1 'FS_CONTEXT_AWAITING_MOUNT' val=2 'FS_CONTEXT_AWAITING_RECONF' val=3 'FS_CONTEXT_RECONF_PARAMS' val=4 'FS_CONTEXT_RECONFIGURING' val=5 'FS_CONTEXT_FAILED' val=6 [3602] STRUCT 'fs_context' size=264 vlen=21 'ops' type_id=3603 bits_offset=0 'uapi_mutex' type_id=235 bits_offset=64 'fs_type' type_id=872 bits_offset=1216 'fs_private' type_id=21 bits_offset=1280 'sget_key' type_id=21 bits_offset=1344 'root' type_id=781 bits_offset=1408 'user_ns' type_id=251 bits_offset=1472 'net_ns' type_id=984 bits_offset=1536 'cred' type_id=1785 bits_offset=1600 'log' type_id=3621 bits_offset=1664 'source' type_id=42 bits_offset=1792 'security' type_id=21 bits_offset=1856 's_fs_info' type_id=21 bits_offset=1920 'sb_flags' type_id=20 bits_offset=1984 'sb_flags_mask' type_id=20 bits_offset=2016 's_iflags' type_id=20 bits_offset=2048 'purpose' type_id=3600 bits_offset=2080 bitfield_size=8 'phase' type_id=3601 bits_offset=2088 bitfield_size=8 'need_free' type_id=67 bits_offset=2096 bitfield_size=1 'global' type_id=67 bits_offset=2097 bitfield_size=1 'oldapi' type_id=67 bits_offset=2098 bitfield_size=1 Fixes: 920d16af ("libbpf: BTF dumper support for typed data") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230428013638.1581263-1-martin.lau@linux.dev
-
- 28 Apr, 2023 1 commit
-
-
Martin KaFai Lau authored
It is reported that the fexit_sleep never returns in aarch64. The remaining tests cannot start. Put this test into DENYLIST.aarch64 for now so that other tests can continue to run in the CI. Acked-by: Manu Bretelle <chantr4@gmail.com> Reported-by: Manu Bretelle <chantra@meta.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-