- 16 Mar, 2023 5 commits
-
-
Martin KaFai Lau authored
In __start_server, it leaks a fd when setsockopt(SO_REUSEPORT) fails. This patch fixes it. Fixes: eed92afd ("bpf: selftest: Test batching and bpf_(get|set)sockopt in bpf tcp iter") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230316000726.1016773-2-martin.lau@linux.dev
-
Martin KaFai Lau authored
In tcp_hdr_options test, it ensures the received tcp hdr option and the sk local storage have the expected values. It uses memcmp to check that. Testing the memcmp result with ASSERT_OK is confusing because ASSERT_OK will print out the errno which is not set. This patch uses ASSERT_EQ to check for 0 instead. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230316000726.1016773-1-martin.lau@linux.dev
-
Alexei Starovoitov authored
Viktor Malik says: ==================== I noticed that the verifier behaves incorrectly when attaching to fentry of multiple functions of the same name located in different modules (or in vmlinux). The reason for this is that if the target program is not specified, the verifier will search kallsyms for the trampoline address to attach to. The entire kallsyms is always searched, not respecting the module in which the function to attach to is located. As Yonghong correctly pointed out, there is yet another issue - the trampoline acquires the module reference in register_fentry which means that if the module is unloaded between the place where the address is found in the verifier and register_fentry, it is possible that another module is loaded to the same address in the meantime, which may lead to errors. This patch fixes the above issues by extracting the module name from the BTF of the attachment target (which must be specified) and by doing the search in kallsyms of the correct module. At the same time, the module reference is acquired right after the address is found and only released right before the program itself is unloaded. --- Changes in v10: - added the new test to DENYLIST.aarch64 (suggested by Andrii) - renamed the test source file to match the test name Changes in v9: - two small changes suggested by Jiri Olsa and Jiri's ack Changes in v8: - added module_put to error paths in bpf_check_attach_target after the module reference is acquired Changes in v7: - refactored the module reference manipulation (comments by Jiri Olsa) - cleaned up the test (comments by Andrii Nakryiko) Changes in v6: - storing the module reference inside bpf_prog_aux instead of bpf_trampoline and releasing it when the program is unloaded (suggested by Jiri Olsa) Changes in v5: - fixed acquiring and releasing of module references by trampolines to prevent modules being unloaded between address lookup and trampoline allocation Changes in v4: - reworked module kallsyms lookup approach using existing functions, verifier now calls btf_try_get_module to retrieve the module and find_kallsyms_symbol_value to get the symbol address (suggested by Alexei) - included Jiri Olsa's comments - improved description of the new test and added it as a comment into the test source Changes in v3: - added trivial implementation for kallsyms_lookup_name_in_module() for !CONFIG_MODULES (noticed by test robot, fix suggested by Hao Luo) Changes in v2: - introduced and used more space-efficient kallsyms lookup function, suggested by Jiri Olsa - included Hao Luo's comments ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Viktor Malik authored
Adds a new test that tries to attach a program to fentry of two functions of the same name, one located in vmlinux and the other in bpf_testmod. To avoid conflicts with existing tests, a new function "bpf_fentry_shadow_test" was created both in vmlinux and in bpf_testmod. The previous commit fixed a bug which caused this test to fail. The verifier would always use the vmlinux function's address as the target trampoline address, hence trying to create two trampolines for a single address, which is forbidden. The test (similarly to other fentry/fexit tests) is not working on arm64 at the moment. Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/5fe2f364190b6f79b085066ed7c5989c5bc475fa.1678432753.git.vmalik@redhat.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Viktor Malik authored
This resolves two problems with attachment of fentry/fexit/fmod_ret/lsm to functions located in modules: 1. The verifier tries to find the address to attach to in kallsyms. This is always done by searching the entire kallsyms, not respecting the module in which the function is located. Such approach causes an incorrect attachment address to be computed if the function to attach to is shadowed by a function of the same name located earlier in kallsyms. 2. If the address to attach to is located in a module, the module reference is only acquired in register_fentry. If the module is unloaded between the place where the address is found (bpf_check_attach_target in the verifier) and register_fentry, it is possible that another module is loaded to the same address which may lead to potential errors. Since the attachment must contain the BTF of the program to attach to, we extract the module from it and search for the function address in the correct module (resolving problem no. 1). Then, the module reference is taken directly in bpf_check_attach_target and stored in the bpf program (in bpf_prog_aux). The reference is only released when the program is unloaded (resolving problem no. 2). Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/3f6a9d8ae850532b5ef864ef16327b0f7a669063.1678432753.git.vmalik@redhat.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 14 Mar, 2023 14 commits
-
-
Tejun Heo authored
The commit 332ea1f6 ("bpf: Add bpf_cgroup_from_id() kfunc") added bpf_cgroup_from_id() which calls current_cgns_cgroup_dfl() through cgroup_get_from_id(). However, BPF programs may be attached to a point where current->nsproxy has already been cleared to NULL by exit_task_namespace() and calling bpf_cgroup_from_id() would cause an oops. Just return the system-wide root if nsproxy has been cleared. This allows all cgroups to be looked up after the task passed through exit_task_namespace(), which semantically makes sense. Given that the only way to get this behavior is through BPF programs, it seems safe but let's see what others think. Fixes: 332ea1f6 ("bpf: Add bpf_cgroup_from_id() kfunc") Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/ZBDuVWiFj2jiz3i8@slm.duckdns.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
LLVM commit https://reviews.llvm.org/D143726 introduced hoistMinMax optimization that transformed (i < VIRTIO_MAX_SGS) && (i < out_sgs) into i < MIN(VIRTIO_MAX_SGS, out_sgs) and caused the verifier to stop recognizing such loop as bounded. Which resulted in the following test failure: libbpf: prog 'trace_virtqueue_add_sgs': BPF program load failed: Bad address libbpf: prog 'trace_virtqueue_add_sgs': -- BEGIN PROG LOAD LOG -- The sequence of 8193 jumps is too complex. verification time 789206 usec stack depth 56 processed 156446 insns (limit 1000000) max_states_per_insn 7 total_states 1746 peak_states 1701 mark_read 12 -- END PROG LOAD LOG -- libbpf: prog 'trace_virtqueue_add_sgs': failed to load: -14 libbpf: failed to load object 'loop6.bpf.o' Workaround the verifier limitation for now with inline asm that prevents this particular optimization. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Alexander Lobakin says: ==================== Yeah, I still remember that "Who needs cpumap nowadays" (c), but anyway. __xdp_build_skb_from_frame() missed the moment when the networking stack became able to recycle skb pages backed by a page_pool. This was making e.g. cpumap redirect even less effective than simple %XDP_PASS. veth was also affected in some scenarios. A lot of drivers use skb_mark_for_recycle() already, it's been almost two years and seems like there are no issues in using it in the generic code too. {__,}xdp_release_frame() can be then removed as it losts its last user. Page Pool becomes then zero-alloc (or almost) in the abovementioned cases, too. Other memory type models (who needs them at this point) have no changes. Some numbers on 1 Xeon Platinum core bombed with 27 Mpps of 64-byte IPv6 UDP, iavf w/XDP[0] (CONFIG_PAGE_POOL_STATS is enabled): Plain %XDP_PASS on baseline, Page Pool driver: src cpu Rx drops dst cpu Rx 2.1 Mpps N/A 2.1 Mpps cpumap redirect (cross-core, w/o leaving its NUMA node) on baseline: 6.8 Mpps 5.0 Mpps 1.8 Mpps cpumap redirect with skb PP recycling: 7.9 Mpps 5.7 Mpps 2.2 Mpps +22% (from cpumap redir on baseline) [0] https://github.com/alobakin/linux/commits/iavf-xdp ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexander Lobakin authored
__xdp_build_skb_from_frame() was the last user of {__,}xdp_release_frame(), which detaches pages from the page_pool. All the consumers now recycle Page Pool skbs and page, except mlx5, stmmac and tsnep drivers, which use page_pool_release_page() directly (might change one day). It's safe to assume this functionality is not needed anymore and can be removed (in favor of recycling). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-5-aleksander.lobakin@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexander Lobakin authored
__xdp_build_skb_from_frame() state(d): /* Until page_pool get SKB return path, release DMA here */ Page Pool got skb pages recycling in April 2021, but missed this function. xdp_release_frame() is relevant only for Page Pool backed frames and it detaches the page from the corresponding page_pool in order to make it freeable via page_frag_free(). It can instead just mark the output skb as eligible for recycling if the frame is backed by a pp. No change for other memory model types (the same condition check as before). cpumap redirect and veth on Page Pool drivers now become zero-alloc (or almost). Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-4-aleksander.lobakin@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexander Lobakin authored
skb_mark_for_recycle() is guarded with CONFIG_PAGE_POOL, this creates unneeded complication when using it in the generic code. For now, it's only used in the drivers always selecting Page Pool, so this works. Move the guards so that preprocessor will cut out only the operation itself and the function will still be a noop on !PAGE_POOL systems, but available there as well. No functional changes. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303020342.Wi2PRFFH-lkp@intel.comSigned-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-3-aleksander.lobakin@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexander Lobakin authored
Currently, the test relies on that only dropped ("xmitted") frames will be recycled and if a frame became an skb, it will be freed later by the stack and never come back to its page_pool. So, it easily gets broken by trying to recycle skbs[0]: test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec test_xdp_do_redirect:FAIL:pkt_count_zero unexpected pkt_count_zero: actual 9936 != expected 2 test_xdp_do_redirect:PASS:pkt_count_tc 0 nsec That huge mismatch happened because after the TC ingress hook zeroes the magic, the page gets recycled when skb is freed, not returned to the MM layer. "Live frames" mode initializes only new pages and keeps the recycled ones as is by design, so they appear with zeroed magic on the Rx path again. Expand the possible magic values from two: 0 (was "xmitted"/dropped or did hit the TC hook) and 0x42 (hit the input XDP prog) to three: the new one will mark frames hit the TC hook, so that they will elide both @pkt_count_zero and @pkt_count_xdp. They can then be recycled to their page_pool or returned to the page allocator, this won't affect the counters anyhow. Just make sure to mark them as "input" (0x42) when they appear on the Rx path again. Also make an enum from those magics, so that they will be always visible and can be changed in just one place anytime. This also eases adding any new marks later on. Link: https://github.com/kernel-patches/bpf/actions/runs/4386538411/jobs/7681081789Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230313215553.1045175-2-aleksander.lobakin@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Alexei Starovoitov says: ==================== From: Alexei Starovoitov <ast@kernel.org> Allow code like: bpf_strncmp(task->comm, 16, "foo"); ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Alexei Starovoitov authored
Add various tests to check helper access into ptr_to_btf_id. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230313235845.61029-4-alexei.starovoitov@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Alexei Starovoitov authored
The verifier rejects the code: bpf_strncmp(task->comm, 16, "my_task"); with the message: 16: (85) call bpf_strncmp#182 R1 type=trusted_ptr_ expected=fp, pkt, pkt_meta, map_key, map_value, mem, ringbuf_mem, buf Teach the verifier that such access pattern is safe. Do not allow untrusted and legacy ptr_to_btf_id to be passed into helpers. Reported-by: David Vernet <void@manifault.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230313235845.61029-3-alexei.starovoitov@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Alexei Starovoitov authored
bpf_strncmp() doesn't write into its first argument. Make sure that the verifier knows about it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230313235845.61029-2-alexei.starovoitov@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Dave Thaler authored
Improve clarity by adding an example of a signed comparison instruction Signed-off-by: Dave Thaler <dthaler@microsoft.com> Acked-by: David Vernet <void@manifault.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230310233814.4641-1-dthaler1968@googlemail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Ross Zwisler authored
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many tests in the bpf selftest code still refer to this older debugfs path, so let's update them to avoid confusion. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-3-zwisler@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Ross Zwisler authored
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many comments and samples in the bpf code still refer to this older debugfs path, so let's update them to avoid confusion. There are a few spots where the bpf code explicitly checks both tracefs and debugfs (tools/bpf/bpftool/tracelog.c and tools/lib/api/fs/fs.c) and I've left those alone so that the tools can continue to work with both paths. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-2-zwisler@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 13 Mar, 2023 3 commits
-
-
Dave Marchevsky authored
When a local kptr is stashed in a map and freed when the map goes away, currently an error like the below appears: [ 39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875 [ 39.196549] caller is bpf_mem_free+0x56/0xc0 [ 39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G O 6.2.0-13016-g22df776a #4477 [ 39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 39.198949] Workqueue: events_unbound bpf_map_free_deferred [ 39.199470] Call Trace: [ 39.199703] <TASK> [ 39.199911] dump_stack_lvl+0x60/0x70 [ 39.200267] check_preemption_disabled+0xbf/0xe0 [ 39.200704] bpf_mem_free+0x56/0xc0 [ 39.201032] ? bpf_obj_new_impl+0xa0/0xa0 [ 39.201430] bpf_obj_free_fields+0x1cd/0x200 [ 39.201838] array_map_free+0xad/0x220 [ 39.202193] ? finish_task_switch+0xe5/0x3c0 [ 39.202614] bpf_map_free_deferred+0xea/0x210 [ 39.203006] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.203460] process_one_work+0x64f/0xbe0 [ 39.203822] ? pwq_dec_nr_in_flight+0x110/0x110 [ 39.204264] ? do_raw_spin_lock+0x107/0x1c0 [ 39.204662] ? lockdep_hardirqs_on_prepare+0xe/0x220 [ 39.205107] worker_thread+0x74/0x7a0 [ 39.205451] ? process_one_work+0xbe0/0xbe0 [ 39.205818] kthread+0x171/0x1a0 [ 39.206111] ? kthread_complete_and_exit+0x20/0x20 [ 39.206552] ret_from_fork+0x1f/0x30 [ 39.206886] </TASK> This happens because the call to __bpf_obj_drop_impl I added in the patch adding support for stashing local kptrs doesn't disable migration. Prior to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF progarm, whereas now it can be called from map free path, so it's necessary to explicitly disable migration. Also, refactor a bit to just call __bpf_obj_drop_impl directly instead of bothering w/ dtor union and setting pointer-to-obj_drop. Fixes: c8e18754 ("bpf: Support __kptr to local kptrs") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
David Vernet authored
In commit 3fbd7ee2 ("tasks: Add a count of task RCU users"), a count on the number of RCU users was added to struct task_struct. This was done so as to enable the removal of task_rcu_dereference(), and allow tasks to be protected by RCU even after exiting and being removed from the runqueue. In this commit, the 'refcount_t rcu_users' field that keeps track of this refcount was put into a union co-located with 'struct rcu_head rcu', so as to avoid taking up any extra space in task_struct. This was possible to do safely, because the field was only ever decremented by a static set of specific callers, and then never incremented again. While this restriction of there only being a small, static set of users of this field has worked fine, it prevents us from leveraging the field to use RCU to protect tasks in other contexts. During tracing, for example, it would be useful to be able to collect some tasks that performed a certain operation, put them in a map, and then periodically summarize who they are, which cgroup they're in, how much CPU time they've utilized, etc. While this can currently be done with 'usage', it becomes tricky when a task is already in a map, or if a reference should only be taken if a task is valid and will not soon be reaped. Ideally, we could do something like pass a reference to a map value, and then try to acquire a reference to the task in an RCU read region by using refcount_inc_not_zero(). Similarly, in sched_ext, schedulers are using integer pids to remember tasks, and then looking them up with find_task_by_pid_ns(). This is slow, error prone, and adds complexity. It would be more convenient and performant if BPF schedulers could instead store tasks directly in maps, and then leverage RCU to ensure they can be safely accessed with low overhead. Finally, overloading fields like this is error prone. Someone that wants to use 'rcu_users' could easily overlook the fact that once the rcu callback is scheduled, the refcount will go back to being nonzero, thus precluding the use of refcount_inc_not_zero(). Furthermore, as described below, it's possible to extract the fields of the union without changing the size of task_struct. There are several possible ways to enable this: 1. The lightest touch approach is likely the one proposed in this patch, which is to simply extract 'rcu_users' and 'rcu' from the union, so that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users' refcount. If we have a trusted task pointer, this would allow us to use refcnt_inc_not_zero() inside of an RCU region to determine if we can safely acquire a reference to the task and store it in a map. As mentioned below, this can be done without changing the size of task_struct, by moving the location of the union to another location that has padding gaps we can fill in. 2. Removing 'refcount_t rcu_users', and instead having the entire task be freed in an rcu callback. This is likely the most sound overall design, though it changes the behavioral semantics exposed to callers, who currently expect that a task that's successfully looked up in e.g. the pid_list with find_task_by_pid_ns(), can always have a 'usage' reference acquired on them, as it's guaranteed to be > 0 until after the next gp. In order for this approach to work, we'd have to audit all callers. This approach also slightly changes behavior observed by user space by not invoking trace_sched_process_free() until the whole task_struct is actually being freed, rather than just after it's exited. It also may change timings, as memory will be freed in an RCU callback rather than immediately when the final 'usage' refcount drops to 0. This also is arguably a benefit, as it provides more predictable performance to callers who are refcounting tasks. 3. There may be other solutions as well that don't require changing the layout of task_struct. For example, we could possibly do something complex from the BPF side, such as listen for task exit and remove a task from a map when the task is exiting. This would likely require significant custom handling for task_struct in the verifier, so a more generalizable solution is likely warranted. As mentioned above, this patch proposes the lightest-touch approach which allows callers elsewhere in the kernel to use 'rcu_users' to ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from the union. There is no size change in task_struct with this patch. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: David Vernet <void@manifault.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: 529409ea ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 11 Mar, 2023 4 commits
-
-
Alexei Starovoitov authored
Dave Marchevsky says: ==================== Local kptrs are kptrs allocated via bpf_obj_new with a type specified in program BTF. A BPF program which creates a local kptr has exclusive control of the lifetime of the kptr, and, prior to terminating, must: * free the kptr via bpf_obj_drop * If the kptr is a {list,rbtree} node, add the node to a {list, rbtree}, thereby passing control of the lifetime to the collection This series adds a third option: * stash the kptr in a map value using bpf_kptr_xchg As indicated by the use of "stash" to describe this behavior, the intended use of this feature is temporary storage of local kptrs. For example, a sched_ext ([0]) scheduler may want to create an rbtree node for each new cgroup on cgroup init, but to add that node to the rbtree as part of a separate program which runs on enqueue. Stashing the node in a map_value allows its lifetime to outlive the execution of the cgroup_init program. Behavior: There is no semantic difference between adding a kptr to a graph collection and "stashing" it in a map. In both cases exclusive ownership of the kptr's lifetime is passed to some containing data structure, which is responsible for bpf_obj_drop'ing it when the container goes away. Since graph collections also expect exclusive ownership of the nodes they contain, graph nodes cannot be both stashed in a map_value and contained by their corresponding collection. Implementation: Two observations simplify the verifier changes for this feature. First, kptrs ("referenced kptrs" until a recent renaming) require registration of a dtor function as part of their acquire/release semantics, so that a referenced kptr which is placed in a map_value is properly released when the map goes away. We want this exact behavior for local kptrs, but with bpf_obj_drop as the dtor instead of a per-btf_id dtor. The second observation is that, in terms of identification, "referenced kptr" and "local kptr" already don't interfere with one another. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; }; struct map_value { struct node_data __kptr *node; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct map_value); __uint(max_entries, 1); } some_nodes SEC(".maps"); struct map_value *mapval; struct node_data *res; int key = 0; res = bpf_obj_new(typeof(*res)); if (!res) { /* err handling */ } mapval = bpf_map_lookup_elem(&some_nodes, &key); if (!mapval) { /* err handling */ } res = bpf_kptr_xchg(&mapval->node, res); if (res) bpf_obj_drop(res); The __kptr tag identifies map_value's node as a referenced kptr, while the PTR_TO_BTF_ID which bpf_obj_new returns - a type in some non-vmlinux, non-module BTF - identifies res as a local kptr. Type tag on the pointer indicates referenced kptr, while the type of the pointee indicates local kptr. So using existing facilities we can tell the verifier about a "referenced kptr" pointer to a "local kptr" pointee. When kptr_xchg'ing a kptr into a map_value, the verifier can recognize local kptr types and treat them like referenced kptrs with a properly-typed bpf_obj_drop as a dtor. Other implementation notes: * We don't need to do anything special to enforce "graph nodes cannot be both stashed in a map_value and contained by their corresponding collection" * bpf_kptr_xchg both returns and takes as input a (possibly-null) owning reference. It does not accept non-owning references as input by virtue of requiring a ref_obj_id. By definition, if a program has an owning ref to a node, the node isn't in a collection, so it's safe to pass ownership via bpf_kptr_xchg. Summary of patches: * Patch 1 modifies BTF plumbing to support using bpf_obj_drop as a dtor * Patch 2 adds verifier plumbing to support MEM_ALLOC-flagged param for bpf_kptr_xchg * Patch 3 adds selftests exercising the new behavior Changelog: v1 -> v2: https://lore.kernel.org/bpf/20230309180111.1618459-1-davemarchevsky@fb.com/ Patch #s used below refer to the patch's position in v1 unless otherwise specified. Patches 1-3 were applied and are not included in v2. Rebase onto latest bpf-next: "libbpf: Revert poisoning of strlcpy" Patch 4: "bpf: Support __kptr to local kptrs" * Remove !btf_is_kernel(btf) check, WARN_ON_ONCE instead (Alexei) Patch 6: "selftests/bpf: Add local kptr stashing test" * Add test which stashes 2 nodes and later unstashes one of them using a separate BPF program (Alexei) * Fix incorrect runner subtest name for original test (was "rbtree_add_nodes") ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
Add a new selftest, local_kptr_stash, which uses bpf_kptr_xchg to stash a bpf_obj_new-allocated object in a map. Test the following scenarios: * Stash two rb_nodes in an arraymap, don't unstash them, rely on map free to destruct them * Stash two rb_nodes in an arraymap, unstash the second one in a separate program, rely on map free to destruct first Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-4-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
The previous patch added necessary plumbing for verifier and runtime to know what to do with non-kernel PTR_TO_BTF_IDs in map values, but didn't provide any way to get such local kptrs into a map value. This patch modifies verifier handling of bpf_kptr_xchg to allow MEM_ALLOC kptr types. check_reg_type is modified accept MEM_ALLOC-flagged input to bpf_kptr_xchg despite such types not being in btf_ptr_types. This could have been done with a MAYBE_MEM_ALLOC equivalent to MAYBE_NULL, but bpf_kptr_xchg is the only helper that I can forsee using MAYBE_MEM_ALLOC, so keep it special-cased for now. The verifier tags bpf_kptr_xchg retval MEM_ALLOC if and only if the BTF associated with the retval is not kernel BTF. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-3-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module BTF - it must have been allocated by bpf_obj_new and therefore must be free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new. This patch adds support for treating __kptr-tagged pointers to "local kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr acquire / release semantics. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; }; struct map_value { struct node_data __kptr *node; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct map_value); __uint(max_entries, 1); } some_nodes SEC(".maps"); If struct node_data had a matching definition in kernel BTF, the verifier would expect a destructor for the type to be registered. Since struct node_data does not match any type in kernel BTF, the verifier knows that there is no kfunc that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can only come from bpf_obj_new. So instead of searching for a registered dtor, a bpf_obj_drop dtor can be assumed. This allows the runtime to properly destruct such kptrs in bpf_obj_free_fields, which enables maps to clean up map_vals w/ such kptrs when going away. Implementation notes: * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr. Before this patch, the variable would only ever point to vmlinux or module BTFs, but now it can point to some program BTF for local kptr type. It's later used to populate the (btf, btf_id) pair in kptr btf field. * It's necessary to btf_get the program BTF when populating btf_field for local kptr. btf_record_free later does a btf_put. * Behavior for non-local referenced kptrs is not modified, as bpf_find_btf_id helper only searches vmlinux and module BTFs for matching BTF type. If such a type is found, btf_field_kptr's btf will pass btf_is_kernel check, and the associated release function is some one-argument dtor. If btf_is_kernel check fails, associated release function is two-arg bpf_obj_drop_impl. Before this patch only btf_field_kptr's w/ kernel or module BTFs were created. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 10 Mar, 2023 14 commits
-
-
Dave Thaler authored
Add brief text about existence of helper functions, with details to go in separate psABI text. Note that text about runtime functions (kfuncs) is part of a separate patch, not this one. Signed-off-by: Dave Thaler <dthaler@microsoft.com> Link: https://lore.kernel.org/r/20230308205303.1308-1-dthaler1968@googlemail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
btf_record_find's 3rd parameter can be multiple enum btf_field_type's masked together. The function is called with BPF_KPTR in two places in verifier.c, so it works with masked values already. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
This enum was added and used in commit aa3496ac ("bpf: Refactor kptr_off_tab into btf_record"). Later refactoring in commit db559117 ("bpf: Consolidate spin_lock, timer management into btf_record") resulted in the enum values no longer being used anywhere. Let's remove them. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230309180111.1618459-3-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Marchevsky authored
kernel_type_name was introduced in commit 9e15db66 ("bpf: Implement accurate raw_tp context access via BTF") with type signature: const char *kernel_type_name(u32 id) At that time the function used global btf_vmlinux BTF for all id lookups. Later, in commit 22dc4a0f ("bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier"), the type signature was changed to: static const char *kernel_type_name(const struct btf* btf, u32 id) With the btf parameter used for lookups instead of global btf_vmlinux. The helper will function as expected for type name lookup using non-kernel BTFs, and will be used for such in further patches in the series. Let's rename it to avoid incorrect assumptions that might arise when seeing the current name. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230309180111.1618459-2-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch tests how many kmallocs is needed to create and free a batch of UDP sockets and each socket has a 64bytes bpf storage. It also measures how fast the UDP sockets can be created. The result is from my qemu setup. Before bpf_mem_cache_alloc/free: ./bench -p 1 local-storage-create Setting up benchmark 'local-storage-create'... Benchmark 'local-storage-create' started. Iter 0 ( 73.193us): creates 213.552k/s (213.552k/prod), 3.09 kmallocs/create Iter 1 (-20.724us): creates 211.908k/s (211.908k/prod), 3.09 kmallocs/create Iter 2 ( 9.280us): creates 212.574k/s (212.574k/prod), 3.12 kmallocs/create Iter 3 ( 11.039us): creates 213.209k/s (213.209k/prod), 3.12 kmallocs/create Iter 4 (-11.411us): creates 213.351k/s (213.351k/prod), 3.12 kmallocs/create Iter 5 ( -7.915us): creates 214.754k/s (214.754k/prod), 3.12 kmallocs/create Iter 6 ( 11.317us): creates 210.942k/s (210.942k/prod), 3.12 kmallocs/create Summary: creates 212.789 ± 1.310k/s (212.789k/prod), 3.12 kmallocs/create After bpf_mem_cache_alloc/free: ./bench -p 1 local-storage-create Setting up benchmark 'local-storage-create'... Benchmark 'local-storage-create' started. Iter 0 ( 68.265us): creates 243.984k/s (243.984k/prod), 1.04 kmallocs/create Iter 1 ( 30.357us): creates 238.424k/s (238.424k/prod), 1.04 kmallocs/create Iter 2 (-18.712us): creates 232.963k/s (232.963k/prod), 1.04 kmallocs/create Iter 3 (-15.885us): creates 238.879k/s (238.879k/prod), 1.04 kmallocs/create Iter 4 ( 5.590us): creates 237.490k/s (237.490k/prod), 1.04 kmallocs/create Iter 5 ( 8.577us): creates 237.521k/s (237.521k/prod), 1.04 kmallocs/create Iter 6 ( -6.263us): creates 238.508k/s (238.508k/prod), 1.04 kmallocs/create Summary: creates 237.298 ± 2.198k/s (237.298k/prod), 1.04 kmallocs/create Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-18-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch tweats the socket_bind bpf prog to test the local_storage->smap == NULL case in the bpf_local_storage_free() code path. The idea is to create the local_storage with the sk_storage_map's selem first. Then add the sk_storage_map2's selem and then delete the earlier sk_storeage_map's selem. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-17-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch migrates the CHECK macro to ASSERT macro. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-16-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch refactors local_storage freeing logic into bpf_local_storage_free(). It is a preparation work for a later patch that uses bpf_mem_cache_alloc/free. The other kfree(local_storage) cases are also changed to bpf_local_storage_free(..., reuse_now = true). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-12-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
The existing bpf_local_storage_free_rcu is renamed to bpf_local_storage_free_trace_rcu. A new bpf_local_storage_rcu callback is added to do the kfree instead of using kfree_rcu. It is a preparation work for a later patch using bpf_mem_cache_alloc/free. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-11-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch refactors the selem freeing logic into bpf_selem_free(). It is a preparation work for a later patch using bpf_mem_cache_alloc/free. The other kfree(selem) cases are also changed to bpf_selem_free(..., reuse_now = true). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Add bpf_selem_free_rcu() callback to do the kfree() instead of using kfree_rcu. It is a preparation work for using bpf_mem_cache_alloc/free in a later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-9-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch removes the bpf_selem_free_fields*_rcu. The bpf_obj_free_fields() can be done before the call_rcu_trasks_trace() and kfree_rcu(). It is needed when a later patch uses bpf_mem_cache_alloc/free. In bpf hashtab, bpf_obj_free_fields() is also called before calling bpf_mem_cache_free. The discussion can be found in https://lore.kernel.org/bpf/f67021ee-21d9-bfae-6134-4ca542fab843@linux.dev/Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-8-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch re-purpose the use_trace_rcu to mean if the freed memory can be reused immediately or not. The use_trace_rcu is renamed to reuse_now. Other than the boolean test is reversed, it should be a no-op. The following explains the reason for the rename and how it will be used in a later patch. In a later patch, bpf_mem_cache_alloc/free will be used in the bpf_local_storage. The bpf mem allocator will reuse the freed memory immediately. Some of the free paths in bpf_local_storage does not support memory to be reused immediately. These paths are the "delete" elem cases from the bpf_*_storage_delete() helper and the map_delete_elem() syscall. Note that "delete" elem before the owner's (sk/task/cgrp/inode) lifetime ended is not the common usage for the local storage. The common free path, bpf_local_storage_destroy(), can reuse the memory immediately. This common path means the storage stays with its owner until the owner is destroyed. The above mentioned "delete" elem paths that cannot reuse immediately always has the 'use_trace_rcu == true'. The cases that is safe for immediate reuse always have 'use_trace_rcu == false'. Instead of adding another arg in a later patch, this patch re-purpose this arg to reuse_now and have the test logic reversed. In a later patch, 'reuse_now == true' will free to the bpf_mem_cache_free() where the memory can be reused immediately. 'reuse_now == false' will go through the call_rcu_tasks_trace(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch remembers which smap triggers the allocation of a 'struct bpf_local_storage' object. The local_storage is allocated during the very first selem added to the owner. The smap pointer is needed when using the bpf_mem_cache_free in a later patch because it needs to free to the correct smap's bpf_mem_alloc object. When a selem is being removed, it needs to check if it is the selem that triggers the creation of the local_storage. If it is, the local_storage->smap pointer will be reset to NULL. This NULL reset is done under the local_storage->lock in bpf_selem_unlink_storage_nolock() when a selem is being removed. Also note that the local_storage may not go away even local_storage->smap is NULL because there may be other selem still stored in the local_storage. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-