Commits · 50408d7abea68e2d1ae3a9328e1a468b7089b11c · Kirill Smelkov / linux

06 Apr, 2024 4 commits

Merge branch 'libbpf-api-to-partially-consume-items-from-ringbuffer' · 50408d7a

Andrii Nakryiko authored Apr 06, 2024

Andrea Righi says:

====================
libbpf: API to partially consume items from ringbuffer

Introduce ring__consume_n() and ring_buffer__consume_n() API to
partially consume items from one (or more) ringbuffer(s).

This can be useful, for example, to consume just a single item or when
we need to copy multiple items to a limited user-space buffer from the
ringbuffer callback.

Practical example (where this API can be used):
https://github.com/sched-ext/scx/blob/b7c06b9ed9f72cad83c31e39e9c4e2cfd8683a55/rust/scx_rustland_core/src/bpf.rs#L217

See also:
https://lore.kernel.org/lkml/20240310154726.734289-1-andrea.righi@canonical.com/T/#u

v4:
 - open a new 1.5.0 cycle

v3:
 - rename ring__consume_max() -> ring__consume_n() and
   ring_buffer__consume_max() -> ring_buffer__consume_n()
 - add new API to a new 1.5.0 cycle
 - fixed minor nits / comments

v2:
 - introduce a new API instead of changing the callback's retcode
   behavior
====================

Link: https://lore.kernel.org/r/20240406092005.92399-1-andrea.righi@canonical.comSigned-off-by: Andrii Nakryiko <andrii@kernel.org>

50408d7a

libbpf: Add ring__consume_n / ring_buffer__consume_n · 4d22ea94

Andrea Righi authored Apr 06, 2024

Introduce a new API to consume items from a ring buffer, limited to a
specified amount, and return to the caller the actual number of items
consumed.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/lkml/20240310154726.734289-1-andrea.righi@canonical.com/T
Link: https://lore.kernel.org/bpf/20240406092005.92399-4-andrea.righi@canonical.com

4d22ea94

libbpf: ringbuf: Allow to consume up to a certain amount of items · 13e8125a

Andrea Righi authored Apr 06, 2024

In some cases, instead of always consuming all items from ring buffers
in a greedy way, we may want to consume up to a certain amount of items,
for example when we need to copy items from the BPF ring buffer to a
limited user buffer.

This change allows to set an upper limit to the amount of items consumed
from one or more ring buffers.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240406092005.92399-3-andrea.righi@canonical.com

13e8125a

libbpf: Start v1.5 development cycle · 5bd2ed65

Andrea Righi authored Apr 06, 2024

Bump libbpf.map to v1.5.0 to start a new libbpf version cycle.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240406092005.92399-2-andrea.righi@canonical.com

5bd2ed65

05 Apr, 2024 12 commits

Merge branch 'bpf-allow-invoking-kfuncs-from-bpf_prog_type_syscall-progs' · d564ffde

Andrii Nakryiko authored Apr 05, 2024

David Vernet says:

====================
bpf: Allow invoking kfuncs from BPF_PROG_TYPE_SYSCALL progs

Currently, a set of core BPF kfuncs (e.g. bpf_task_*, bpf_cgroup_*,
bpf_cpumask_*, etc) cannot be invoked from BPF_PROG_TYPE_SYSCALL
programs. The whitelist approach taken for enabling kfuncs makes sense:
it not safe to call these kfuncs from every program type. For example,
it may not be safe to call bpf_task_acquire() in an fentry to
free_task().

BPF_PROG_TYPE_SYSCALL, on the other hand, is a perfectly safe program
type from which to invoke these kfuncs, as it's a very controlled
environment, and we should never be able to run into any of the typical
problems such as recursive invoations, acquiring references on freeing
kptrs, etc. Being able to invoke these kfuncs would be useful, as
BPF_PROG_TYPE_SYSCALL can be invoked with BPF_PROG_RUN, and would
therefore enable user space programs to synchronously call into BPF to
manipulate these kptrs.
---

v1: https://lore.kernel.org/all/20240404010308.334604-1-void@manifault.com/
v1 -> v2:

- Create new verifier_kfunc_prog_types testcase meant to specifically
  validate calling core kfuncs from various program types. Remove the
  macros and testcases that had been added to the task, cgrp, and
  cpumask kfunc testcases (Andrii and Yonghong)
====================

Link: https://lore.kernel.org/r/20240405143041.632519-1-void@manifault.comSigned-off-by: Andrii Nakryiko <andrii@kernel.org>

d564ffde

selftests/bpf: Verify calling core kfuncs from BPF_PROG_TYPE_SYCALL · 1bc724af

David Vernet authored Apr 05, 2024

Now that we can call some kfuncs from BPF_PROG_TYPE_SYSCALL progs, let's
add some selftests that verify as much. As a bonus, let's also verify
that we can't call the progs from raw tracepoints. Do do this, we add a
new selftest suite called verifier_kfunc_prog_types.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240405143041.632519-3-void@manifault.com

1bc724af

bpf: Allow invoking kfuncs from BPF_PROG_TYPE_SYSCALL progs · a8e03b6b

David Vernet authored Apr 05, 2024

Currently, a set of core BPF kfuncs (e.g. bpf_task_*, bpf_cgroup_*,
bpf_cpumask_*, etc) cannot be invoked from BPF_PROG_TYPE_SYSCALL
programs. The whitelist approach taken for enabling kfuncs makes sense:
it not safe to call these kfuncs from every program type. For example,
it may not be safe to call bpf_task_acquire() in an fentry to
free_task().

BPF_PROG_TYPE_SYSCALL, on the other hand, is a perfectly safe program
type from which to invoke these kfuncs, as it's a very controlled
environment, and we should never be able to run into any of the typical
problems such as recursive invoations, acquiring references on freeing
kptrs, etc. Being able to invoke these kfuncs would be useful, as
BPF_PROG_TYPE_SYSCALL can be invoked with BPF_PROG_RUN, and would
therefore enable user space programs to synchronously call into BPF to
manipulate these kptrs.

This patch therefore enables invoking the aforementioned core kfuncs
from BPF_PROG_TYPE_SYSCALL progs.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240405143041.632519-2-void@manifault.com

a8e03b6b

bpf, docs: Editorial nits in instruction-set.rst · 00d5d22a

Dave Thaler authored Apr 05, 2024

This patch addresses a number of editorial nits including
spelling, punctuation, grammar, and wording consistency issues
in instruction-set.rst.
Signed-off-by: Dave Thaler <dthaler1968@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20240405155245.3618-1-dthaler1968@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

00d5d22a

selftests/bpf: Make sure libbpf doesn't enforce the signature of a func pointer. · ba0cbe2b

Kui-Feng Lee authored Apr 04, 2024

The verifier in the kernel ensures that the struct_ops operators behave
correctly by checking that they access parameters and context
appropriately. The verifier will approve a program as long as it correctly
accesses the context/parameters, regardless of its function signature. In
contrast, libbpf should not verify the signature of function pointers and
functions to enable flexibility in loading various implementations of an
operator even if the signature of the function pointer does not match those
in the implementations or the kernel.

With this flexibility, user space applications can adapt to different
kernel versions by loading a specific implementation of an operator based
on feature detection.

This is a follow-up of the commit c911fc61 ("libbpf: Skip zeroed or
null fields if not found in the kernel type.")
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240404232342.991414-1-thinker.li@gmail.com

ba0cbe2b

Merge branch 'bpf-allow-bpf_for_each_map_elem-helper-with-different-input-maps' · 27095479

Alexei Starovoitov authored Apr 05, 2024

Philo Lu says:

====================
bpf: allow bpf_for_each_map_elem() helper with different input maps

Currently, taking different maps within a single bpf_for_each_map_elem
call is not allowed. For example the following codes cannot pass the
verifier (with error "tail_call abusing map_ptr"):
```
static void test_by_pid(int pid)
{
	if (pid <= 100)
		bpf_for_each_map_elem(&map1, map_elem_cb, NULL, 0);
	else
		bpf_for_each_map_elem(&map2, map_elem_cb, NULL, 0);
}
```

This is because during bpf_for_each_map_elem verifying,
bpf_insn_aux_data->map_ptr_state is expected as map_ptr (instead of poison
state), which is then needed by set_map_elem_callback_state. However, as
there are two different map ptr input, map_ptr_state is marked as
BPF_MAP_PTR_POISON, and thus the second map_ptr would be lost.
BPF_MAP_PTR_POISON is also needed by bpf_for_each_map_elem to skip
retpoline optimization in do_misc_fixups(). Therefore, map_ptr_state and
map_ptr are both needed for bpf_for_each_map_elem.

This patchset solves it by transform bpf_insn_aux_data->map_ptr_state as a
new struct, storing poison/unpriv state and map pointer together without
additional memory overhead. Then bpf_for_each_map_elem works well with
different input maps. It also makes map_ptr_state logic clearer.

A test case is added to selftest, which would fail to load without this
patchset.

Changelogs
-> v1:
- PATCH 1/3:
  - make the commit log clearer
  - change poison and unpriv to bool in struct bpf_map_ptr_state, also the
    return value in bpf_map_ptr_poisoned() and bpf_map_ptr_unpriv()
- PATCH 2/3:
  - change the comments in set_map_elem_callback_state()
- PATCH 3/3:
  - remove the "skipping the last element" logic during map updating
  - change if() to ASSERT_OK()

Please review, thanks.
====================

Link: https://lore.kernel.org/r/20240405025536.18113-1-lulie@linux.alibaba.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

27095479

selftests/bpf: add test for bpf_for_each_map_elem() with different maps · fecb1597

Philo Lu authored Apr 05, 2024

A test is added for bpf_for_each_map_elem() with either an arraymap or a
hashmap.
$ tools/testing/selftests/bpf/test_progs -t for_each
 #93/1    for_each/hash_map:OK
 #93/2    for_each/array_map:OK
 #93/3    for_each/write_map_key:OK
 #93/4    for_each/multi_maps:OK
 #93      for_each:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-4-lulie@linux.alibaba.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

fecb1597

bpf: allow invoking bpf_for_each_map_elem with different maps · 9d482da9

Philo Lu authored Apr 05, 2024

Taking different maps within a single bpf_for_each_map_elem call is not
allowed before, because from the second map,
bpf_insn_aux_data->map_ptr_state will be marked as *poison*. In fact
both map_ptr and state are needed to support this use case: map_ptr is
used by set_map_elem_callback_state() while poison state is needed to
determine whether to use direct call.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-3-lulie@linux.alibaba.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9d482da9

bpf: store both map ptr and state in bpf_insn_aux_data · 0a525621

Philo Lu authored Apr 05, 2024

Currently, bpf_insn_aux_data->map_ptr_state is used to store either
map_ptr or its poison state (i.e., BPF_MAP_PTR_POISON). Thus
BPF_MAP_PTR_POISON must be checked before reading map_ptr. In certain
cases, we may need valid map_ptr even in case of poison state.
This will be explained in next patch with bpf_for_each_map_elem()
helper.

This patch changes map_ptr_state into a new struct including both map
pointer and its state (poison/unpriv). It's in the same union with
struct bpf_loop_inline_state, so there is no extra memory overhead.
Besides, macros BPF_MAP_PTR_UNPRIV/BPF_MAP_PTR_POISON/BPF_MAP_PTR are no
longer needed.

This patch does not change any existing functionality.
Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240405025536.18113-2-lulie@linux.alibaba.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0a525621

bpf: fix perf_snapshot_branch_stack link failure · 58babe27

Arnd Bergmann authored Apr 05, 2024

The newly added code to handle bpf_get_branch_snapshot fails to link when
CONFIG_PERF_EVENTS is disabled:

aarch64-linux-ld: kernel/bpf/verifier.o: in function `do_misc_fixups':
verifier.c:(.text+0x1090c): undefined reference to `__SCK__perf_snapshot_branch_stack'

Add a build-time check for that Kconfig symbol around the code to
remove the link time dependency.

Fixes: 314a5362 ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240405142637.577046-1-arnd@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

58babe27

selftests/bpf: add fp-leaking precise subprog result tests · 343ca813

Andrii Nakryiko authored Apr 04, 2024

Add selftests validating that BPF verifier handles precision marking
for SCALAR registers derived from r10 (fp) register correctly.

Given `r0 = (s8)r10;` syntax is not supported by older Clang compilers,
use the raw BPF instruction syntax to maximize compatibility.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

343ca813

bpf: prevent r10 register from being marked as precise · 1f2a74b4

Andrii Nakryiko authored Apr 04, 2024

r10 is a special register that is not under BPF program's control and is
always effectively precise. The rest of precision logic assumes that
only r0-r9 SCALAR registers are marked as precise, so prevent r10 from
being marked precise.

This can happen due to signed cast instruction allowing to do something
like `r0 = (s8)r10;`, which later, if r0 needs to be precise, would lead
to an attempt to mark r10 as precise.

Prevent this with an extra check during instruction backtracking.

Fixes: 8100928c ("bpf: Support new sign-extension mov insns")
Reported-by: syzbot+148110ee7cf72f39f33e@syzkaller.appspotmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404214536.3551295-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1f2a74b4

04 Apr, 2024 9 commits

bpf: Pack struct bpf_fib_lookup · f9171700

Anton Protopopov authored Apr 03, 2024

The struct bpf_fib_lookup is supposed to be of size 64. A recent commit
59b418c7 ("bpf: Add a check for struct bpf_fib_lookup size") added
a static assertion to check this property so that future changes to the
structure will not accidentally break this assumption.

As it immediately turned out, on some 32-bit arm systems, when AEABI=n,
the total size of the structure was equal to 68, see [1]. This happened
because the bpf_fib_lookup structure contains a union of two 16-bit
fields:

    union {
            __u16 tot_len;
            __u16 mtu_result;
    };

which was supposed to compile to a 16-bit-aligned 16-bit field. On the
aforementioned setups it was instead both aligned and padded to 32-bits.

Declare this inner union as __attribute__((packed, aligned(2))) such
that it always is of size 2 and is aligned to 16 bits.

  [1] https://lore.kernel.org/all/CA+G9fYtsoP51f-oP_Sp5MOq-Ffv8La2RztNpwvE6+R1VtFiLrw@mail.gmail.com/#tReported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: e1850ea9 ("bpf: bpf_fib_lookup return MTU value as output when looked up")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240403123303.1452184-1-aspsk@isovalent.com

f9171700

bpftool: Mount bpffs on provided dir instead of parent dir · 478a535a

Sahil Siddiq authored Apr 05, 2024

When pinning programs/objects under PATH (eg: during "bpftool prog
loadall") the bpffs is mounted on the parent dir of PATH in the
following situations:
- the given dir exists but it is not bpffs.
- the given dir doesn't exist and the parent dir is not bpffs.

Mounting on the parent dir can also have the unintentional side-
effect of hiding other files located under the parent dir.

If the given dir exists but is not bpffs, then the bpffs should
be mounted on the given dir and not its parent dir.

Similarly, if the given dir doesn't exist and its parent dir is not
bpffs, then the given dir should be created and the bpffs should be
mounted on this new dir.

Fixes: 2a36c26f ("bpftool: Support bpffs mountpoint as pin path for prog loadall")
Signed-off-by: Sahil Siddiq <icegambit91@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/2da44d24-74ae-a564-1764-afccf395eeec@isovalent.com/T/#t
Link: https://lore.kernel.org/bpf/20240404192219.52373-1-icegambit91@gmail.com

Closes: https://github.com/libbpf/bpftool/issues/100

Changes since v1:
 - Split "mount_bpffs_for_pin" into two functions.
   This is done to improve maintainability and readability.

Changes since v2:
- mount_bpffs_for_pin: rename to "create_and_mount_bpffs_dir".
- mount_bpffs_given_file: rename to "mount_bpffs_given_file".
- create_and_mount_bpffs_dir:
  - introduce "dir_exists" boolean.
  - remove new dir if "mnt_fs" fails.
- improve error handling and error messages.

Changes since v3:
- Rectify function name.
- Improve error messages and formatting.
- mount_bpffs_for_file:
  - Check if dir exists before block_mount check.

Changes since v4:
- Use strdup instead of strcpy.
- create_and_mount_bpffs_dir:
  - Use S_IRWXU instead of 0700.
- Improve error handling and formatting.

478a535a

Merge branch 'inline-bpf_get_branch_snapshot-bpf-helper' · d82c045f

Alexei Starovoitov authored Apr 04, 2024

Andrii Nakryiko says:

====================
Inline bpf_get_branch_snapshot() BPF helper

Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach. This allows to reduce LBR record usage right before LBR
records are captured from inside BPF program.

See v1 cover letter ([0]) for some visual examples. I dropped them from v2
because there are multiple independent changes landing and being reviewed, all
of which remove different parts of LBR record waste, so presenting final state
of LBR "waste" gets more complicated until all of the pieces land.

  [0] https://lore.kernel.org/bpf/20240321180501.734779-1-andrii@kernel.org/

v2->v3:
  - fix BPF_MUL instruction definition;
v1->v2:
  - inlining of bpf_get_smp_processor_id() split out into a separate patch set
    implementing internal per-CPU BPF instruction;
  - add efficient divide-by-24 through multiplication logic, and leave
    comments to explain the idea behind it; this way inlined version of
    bpf_get_branch_snapshot() has no compromises compared to non-inlined
    version of the helper  (Alexei).
====================

Link: https://lore.kernel.org/r/20240404002640.1774210-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d82c045f

bpf: inline bpf_get_branch_snapshot() helper · 314a5362

Andrii Nakryiko authored Apr 03, 2024

Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

314a5362

bpf: make bpf_get_branch_snapshot() architecture-agnostic · 5e6a3c1e

Andrii Nakryiko authored Apr 03, 2024

perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.

perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.

  [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5e6a3c1e

bpf, riscv: Implement bpf_addr_space_cast instruction · 21ab0b6d

Puranjay Mohan authored Apr 04, 2024

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY

rY = addr_space_cast(rX, 1, 0) has to be converted by JIT.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20240404114203.105970-3-puranjay12@gmail.com

21ab0b6d

bpf, riscv: Implement PROBE_MEM32 pseudo instructions · 633a6e01

Puranjay Mohan authored Apr 04, 2024

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions. They are similar to PROBE_MEM instructions with the
following differences:

- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
  src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in S7
  in the prologue). Due to bpf_arena constructions such S7 + reg +
  off16 access is guaranteed to be within arena virtual range, so no
  address check at run-time.
- S11 is a free callee-saved register, so it is used to store kern_vm_start
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
  LDX faults the destination register is zeroed.

To support these on riscv, we do tmp = S7 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20240404114203.105970-2-puranjay12@gmail.com

633a6e01

bpf: Optimize emit_mov_imm64(). · af682b76

Alexei Starovoitov authored Apr 01, 2024

Turned out that bpf prog callback addresses, bpf prog addresses
used in bpf_trampoline, and in other cases the 64-bit address
can be represented as sign extended 32-bit value.

According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
"Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for mov r32, imm32."
So use shorter encoding and faster instruction when possible.

Special care is needed in jit_subprogs(), since bpf_pseudo_func()
instruction cannot change its size during the last step of JIT.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/CAADnVQKFfpY-QZBrOU2CG8v2du8Lgyb7MNVmOZVK_yTyOdNbBA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20240401233800.42737-1-alexei.starovoitov@gmail.com

af682b76

bpf: handle CONFIG_SMP=n configuration in x86 BPF JIT · 1e9e0b85

Andrii Nakryiko authored Apr 03, 2024

On non-SMP systems, there is no "per-CPU" data, it's just global data.
So in such case just don't do this_cpu_off-based per-CPU address adjustment.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404040951.d4CUx5S6-lkp@intel.com/
Fixes: 7bdbf744 ("bpf: add special internal-only MOV instruction to resolve per-CPU addrs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240404034726.2766740-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1e9e0b85

03 Apr, 2024 15 commits

Merge branch 'add-internal-only-bpf-per-cpu-instruction' · 519e1de9

Alexei Starovoitov authored Apr 03, 2024

Andrii Nakryiko says:

====================
Add internal-only BPF per-CPU instruction

Add a new BPF instruction for resolving per-CPU memory addresses.

New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_X, with
insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
to an absolute address where per-CPU data resides for "this" CPU.

This patch set implements support for it in x86-64 BPF JIT only.

Using the new instruction, we also implement inlining for three cases:
  - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
    function call, saving a bit of performance and also not polluting LBR
    records with unnecessary function call/return records;
  - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
    performance to implementing per-CPU data structures using global variables
    in BPF (which is an awesome improvement, see benchmarks below);
  - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
    same for non-PERCPU HASH map; this still saves a bit of overhead.

To validate performance benefits, I hacked together a tiny benchmark doing
only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
(arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
on global variable array using bpf_get_smp_processor_id() to index array for
current CPU (glob-arr-inc benchmark below).

BEFORE
======
glob-arr-inc   :  163.685 ± 0.092M/s
arr-inc        :  138.096 ± 0.160M/s
hash-inc       :   66.855 ± 0.123M/s

AFTER
=====
glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
arr-inc        :  170.729 ± 0.210M/s (+23.7%)
hash-inc       :   68.673 ± 0.070M/s (+2.7%)

As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().

But what's really important is that arr-inc benchmark basically catches up
with glob-arr-inc, resulting in +23.7% improvement. This means that in
practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
critical (e.g., high-frequent stats collection, which is often a practical use
for PERCPU_ARRAY today).

v1->v2:
  - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
  - dropped the direct per-CPU memory read instruction, it can always be added
    back, if necessary;
  - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
  - switched all per-cpu addr casts to (unsigned long) to avoid sparse
    warnings.
====================

Link: https://lore.kernel.org/r/20240402021307.1012571-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

519e1de9

bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map · 0b56e637

Andrii Nakryiko authored Apr 01, 2024

Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0b56e637

bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps · db69718b

Andrii Nakryiko authored Apr 01, 2024

Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
map lookup helper, if BPF JIT support is present.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

db69718b

bpf: inline bpf_get_smp_processor_id() helper · 1ae69210

Andrii Nakryiko authored Apr 01, 2024

If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
to eliminate unnecessary function calls.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1ae69210

bpf: add special internal-only MOV instruction to resolve per-CPU addrs · 7bdbf744

Andrii Nakryiko authored Apr 01, 2024

Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.

We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.

Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).

BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).

Add arch-specific way for BPF JITs to mark support for this instructions.

This patch also adds support for these instructions in x86-64 BPF JIT.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7bdbf744

bpf: Replace deprecated strncpy with strscpy · 2e114248

Justin Stitt authored Apr 02, 2024

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

bpf sym names get looked up and compared/cleaned with various string
apis. This suggests they need to be NUL-terminated (strncpy() suggests
this but does not guarantee it).

|	static int compare_symbol_name(const char *name, char *namebuf)
|	{
|		cleanup_symbol_name(namebuf);
|		return strcmp(name, namebuf);
|	}

|	static void cleanup_symbol_name(char *s)
|	{
|		...
|		res = strstr(s, ".llvm.");
|		...
|	}

Use strscpy() as this method guarantees NUL-termination on the
destination buffer.

This patch also replaces two uses of strncpy() used in log.c. These are
simple replacements as postfix has been zero-initialized on the stack
and has source arguments with a size less than the destination's size.

Note that this patch uses the new 2-argument version of strscpy
introduced in commit e6584c39 ("string: Allow 2-argument strscpy()").
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/bpf/20240402-strncpy-kernel-bpf-core-c-v1-1-7cb07a426e78@google.com

2e114248

selftests/xsk: Add new test case for AF_XDP under max ring sizes · c53908b2

Tushar Vyavahare authored Apr 02, 2024

Introduce a test case to evaluate AF_XDP's robustness by pushing hardware
and software ring sizes to their limits. This test ensures AF_XDP's
reliability amidst potential producer/consumer throttling due to maximum
ring utilization. The testing strategy includes:

1. Configuring rings to their maximum allowable sizes.
2. Executing a series of tests across diverse batch sizes to assess
system's behavior under different configurations.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-8-tushar.vyavahare@intel.com

c53908b2

selftests/xsk: Test AF_XDP functionality under minimal ring configurations · c4f96053

Tushar Vyavahare authored Apr 02, 2024

Add a new test case that stresses AF_XDP and the driver by configuring
small hardware and software ring sizes. This verifies that AF_XDP continues
to function properly even with insufficient ring space that could lead
to frequent producer/consumer throttling. The test procedure involves:

1. Set the minimum possible ring configuration(tx 64 and rx 128).
2. Run tests with various batch sizes(1 and 63) to validate the system's
behavior under different configurations.

Update Makefile to include network_helpers.o in the build process for
xskxceiver.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-7-tushar.vyavahare@intel.com

c4f96053

selftests/xsk: Introduce set_ring_size function with a retry mechanism for... · 776021e0

Tushar Vyavahare authored Apr 02, 2024

selftests/xsk: Introduce set_ring_size function with a retry mechanism for handling AF_XDP socket closures

Introduce a new function, set_ring_size(), to manage asynchronous AF_XDP
socket closure. Retry set_hw_ring_size up to SOCK_RECONF_CTR times if it
fails due to an active AF_XDP socket. Return an error immediately for
non-EBUSY errors. This enhances robustness against asynchronous AF_XDP
socket closures during ring size changes.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-6-tushar.vyavahare@intel.com

776021e0

selftests/bpf: Implement set_hw_ring_size function to configure interface ring size · bee3a7b0

Tushar Vyavahare authored Apr 02, 2024

Introduce a new function called set_hw_ring_size that allows for the
dynamic configuration of the ring size within the interface.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-5-tushar.vyavahare@intel.com

bee3a7b0

selftests/bpf: Implement get_hw_ring_size function to retrieve current and max interface size · 90a695c3

Tushar Vyavahare authored Apr 02, 2024

Introduce a new function called get_hw_size that retrieves both the
current and maximum size of the interface and stores this information
in the 'ethtool_ringparam' structure.

Remove ethtool_channels struct from xdp_hw_metadata.c due to redefinition
error. Remove unused linux/if.h include from flow_dissector BPF test to
address CI pipeline failure.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-4-tushar.vyavahare@intel.com

90a695c3

selftests/xsk: Make batch size variable · c3bd0150

Tushar Vyavahare authored Apr 02, 2024

Convert the constant BATCH_SIZE into a variable named batch_size to allow
dynamic modification at runtime. This is required for the forthcoming
changes to support testing different hardware ring sizes.

While running these tests, a bug was identified when the batch size is
roughly the same as the NIC ring size. This has now been addressed by
Maciej's fix in commit 913eda2b ("i40e: xsk: remove count_mask").
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-3-tushar.vyavahare@intel.com

c3bd0150

tools: Add ethtool.h header to tooling infra · 7effe3fd

Tushar Vyavahare authored Apr 02, 2024

This commit duplicates the ethtool.h file from the include/uapi/linux
directory in the kernel source to the tools/include/uapi/linux directory.

This action ensures that the ethtool.h file used in the tools directory
is in sync with the kernel's version, maintaining consistency across the
codebase.

There are some checkpatch warnings in this file that could be cleaned up,
but I preferred to move it over as-is for now to avoid disrupting the code.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-2-tushar.vyavahare@intel.com

7effe3fd

Merge branch 'bpf-arm64-add-support-for-bpf-arena' · 49b73fa6

Alexei Starovoitov authored Apr 02, 2024

Puranjay Mohan says:

====================
bpf,arm64: Add support for BPF Arena

Changes in V4
V3: https://lore.kernel.org/bpf/20240323103057.26499-1-puranjay12@gmail.com/
- Use more descriptive variable names.
- Use insn_is_cast_user() helper.

Changes in V3
V2: https://lore.kernel.org/bpf/20240321153102.103832-1-puranjay12@gmail.com/
- Optimize bpf_addr_space_cast as suggested by Xu Kuohai

Changes in V2
V1: https://lore.kernel.org/bpf/20240314150003.123020-1-puranjay12@gmail.com/
- Fix build warnings by using 5 in place of 32 as DONT_CLEAR marker.
  R5 is not mapped to any BPF register so it can safely be used here.

This series adds the support for PROBE_MEM32 and bpf_addr_space_cast
instructions to the ARM64 BPF JIT. These two instructions allow the
enablement of BPF Arena.

All arena related selftests are passing.

  [root@ip-172-31-6-62 bpf]# ./test_progs -a "*arena*"
  #3/1     arena_htab/arena_htab_llvm:OK
  #3/2     arena_htab/arena_htab_asm:OK
  #3       arena_htab:OK
  #4/1     arena_list/arena_list_1:OK
  #4/2     arena_list/arena_list_1000:OK
  #4       arena_list:OK
  #434/1   verifier_arena/basic_alloc1:OK
  #434/2   verifier_arena/basic_alloc2:OK
  #434/3   verifier_arena/basic_alloc3:OK
  #434/4   verifier_arena/iter_maps1:OK
  #434/5   verifier_arena/iter_maps2:OK
  #434/6   verifier_arena/iter_maps3:OK
  #434     verifier_arena:OK
  Summary: 3/10 PASSED, 0 SKIPPED, 0 FAILED

This will need the patch [1] that introduced insn_is_cast_user() helper to
build.

The verifier_arena selftest could fail in the CI because the following
commit[2] is missing from bpf-next:

[1] https://lore.kernel.org/bpf/20240324183226.29674-1-puranjay12@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=fa3550dca8f02ec312727653a94115ef3ab68445

Here is a CI run with all dependencies added: https://github.com/kernel-patches/bpf/pull/6641
====================

Link: https://lore.kernel.org/r/20240325150716.4387-1-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

49b73fa6

bpf: Add arm64 JIT support for bpf_addr_space_cast instruction. · 4dd31243

Puranjay Mohan authored Mar 25, 2024

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY.

rY = addr_space_cast(rX, 1, 0) : used to convert a bpf arena pointer to
a pointer in the userspace vma. This has to be converted by the JIT.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-3-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4dd31243