Commits · f91717007217d975aa975ddabd91ae1a107b9bff · Kirill Smelkov / linux

04 Apr, 2024 9 commits

bpf: Pack struct bpf_fib_lookup · f9171700

Anton Protopopov authored Apr 03, 2024

The struct bpf_fib_lookup is supposed to be of size 64. A recent commit
59b418c7 ("bpf: Add a check for struct bpf_fib_lookup size") added
a static assertion to check this property so that future changes to the
structure will not accidentally break this assumption.

As it immediately turned out, on some 32-bit arm systems, when AEABI=n,
the total size of the structure was equal to 68, see [1]. This happened
because the bpf_fib_lookup structure contains a union of two 16-bit
fields:

    union {
            __u16 tot_len;
            __u16 mtu_result;
    };

which was supposed to compile to a 16-bit-aligned 16-bit field. On the
aforementioned setups it was instead both aligned and padded to 32-bits.

Declare this inner union as __attribute__((packed, aligned(2))) such
that it always is of size 2 and is aligned to 16 bits.

  [1] https://lore.kernel.org/all/CA+G9fYtsoP51f-oP_Sp5MOq-Ffv8La2RztNpwvE6+R1VtFiLrw@mail.gmail.com/#tReported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: e1850ea9 ("bpf: bpf_fib_lookup return MTU value as output when looked up")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240403123303.1452184-1-aspsk@isovalent.com

f9171700

bpftool: Mount bpffs on provided dir instead of parent dir · 478a535a

Sahil Siddiq authored Apr 05, 2024

When pinning programs/objects under PATH (eg: during "bpftool prog
loadall") the bpffs is mounted on the parent dir of PATH in the
following situations:
- the given dir exists but it is not bpffs.
- the given dir doesn't exist and the parent dir is not bpffs.

Mounting on the parent dir can also have the unintentional side-
effect of hiding other files located under the parent dir.

If the given dir exists but is not bpffs, then the bpffs should
be mounted on the given dir and not its parent dir.

Similarly, if the given dir doesn't exist and its parent dir is not
bpffs, then the given dir should be created and the bpffs should be
mounted on this new dir.

Fixes: 2a36c26f ("bpftool: Support bpffs mountpoint as pin path for prog loadall")
Signed-off-by: Sahil Siddiq <icegambit91@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/2da44d24-74ae-a564-1764-afccf395eeec@isovalent.com/T/#t
Link: https://lore.kernel.org/bpf/20240404192219.52373-1-icegambit91@gmail.com

Closes: https://github.com/libbpf/bpftool/issues/100

Changes since v1:
 - Split "mount_bpffs_for_pin" into two functions.
   This is done to improve maintainability and readability.

Changes since v2:
- mount_bpffs_for_pin: rename to "create_and_mount_bpffs_dir".
- mount_bpffs_given_file: rename to "mount_bpffs_given_file".
- create_and_mount_bpffs_dir:
  - introduce "dir_exists" boolean.
  - remove new dir if "mnt_fs" fails.
- improve error handling and error messages.

Changes since v3:
- Rectify function name.
- Improve error messages and formatting.
- mount_bpffs_for_file:
  - Check if dir exists before block_mount check.

Changes since v4:
- Use strdup instead of strcpy.
- create_and_mount_bpffs_dir:
  - Use S_IRWXU instead of 0700.
- Improve error handling and formatting.

478a535a

Merge branch 'inline-bpf_get_branch_snapshot-bpf-helper' · d82c045f

Alexei Starovoitov authored Apr 04, 2024

Andrii Nakryiko says:

====================
Inline bpf_get_branch_snapshot() BPF helper

Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach. This allows to reduce LBR record usage right before LBR
records are captured from inside BPF program.

See v1 cover letter ([0]) for some visual examples. I dropped them from v2
because there are multiple independent changes landing and being reviewed, all
of which remove different parts of LBR record waste, so presenting final state
of LBR "waste" gets more complicated until all of the pieces land.

  [0] https://lore.kernel.org/bpf/20240321180501.734779-1-andrii@kernel.org/

v2->v3:
  - fix BPF_MUL instruction definition;
v1->v2:
  - inlining of bpf_get_smp_processor_id() split out into a separate patch set
    implementing internal per-CPU BPF instruction;
  - add efficient divide-by-24 through multiplication logic, and leave
    comments to explain the idea behind it; this way inlined version of
    bpf_get_branch_snapshot() has no compromises compared to non-inlined
    version of the helper  (Alexei).
====================

Link: https://lore.kernel.org/r/20240404002640.1774210-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d82c045f

bpf: inline bpf_get_branch_snapshot() helper · 314a5362

Andrii Nakryiko authored Apr 03, 2024

Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

314a5362

bpf: make bpf_get_branch_snapshot() architecture-agnostic · 5e6a3c1e

Andrii Nakryiko authored Apr 03, 2024

perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.

perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.

  [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240404002640.1774210-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5e6a3c1e

bpf, riscv: Implement bpf_addr_space_cast instruction · 21ab0b6d

Puranjay Mohan authored Apr 04, 2024

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY

rY = addr_space_cast(rX, 1, 0) has to be converted by JIT.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20240404114203.105970-3-puranjay12@gmail.com

21ab0b6d

bpf, riscv: Implement PROBE_MEM32 pseudo instructions · 633a6e01

Puranjay Mohan authored Apr 04, 2024

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions. They are similar to PROBE_MEM instructions with the
following differences:

- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
  src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in S7
  in the prologue). Due to bpf_arena constructions such S7 + reg +
  off16 access is guaranteed to be within arena virtual range, so no
  address check at run-time.
- S11 is a free callee-saved register, so it is used to store kern_vm_start
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
  LDX faults the destination register is zeroed.

To support these on riscv, we do tmp = S7 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/bpf/20240404114203.105970-2-puranjay12@gmail.com

633a6e01

bpf: Optimize emit_mov_imm64(). · af682b76

Alexei Starovoitov authored Apr 01, 2024

Turned out that bpf prog callback addresses, bpf prog addresses
used in bpf_trampoline, and in other cases the 64-bit address
can be represented as sign extended 32-bit value.

According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
"Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for mov r32, imm32."
So use shorter encoding and faster instruction when possible.

Special care is needed in jit_subprogs(), since bpf_pseudo_func()
instruction cannot change its size during the last step of JIT.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/CAADnVQKFfpY-QZBrOU2CG8v2du8Lgyb7MNVmOZVK_yTyOdNbBA@mail.gmail.com
Link: https://lore.kernel.org/bpf/20240401233800.42737-1-alexei.starovoitov@gmail.com

af682b76

bpf: handle CONFIG_SMP=n configuration in x86 BPF JIT · 1e9e0b85

Andrii Nakryiko authored Apr 03, 2024

On non-SMP systems, there is no "per-CPU" data, it's just global data.
So in such case just don't do this_cpu_off-based per-CPU address adjustment.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202404040951.d4CUx5S6-lkp@intel.com/
Fixes: 7bdbf744 ("bpf: add special internal-only MOV instruction to resolve per-CPU addrs")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240404034726.2766740-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1e9e0b85

03 Apr, 2024 16 commits

Merge branch 'add-internal-only-bpf-per-cpu-instruction' · 519e1de9

Alexei Starovoitov authored Apr 03, 2024

Andrii Nakryiko says:

====================
Add internal-only BPF per-CPU instruction

Add a new BPF instruction for resolving per-CPU memory addresses.

New instruction is a special form of BPF_ALU64 | BPF_MOV | BPF_X, with
insns->off set to BPF_ADDR_PERCPU (== -1). It resolves provided per-CPU offset
to an absolute address where per-CPU data resides for "this" CPU.

This patch set implements support for it in x86-64 BPF JIT only.

Using the new instruction, we also implement inlining for three cases:
  - bpf_get_smp_processor_id(), which allows to avoid unnecessary trivial
    function call, saving a bit of performance and also not polluting LBR
    records with unnecessary function call/return records;
  - PERCPU_ARRAY's bpf_map_lookup_elem() is completely inlined, bringing its
    performance to implementing per-CPU data structures using global variables
    in BPF (which is an awesome improvement, see benchmarks below);
  - PERCPU_HASH's bpf_map_lookup_elem() is partially inlined, just like the
    same for non-PERCPU HASH map; this still saves a bit of overhead.

To validate performance benefits, I hacked together a tiny benchmark doing
only bpf_map_lookup_elem() and incrementing the value by 1 for PERCPU_ARRAY
(arr-inc benchmark below) and PERCPU_HASH (hash-inc benchmark below) maps. To
establish a baseline, I also implemented logic similar to PERCPU_ARRAY based
on global variable array using bpf_get_smp_processor_id() to index array for
current CPU (glob-arr-inc benchmark below).

BEFORE
======
glob-arr-inc   :  163.685 ± 0.092M/s
arr-inc        :  138.096 ± 0.160M/s
hash-inc       :   66.855 ± 0.123M/s

AFTER
=====
glob-arr-inc   :  173.921 ± 0.039M/s (+6%)
arr-inc        :  170.729 ± 0.210M/s (+23.7%)
hash-inc       :   68.673 ± 0.070M/s (+2.7%)

As can be seen, PERCPU_HASH gets a modest +2.7% improvement, while global
array-based gets a nice +6% due to inlining of bpf_get_smp_processor_id().

But what's really important is that arr-inc benchmark basically catches up
with glob-arr-inc, resulting in +23.7% improvement. This means that in
practice it won't be necessary to avoid PERCPU_ARRAY anymore if performance is
critical (e.g., high-frequent stats collection, which is often a practical use
for PERCPU_ARRAY today).

v1->v2:
  - use BPF_ALU64 | BPF_MOV instruction instead of LDX (Alexei);
  - dropped the direct per-CPU memory read instruction, it can always be added
    back, if necessary;
  - guarded bpf_get_smp_processor_id() behind x86-64 check (Alexei);
  - switched all per-cpu addr casts to (unsigned long) to avoid sparse
    warnings.
====================

Link: https://lore.kernel.org/r/20240402021307.1012571-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

519e1de9

bpf: inline bpf_map_lookup_elem() helper for PERCPU_HASH map · 0b56e637

Andrii Nakryiko authored Apr 01, 2024

Using new per-CPU BPF instruction, partially inline
bpf_map_lookup_elem() helper for per-CPU hashmap BPF map. Just like for
normal HASH map, we still generate a call into __htab_map_lookup_elem(),
but after that we resolve per-CPU element address using a new
instruction, saving on extra functions calls.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0b56e637

bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps · db69718b

Andrii Nakryiko authored Apr 01, 2024

Using new per-CPU BPF instruction implement inlining for per-CPU ARRAY
map lookup helper, if BPF JIT support is present.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

db69718b

bpf: inline bpf_get_smp_processor_id() helper · 1ae69210

Andrii Nakryiko authored Apr 01, 2024

If BPF JIT supports per-CPU MOV instruction, inline bpf_get_smp_processor_id()
to eliminate unnecessary function calls.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1ae69210

bpf: add special internal-only MOV instruction to resolve per-CPU addrs · 7bdbf744

Andrii Nakryiko authored Apr 01, 2024

Add a new BPF instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF JITs.

We use a special BPF_MOV | BPF_ALU64 | BPF_X form with insn->off field
set to BPF_ADDR_PERCPU = -1. I used negative offset value to distinguish
them from positive ones used by user-exposed instructions.

Such instruction performs a resolution of a per-CPU offset stored in
a register to a valid kernel address which can be dereferenced. It is
useful in any use case where absolute address of a per-CPU data has to
be resolved (e.g., in inlining bpf_map_lookup_elem()).

BPF disassembler is also taught to recognize them to support dumping
final BPF assembly code (non-JIT'ed version).

Add arch-specific way for BPF JITs to mark support for this instructions.

This patch also adds support for these instructions in x86-64 BPF JIT.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240402021307.1012571-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7bdbf744

bpf: Replace deprecated strncpy with strscpy · 2e114248

Justin Stitt authored Apr 02, 2024

strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

bpf sym names get looked up and compared/cleaned with various string
apis. This suggests they need to be NUL-terminated (strncpy() suggests
this but does not guarantee it).

|	static int compare_symbol_name(const char *name, char *namebuf)
|	{
|		cleanup_symbol_name(namebuf);
|		return strcmp(name, namebuf);
|	}

|	static void cleanup_symbol_name(char *s)
|	{
|		...
|		res = strstr(s, ".llvm.");
|		...
|	}

Use strscpy() as this method guarantees NUL-termination on the
destination buffer.

This patch also replaces two uses of strncpy() used in log.c. These are
simple replacements as postfix has been zero-initialized on the stack
and has source arguments with a size less than the destination's size.

Note that this patch uses the new 2-argument version of strscpy
introduced in commit e6584c39 ("string: Allow 2-argument strscpy()").
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/bpf/20240402-strncpy-kernel-bpf-core-c-v1-1-7cb07a426e78@google.com

2e114248

selftests/xsk: Add new test case for AF_XDP under max ring sizes · c53908b2

Tushar Vyavahare authored Apr 02, 2024

Introduce a test case to evaluate AF_XDP's robustness by pushing hardware
and software ring sizes to their limits. This test ensures AF_XDP's
reliability amidst potential producer/consumer throttling due to maximum
ring utilization. The testing strategy includes:

1. Configuring rings to their maximum allowable sizes.
2. Executing a series of tests across diverse batch sizes to assess
system's behavior under different configurations.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-8-tushar.vyavahare@intel.com

c53908b2

selftests/xsk: Test AF_XDP functionality under minimal ring configurations · c4f96053

Tushar Vyavahare authored Apr 02, 2024

Add a new test case that stresses AF_XDP and the driver by configuring
small hardware and software ring sizes. This verifies that AF_XDP continues
to function properly even with insufficient ring space that could lead
to frequent producer/consumer throttling. The test procedure involves:

1. Set the minimum possible ring configuration(tx 64 and rx 128).
2. Run tests with various batch sizes(1 and 63) to validate the system's
behavior under different configurations.

Update Makefile to include network_helpers.o in the build process for
xskxceiver.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-7-tushar.vyavahare@intel.com

c4f96053

selftests/xsk: Introduce set_ring_size function with a retry mechanism for... · 776021e0

Tushar Vyavahare authored Apr 02, 2024

selftests/xsk: Introduce set_ring_size function with a retry mechanism for handling AF_XDP socket closures

Introduce a new function, set_ring_size(), to manage asynchronous AF_XDP
socket closure. Retry set_hw_ring_size up to SOCK_RECONF_CTR times if it
fails due to an active AF_XDP socket. Return an error immediately for
non-EBUSY errors. This enhances robustness against asynchronous AF_XDP
socket closures during ring size changes.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-6-tushar.vyavahare@intel.com

776021e0

selftests/bpf: Implement set_hw_ring_size function to configure interface ring size · bee3a7b0

Tushar Vyavahare authored Apr 02, 2024

Introduce a new function called set_hw_ring_size that allows for the
dynamic configuration of the ring size within the interface.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-5-tushar.vyavahare@intel.com

bee3a7b0

selftests/bpf: Implement get_hw_ring_size function to retrieve current and max interface size · 90a695c3

Tushar Vyavahare authored Apr 02, 2024

Introduce a new function called get_hw_size that retrieves both the
current and maximum size of the interface and stores this information
in the 'ethtool_ringparam' structure.

Remove ethtool_channels struct from xdp_hw_metadata.c due to redefinition
error. Remove unused linux/if.h include from flow_dissector BPF test to
address CI pipeline failure.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-4-tushar.vyavahare@intel.com

90a695c3

selftests/xsk: Make batch size variable · c3bd0150

Tushar Vyavahare authored Apr 02, 2024

Convert the constant BATCH_SIZE into a variable named batch_size to allow
dynamic modification at runtime. This is required for the forthcoming
changes to support testing different hardware ring sizes.

While running these tests, a bug was identified when the batch size is
roughly the same as the NIC ring size. This has now been addressed by
Maciej's fix in commit 913eda2b ("i40e: xsk: remove count_mask").
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-3-tushar.vyavahare@intel.com

c3bd0150

tools: Add ethtool.h header to tooling infra · 7effe3fd

Tushar Vyavahare authored Apr 02, 2024

This commit duplicates the ethtool.h file from the include/uapi/linux
directory in the kernel source to the tools/include/uapi/linux directory.

This action ensures that the ethtool.h file used in the tools directory
is in sync with the kernel's version, maintaining consistency across the
codebase.

There are some checkpatch warnings in this file that could be cleaned up,
but I preferred to move it over as-is for now to avoid disrupting the code.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-2-tushar.vyavahare@intel.com

7effe3fd

Merge branch 'bpf-arm64-add-support-for-bpf-arena' · 49b73fa6

Alexei Starovoitov authored Apr 02, 2024

Puranjay Mohan says:

====================
bpf,arm64: Add support for BPF Arena

Changes in V4
V3: https://lore.kernel.org/bpf/20240323103057.26499-1-puranjay12@gmail.com/
- Use more descriptive variable names.
- Use insn_is_cast_user() helper.

Changes in V3
V2: https://lore.kernel.org/bpf/20240321153102.103832-1-puranjay12@gmail.com/
- Optimize bpf_addr_space_cast as suggested by Xu Kuohai

Changes in V2
V1: https://lore.kernel.org/bpf/20240314150003.123020-1-puranjay12@gmail.com/
- Fix build warnings by using 5 in place of 32 as DONT_CLEAR marker.
  R5 is not mapped to any BPF register so it can safely be used here.

This series adds the support for PROBE_MEM32 and bpf_addr_space_cast
instructions to the ARM64 BPF JIT. These two instructions allow the
enablement of BPF Arena.

All arena related selftests are passing.

  [root@ip-172-31-6-62 bpf]# ./test_progs -a "*arena*"
  #3/1     arena_htab/arena_htab_llvm:OK
  #3/2     arena_htab/arena_htab_asm:OK
  #3       arena_htab:OK
  #4/1     arena_list/arena_list_1:OK
  #4/2     arena_list/arena_list_1000:OK
  #4       arena_list:OK
  #434/1   verifier_arena/basic_alloc1:OK
  #434/2   verifier_arena/basic_alloc2:OK
  #434/3   verifier_arena/basic_alloc3:OK
  #434/4   verifier_arena/iter_maps1:OK
  #434/5   verifier_arena/iter_maps2:OK
  #434/6   verifier_arena/iter_maps3:OK
  #434     verifier_arena:OK
  Summary: 3/10 PASSED, 0 SKIPPED, 0 FAILED

This will need the patch [1] that introduced insn_is_cast_user() helper to
build.

The verifier_arena selftest could fail in the CI because the following
commit[2] is missing from bpf-next:

[1] https://lore.kernel.org/bpf/20240324183226.29674-1-puranjay12@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=fa3550dca8f02ec312727653a94115ef3ab68445

Here is a CI run with all dependencies added: https://github.com/kernel-patches/bpf/pull/6641
====================

Link: https://lore.kernel.org/r/20240325150716.4387-1-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

49b73fa6

bpf: Add arm64 JIT support for bpf_addr_space_cast instruction. · 4dd31243

Puranjay Mohan authored Mar 25, 2024

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY.

rY = addr_space_cast(rX, 1, 0) : used to convert a bpf arena pointer to
a pointer in the userspace vma. This has to be converted by the JIT.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-3-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4dd31243

bpf: Add arm64 JIT support for PROBE_MEM32 pseudo instructions. · 339af577

Puranjay Mohan authored Mar 25, 2024

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions.  They are similar to PROBE_MEM instructions with the
following differences:
- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
  src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in R28
  in the prologue). Due to bpf_arena constructions such R28 + reg +
  off16 access is guaranteed to be within arena virtual range, so no
  address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
  LDX faults the destination register is zeroed.

To support these on arm64, we do tmp2 = R28 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-2-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

339af577

02 Apr, 2024 11 commits

selftests/bpf: Add pid limit for mptcpify prog · c07b4bcd

Geliang Tang authored Apr 02, 2024

In order to prevent mptcpify prog from affecting the running results
of other BPF tests, a pid limit was added to restrict it from only
modifying its own program.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/8987e2938e15e8ec390b85b5dcbee704751359dc.1712054986.git.tanggeliang@kylinos.cn

c07b4bcd

libbpf: Use local bpf_helpers.h include · 15ea39ad

Tobias Böhm authored Apr 02, 2024

Commit 20d59ee5 ("libbpf: add bpf_core_cast() macro") added a
bpf_helpers include in bpf_core_read.h as a system include. Usually, the
includes are local, though, like in bpf_tracing.h. This commit adjusts
the include to be local as well.
Signed-off-by: Tobias Böhm <tobias@aibor.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/q5d5bgc6vty2fmaazd5e73efd6f5bhiru2le6fxn43vkw45bls@fhlw2s5ootdb

15ea39ad

bpf: Improve program stats run-time calculation · ce09cbdd

Jose Fernandez authored Apr 01, 2024

This patch improves the run-time calculation for program stats by
capturing the duration as soon as possible after the program returns.

Previously, the duration included u64_stats_t operations. While the
instrumentation overhead is part of the total time spent when stats are
enabled, distinguishing between the program's native execution time and
the time spent due to instrumentation is crucial for accurate
performance analysis.

By making this change, the patch facilitates more precise optimization
of BPF programs, enabling users to understand their performance in
environments without stats enabled.

I used a virtualized environment to measure the run-time over one minute
for a basic raw_tracepoint/sys_enter program, which just increments a
local counter. Although the virtualization introduced some performance
degradation that could affect the results, I observed approximately a
16% decrease in average run-time reported by stats with this change
(310 -> 260 nsec).
Signed-off-by: Jose Fernandez <josef@netflix.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402034010.25060-1-josef@netflix.com

ce09cbdd

selftests/bpf: Skip test when perf_event_open returns EOPNOTSUPP · c186ed12

Pu Lehui authored Apr 02, 2024

When testing send_signal and stacktrace_build_id_nmi using the riscv sbi
pmu driver without the sscofpmf extension or the riscv legacy pmu driver,
then failures as follows are encountered:

    test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0
    #272/3   send_signal/send_signal_nmi:FAIL

    test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95
    #304     stacktrace_build_id_nmi:FAIL

The reason is that the above pmu driver or hardware does not support
sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu
capabilities, and then perf_event_open returns EOPNOTSUPP. Since
PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver,
it is better to skip testing when this capability is set.
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402073029.1299085-1-pulehui@huaweicloud.com

c186ed12

bpftool: Use __typeof__() instead of typeof() in BPF skeleton · 2a24e248

Andrii Nakryiko authored Apr 01, 2024

When generated BPF skeleton header is included in C++ code base, some
compiler setups will emit warning about using language extensions due to
typeof() usage, resulting in something like:

  error: extension used [-Werror,-Wlanguage-extension-token]
  obj->struct_ops.empty_tcp_ca = (typeof(obj->struct_ops.empty_tcp_ca))
                                  ^

It looks like __typeof__() is a preferred way to do typeof() with better
C++ compatibility behavior, so switch to that. With __typeof__() we get
no such warning.

Fixes: c2a0257c ("bpftool: Cast pointers for shadow types explicitly.")
Fixes: 00389c58 ("bpftool: Add support for subskeletons")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Quentin Monnet <qmo@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240401170713.2081368-1-andrii@kernel.org

2a24e248

selftests/bpf: Using llvm may_goto inline asm for cond_break macro · 965c6167

Yonghong Song authored Apr 01, 2024

Currently, cond_break macro uses bytes to encode the may_goto insn.
Patch [1] in llvm implemented may_goto insn in BPF backend.
Replace byte-level encoding with llvm inline asm for better usability.
Using llvm may_goto insn is controlled by macro __BPF_FEATURE_MAY_GOTO.

  [1] https://github.com/llvm/llvm-project/commit/0e0bfacff71859d1f9212205f8f873d47029d3fbSigned-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402025446.3215182-1-yonghong.song@linux.dev

965c6167

bpf: Add a verbose message if map limit is reached · 9dc182c5

Anton Protopopov authored Apr 02, 2024

When more than 64 maps are used by a program and its subprograms the
verifier returns -E2BIG. Add a verbose message which highlights the
source of the error and also print the actual limit.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402073347.195920-1-aspsk@isovalent.com

9dc182c5

bpf: Fix typo in uapi doc comments · ca4ddc26

David Lechner authored Mar 29, 2024

In a few places in the bpf uapi headers, EOPNOTSUPP is missing a "P" in
the doc comments. This adds the missing "P".
Signed-off-by: David Lechner <dlechner@baylibre.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240329152900.398260-2-dlechner@baylibre.com

ca4ddc26

bpftool: Clean-up typos, punctuation, list formatting in docs · a70f5d84

Rameez Rehman authored Mar 31, 2024

Improve the formatting of the attach flags for cgroup programs in the
relevant man page, and fix typos ("can be on of", "an userspace inet
socket") when introducing that list. Also fix a couple of other trivial
issues in docs.

[ Quentin: Fixed trival issues in bpftool-gen.rst and bpftool-iter.rst ]
Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-4-qmo@kernel.org

a70f5d84

bpftool: Remove useless emphasis on command description in man pages · ea379b3c

Rameez Rehman authored Mar 31, 2024

As it turns out, the terms in definition lists in the rST file are
already rendered with bold-ish formatting when generating the man pages;
all double-star sequences we have in the commands for the command
description are unnecessary, and can be removed to make the
documentation easier to read.

The rST files were automatically processed with:

    sed -i '/DESCRIPTION/,/OPTIONS/ { /^\*/ s/\*\*//g }' b*.rst
Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-3-qmo@kernel.org

ea379b3c

bpftool: Use simpler indentation in source rST for documentation · f7b68543

Rameez Rehman authored Mar 31, 2024

The rST manual pages for bpftool would use a mix of tabs and spaces for
indentation. While this is the norm in C code, this is rather unusual
for rST documents, and over time we've seen many contributors use a
wrong level of indentation for documentation update.

Let's fix bpftool's indentation in docs once and for all:

- Let's use spaces, that are more common in rST files.
- Remove one level of indentation for the synopsis, the command
  description, and the "see also" section. As a result, all sections
  start with the same indentation level in the generated man page.
- Rewrap the paragraphs after the changes.

There is no content change in this patch, only indentation and
rewrapping changes. The wrapping in the generated source files for the
manual pages is changed, but the pages displayed with "man" remain the
same, apart from the adjusted indentation level on relevant sections.

[ Quentin: rebased on bpf-next, removed indent level for command
  description and options, updated synopsis, command summary, and "see
  also" sections. ]
Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-2-qmo@kernel.org

f7b68543

30 Mar, 2024 1 commit

selftests/bpf: make multi-uprobe tests work in RELEASE=1 mode · 623bdd58

Andrii Nakryiko authored Mar 29, 2024

When BPF selftests are built in RELEASE=1 mode with -O2 optimization
level, uprobe_multi binary, called from multi-uprobe tests is optimized
to the point that all the thousands of target uprobe_multi_func_XXX
functions are eliminated, breaking tests.

So ensure they are preserved by using weak attribute.

But, actually, compiling uprobe_multi binary with -O2 takes a really
long time, and is quite useless (it's not a benchmark). So in addition
to ensuring that uprobe_multi_func_XXX functions are preserved, opt-out
of -O2 explicitly in Makefile and stick to -O0. This saves a lot of
compilation time.

With -O2, just recompiling uprobe_multi:

  $ touch uprobe_multi.c
  $ time make RELEASE=1 -j90
  make RELEASE=1 -j90  291.66s user 2.54s system 99% cpu 4:55.52 total

With -O0:
  $ touch uprobe_multi.c
  $ time make RELEASE=1 -j90
  make RELEASE=1 -j90  22.40s user 1.91s system 99% cpu 24.355 total

5 minutes vs (still slow, but...) 24 seconds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240329190410.4191353-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

623bdd58

29 Mar, 2024 3 commits

bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie. · 59f2f841

Alexei Starovoitov authored Mar 29, 2024

syzbot reported the following lock sequence:
cpu 2:
  grabs timer_base lock
    spins on bpf_lpm lock

cpu 1:
  grab rcu krcp lock
    spins on timer_base lock

cpu 0:
  grab bpf_lpm lock
    spins on rcu krcp lock

bpf_lpm lock can be the same.
timer_base lock can also be the same due to timer migration.
but rcu krcp lock is always per-cpu, so it cannot be the same lock.
Hence it's a false positive.
To avoid lockdep complaining move kfree_rcu() after spin_unlock.

Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com

59f2f841

Merge branch 'Use start_server and connect_fd_to_fd' · 201874fc

Martin KaFai Lau authored Mar 28, 2024

Geliang Tang says:

====================
Simplify bpf_tcp_ca test by using connect_fd_to_fd and start_server
helpers.

v4:
 - Matt reminded me that I shouldn't send a square-to patch to BPF (thanks),
   so I update them into two patches in v4.

v3:
 - split v2 as two patches as Daniel suggested.
 - The patch "selftests/bpf: Use start_server in bpf_tcp_ca" is merged
   by Daniel (thanks), but I forgot to drop 'settimeo(lfd, 0)' in it, so
   I send a squash-to patch to fix this.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

201874fc

selftests/bpf: Drop settimeo in do_test · 42667092

Geliang Tang authored Mar 26, 2024

settimeo is invoked in start_server() and in connect_fd_to_fd() already,
no need to invoke settimeo(lfd, 0) and settimeo(fd, 0) in do_test()
anymore. This patch drops them.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/dbc3613bee3b1c78f95ac9ff468bf47c92f106ea.1711447102.git.tanggeliang@kylinos.cnSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

42667092