Commits · ee9fd0ac3017c4313be91a220a9ac4c99dde7ad4 · Kirill Smelkov / linux

12 May, 2023 2 commits

bpf: Address KCSAN report on bpf_lru_list · ee9fd0ac

Martin KaFai Lau authored May 10, 2023

KCSAN reported a data-race when accessing node->ref.
Although node->ref does not have to be accurate,
take this chance to use a more common READ_ONCE() and WRITE_ONCE()
pattern instead of data_race().

There is an existing bpf_lru_node_is_ref() and bpf_lru_node_set_ref().
This patch also adds bpf_lru_node_clear_ref() to do the
WRITE_ONCE(node->ref, 0) also.

==================================================================
BUG: KCSAN: data-race in __bpf_lru_list_rotate / __htab_lru_percpu_map_update_elem

write to 0xffff888137038deb of 1 bytes by task 11240 on cpu 1:
__bpf_lru_node_move kernel/bpf/bpf_lru_list.c:113 [inline]
__bpf_lru_list_rotate_active kernel/bpf/bpf_lru_list.c:149 [inline]
__bpf_lru_list_rotate+0x1bf/0x750 kernel/bpf/bpf_lru_list.c:240
bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:329 [inline]
bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
bpf_lru_pop_free+0x638/0xe20 kernel/bpf/bpf_lru_list.c:499
prealloc_lru_pop kernel/bpf/hashtab.c:290 [inline]
__htab_lru_percpu_map_update_elem+0xe7/0x820 kernel/bpf/hashtab.c:1316
bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313
bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200
generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687
bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534
__sys_bpf+0x338/0x810
__do_sys_bpf kernel/bpf/syscall.c:5096 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5094 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

read to 0xffff888137038deb of 1 bytes by task 11241 on cpu 0:
bpf_lru_node_set_ref kernel/bpf/bpf_lru_list.h:70 [inline]
__htab_lru_percpu_map_update_elem+0x2f1/0x820 kernel/bpf/hashtab.c:1332
bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313
bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200
generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687
bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534
__sys_bpf+0x338/0x810
__do_sys_bpf kernel/bpf/syscall.c:5096 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5094 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

value changed: 0x01 -> 0x00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11241 Comm: syz-executor.3 Not tainted 6.3.0-rc7-syzkaller-00136-g6a66fdd2 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023
==================================================================

Reported-by: syzbot+ebe648a84e8784763f82@syzkaller.appspotmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230511043748.1384166-1-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

ee9fd0ac

bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25 · 7b99f759

Alan Maguire authored May 10, 2023

v1.25 of pahole supports filtering out functions with multiple inconsistent
function prototypes or optimized-out parameters from the BTF representation.
These present problems because there is no additional info in BTF saying which
inconsistent prototype matches which function instance to help guide attachment,
and functions with optimized-out parameters can lead to incorrect assumptions
about register contents.

So for now, filter out such functions while adding BTF representations for
functions that have "."-suffixes (foo.isra.0) but not optimized-out parameters.
This patch assumes that below linked changes land in pahole for v1.25.

Issues with pahole filtering being too aggressive in removing functions
appear to be resolved now, but CI and further testing will confirm.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230510130241.1696561-1-alan.maguire@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7b99f759

06 May, 2023 9 commits

Merge branch 'Dynptr Verifier Adjustments' · 6e61c5fa

Alexei Starovoitov authored May 06, 2023

Daniel Rosenberg says:

====================
These patches relax a few verifier requirements around dynptrs.
Patches 1-3 are unchanged from v2, apart from rebasing
Patch 4 is the same as in v1, see
https://lore.kernel.org/bpf/CA+PiJmST4WUH061KaxJ4kRL=fqy3X6+Wgb2E2rrLT5OYjUzxfQ@mail.gmail.com/
Patch 5 adds a test for the change in Patch 4
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

6e61c5fa

selftests/bpf: Accept mem from dynptr in helper funcs · 798e48fc

Daniel Rosenberg authored May 05, 2023

This ensures that buffers retrieved from dynptr_data are allowed to be
passed in to helpers that take mem, like bpf_strncmp
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-6-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

798e48fc

bpf: verifier: Accept dynptr mem as mem in helpers · 2012c867

Daniel Rosenberg authored May 05, 2023

This allows using memory retrieved from dynptrs with helper functions
that accept ARG_PTR_TO_MEM. For instance, results from bpf_dynptr_data
can be passed along to bpf_strncmp.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-5-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2012c867

selftests/bpf: Check overflow in optional buffer · 3881fdfe

Daniel Rosenberg authored May 05, 2023

This ensures we still reject invalid memory accesses in buffers that are
marked optional.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-4-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3881fdfe

selftests/bpf: Test allowing NULL buffer in dynptr slice · 1ce33b6c

Daniel Rosenberg authored May 05, 2023

bpf_dynptr_slice(_rw) no longer requires a buffer for verification. If the
buffer is needed, but not present, the function will return NULL.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-3-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1ce33b6c

bpf: Allow NULL buffers in bpf_dynptr_slice(_rw) · 3bda08b6

Daniel Rosenberg authored May 05, 2023

bpf_dynptr_slice(_rw) uses a user provided buffer if it can not provide
a pointer to a block of contiguous memory. This buffer is unused in the
case of local dynptrs, and may be unused in other cases as well. There
is no need to require the buffer, as the kfunc can just return NULL if
it was needed and not provided.

This adds another kfunc annotation, __opt, which combines with __sz and
__szk to allow the buffer associated with the size to be NULL. If the
buffer is NULL, the verifier does not check that the buffer is of
sufficient size.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20230506013134.2492210-2-drosen@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3bda08b6

Merge branch 'Introduce a new kfunc of bpf_task_under_cgroup' · e04ddf17

Alexei Starovoitov authored May 06, 2023

Feng zhou says:

====================
Trace sched related functions, such as enqueue_task_fair, it is necessary to
specify a task instead of the current task which within a given cgroup.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e04ddf17

selftests/bpf: Add testcase for bpf_task_under_cgroup · 49e0263a

Feng Zhou authored May 06, 2023

test_progs:
Tests new kfunc bpf_task_under_cgroup().

The bpf program saves the new task's pid within a given cgroup to
the remote_pid, which is convenient for the user-mode program to
verify the test correctness.

The user-mode program creates its own mount namespace, and mounts the
cgroupsv2 hierarchy in there, call the fork syscall, then check if
remote_pid and local_pid are unequal.
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230506031545.35991-3-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

49e0263a

bpf: Add bpf_task_under_cgroup() kfunc · b5ad4cdc

Feng Zhou authored May 06, 2023

Add a kfunc that's similar to the bpf_current_task_under_cgroup.
The difference is that it is a designated task.

When hook sched related functions, sometimes it is necessary to
specify a task instead of the current task.
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230506031545.35991-2-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

b5ad4cdc

05 May, 2023 14 commits

samples/bpf: Fix buffer overflow in tcp_basertt · f4dea968

Pengcheng Yang authored May 05, 2023

Using sizeof(nv) or strlen(nv)+1 is correct.

Fixes: c890063e ("bpf: sample BPF_SOCKET_OPS_BASE_RTT program")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Link: https://lore.kernel.org/r/1683276658-2860-1-git-send-email-yangpc@wangsu.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f4dea968

bpf, docs: Update llvm_relocs.rst with typo fixes · 69535186

Will Hawkins authored Apr 27, 2023

Correct a few typographical errors and fix some mistakes in examples.
Signed-off-by: Will Hawkins <hawkinsw@obs.cr>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230428023015.1698072-2-hawkinsw@obs.crSigned-off-by: Alexei Starovoitov <ast@kernel.org>

69535186

Merge branch 'Add precision propagation for subprogs and callbacks' · fbc0b025

Alexei Starovoitov authored May 04, 2023

Andrii Nakryiko says:

====================
As more and more real-world BPF programs become more complex
and increasingly use subprograms (both static and global), scalar precision
tracking and its (previously weak) support for BPF subprograms (and callbacks
as a special case of that) is becoming more and more of an issue and
limitation. Couple that with increasing reliance on state equivalence (BPF
open-coded iterators have a hard requirement for state equivalence to converge
and successfully validate loops), and it becomes pretty critical to address
this limitation and make precision tracking universally supported for BPF
programs of any complexity and composition.

This patch set teaches BPF verifier to support SCALAR precision
backpropagation across multiple frames (for subprogram calls and callback
simulations) and addresses most practical situations (SCALAR stack
loads/stores using registers other than r10 being the last remaining
limitation, though thankfully rarely used in practice).

Main logic is explained in details in patch #8. The rest are preliminary
preparations, refactorings, clean ups, and fixes. See respective patches for
details.

Patch #8 has also veristat comparison of results for selftests, Cilium, and
some of Meta production BPF programs before and after these changes.

v2->v3:
  - drop bitcnt and ifs from bt_xxx() helpers (Alexei);
v1->v2:
  - addressed review feedback form Alexei, adjusted commit messages, comments,
    added verbose(), WARN_ONCE(), etc;
  - re-ran all the tests and veristat on selftests, cilium, and meta-internal
    code: no new changes and no kernel warnings.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

fbc0b025

selftests/bpf: revert iter test subprog precision workaround · c91ab90c

Andrii Nakryiko authored May 04, 2023

Now that precision propagation is supported fully in the presence of
subprogs, there is no need to work around iter test. Revert original
workaround.

This reverts be7dbd27 ("selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests").
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-11-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c91ab90c

selftests/bpf: add precision propagation tests in the presence of subprogs · 3ef3d217

Andrii Nakryiko authored May 04, 2023

Add a bunch of tests validating verifier's precision backpropagation
logic in the presence of subprog calls and/or callback-calling
helpers/kfuncs.

We validate the following conditions:
  - subprog_result_precise: static subprog r0 result precision handling;
  - global_subprog_result_precise: global subprog r0 precision
    shortcutting, similar to BPF helper handling;
  - callback_result_precise: similarly r0 marking precise for
    callback-calling helpers;
  - parent_callee_saved_reg_precise, parent_callee_saved_reg_precise_global:
    propagation of precision for callee-saved registers bypassing
    static/global subprogs;
  - parent_callee_saved_reg_precise_with_callback: same as above, but in
    the presence of callback-calling helper;
  - parent_stack_slot_precise, parent_stack_slot_precise_global:
    similar to above, but instead propagating precision of stack slot
    (spilled SCALAR reg);
  - parent_stack_slot_precise_with_callback: same as above, but in the
    presence of callback-calling helper;
  - subprog_arg_precise: propagation of precision of static subprog's
    input argument back to caller;
  - subprog_spill_into_parent_stack_slot_precise: negative test
    validating that verifier currently can't support backtracking of stack
    access with non-r10 register, we validate that we fallback to
    forcing precision for all SCALARs.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-10-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3ef3d217

bpf: support precision propagation in the presence of subprogs · fde2a388

Andrii Nakryiko authored May 04, 2023

Add support precision backtracking in the presence of subprogram frames in
jump history.

This means supporting a few different kinds of subprogram invocation
situations, all requiring a slightly different handling in precision
backtracking handling logic:
  - static subprogram calls;
  - global subprogram calls;
  - callback-calling helpers/kfuncs.

For each of those we need to handle a few precision propagation cases:
  - what to do with precision of subprog returns (r0);
  - what to do with precision of input arguments;
  - for all of them callee-saved registers in caller function should be
    propagated ignoring subprog/callback part of jump history.

N.B. Async callback-calling helpers (currently only
bpf_timer_set_callback()) are transparent to all this because they set
a separate async callback environment and thus callback's history is not
shared with main program's history. So as far as all the changes in this
commit goes, such helper is just a regular helper.

Let's look at all these situation in more details. Let's start with
static subprogram being called, using an exxerpt of a simple main
program and its static subprog, indenting subprog's frame slightly to
make everything clear.

frame 0				frame 1			precision set
=======				=======			=============

 9: r6 = 456;
10: r1 = 123;						fr0: r6
11: call pc+10;						fr0: r1, r6
				22: r0 = r1;		fr0: r6;     fr1: r1
				23: exit		fr0: r6;     fr1: r0
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6
15: exit

As can be seen above main function is passing 123 as single argument to
an identity (`return x;`) subprog. Returned value is used to adjust map
pointer offset, which forces r0 to be marked as precise. Then
instruction #14 does the same for callee-saved r6, which will have to be
backtracked all the way to instruction #9. For brevity, precision sets
for instruction #13 and #14 are combined in the diagram above.

First, for subprog calls, r0 returned from subprog (in frame 0) has to
go into subprog's frame 1, and should be cleared from frame 0. So we go
back into subprog's frame knowing we need to mark r0 precise. We then
see that insn #22 sets r0 from r1, so now we care about marking r1
precise.  When we pop up from subprog's frame back into caller at
insn #11 we keep r1, as it's an argument-passing register, so we eventually
find `10: r1 = 123;` and satify precision propagation chain for insn #13.

This example demonstrates two sets of rules:
  - r0 returned after subprog call has to be moved into subprog's r0 set;
  - *static* subprog arguments (r1-r5) are moved back to caller precision set.

Let's look at what happens with callee-saved precision propagation. Insn #14
mark r6 as precise. When we get into subprog's frame, we keep r6 in
frame 0's precision set *only*. Subprog itself has its own set of
independent r6-r10 registers and is not affected. When we eventually
made our way out of subprog frame we keep r6 in precision set until we
reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is
perhaps the simplest aspect, it always stays in its original frame.

That's pretty much all we have to do to support precision propagation
across *static subprog* invocation.

Let's look at what happens when we have global subprog invocation.

frame 0				frame 1			precision set
=======				=======			=============

 9: r6 = 456;
10: r1 = 123;						fr0: r6
11: call pc+10; # global subprog			fr0: r6
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6;
15: exit

Starting from insn #13, r0 has to be precise. We backtrack all the way
to insn #11 (call pc+10) and see that subprog is global, so was already
validated in isolation. As opposed to static subprog, global subprog
always returns unknown scalar r0, so that satisfies precision
propagation and we drop r0 from precision set. We are done for insns #13.

Now for insn #14. r6 is in precision set, we backtrack to `call pc+10;`.
Here we need to recognize that this is effectively both exit and entry
to global subprog, which means we stay in caller's frame. So we carry on
with r6 still in precision set, until we satisfy it at insn #9. The only
hard part with global subprogs is just knowing when it's a global func.

Lastly, callback-calling helpers and kfuncs do simulate subprog calls,
so jump history will have subprog instructions in between caller
program's instructions, but the rules of propagating r0 and r1-r5
differ, because we don't actually directly call callback. We actually
call helper/kfunc, which at runtime will call subprog, so the only
difference between normal helper/kfunc handling is that we need to make
sure to skip callback simulatinog part of jump history.
Let's look at an example to make this clearer.

frame 0				frame 1			precision set
=======				=======			=============

 8: r6 = 456;
 9: r1 = 123;						fr0: r6
10: r2 = &callback;					fr0: r6
11: call bpf_loop;					fr0: r6
				22: r0 = r1;		fr0: r6      fr1:
				23: exit		fr0: r6      fr1:
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6;
15: exit

Again, insn #13 forces r0 to be precise. As soon as we get to `23: exit`
we see that this isn't actually a static subprog call (it's `call
bpf_loop;` helper call instead). So we clear r0 from precision set.

For callee-saved register, there is no difference: it stays in frame 0's
precision set, we go through insn #22 and #23, ignoring them until we
get back to caller frame 0, eventually satisfying precision backtrack
logic at insn #8 (`r6 = 456;`).

Assuming callback needed to set r0 as precise at insn #23, we'd
backtrack to insn #22, switching from r0 to r1, and then at the point
when we pop back to frame 0 at insn #11, we'll clear r1-r5 from
precision set, as we don't really do a subprog call directly, so there
is no input argument precision propagation.

That's pretty much it. With these changes, it seems like the only still
unsupported situation for precision backpropagation is the case when
program is accessing stack through registers other than r10. This is
still left as unsupported (though rare) case for now.

As for results. For selftests, few positive changes for bigger programs,
cls_redirect in dynptr variant benefitting the most:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0'
File                                      Program        Insns (A)  Insns (B)  Insns     (DIFF)
----------------------------------------  -------------  ---------  ---------  ----------------
pyperf600_bpf_loop.bpf.linked1.o          on_event            2060       2002      -58 (-2.82%)
test_cls_redirect_dynptr.bpf.linked1.o    cls_redirect       15660       2914  -12746 (-81.39%)
test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect       61620      59088    -2532 (-4.11%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_tc      109980      86278  -23702 (-21.55%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp      97716      85147  -12569 (-12.86%)

Cilium progress don't really regress. They don't use subprogs and are
mostly unaffected, but some other fixes and improvements could have
changed something. This doesn't appear to be the case:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0'
File           Program                         Insns (A)  Insns (B)  Insns (DIFF)
-------------  ------------------------------  ---------  ---------  ------------
bpf_host.o     tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6            12475      12504  +29 (+0.23%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6       6363       6371   +8 (+0.13%)

Looking at (somewhat anonymized) Meta production programs, we see mostly
insignificant variation in number of instructions, with one program
(syar_bind6_protect6) benefitting the most at -17%.

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0'
Program                   Insns (A)  Insns (B)  Insns     (DIFF)
------------------------  ---------  ---------  ----------------
on_request_context_event        597        585      -12 (-2.01%)
read_async_py_stack           43789      43657     -132 (-0.30%)
read_sync_py_stack            35041      37599    +2558 (+7.30%)
rrm_usdt                        946        940       -6 (-0.63%)
sysarmor_inet6_bind           28863      28249     -614 (-2.13%)
sysarmor_inet_bind            28845      28240     -605 (-2.10%)
syar_bind4_protect4          154145     147640    -6505 (-4.22%)
syar_bind6_protect6          165242     137088  -28154 (-17.04%)
syar_task_exit_setgid         21289      19720    -1569 (-7.37%)
syar_task_exit_setuid         21290      19721    -1569 (-7.37%)
do_uprobe                     19967      19413     -554 (-2.77%)
tw_twfw_ingress              215877     204833   -11044 (-5.12%)
tw_twfw_tc_in                215877     204833   -11044 (-5.12%)

But checking duration (wall clock) differences, that is the actual time taken
by verifier to validate programs, we see a sometimes dramatic improvements, all
the way to about 16x improvements:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20
Program                                   Duration (us) (A)  Duration (us) (B)  Duration (us) (DIFF)
----------------------------------------  -----------------  -----------------  --------------------
tw_twfw_ingress                                     4488374             272836    -4215538 (-93.92%)
tw_twfw_tc_in                                       4339111             268175    -4070936 (-93.82%)
tw_twfw_egress                                      3521816             270751    -3251065 (-92.31%)
tw_twfw_tc_eg                                       3472878             284294    -3188584 (-91.81%)
balancer_ingress                                     343119             291391      -51728 (-15.08%)
syar_bind6_protect6                                   78992              64782      -14210 (-17.99%)
ttls_tc_ingress                                       11739               8176       -3563 (-30.35%)
kprobe__security_inode_link                           13864              11341       -2523 (-18.20%)
read_sync_py_stack                                    21927              19442       -2485 (-11.33%)
read_async_py_stack                                   30444              28136        -2308 (-7.58%)
syar_task_exit_setuid                                 10256               8440       -1816 (-17.71%)
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-9-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

fde2a388

bpf: fix mark_all_scalars_precise use in mark_chain_precision · c50c0b57

Andrii Nakryiko authored May 04, 2023

When precision backtracking bails out due to some unsupported sequence
of instructions (e.g., stack access through register other than r10), we
need to mark all SCALAR registers as precise to be safe. Currently,
though, we mark SCALARs precise only starting from the state we detected
unsupported condition, which could be one of the parent states of the
actual current state. This will leave some registers potentially not
marked as precise, even though they should. So make sure we start
marking scalars as precise from current state (env->cur_state).

Further, we don't currently detect a situation when we end up with some
stack slots marked as needing precision, but we ran out of available
states to find the instructions that populate those stack slots. This is
akin the `i >= func->allocated_stack / BPF_REG_SIZE` check and should be
handled similarly by falling back to marking all SCALARs precise. Add
this check when we run out of states.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-8-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c50c0b57

bpf: fix propagate_precision() logic for inner frames · f655badf

Andrii Nakryiko authored May 04, 2023

Fix propagate_precision() logic to perform propagation of all necessary
registers and stack slots across all active frames *in one batch step*.

Doing this for each register/slot in each individual frame is wasteful,
but the main problem is that backtracking of instruction in any frame
except the deepest one just doesn't work. This is due to backtracking
logic relying on jump history, and available jump history always starts
(or ends, depending how you view it) in current frame. So, if
prog A (frame #0) called subprog B (frame #1) and we need to propagate
precision of, say, register R6 (callee-saved) within frame #0, we
actually don't even know where jump history that corresponds to prog
A even starts. We'd need to skip subprog part of jump history first to
be able to do this.

Luckily, with struct backtrack_state and __mark_chain_precision()
handling bitmasks tracking/propagation across all active frames at the
same time (added in previous patch), propagate_precision() can be both
fixed and sped up by setting all the necessary bits across all frames
and then performing one __mark_chain_precision() pass. This makes it
unnecessary to skip subprog parts of jump history.

We also improve logging along the way, to clearly specify which
registers' and slots' precision markings are propagated within which
frame. Each frame will have dedicated line and all registers and stack
slots from that frame will be reported in format similar to precision
backtrack regs/stack logging. E.g.:

frame 1: propagating r1,r2,r3,fp-8,fp-16
frame 0: propagating r3,r9,fp-120

Fixes: 529409ea ("bpf: propagate precision across all frames, not just the last one")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-7-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f655badf

bpf: maintain bitmasks across all active frames in __mark_chain_precision · 1ef22b68

Andrii Nakryiko authored May 04, 2023

Teach __mark_chain_precision logic to maintain register/stack masks
across all active frames when going from child state to parent state.
Currently this should be mostly no-op, as precision backtracking usually
bails out when encountering subprog entry/exit.

It's not very apparent from the diff due to increased indentation, but
the logic remains the same, except everything is done on specific `fr`
frame index. Calls to bt_clear_reg() and bt_clear_slot() are replaced
with frame-specific bt_clear_frame_reg() and bt_clear_frame_slot(),
where frame index is passed explicitly, instead of using current frame
number.

We also adjust logging to emit affected frame number. And we also add
better logging of human-readable register and stack slot masks, similar
to previous patch.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1ef22b68

bpf: improve precision backtrack logging · d9439c21

Andrii Nakryiko authored May 04, 2023

Add helper to format register and stack masks in more human-readable
format. Adjust logging a bit during backtrack propagation and especially
during forcing precision fallback logic to make it clearer what's going
on (with log_level=2, of course), and also start reporting affected
frame depth. This is in preparation for having more than one active
frame later when precision propagation between subprog calls is added.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d9439c21

bpf: encapsulate precision backtracking bookkeeping · 407958a0

Andrii Nakryiko authored May 04, 2023

Add struct backtrack_state and straightforward API around it to keep
track of register and stack masks used and maintained during precision
backtracking process. Having this logic separately allow to keep
high-level backtracking algorithm cleaner, but also it sets us up to
cleanly keep track of register and stack masks per frame, allowing (with
some further logic adjustments) to perform precision backpropagation
across multiple frames (i.e., subprog calls).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

407958a0

bpf: mark relevant stack slots scratched for register read instructions · e0bf4622

Andrii Nakryiko authored May 04, 2023

When handling instructions that read register slots, mark relevant stack
slots as scratched so that verifier log would contain those slots' states, in
addition to currently emitted registers with stack slot offsets.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

e0bf4622

veristat: add -t flag for adding BPF_F_TEST_STATE_FREQ program flag · 5956f301

Andrii Nakryiko authored May 04, 2023

Sometimes during debugging it's important that BPF program is loaded
with BPF_F_TEST_STATE_FREQ flag set to force verifier to do frequent
state checkpointing. Teach veristat to do this when -t ("test state")
flag is specified.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5956f301

libbpf: Fix comment about arc and riscv arch in bpf_tracing.h · 7866fc6a

Kenjiro Nakayama authored May 04, 2023

To make comments about arc and riscv arch in bpf_tracing.h accurate,
this patch fixes the comment about arc and adds the comment for riscv.
Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230504035443.427927-1-nakayamakenjiro@gmail.com

7866fc6a

02 May, 2023 2 commits

bpf: Print a warning only if writing to unprivileged_bpf_disabled. · fedf9920

Kui-Feng Lee authored May 02, 2023

Only print the warning message if you are writing to
"/proc/sys/kernel/unprivileged_bpf_disabled".

The kernel may print an annoying warning when you read
"/proc/sys/kernel/unprivileged_bpf_disabled" saying

  WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible
  via Spectre v2 BHB attacks!

However, this message is only meaningful when the feature is
disabled or enabled.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230502181418.308479-1-kuifeng@meta.com

fedf9920

bpf: Emit struct bpf_tcp_sock type in vmlinux BTF · bf6882ae

Yonghong Song authored May 02, 2023

In one of our internal testing, we found a case where
  - uapi struct bpf_tcp_sock is in vmlinux.h where vmlinux.h is not
    generated from the testing kernel
  - struct bpf_tcp_sock is not in vmlinux BTF

The above combination caused bpf load failure as the following
memory access
  struct bpf_tcp_sock *tcp_sock = ...;
  ... tcp_sock->snd_cwnd ...
needs CORE relocation but the relocation cannot be resolved since
the kernel BTF does not have corresponding type.

Similar to other previous cases (nf_conn___init, tcp6_sock, mctcp_sock, etc.),
add the type to vmlinux BTF with BTF_EMIT_TYPE macro.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230502180543.1832140-1-yhs@fb.com

bf6882ae

01 May, 2023 4 commits

Merge branch 'selftests/bpf: test_progs can read test lists from file' · 577c34b3

Andrii Nakryiko authored May 01, 2023

Stephen Veiss says:

====================

BPF selftests have ALLOWLIST and DENYLIST files, used to control which
tests are run in CI. These files are currently parsed by a shell
script. [1]

This patchset allows those files to be specified directly on the
test_progs command line (eg, as -a @ALLOWLIST).

This also fixes a bug in the existing test filter code causing
unnecessary duplicate top-level test filter entries to be created.

[1] https://github.com/kernel-patches/vmtest/blob/57feb460047b69f891cf4afe3cc860794a2ced17/ci/vmtest/run_selftests.sh#L21-L27
---

v2:
 - error handling style changes per reviewer comments
 - fdopen return value checking in test_parse_test_list_file

v1:
 https://lore.kernel.org/bpf/20230425225401.1075796-1-sveiss@meta.com/
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

577c34b3

selftests/bpf: Test_progs can read test lists from file · 64276f01

Stephen Veiss authored Apr 27, 2023

Improve test selection logic when using -a/-b/-d/-t options.
The list of tests to include or exclude can now be read from a file,
specified as @<filename>.

The file contains one name (or wildcard pattern) per line, and
comments beginning with # are ignored.

These options can be passed multiple times to read more than one file.
Signed-off-by: Stephen Veiss <sveiss@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230427225333.3506052-3-sveiss@meta.com

64276f01

selftests/bpf: Extract insert_test from parse_test_list · 0a5c0de8

Stephen Veiss authored Apr 27, 2023

Split the logic to insert new tests into test filter sets out from
parse_test_list.

Fix the subtest insertion logic to reuse an existing top-level test
filter, which prevents the creation of duplicate top-level test filters
each with a single subtest.
Signed-off-by: Stephen Veiss <sveiss@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230427225333.3506052-2-sveiss@meta.com

0a5c0de8

libbpf: btf_dump_type_data_check_overflow needs to consider BTF_MEMBER_BITFIELD_SIZE · c39028b3

Martin KaFai Lau authored Apr 27, 2023

The btf_dump/struct_data selftest is failing with:

  [...]
  test_btf_dump_struct_data:FAIL:unexpected return value dumping fs_context unexpected unexpected return value dumping fs_context: actual -7 != expected 264
  [...]

The reason is in btf_dump_type_data_check_overflow(). It does not use
BTF_MEMBER_BITFIELD_SIZE from the struct's member (btf_member). Instead,
it is using the enum size which is 4. It had been working till the recent
commit 4e04143c ("fs_context: drop the unused lsm_flags member")
removed an integer member which also removed the 4 bytes padding at the
end of the fs_context. Missing this 4 bytes padding exposed this bug. In
particular, when btf_dump_type_data_check_overflow() reaches the member
'phase', -E2BIG is returned.

The fix is to pass bit_sz to btf_dump_type_data_check_overflow(). In
btf_dump_type_data_check_overflow(), it does a different size check when
bit_sz is not zero.

The current fs_context:

[3600] ENUM 'fs_context_purpose' encoding=UNSIGNED size=4 vlen=3
	'FS_CONTEXT_FOR_MOUNT' val=0
	'FS_CONTEXT_FOR_SUBMOUNT' val=1
	'FS_CONTEXT_FOR_RECONFIGURE' val=2
[3601] ENUM 'fs_context_phase' encoding=UNSIGNED size=4 vlen=7
	'FS_CONTEXT_CREATE_PARAMS' val=0
	'FS_CONTEXT_CREATING' val=1
	'FS_CONTEXT_AWAITING_MOUNT' val=2
	'FS_CONTEXT_AWAITING_RECONF' val=3
	'FS_CONTEXT_RECONF_PARAMS' val=4
	'FS_CONTEXT_RECONFIGURING' val=5
	'FS_CONTEXT_FAILED' val=6
[3602] STRUCT 'fs_context' size=264 vlen=21
	'ops' type_id=3603 bits_offset=0
	'uapi_mutex' type_id=235 bits_offset=64
	'fs_type' type_id=872 bits_offset=1216
	'fs_private' type_id=21 bits_offset=1280
	'sget_key' type_id=21 bits_offset=1344
	'root' type_id=781 bits_offset=1408
	'user_ns' type_id=251 bits_offset=1472
	'net_ns' type_id=984 bits_offset=1536
	'cred' type_id=1785 bits_offset=1600
	'log' type_id=3621 bits_offset=1664
	'source' type_id=42 bits_offset=1792
	'security' type_id=21 bits_offset=1856
	's_fs_info' type_id=21 bits_offset=1920
	'sb_flags' type_id=20 bits_offset=1984
	'sb_flags_mask' type_id=20 bits_offset=2016
	's_iflags' type_id=20 bits_offset=2048
	'purpose' type_id=3600 bits_offset=2080 bitfield_size=8
	'phase' type_id=3601 bits_offset=2088 bitfield_size=8
	'need_free' type_id=67 bits_offset=2096 bitfield_size=1
	'global' type_id=67 bits_offset=2097 bitfield_size=1
	'oldapi' type_id=67 bits_offset=2098 bitfield_size=1

Fixes: 920d16af ("libbpf: BTF dumper support for typed data")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230428013638.1581263-1-martin.lau@linux.dev

c39028b3

28 Apr, 2023 1 commit

selftests/bpf: Add fexit_sleep to DENYLIST.aarch64 · 31f4f810

Martin KaFai Lau authored Apr 27, 2023

It is reported that the fexit_sleep never returns in aarch64.
The remaining tests cannot start. Put this test into DENYLIST.aarch64
for now so that other tests can continue to run in the CI.
Acked-by: Manu Bretelle <chantr4@gmail.com>
Reported-by: Manu Bretelle <chantra@meta.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

31f4f810

27 Apr, 2023 8 commits

selftests/bpf: Fix selftest test_global_funcs/global_func1 failure with latest clang · f1f5553d

Yonghong Song authored Apr 25, 2023

The selftest test_global_funcs/global_func1 failed with the latest clang17.
The reason is due to upstream ArgumentPromotionPass ([1]),
which may manipulate static function parameters and cause inlining
although the funciton is marked as noinline.

The original code:
  static __attribute__ ((noinline))
  int f0(int var, struct __sk_buff *skb)
  {
        return skb->len;
  }

  __attribute__ ((noinline))
  int f1(struct __sk_buff *skb)
  {
	...
        return f0(0, skb) + skb->len;
  }

  ...

  SEC("tc")
  __failure __msg("combined stack size of 4 calls is 544")
  int global_func1(struct __sk_buff *skb)
  {
        return f0(1, skb) + f1(skb) + f2(2, skb) + f3(3, skb, 4);
  }

After ArgumentPromotionPass, the code is translated to
  static __attribute__ ((noinline))
  int f0(int var, int skb_len)
  {
        return skb_len;
  }

  __attribute__ ((noinline))
  int f1(struct __sk_buff *skb)
  {
	...
        return f0(0, skb->len) + skb->len;
  }

  ...

  SEC("tc")
  __failure __msg("combined stack size of 4 calls is 544")
  int global_func1(struct __sk_buff *skb)
  {
        return f0(1, skb->len) + f1(skb) + f2(2, skb) + f3(3, skb, 4);
  }

And later llvm InstCombine phase recognized that f0()
simplify returns the value of the second argument and removed f0()
completely and the final code looks like:
  __attribute__ ((noinline))
  int f1(struct __sk_buff *skb)
  {
	...
        return skb->len + skb->len;
  }

  ...

  SEC("tc")
  __failure __msg("combined stack size of 4 calls is 544")
  int global_func1(struct __sk_buff *skb)
  {
        return skb->len + f1(skb) + f2(2, skb) + f3(3, skb, 4);
  }

If f0() is not inlined, the verification will fail with stack size
544 for a particular callchain. With f0() inlined, the maximum
stack size is 512 which is in the limit.

Let us add a `asm volatile ("")` in f0() to prevent ArgumentPromotionPass
from hoisting the code to its caller, and this fixed the test failure.

  [1] https://reviews.llvm.org/D148269Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230425174744.1758515-1-yhs@fb.com

f1f5553d

bpftool: Dump map id instead of value for map_of_maps types · bf06c939

Xueming Feng authored Apr 27, 2023

When using `bpftool map dump` with map_of_maps, it is usually
more convenient to show the inner map id instead of raw value.

We are changing the plain print behavior to show inner_map_id
instead of hex value, this would help with quick look up of
inner map with `bpftool map dump id <inner_map_id>`.
To avoid disrupting scripted behavior, we will add a new
`inner_map_id` field to json output instead of replacing value.

plain print:
```
$ bpftool map dump id 138

Without Patch:
key:
fc 00 00 00 00 00 00 00  00 00 00 00 00 00 00 05
27 16 06 00
value:
8b 00 00 00
Found 1 element

With Patch:
key:
fc 00 00 00 00 00 00 00  00 00 00 00 00 00 00 05
27 16 06 00
inner_map_id:
139
Found 1 element
```

json print:
```
$ bpftool -p map dump id 567

Without Patch:
[{
        "key": ["0xc0","0x00","0x02","0x05","0x27","0x16","0x06","0x00"
        ],
        "value": ["0x38","0x02","0x00","0x00"
        ]
    }
]

With Patch:
[{
        "key": ["0xc0","0x00","0x02","0x05","0x27","0x16","0x06","0x00"
        ],
        "value": ["0x38","0x02","0x00","0x00"
        ],
        "inner_map_id": 568
    }
]
```
Signed-off-by: Xueming Feng <kuro@kuroa.me>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230427120313.43574-1-kuro@kuroa.meSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

bf06c939

xsk: Use pool->dma_pages to check for DMA · 6ec7be9a

Kal Conley authored Apr 23, 2023

Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
active DMA mapping. pool->dma_pages needs to be read anyway to access
the map so this compiles to more efficient code.
Signed-off-by: Kal Conley <kal.conley@dectris.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20230423180157.93559-1-kal.conley@dectris.com

6ec7be9a

selftests/bpf: Update the aarch64 tests deny list · a4644119

Florent Revest authored Apr 27, 2023

Now that ftrace supports direct call on arm64, BPF tracing programs work
on that architecture. This fixes the vast majority of BPF selftests
except for:

- multi_kprobe programs which require fprobe, not available on arm64 yet
- tracing_struct which requires trampoline support to access struct args

This patch updates the list of BPF selftests which are known to fail so
the BPF CI can validate the tests which pass now.
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230427143207.635263-1-revest@chromium.org

a4644119

selftests/bpf: xdp_hw_metadata track more timestamps · bb323478

Jesper Dangaard Brouer authored Apr 18, 2023

To correlate the hardware RX timestamp with something, add tracking of
two software timestamps both clock source CLOCK_TAI (see description in
man clock_gettime(2)).

XDP metadata is extended with xdp_timestamp for capturing when XDP
received the packet. Populated with BPF helper bpf_ktime_get_tai_ns(). I
could not find a BPF helper for getting CLOCK_REALTIME, which would have
been preferred. In userspace when AF_XDP sees the packet another
software timestamp is recorded via clock_gettime() also clock source
CLOCK_TAI.

Example output shortly after loading igc driver:

  poll: 1 (0) skip=1 fail=0 redir=2
  xsk_ring_cons__peek: 1
  0x12557a8: rx_desc[1]->addr=100000000009000 addr=9100 comp_addr=9000
  rx_hash: 0x82A96531 with RSS type:0x1
  rx_timestamp:  1681740540304898909 (sec:1681740540.3049)
  XDP RX-time:   1681740577304958316 (sec:1681740577.3050) delta sec:37.0001 (37000059.407 usec)
  AF_XDP time:   1681740577305051315 (sec:1681740577.3051) delta sec:0.0001 (92.999 usec)
  0x12557a8: complete idx=9 addr=9000

The first observation is that the 37 sec difference between RX HW vs XDP
timestamps, which indicate hardware is likely clock source
CLOCK_REALTIME, because (as of this writing) CLOCK_TAI is initialised
with a 37 sec offset.

The 93 usec (microsec) difference between XDP vs AF_XDP userspace is the
userspace wakeup time. On this hardware it was caused by CPU idle sleep
states, which can be reduced by tuning /dev/cpu_dma_latency.

View current requested/allowed latency bound via:
  hexdump --format '"%d\n"' /dev/cpu_dma_latency

More explanation of the output and how this can be used to identify
clock drift for the HW clock can be seen here[1]:

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/hints/xdp_hints_kfuncs02_driver_igc.orgSigned-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Yoong Siang <yoong.siang.song@intel.com>
Link: https://lore.kernel.org/bpf/168182466298.616355.2544377890818617459.stgit@firesoul

bb323478

igc: Add XDP hints kfuncs for RX timestamp · d6772667

Jesper Dangaard Brouer authored Apr 18, 2023

The NIC hardware RX timestamping mechanism adds an optional tailored
header before the MAC header containing packet reception time. Optional
depending on RX descriptor TSIP status bit (IGC_RXDADV_STAT_TSIP). In
case this bit is set driver does offset adjustments to packet data start
and extracts the timestamp.

The timestamp need to be extracted before invoking the XDP bpf_prog,
because this area just before the packet is also accessible by XDP via
data_meta context pointer (and helper bpf_xdp_adjust_meta). Thus, an XDP
bpf_prog can potentially overwrite this and corrupt data that we want to
extract with the new kfunc for reading the timestamp.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Yoong Siang <yoong.siang.song@intel.com>
Link: https://lore.kernel.org/bpf/168182465791.616355.2583922957423587914.stgit@firesoul

d6772667

igc: Add XDP hints kfuncs for RX hash · 8416814f

Jesper Dangaard Brouer authored Apr 18, 2023

This implements XDP hints kfunc for RX-hash (xmo_rx_hash).
The HW rss hash type is handled via mapping table.

This igc driver (default config) does L3 hashing for UDP packets
(excludes UDP src/dest ports in hash calc).  Meaning RSS hash type is
L3 based.  Tested that the igc_rss_type_num for UDP is either
IGC_RSS_TYPE_HASH_IPV4 or IGC_RSS_TYPE_HASH_IPV6.

This patch also updates AF_XDP zero-copy function igc_clean_rx_irq_zc()
to use the xdp_buff wrapper struct igc_xdp_buff.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Yoong Siang <yoong.siang.song@intel.com>
Link: https://lore.kernel.org/bpf/168182465285.616355.2701740913376314790.stgit@firesoul

8416814f

igc: Add igc_xdp_buff wrapper for xdp_buff in driver · 73b7123d

Jesper Dangaard Brouer authored Apr 18, 2023

Driver specific metadata data for XDP-hints kfuncs are propagated via tail
extending the struct xdp_buff with a locally scoped driver struct.

Zero-Copy AF_XDP/XSK does similar tricks via struct xdp_buff_xsk. This
xdp_buff_xsk struct contains a CB area (24 bytes) that can be used for
extending the locally scoped driver into. The XSK_CHECK_PRIV_TYPE define
catch size violations build time.

The changes needed for AF_XDP zero-copy in igc_clean_rx_irq_zc()
is done in next patch, because the member rx_desc isn't available
at this point.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Yoong Siang <yoong.siang.song@intel.com>
Link: https://lore.kernel.org/bpf/168182464779.616355.3761989884165609387.stgit@firesoul

73b7123d