- 06 Mar, 2024 9 commits
-
-
Eduard Zingerman authored
E.g. allow the following struct_ops definitions: struct bpf_testmod_ops___v1 { int (*test)(void); }; struct bpf_testmod_ops___v2 { int (*test)(void); }; SEC(".struct_ops.link") struct bpf_testmod_ops___v1 a = { .test = ... } SEC(".struct_ops.link") struct bpf_testmod_ops___v2 b = { .test = ... } Where both bpf_testmod_ops__v1 and bpf_testmod_ops__v2 would be resolved as 'struct bpf_testmod_ops' from kernel BTF. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: David Vernet <void@manifault.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240306104529.6453-2-eddyz87@gmail.com
-
Andrii Nakryiko authored
Alexei Starovoitov says: ==================== bpf: Introduce may_goto and cond_break From: Alexei Starovoitov <ast@kernel.org> v5 -> v6: - Rename BPF_JMA to BPF_JCOND - Addressed Andrii's review comments v4 -> v5: - rewrote patch 1 to avoid fake may_goto_reg and use 'u32 may_goto_cnt' instead. This way may_goto handling is similar to bpf_loop() processing. - fixed bug in patch 2 that RANGE_WITHIN should not use rold->type == NOT_INIT as a safe signal. - patch 3 fixed negative offset computation in cond_break macro - using bpf_arena and cond_break recompiled lib/glob.c as bpf prog and it works! It will be added as a selftest to arena series. v3 -> v4: - fix drained issue reported by John. may_goto insn could be implemented with sticky state (once reaches 0 it stays 0), but the verifier shouldn't assume that. It has to explore both branches. Arguably drained iterator state shouldn't be there at all. bpf_iter_css_next() is not sticky. Can be fixed, but auditing all iterators for stickiness. That's an orthogonal discussion. - explained JMA name reasons in patch 1 - fixed test_progs-no_alu32 issue and added another test v2 -> v3: Major change - drop bpf_can_loop() kfunc and introduce may_goto instruction instead kfunc is a function call while may_goto doesn't consume any registers and LLVM can produce much better code due to less register pressure. - instead of counting from zero to BPF_MAX_LOOPS start from it instead and break out of the loop when count reaches zero - use may_goto instruction in cond_break macro - recognize that 'exact' state comparison doesn't need to be truly exact. regsafe() should ignore precision and liveness marks, but range_within logic is safe to use while evaluating open coded iterators. ==================== Link: https://lore.kernel.org/r/20240306031929.42666-1-alexei.starovoitov@gmail.comSigned-off-by: Andrii Nakryiko <andrii@kernel.org>
-
Alexei Starovoitov authored
Add tests for may_goto instruction via cond_break macro. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Tested-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240306031929.42666-5-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
Use may_goto instruction to implement cond_break macro. Ideally the macro should be written as: asm volatile goto(".byte 0xe5; .byte 0; .short %l[l_break] ... .long 0; but LLVM doesn't recognize fixup of 2 byte PC relative yet. Hence use asm volatile goto(".byte 0xe5; .byte 0; .long %l[l_break] ... .short 0; that produces correct asm on little endian. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Tested-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240306031929.42666-4-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
When open code iterators, bpf_loop or may_goto are used the following two states are equivalent and safe to prune the search: cur state: fp-8_w=scalar(id=3,smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf)) old state: fp-8_rw=scalar(id=2,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf)) In other words "exact" state match should ignore liveness and precision marks, since open coded iterator logic didn't complete their propagation, reg_old->type == NOT_INIT && reg_cur->type != NOT_INIT is also not safe to prune while looping, but range_within logic that applies to scalars, ptr_to_mem, map_value, pkt_ptr is safe to rely on. Avoid doing such comparison when regular infinite loop detection logic is used, otherwise bounded loop logic will declare such "infinite loop" as false positive. Such example is in progs/verifier_loops1.c not_an_inifinite_loop(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Tested-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240306031929.42666-3-alexei.starovoitov@gmail.com
-
Andrii Nakryiko authored
Alexei Starovoitov says: ==================== mm: Enforce ioremap address space and introduce sparse vm_area From: Alexei Starovoitov <ast@kernel.org> v3 -> v4 - dropped VM_XEN patch for now. It will be in the follow up. - fixed constant as pointed out by Mike v2 -> v3 - added Christoph's reviewed-by to patch 1 - cap commit log lines to 75 chars - factored out common checks in patch 3 into helper - made vm_area_unmap_pages() return void There are various users of kernel virtual address space: vmalloc, vmap, ioremap, xen. - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag and these areas are treated differently by KASAN. - the areas created by vmap() function should be tagged with VM_MAP (as majority of the users do). - ioremap areas are tagged with VM_IOREMAP and vm area start is aligned to size of the area unlike vmalloc/vmap. - there is also xen usage that is marked as VM_IOREMAP, but it doesn't call ioremap_page_range() unlike all other VM_IOREMAP users. To clean this up a bit, enforce that ioremap_page_range() checks the range and VM_IOREMAP flag. In addition BPF would like to reserve regions of kernel virtual address space and populate it lazily, similar to xen use cases. For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages() helpers to populate this sparse area. In the end the /proc/vmallocinfo will show "vmalloc" "vmap" "ioremap" "sparse" categories for different kinds of address regions. ioremap, sparse will return zero when dumped through /proc/kcore ==================== Link: https://lore.kernel.org/r/20240305030516.41519-1-alexei.starovoitov@gmail.comSigned-off-by: Andrii Nakryiko <andrii@kernel.org>
-
Alexei Starovoitov authored
Introduce may_goto instruction that from the verifier pov is similar to open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it doesn't iterate any objects. In assembly 'may_goto' is a nop most of the time until bpf runtime has to terminate the program for whatever reason. In the current implementation may_goto has a hidden counter, but other mechanisms can be used. For programs written in C the later patch introduces 'cond_break' macro that combines 'may_goto' with 'break' statement and has similar semantics: cond_break is a nop until bpf runtime has to break out of this loop. It can be used in any normal "for" or "while" loop, like for (i = zero; i < cnt; cond_break, i++) { The verifier recognizes that may_goto is used in the program, reserves additional 8 bytes of stack, initializes them in subprog prologue, and replaces may_goto instruction with: aux_reg = *(u64 *)(fp - 40) if aux_reg == 0 goto pc+off aux_reg -= 1 *(u64 *)(fp - 40) = aux_reg may_goto instruction can be used by LLVM to implement __builtin_memcpy, __builtin_strcmp. may_goto is not a full substitute for bpf_for() macro. bpf_for() doesn't have induction variable that verifiers sees, so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded. But when the code is written as: for (i = 0; i < 100; cond_break, i++) the verifier see 'i' as precise constant zero, hence cond_break (aka may_goto) doesn't help to converge the loop. A static or global variable can be used as a workaround: static int zero = 0; for (i = zero; i < 100; cond_break, i++) // works! may_goto works well with arena pointers that don't need to be bounds checked on access. Load/store from arena returns imprecise unbounded scalar and loops with may_goto pass the verifier. Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn. JCOND stands for conditional pseudo jump. Since goto_or_nop insn was proposed, it may use the same opcode. may_goto vs goto_or_nop can be distinguished by src_reg: code = BPF_JMP | BPF_JCOND src_reg = 0 - may_goto src_reg = 1 - goto_or_nop Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Tested-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
vmap/vmalloc APIs are used to map a set of pages into contiguous kernel virtual space. get_vm_area() with appropriate flag is used to request an area of kernel address range. It's used for vmalloc, vmap, ioremap, xen use cases. - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag. - the areas created by vmap() function should be tagged with VM_MAP. - ioremap areas are tagged with VM_IOREMAP. BPF would like to extend the vmap API to implement a lazily-populated sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag and vm_area_map_pages(area, start_addr, count, pages) API to map a set of pages within a given area. It has the same sanity checks as vmap() does. It also checks that get_vm_area() was created with VM_SPARSE flag which identifies such areas in /proc/vmallocinfo and returns zero pages on read through /proc/kcore. The next commits will introduce bpf_arena which is a sparsely populated shared memory region between bpf program and user space process. It will map privately-managed pages into a sparse vm area with the following steps: // request virtual memory region during bpf prog verification area = get_vm_area(area_size, VM_SPARSE); // on demand vm_area_map_pages(area, kaddr, kend, pages); vm_area_unmap_pages(area, kaddr, kend); // after bpf program is detached and unloaded free_vm_area(area); Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/bpf/20240305030516.41519-3-alexei.starovoitov@gmail.com
-
Alexei Starovoitov authored
There are various users of get_vm_area() + ioremap_page_range() APIs. Enforce that get_vm_area() was requested as VM_IOREMAP type and range passed to ioremap_page_range() matches created vm_area to avoid accidentally ioremap-ing into wrong address range. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20240305030516.41519-2-alexei.starovoitov@gmail.com
-
- 04 Mar, 2024 8 commits
-
-
Martin KaFai Lau authored
Kui-Feng Lee says: ==================== The BPF struct_ops previously only allowed for one page to be used for the trampolines of all links in a map. However, we have recently run out of space due to the large number of BPF program links. By allocating additional pages when we exhaust an existing page, we can accommodate more links in a single map. The variable st_map->image has been changed to st_map->image_pages, and its type has been changed to an array of pointers to buffers of PAGE_SIZE. Additional pages are allocated when all existing pages are exhausted. The test case loads a struct_ops maps having 40 programs. Their trampolines takes about 6.6k+ bytes over 1.5 pages on x86. --- Major differences from v3: - Refactor buffer allocations to bpf_struct_ops_tramp_buf_alloc() and bpf_struct_ops_tramp_buf_free(). Major differences from v2: - Move image buffer allocation to bpf_struct_ops_prepare_trampoline(). Major differences from v1: - Always free pages if failing to update. - Allocate 8 pages at most. v3: https://lore.kernel.org/all/20240224030302.1500343-1-thinker.li@gmail.com/ v2: https://lore.kernel.org/all/20240221225911.757861-1-thinker.li@gmail.com/ v1: https://lore.kernel.org/all/20240216182828.201727-1-thinker.li@gmail.com/ ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Kui-Feng Lee authored
Create and load a struct_ops map with a large number of struct_ops programs to generate trampolines taking a size over multiple pages. The map includes 40 programs. Their trampolines takes 6.6k+, more than 1.5 pages, on x86. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240224223418.526631-4-thinker.li@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Kui-Feng Lee authored
The BPF struct_ops previously only allowed one page of trampolines. Each function pointer of a struct_ops is implemented by a struct_ops bpf program. Each struct_ops bpf program requires a trampoline. The following selftest patch shows each page can hold a little more than 20 trampolines. While one page is more than enough for the tcp-cc usecase, the sched_ext use case shows that one page is not always enough and hits the one page limit. This patch overcomes the one page limit by allocating another page when needed and it is limited to a total of MAX_IMAGE_PAGES (8) pages which is more than enough for reasonable usages. The variable st_map->image has been changed to st_map->image_pages, and its type has been changed to an array of pointers to pages. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240224223418.526631-3-thinker.li@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Kui-Feng Lee authored
Perform all validations when updating values of struct_ops maps. Doing validation in st_ops->reg() and st_ops->update() is not necessary anymore. However, tcp_register_congestion_control() has been called in various places. It still needs to do validations. Cc: netdev@vger.kernel.org Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240224223418.526631-2-thinker.li@gmail.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Song Yoong Siang authored
In current ping-pong design, xdp_hw_metadata will wait until the packet transmission completely done, then only start to receive the next packet. The current sleep interval is 10ms, which is unnecessary large. Typically, a NIC does not need such a long time to transmit a packet. Furthermore, during this 10ms sleep time, the app is unable to receive incoming packets. Therefore, this commit reduce sleep interval to 10us, so that xdp_hw_metadata is able to support periodic packets with shorter interval. 10us * 500 = 5ms should be enough for packet transmission and status retrieval. Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20240303083225.1184165-2-yoong.siang.song@intel.com
-
Andrii Nakryiko authored
Settle on three "flavors" of uprobe/uretprobe, installed on different kinds of instruction: nop, push, and ret. All three are testing different internal code paths emulating or single-stepping instructions, so are interesting to compare and benchmark separately. To ensure `push rbp` instruction we ensure that uprobe_target_push() is not a leaf function by calling (global __weak) noop function and returning something afterwards (if we don't do that, compiler will just do a tail call optimization). Also, we need to make sure that compiler isn't skipping frame pointer generation, so let's add `-fno-omit-frame-pointers` to Makefile. Just to give an idea of where we currently stand in terms of relative performance of different uprobe/uretprobe cases vs a cheap syscall (getpgid()) baseline, here are results from my local machine: $ benchs/run_bench_uprobes.sh base : 1.561 ± 0.020M/s uprobe-nop : 0.947 ± 0.007M/s uprobe-push : 0.951 ± 0.004M/s uprobe-ret : 0.443 ± 0.007M/s uretprobe-nop : 0.471 ± 0.013M/s uretprobe-push : 0.483 ± 0.004M/s uretprobe-ret : 0.306 ± 0.007M/s Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240301214551.1686095-1-andrii@kernel.org
-
Chen Shen authored
In the function btf__load_vmlinux_btf, the debug message incorrectly refers to 'path' instead of 'sysfs_btf_path'. Signed-off-by: Chen Shen <peterchenshen@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240302062218.3587-1-peterchenshen@gmail.com
-
Dave Thaler authored
There could be other legacy conformance groups in the future, so use a more descriptive name. The status of the conformance group in the IANA registry is what designates it as legacy, not the name of the group. Signed-off-by: Dave Thaler <dthaler1968@gmail.com> Link: https://lore.kernel.org/r/20240302012229.16452-1-dthaler1968@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 03 Mar, 2024 2 commits
-
-
Dave Thaler authored
In preparation for publication as an IETF RFC, the WG chairs asked me to convert the document to use IETF packet format for field layout, so this patch attempts to make it consistent with other IETF documents. Some fields that are not byte aligned were previously inconsistent in how values were defined. Some were defined as the value of the byte containing the field (like 0x20 for a field holding the high four bits of the byte), and others were defined as the value of the field itself (like 0x2). This PR makes them be consistent in using just the values of the field itself, which is IETF convention. As a result, some of the defines that used BPF_* would no longer match the value in the spec, and so this patch also drops the BPF_* prefix to avoid confusion with the defines that are the full-byte equivalent values. For consistency, BPF_* is then dropped from other fields too. BPF_<foo> is thus the Linux implementation-specific define for <foo> as it appears in the BPF ISA specification. The syntax BPF_ADD | BPF_X | BPF_ALU only worked for full-byte values so the convention {ADD, X, ALU} is proposed for referring to field values instead. Also replace the redundant "LSB bits" with "least significant bits". A preview of what the resulting Internet Draft would look like can be seen at: https://htmlpreview.github.io/?https://raw.githubusercontent.com/dthaler/ebp f-docs-1/format/draft-ietf-bpf-isa.html v1->v2: Fix sphinx issue as recommended by David Vernet Signed-off-by: Dave Thaler <dthaler1968@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240301222337.15931-1-dthaler1968@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2024-02-29 We've added 119 non-merge commits during the last 32 day(s) which contain a total of 150 files changed, 3589 insertions(+), 995 deletions(-). The main changes are: 1) Extend the BPF verifier to enable static subprog calls in spin lock critical sections, from Kumar Kartikeya Dwivedi. 2) Fix confusing and incorrect inference of PTR_TO_CTX argument type in BPF global subprogs, from Andrii Nakryiko. 3) Larger batch of riscv BPF JIT improvements and enabling inlining of the bpf_kptr_xchg() for RV64, from Pu Lehui. 4) Allow skeleton users to change the values of the fields in struct_ops maps at runtime, from Kui-Feng Lee. 5) Extend the verifier's capabilities of tracking scalars when they are spilled to stack, especially when the spill or fill is narrowing, from Maxim Mikityanskiy & Eduard Zingerman. 6) Various BPF selftest improvements to fix errors under gcc BPF backend, from Jose E. Marchesi. 7) Avoid module loading failure when the module trying to register a struct_ops has its BTF section stripped, from Geliang Tang. 8) Annotate all kfuncs in .BTF_ids section which eventually allows for automatic kfunc prototype generation from bpftool, from Daniel Xu. 9) Several updates to the instruction-set.rst IETF standardization document, from Dave Thaler. 10) Shrink the size of struct bpf_map resp. bpf_array, from Alexei Starovoitov. 11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer, from Benjamin Tissoires. 12) Fix bpftool to be more portable to musl libc by using POSIX's basename(), from Arnaldo Carvalho de Melo. 13) Add libbpf support to gcc in CORE macro definitions, from Cupertino Miranda. 14) Remove a duplicate type check in perf_event_bpf_event, from Florian Lehner. 15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them with notrace correctly, from Yonghong Song. 16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible array to fix build warnings, from Kees Cook. 17) Fix resolve_btfids cross-compilation to non host-native endianness, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits) selftests/bpf: Test if shadow types work correctly. bpftool: Add an example for struct_ops map and shadow type. bpftool: Generated shadow variables for struct_ops maps. libbpf: Convert st_ops->data to shadow type. libbpf: Set btf_value_type_id of struct bpf_map for struct_ops. bpf: Replace bpf_lpm_trie_key 0-length array with flexible array bpf, arm64: use bpf_prog_pack for memory management arm64: patching: implement text_poke API bpf, arm64: support exceptions arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT bpf: add is_async_callback_calling_insn() helper bpf: introduce in_sleepable() helper bpf: allow more maps in sleepable bpf programs selftests/bpf: Test case for lacking CFI stub functions. bpf: Check cfi_stubs before registering a struct_ops type. bpf: Clarify batch lookup/lookup_and_delete semantics bpf, docs: specify which BPF_ABS and BPF_IND fields were zero bpf, docs: Fix typos in instruction-set.rst selftests/bpf: update tcp_custom_syncookie to use scalar packet offset bpf: Shrink size of struct bpf_map/bpf_array. ... ==================== Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 01 Mar, 2024 21 commits
-
-
David S. Miller authored
Eric Dumazet says: ==================== inet: no longer use RTNL to protect inet_dump_ifaddr() This series convert inet so that a dump of addresses (ip -4 addr) no longer requires RTNL. ==================== Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
1) inet_dump_ifaddr() can can run under RCU protection instead of RTNL. 2) properly return 0 at the end of a dump, avoiding an an extra recvmsg() system call. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
In the following patch, inet_base_seq() will no longer be called with RTNL held. Add READ_ONCE()/WRITE_ONCE() annotations in dev_base_seq_inc() and inet_base_seq(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
ifa->ifa_flags can be read locklessly. Add appropriate READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
ifa->ifa_preferred_lft can be read locklessly. Add appropriate READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
ifa->ifa_valid_lft can be read locklessly. Add appropriate READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
ifa->ifa_tstamp can be read locklessly. Add appropriate READ_ONCE()/WRITE_ONCE() annotations. Do the same for ifa->ifa_cstamp to prepare upcoming changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
David Wei says: ==================== netdevsim: link and forward skbs between ports This patchset adds the ability to link two netdevsim ports together and forward skbs between them, similar to veth. The goal is to use netdevsim for testing features e.g. zero copy Rx using io_uring. This feature was tested locally on QEMU, and a selftest is included. I ran netdev selftests CI style and all tests but the following passed: - gro.sh - l2tp.sh - ip_local_port_range.sh gro.sh fails because virtme-ng mounts as read-only and it tries to write to log.txt. This issue was reported to virtme-ng upstream. l2tp.sh and ip_local_port_range.sh both fail for me on net-next/main as well. --- v13->v14: - implement ndo_get_iflink() - fix returning 0 if peer is already linked during linking or not linked during unlinking - bump dropped counter if nsim_ipsec_tx() fails and generally reorder nsim_start_xmit() - fix overflowing lines and indentations v12->v13: - wait for socat listening port to be ready before sending data in selftest v11->v12: - fix leaked netns refs - fix rtnetlink.sh kci_test_ipsec_offload() selftest v10->v11: - add udevadm settle after creating netdevsims in selftest v9->v10: - fix not freeing skb when not there is no peer - prevent possible id clashes in selftest - cleanup selftest on error paths v8->v9: - switch to getting netns using fd rather than id - prevent linking a netdevsim to itself - update tests v7->v8: - fix not dereferencing RCU ptr using rcu_dereference() - remove unused variables in selftest v6->v7: - change link syntax to netnsid:ifidx - replace dev_get_by_index() with __dev_get_by_index() - check for NULL peer when linking - add a sysfs attribute for unlinking - only update Tx stats if not dropped - update selftest v5->v6: - reworked to link two netdevsims using sysfs attribute on the bus device instead of debugfs due to deadlock possibility if a netdevsim is removed during linking - removed unnecessary patch maintaining a list of probed nsim_devs - updated selftest v4->v5: - reduce nsim_dev_list_lock critical section - fixed missing mutex unlock during unwind ladder - rework nsim_dev_peer_write synchronization to take devlink lock as well as rtnl_lock - return err msgs to user during linking if port doesn't exist or linking to self - update tx stats outside of RCU lock v3->v4: - maintain a mutex protected list of probed nsim_devs instead of using nsim_bus_dev - fixed synchronization issues by taking rtnl_lock - track tx_dropped skbs v2->v3: - take lock when traversing nsim_bus_dev_list - take device ref when getting a nsim_bus_dev - return 0 if nsim_dev_peer_read cannot find the port - address code formatting - do not hard code values in selftests - add Makefile for selftests v1->v2: - renamed debugfs file from "link" to "peer" - replaced strstep() with sscanf() for consistency - increased char[] buf sz to 22 for copying id + port from user - added err msg w/ expected fmt when linking as a hint to user - prevent linking port to itself - protect peer ptr using RCU ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Wei authored
I cleared IFF_NOARP flag from netdevsim dev->flags in order to support skb forwarding. This breaks the rtnetlink.sh selftest kci_test_ipsec_offload() test because ipsec does not connect to peers it cannot transmit to. Fix the issue by adding a neigh entry manually. ipsec_offload test now successfully pass. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Wei authored
Connect two netdevsim ports in different namespaces together, then send packets between them using socat. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Wei authored
Add an implementation for ndo_get_iflink() in netdevsim that shows the ifindex of the linked peer, if any. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Wei authored
Forward skbs sent from one netdevsim port to its connected netdevsim port using dev_forward_skb, in a spirit similar to veth. Add a tx_dropped variable to struct netdevsim, tracking the number of skbs that could not be forwarded using dev_forward_skb(). The xmit() function accessing the peer ptr is protected by an RCU read critical section. The rcu_read_lock() is functionally redundant as since v5.0 all softirqs are implicitly RCU read critical sections; but it is useful for human readers. If another CPU is concurrently in nsim_destroy(), then it will first set the peer ptr to NULL. This does not affect any existing readers that dereferenced a non-NULL peer. Then, in unregister_netdevice(), there is a synchronize_rcu() before the netdev is actually unregistered and freed. This ensures that any readers i.e. xmit() that got a non-NULL peer will complete before the netdev is freed. Any readers after the RCU_INIT_POINTER() but before synchronize_rcu() will dereference NULL, making it safe. The codepath to nsim_destroy() and nsim_create() takes both the newly added nsim_dev_list_lock and rtnl_lock. This makes it safe with concurrent calls to linking two netdevsims together. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Wei authored
Add two netdevsim bus attribute to sysfs: /sys/bus/netdevsim/link_device /sys/bus/netdevsim/unlink_device Writing "A M B N" to link_device will link netdevsim M in netnsid A with netdevsim N in netnsid B. Writing "A M" to unlink_device will unlink netdevsim M in netnsid A from its peer, if any. rtnl_lock is taken to ensure nothing changes during the linking. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Maciek Machnikowski <maciek@machnikowski.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jakub Kicinski says: ==================== selftests: kselftest_harness: support using xfail When running selftests for our subsystem in our CI we'd like all tests to pass. Currently some tests use SKIP for cases they expect to fail, because the kselftest_harness limits the return codes to pass/fail/skip. XFAIL which would be a great match here cannot be used. Remove the no_print handling and use vfork() to run the test in a different process than the setup. This way we don't need to pass "failing step" via the exit code. Further clean up the exit codes so that we can use all KSFT_* values. Rewrite the result printing to make handling XFAIL/XPASS easier. Support tests declaring combinations of fixture + variant they expect to fail. Merge plan is to put it on top of -rc6 and merge into net-next. That way others should be able to pull the patches without any networking changes. v4: - rebase on top of Mickael's vfork() changes v3: https://lore.kernel.org/all/20240220192235.2953484-1-kuba@kernel.org/ - combine multiple series - change to "list of expected failures" rather than SKIP()-like handling v2: https://lore.kernel.org/all/20240216002619.1999225-1-kuba@kernel.org/ - fix alignment follow up RFC: https://lore.kernel.org/all/20240216004122.2004689-1-kuba@kernel.org/ v1: https://lore.kernel.org/all/20240213154416.422739-1-kuba@kernel.org/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
SCTP does not support IP_LOCAL_PORT_RANGE and we know it, so use XFAIL instead of SKIP. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Currently some tests report skip for things they expect to fail e.g. when given combination of parameters is known to be unsupported. This is confusing because in an ideal test environment and fully featured kernel no tests should be skipped. Selftest summary line already includes xfail and xpass counters, e.g.: Totals: pass:725 fail:0 xfail:0 xpass:0 skip:0 error:0 but there's no way to use it from within the harness. Add a new per-fixture+variant combination list of test cases we expect to fail. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Switch to printing KTAP line for PASS / FAIL with ksft_test_result_code(), this gives us the ability to report diagnostic messages. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
According to the spec we should always print a # if we add a diagnostic message. Having the caller pass in the new line as part of diagnostic message makes handling this a bit counter-intuitive, so append the new line in the helper. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Jakub points out that for parsers it's rather useful to always have the test name on the result line. Currently if we SKIP (or soon XFAIL or XPASS), we will print: ok 17 # SKIP SCTP doesn't support IP_BIND_ADDRESS_NO_PORT ^ no test name Always print the test name. KTAP format seems to allow or even call for it, per: https://docs.kernel.org/dev-tools/ktap.htmlSuggested-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/all/87jzn6lnou.fsf@cloudflare.com/Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
For generic test harness code it's more useful to deal with exit codes directly, rather than having to switch on them and call the right ksft_test_result_*() helper. Add such function to kselftest.h. Note that "directive" and "diagnostic" are what ktap docs call those parts of the message. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
We always use skip in combination with exit_code being 0 (KSFT_PASS). This are basic KSFT / KTAP semantics. Store the right KSFT_* code in exit_code directly. This makes it easier to support tests reporting other extended KSFT_* codes like XFAIL / XPASS. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-