- 25 Jan, 2021 6 commits
-
-
Björn Töpel authored
Silence three checkpatch style warnings. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210122154725.22140-4-bjorn.topel@gmail.com
-
Björn Töpel authored
The enums undef and bidi are not used. Remove them. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210122154725.22140-3-bjorn.topel@gmail.com
-
Björn Töpel authored
Instead of passing void * all over the place, let us pass the actual type (ifobject) and remove the void-ptr-to-type-ptr casting. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210122154725.22140-2-bjorn.topel@gmail.com
-
Björn Töpel authored
Add detection for kernel version, and adapt the BPF program based on kernel support. This way, users will get the best possible performance from the BPF program. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-4-bjorn.topel@gmail.com
-
Björn Töpel authored
Fold xp_assign_dev and __xp_assign_dev. The former directly calls the latter. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-3-bjorn.topel@gmail.com
-
Björn Töpel authored
The explicit_free parameter of the __xsk_rcv() function was used to mark whether the call was via the generic XDP or the native XDP path. Instead of clutter the code with if-statements and "true/false" parameters which are hard to understand, simply move the explicit free to the __xsk_map_redirect() which is always called from the native XDP path. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-2-bjorn.topel@gmail.com
-
- 22 Jan, 2021 3 commits
-
-
Hangbin Liu authored
This patch add a xdp program on egress to show that we can modify the packet on egress. In this sample we will set the pkt's src mac to egress's mac address. The xdp_prog will be attached when -X option supplied. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210122025007.2968381-1-liuhangbin@gmail.com
-
Tobias Klauser authored
s/bounts/bounds/ Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210121174324.24127-1-tklauser@distanz.ch
-
Tiezhu Yang authored
The current LLVM and Clang build procedure in samples/bpf/README.rst is out of date. See below that the links are not accessible any more. $ git clone http://llvm.org/git/llvm.git Cloning into 'llvm'... fatal: unable to access 'http://llvm.org/git/llvm.git/': Maximum (20) redirects followed $ git clone --depth 1 http://llvm.org/git/clang.git Cloning into 'clang'... fatal: unable to access 'http://llvm.org/git/clang.git/': Maximum (20) redirects followed The LLVM community has adopted new ways to build the compiler. There are different ways to build LLVM and Clang, the Clang Getting Started page [1] has one way. As Yonghong said, it is better to copy the build procedure in Documentation/bpf/bpf_devel_QA.rst to keep consistent. I verified the procedure and it is proved to be feasible, so we should update README.rst to reflect the reality. At the same time, update the related comment in Makefile. Additionally, as Fangrui said, the dir llvm-project/llvm/build/install is not used, BUILD_SHARED_LIBS=OFF is the default option [2], so also change Documentation/bpf/bpf_devel_QA.rst together. At last, we recommend that developers who want the fastest incremental builds use the Ninja build system [1], you can find it in your system's package manager, usually the package is ninja or ninja-build [3], so add ninja to build dependencies suggested by Nathan. [1] https://clang.llvm.org/get_started.html [2] https://www.llvm.org/docs/CMake.html [3] https://github.com/ninja-build/ninja/wiki/Pre-built-Ninja-packagesSigned-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Cc: Fangrui Song <maskray@google.com> Link: https://lore.kernel.org/bpf/1611279584-26047-1-git-send-email-yangtiezhu@loongson.cn
-
- 21 Jan, 2021 4 commits
-
-
Junlin Yang authored
Change 'exeeds' to 'exceeds'. Signed-off-by: Junlin Yang <yangjunlin@yulong.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210121122309.1501-1-angkery@163.com
-
Jiri Olsa authored
For very large ELF objects (with many sections), we could get special value SHN_XINDEX (65535) for elf object's string table index - e_shstrndx. Call elf_getshdrstrndx to get the proper string table index, instead of reading it directly from ELF header. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210121202203.9346-4-jolsa@kernel.org
-
Brendan Jackman authored
Alexei pointed out [1] that this wording is pretty confusing. Here's an attempt to be more explicit and clear. [1] https://lore.kernel.org/bpf/CAADnVQJVvwoZsE1K+6qRxzF7+6CvZNzygnoBW9tZNWJELk5c=Q@mail.gmail.com/Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210120133946.2107897-3-jackmanb@google.com
-
Brendan Jackman authored
This fixes up the markup to fix a warning, be more consistent with use of monospace, and use the correct .rst syntax for <em> (* instead of _). Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/bpf/20210120133946.2107897-2-jackmanb@google.com
-
- 20 Jan, 2021 27 commits
-
-
Alexei Starovoitov authored
Stanislav Fomichev says: ==================== First patch adds custom getsockopt for TCP_ZEROCOPY_RECEIVE to remove kmalloc and lock_sock overhead from the dat path. Second patch removes kzalloc/kfree from getsockopt for the common cases. Third patch switches cgroup_bpf_enabled to be per-attach to to add only overhead for the cgroup attach types used on the system. No visible user-side changes. v9: - include linux/tcp.h instead of netinet/tcp.h in sockopt_sk.c - note that v9 depends on the commit 4be34f3d ("bpf: Don't leak memory in bpf getsockopt when optlen == 0") from bpf tree v8: - add bpi.h to tools/include/uapi in the same patch (Martin KaFai Lau) - kmalloc instead of kzalloc when exporting buffer (Martin KaFai Lau) - note that v8 depends on the commit 4be34f3d ("bpf: Don't leak memory in bpf getsockopt when optlen == 0") from bpf tree v7: - add comment about buffer contents for retval != 0 (Martin KaFai Lau) - export tcp.h into tools/include/uapi (Martin KaFai Lau) - note that v7 depends on the commit 4be34f3d ("bpf: Don't leak memory in bpf getsockopt when optlen == 0") from bpf tree v6: - avoid indirect cost for new bpf_bypass_getsockopt (Eric Dumazet) v5: - reorder patches to reduce the churn (Martin KaFai Lau) v4: - update performance numbers - bypass_bpf_getsockopt (Martin KaFai Lau) v3: - remove extra newline, add comment about sizeof tcp_zerocopy_receive (Martin KaFai Lau) - add another patch to remove lock_sock overhead from TCP_ZEROCOPY_RECEIVE; technically, this makes patch #1 obsolete, but I'd still prefer to keep it to help with other socket options v2: - perf numbers for getsockopt kmalloc reduction (Song Liu) - (sk) in BPF_CGROUP_PRE_CONNECT_ENABLED (Song Liu) - 128 -> 64 buffer size, BUILD_BUG_ON (Martin KaFai Lau) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Stanislav Fomichev authored
When we attach any cgroup hook, the rest (even if unused/unattached) start to contribute small overhead. In particular, the one we want to avoid is __cgroup_bpf_run_filter_skb which does two redirections to get to the cgroup and pushes/pulls skb. Let's split cgroup_bpf_enabled to be per-attach to make sure only used attach types trigger. I've dropped some existing high-level cgroup_bpf_enabled in some places because BPF_PROG_CGROUP_XXX_RUN macros usually have another cgroup_bpf_enabled check. I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type] has to be constant and known at compile time. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com
-
Stanislav Fomichev authored
When we attach a bpf program to cgroup/getsockopt any other getsockopt() syscall starts incurring kzalloc/kfree cost. Let add a small buffer on the stack and use it for small (majority) {s,g}etsockopt values. The buffer is small enough to fit into the cache line and cover the majority of simple options (most of them are 4 byte ints). It seems natural to do the same for setsockopt, but it's a bit more involved when the BPF program modifies the data (where we have to kmalloc). The assumption is that for the majority of setsockopt calls (which are doing pure BPF options or apply policy) this will bring some benefit as well. Without this patch (we remove about 1% __kmalloc): 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt | --3.30%--__cgroup_bpf_run_filter_getsockopt | --0.81%--__kmalloc Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-3-sdf@google.com
-
Stanislav Fomichev authored
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom call in do_tcp_getsockopt using the on-stack data. This removes 3% overhead for locking/unlocking the socket. Without this patch: 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt | --3.30%--__cgroup_bpf_run_filter_getsockopt | --0.81%--__kmalloc With the patch applied: 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern Note, exporting uapi/tcp.h requires removing netinet/tcp.h from test_progs.h because those headers have confliciting definitions. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
-
Yonghong Song authored
llvm patch https://reviews.llvm.org/D84002 permitted to emit empty rodata datasec if the elf .rodata section contains read-only data from local variables. These local variables will be not emitted as BTF_KIND_VARs since llvm converted these local variables as static variables with private linkage without debuginfo types. Such an empty rodata datasec will make skeleton code generation easy since for skeleton a rodata struct will be generated if there is a .rodata elf section. The existence of a rodata btf datasec is also consistent with the existence of a rodata map created by libbpf. The btf with such an empty rodata datasec will fail in the kernel though as kernel will reject a datasec with zero vlen and zero size. For example, for the below code, int sys_enter(void *ctx) { int fmt[6] = {1, 2, 3, 4, 5, 6}; int dst[6]; bpf_probe_read(dst, sizeof(dst), fmt); return 0; } We got the below btf (bpftool btf dump ./test.o): [1] PTR '(anon)' type_id=0 [2] FUNC_PROTO '(anon)' ret_type_id=3 vlen=1 'ctx' type_id=1 [3] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED [4] FUNC 'sys_enter' type_id=2 linkage=global [5] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED [6] ARRAY '(anon)' type_id=5 index_type_id=7 nr_elems=4 [7] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none) [8] VAR '_license' type_id=6, linkage=global-alloc [9] DATASEC '.rodata' size=0 vlen=0 [10] DATASEC 'license' size=0 vlen=1 type_id=8 offset=0 size=4 When loading the ./test.o to the kernel with bpftool, we see the following error: libbpf: Error loading BTF: Invalid argument(22) libbpf: magic: 0xeb9f ... [6] ARRAY (anon) type_id=5 index_type_id=7 nr_elems=4 [7] INT __ARRAY_SIZE_TYPE__ size=4 bits_offset=0 nr_bits=32 encoding=(none) [8] VAR _license type_id=6 linkage=1 [9] DATASEC .rodata size=24 vlen=0 vlen == 0 libbpf: Error loading .BTF into kernel: -22. BTF is optional, ignoring. Basically, libbpf changed .rodata datasec size to 24 since elf .rodata section size is 24. The kernel then rejected the BTF since vlen = 0. Note that the above kernel verifier failure can be worked around with changing local variable "fmt" to a static or global, optionally const, variable. This patch permits a datasec with vlen = 0 in kernel. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210119153519.3901963-1-yhs@fb.com
-
Alexei Starovoitov authored
Qais Yousef says: ==================== Changes in v3: * Fix not returning error value correctly in trigger_module_test_write() (Yonghong) * Add Yonghong acked-by to patch 1. Changes in v2: * Fix compilation error. (Andrii) * Make the new test use write() instead of read() (Andrii) Add some missing glue logic to teach bpf about bare tracepoints - tracepoints without any trace event associated with them. Bare tracepoints are declare with DECLARE_TRACE(). Full tracepoints are declare with TRACE_EVENT(). BPF can attach to these tracepoints as RAW_TRACEPOINT() only as there're no events in tracefs created with them. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Qais Yousef authored
Reuse module_attach infrastructure to add a new bare tracepoint to check we can attach to it as a raw tracepoint. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210119122237.2426878-3-qais.yousef@arm.com
-
Alexei Starovoitov authored
Gary Lin says: ==================== This patch series implements jump padding to x64 jit to cover some corner cases that used to consume more than 20 jit passes and caused failure. v4: - Add the detailed comments about the possible padding bytes - Add the second test case which triggers jmp_cond padding and imm32 nop jmp padding. - Add the new test case as another subprog v3: - Copy the instructions of prologue separately or the size calculation of the first BPF instruction would include the prologue. - Replace WARN_ONCE() with pr_err() and EFAULT - Use MAX_PASSES in the for loop condition check - Remove the "padded" flag from x64_jit_data. For the extra pass of subprogs, padding is always enabled since it won't hurt the images that converge without padding. v2: - Simplify the sample code in the commit description and provide the jit code - Check the expected padding bytes with WARN_ONCE - Move the 'padded' flag to 'struct x64_jit_data' - Remove the EXPECTED_FAIL flag from bpf_fill_maxinsns11() in test_bpf - Add 2 verifier tests ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Qais Yousef authored
Some subsystems only have bare tracepoints (a tracepoint with no associated trace event) to avoid the problem of trace events being an ABI that can't be changed. >From bpf presepective, bare tracepoints are what it calls RAW_TRACEPOINT(). Since bpf assumed there's 1:1 mapping, it relied on hooking to DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had no knowledge about their existence. By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints. Enabling that comes with the contract that changes to raw tracepoints don't constitute a regression if they break existing bpf programs. We need the ability to continue to morph and modify these raw tracepoints without worrying about any ABI. Update Documentation/bpf/bpf_design_QA.rst to document this contract. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210119122237.2426878-2-qais.yousef@arm.com
-
Gary Lin authored
There are 3 tests added into verifier's jit tests to trigger x64 jit jump padding. The first test can be represented as the following assembly code: 1: bpf_call bpf_get_prandom_u32 2: if r0 == 1 goto pc+128 3: if r0 == 2 goto pc+128 ... 129: if r0 == 128 goto pc+128 130: goto pc+128 131: goto pc+127 ... 256: goto pc+2 257: goto pc+1 258: r0 = 1 259: ret We first store a random number to r0 and add the corresponding conditional jumps (2~129) to make verifier believe that those jump instructions from 130 to 257 are reachable. When the program is sent to x64 jit, it starts to optimize out the NOP jumps backwards from 257. Since there are 128 such jumps, the program easily reaches 15 passes and triggers jump padding. Here is the x64 jit code of the first test: 0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 5: 66 90 xchg ax,ax 7: 55 push rbp 8: 48 89 e5 mov rbp,rsp b: e8 4c 90 75 e3 call 0xffffffffe375905c 10: 48 83 f8 01 cmp rax,0x1 14: 0f 84 fe 04 00 00 je 0x518 1a: 48 83 f8 02 cmp rax,0x2 1e: 0f 84 f9 04 00 00 je 0x51d ... f6: 48 83 f8 18 cmp rax,0x18 fa: 0f 84 8b 04 00 00 je 0x58b 100: 48 83 f8 19 cmp rax,0x19 104: 0f 84 86 04 00 00 je 0x590 10a: 48 83 f8 1a cmp rax,0x1a 10e: 0f 84 81 04 00 00 je 0x595 ... 500: 0f 84 83 01 00 00 je 0x689 506: 48 81 f8 80 00 00 00 cmp rax,0x80 50d: 0f 84 76 01 00 00 je 0x689 513: e9 71 01 00 00 jmp 0x689 518: e9 6c 01 00 00 jmp 0x689 ... 5fe: e9 86 00 00 00 jmp 0x689 603: e9 81 00 00 00 jmp 0x689 608: 0f 1f 00 nop DWORD PTR [rax] 60b: eb 7c jmp 0x689 60d: eb 7a jmp 0x689 ... 683: eb 04 jmp 0x689 685: eb 02 jmp 0x689 687: 66 90 xchg ax,ax 689: b8 01 00 00 00 mov eax,0x1 68e: c9 leave 68f: c3 ret As expected, a 3 bytes NOPs is inserted at 608 due to the transition from imm32 jmp to imm8 jmp. A 2 bytes NOPs is also inserted at 687 to replace a NOP jump. The second test case is tricky. Here is the assembly code: 1: bpf_call bpf_get_prandom_u32 2: if r0 == 1 goto pc+2048 3: if r0 == 2 goto pc+2048 ... 2049: if r0 == 2048 goto pc+2048 2050: goto pc+2048 2051: goto pc+16 2052: goto pc+15 ... 2064: goto pc+3 2065: goto pc+2 2066: goto pc+1 ... [repeat "goto pc+16".."goto pc+1" 127 times] ... 4099: r0 = 2 4100: ret There are 4 major parts of the program. 1) 1~2049: Those are instructions to make 2050~4098 reachable. Some of them also could generate the padding for jmp_cond. 2) 2050: This is the target instruction for the imm32 nop jmp padding. 3) 2051~4098: The repeated "goto 1~16" instructions are designed to be consumed by the nop jmp optimization. In the end, those instrucitons become 128 continuous 0 offset jmp and are optimized out in 1 pass, and this make insn 2050 an imm32 nop jmp in the next pass, so that we can trigger the 5 bytes padding. 4) 4099~4100: Those are the instructions to end the program. The x64 jit code is like this: 0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 5: 66 90 xchg ax,ax 7: 55 push rbp 8: 48 89 e5 mov rbp,rsp b: e8 bc 7b d5 d3 call 0xffffffffd3d57bcc 10: 48 83 f8 01 cmp rax,0x1 14: 0f 84 7e 66 00 00 je 0x6698 1a: 48 83 f8 02 cmp rax,0x2 1e: 0f 84 74 66 00 00 je 0x6698 24: 48 83 f8 03 cmp rax,0x3 28: 0f 84 6a 66 00 00 je 0x6698 2e: 48 83 f8 04 cmp rax,0x4 32: 0f 84 60 66 00 00 je 0x6698 38: 48 83 f8 05 cmp rax,0x5 3c: 0f 84 56 66 00 00 je 0x6698 42: 48 83 f8 06 cmp rax,0x6 46: 0f 84 4c 66 00 00 je 0x6698 ... 666c: 48 81 f8 fe 07 00 00 cmp rax,0x7fe 6673: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 6677: 74 1f je 0x6698 6679: 48 81 f8 ff 07 00 00 cmp rax,0x7ff 6680: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 6684: 74 12 je 0x6698 6686: 48 81 f8 00 08 00 00 cmp rax,0x800 668d: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 6691: 74 05 je 0x6698 6693: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 6698: b8 02 00 00 00 mov eax,0x2 669d: c9 leave 669e: c3 ret Since insn 2051~4098 are optimized out right before the padding pass, there are several conditional jumps from the first part are replaced with imm8 jmp_cond, and this triggers the 4 bytes padding, for example at 6673, 6680, and 668d. On the other hand, Insn 2050 is replaced with the 5 bytes nops at 6693. The third test is to invoke the first and second tests as subprogs to test bpf2bpf. Per the system log, there was one more jit happened with only one pass and the same jit code was produced. v4: - Add the second test case which triggers jmp_cond padding and imm32 nop jmp padding. - Add the new test case as another subprog Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210119102501.511-4-glin@suse.com
-
Gary Lin authored
With NOPs padding, x64 jit now can handle the jump cases like bpf_fill_maxinsns11(). Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210119102501.511-3-glin@suse.com
-
Gary Lin authored
The x64 bpf jit expects bpf images converge within the given passes, but it could fail to do so with some corner cases. For example: l0: ja 40 l1: ja 40 [... repeated ja 40 ] l39: ja 40 l40: ret #0 This bpf program contains 40 "ja 40" instructions which are effectively NOPs and designed to be replaced with valid code dynamically. Ideally, bpf jit should optimize those "ja 40" instructions out when translating the bpf instructions into x64 machine code. However, do_jit() can only remove one "ja 40" for offset==0 on each pass, so it requires at least 40 runs to eliminate those JMPs and exceeds the current limit of passes(20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON is set even though it's legit as a classic socket filter. To make bpf images more likely converge within 20 passes, this commit pads some instructions with NOPs in the last 5 passes: 1. conditional jumps A possible size variance comes from the adoption of imm8 JMP. If the offset is imm8, we calculate the size difference of this BPF instruction between the previous and the current pass and fill the gap with NOPs. To avoid the recalculation of jump offset, those NOPs are inserted before the JMP code, so we have to subtract the 2 bytes of imm8 JMP when calculating the NOP number. 2. BPF_JA There are two conditions for BPF_JA. a.) nop jumps If this instruction is not optimized out in the previous pass, instead of removing it, we insert the equivalent size of NOPs. b.) label jumps Similar to condition jumps, we prepend NOPs right before the JMP code. To make the code concise, emit_nops() is modified to use the signed len and return the number of inserted NOPs. For bpf-to-bpf, we always enable padding for the extra pass since there is only one extra run and the jump padding doesn't affected the images that converge without padding. After applying this patch, the corner case was loaded with the following jit code: flen=45 proglen=77 pass=17 image=ffffffffc03367d4 from=jump pid=10097 JIT code: 00000000: 0f 1f 44 00 00 55 48 89 e5 53 41 55 31 c0 45 31 JIT code: 00000010: ed 48 89 fb eb 30 eb 2e eb 2c eb 2a eb 28 eb 26 JIT code: 00000020: eb 24 eb 22 eb 20 eb 1e eb 1c eb 1a eb 18 eb 16 JIT code: 00000030: eb 14 eb 12 eb 10 eb 0e eb 0c eb 0a eb 08 eb 06 JIT code: 00000040: eb 04 eb 02 66 90 31 c0 41 5d 5b c9 c3 0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 5: 55 push rbp 6: 48 89 e5 mov rbp,rsp 9: 53 push rbx a: 41 55 push r13 c: 31 c0 xor eax,eax e: 45 31 ed xor r13d,r13d 11: 48 89 fb mov rbx,rdi 14: eb 30 jmp 0x46 16: eb 2e jmp 0x46 ... 3e: eb 06 jmp 0x46 40: eb 04 jmp 0x46 42: eb 02 jmp 0x46 44: 66 90 xchg ax,ax 46: 31 c0 xor eax,eax 48: 41 5d pop r13 4a: 5b pop rbx 4b: c9 leave 4c: c3 ret At the 16th pass, 15 jumps were already optimized out, and one jump was replaced with NOPs at 44 and the image converged at the 17th pass. v4: - Add the detailed comments about the possible padding bytes v3: - Copy the instructions of prologue separately or the size calculation of the first BPF instruction would include the prologue. - Replace WARN_ONCE() with pr_err() and EFAULT - Use MAX_PASSES in the for loop condition check - Remove the "padded" flag from x64_jit_data. For the extra pass of subprogs, padding is always enabled since it won't hurt the images that converge without padding. v2: - Simplify the sample code in the description and provide the jit code - Check the expected padding bytes with WARN_ONCE - Move the 'padded' flag to 'struct x64_jit_data' Signed-off-by: Gary Lin <glin@suse.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210119102501.511-2-glin@suse.com
-
Lukas Bulwahn authored
Commit 91c960b0 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm") modified the BPF documentation, but missed some ReST markup. Hence, make htmldocs warns on Documentation/networking/filter.rst:1053: WARNING: Inline emphasis start-string without end-string. Add some minimal markup to address this warning. Fixes: 91c960b0 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Brendan Jackman <jackmanb@google.com> Link: https://lore.kernel.org/bpf/20210118080004.6367-1-lukas.bulwahn@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Björn Töpel authored
Brendan Jackman added extend atomic operations to the BPF instruction set in commit 7064a734 ("Merge branch 'Atomics for eBPF'"), which introduces the BPF_ATOMIC_OP macro. However, that macro was missing for the BPF samples. Fix that by adding it into bpf_insn.h. Fixes: 91c960b0 ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://lore.kernel.org/bpf/20210118091753.107572-1-bjorn.topel@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Lorenzo Bianconi authored
Introduce xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Respect to __xdp_build_skb_from_frame, xdp_build_skb_from_frame will allocate the skb object. Rely on xdp_build_skb_from_frame in veth driver. Introduce missing xdp metadata support in veth_xdp_rcv_one routine. Add missing metadata support in veth_xdp_rcv_one(). Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Lorenzo Bianconi authored
Introduce __xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in cpumap code. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Carlos Neira authored
Currently tests for bpf_get_ns_current_pid_tgid() are outside test_progs. This change folds test cases into test_progs. Changes from v11: - Fixed test failure is not detected. - Removed EXIT(3) call as it will stop test_progs execution. Signed-off-by: Carlos Neira <cneirabustos@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210114141033.GA17348@localhostSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski authored
Conflicts: drivers/net/can/dev.c commit 03f16c50 ("can: dev: can_restart: fix use after free bug") commit 3e77f70e ("can: dev: move driver related infrastructure into separate subdir") Code move. drivers/net/dsa/b53/b53_common.c commit 8e4052c3 ("net: dsa: b53: fix an off by one in checking "vlan->vid"") commit b7a9e0da ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects") Field rename. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds authored
Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and can trees. Current release - regressions: - nfc: nci: fix the wrong NCI_CORE_INIT parameters Current release - new code bugs: - bpf: allow empty module BTFs Previous releases - regressions: - bpf: fix signed_{sub,add32}_overflows type handling - tcp: do not mess with cloned skbs in tcp_add_backlog() - bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach - bpf: don't leak memory in bpf getsockopt when optlen == 0 - tcp: fix potential use-after-free due to double kfree() - mac80211: fix encryption issues with WEP - devlink: use right genl user_ptr when handling port param get/set - ipv6: set multicast flag on the multicast route - tcp: fix TCP_USER_TIMEOUT with zero window Previous releases - always broken: - bpf: local storage helpers should check nullness of owner ptr passed - mac80211: fix incorrect strlen of .write in debugfs - cls_flower: call nla_ok() before nla_next() - skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too" * tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits) net: systemport: free dev before on error path net: usb: cdc_ncm: don't spew notifications net: mscc: ocelot: Fix multicast to the CPU port tcp: Fix potential use-after-free due to double kfree() bpf: Fix signed_{sub,add32}_overflows type handling can: peak_usb: fix use after free bugs can: vxcan: vxcan_xmit: fix use after free bug can: dev: can_restart: fix use after free bug tcp: fix TCP socket rehash stats mis-accounting net: dsa: b53: fix an off by one in checking "vlan->vid" tcp: do not mess with cloned skbs in tcp_add_backlog() selftests: net: fib_tests: remove duplicate log test net: nfc: nci: fix the wrong NCI_CORE_INIT parameters sh_eth: Fix power down vs. is_opened flag ordering net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled netfilter: rpfilter: mask ecn bits before fib lookup udp: mask TOS bits in udp_v4_early_demux() xsk: Clear pool even for inactive queues bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback sh_eth: Make PHY access aware of Runtime PM to fix reboot crash ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tipLinus Torvalds authored
Pull xen fix from Juergen Gross: "A fix for build failure showing up in some configurations" * tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: fix 'nopvspin' build error
-
Tianjia Zhang authored
On the following call path, `sig->pkey_algo` is not assigned in asymmetric_key_verify_signature(), which causes runtime crash in public_key_verify_signature(). keyctl_pkey_verify asymmetric_key_verify_signature verify_signature public_key_verify_signature This patch simply check this situation and fixes the crash caused by NULL pointer. Fixes: 21552563 ("X.509: support OSCCA SM2-with-SM3 certificate verification") Reported-by: Tobias Markus <tobias@markus-regensburg.de> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: João Fonseca <jpedrofonseca@ua.pt> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Takashi Iwai authored
After the recent actions to convert readpages aops to readahead, the NULL checks of readpages aops in cachefiles_read_or_alloc_page() may hit falsely. More badly, it's an ASSERT() call, and this panics. Drop the superfluous NULL checks for fixing this regression. [DH: Note that cachefiles never actually used readpages, so this check was never actually necessary] BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883 BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245 Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem") Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jakub Kicinski authored
Merge tag 'linux-can-fixes-for-5.11-20210120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== linux-can-fixes-for-5.11-20210120 All three patches are by Vincent Mailhol and fix a potential use after free bug in the CAN device infrastructure, the vxcan driver, and the peak_usk driver. In the TX-path the skb is used to read from after it was passed to the networking stack with netif_rx_ni(). * tag 'linux-can-fixes-for-5.11-20210120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: peak_usb: fix use after free bugs can: vxcan: vxcan_xmit: fix use after free bug can: dev: can_restart: fix use after free bug ==================== Link: https://lore.kernel.org/r/20210120125202.2187358-1-mkl@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Pan Bian authored
On the error path, it should goto the error handling label to free allocated memory rather than directly return. Fixes: 31bc72d9 ("net: systemport: fetch and use clock resources") Signed-off-by: Pan Bian <bianpan2016@163.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210120044423.1704-1-bianpan2016@163.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Grant Grundler authored
RTL8156 sends notifications about every 32ms. Only display/log notifications when something changes. This issue has been reported by others: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832472 https://lkml.org/lkml/2020/8/27/1083 ... [785962.779840] usb 1-1: new high-speed USB device number 5 using xhci_hcd [785962.929944] usb 1-1: New USB device found, idVendor=0bda, idProduct=8156, bcdDevice=30.00 [785962.929949] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6 [785962.929952] usb 1-1: Product: USB 10/100/1G/2.5G LAN [785962.929954] usb 1-1: Manufacturer: Realtek [785962.929956] usb 1-1: SerialNumber: 000000001 [785962.991755] usbcore: registered new interface driver cdc_ether [785963.017068] cdc_ncm 1-1:2.0: MAC-Address: 00:24:27:88:08:15 [785963.017072] cdc_ncm 1-1:2.0: setting rx_max = 16384 [785963.017169] cdc_ncm 1-1:2.0: setting tx_max = 16384 [785963.017682] cdc_ncm 1-1:2.0 usb0: register 'cdc_ncm' at usb-0000:00:14.0-1, CDC NCM, 00:24:27:88:08:15 [785963.019211] usbcore: registered new interface driver cdc_ncm [785963.023856] usbcore: registered new interface driver cdc_wdm [785963.025461] usbcore: registered new interface driver cdc_mbim [785963.038824] cdc_ncm 1-1:2.0 enx002427880815: renamed from usb0 [785963.089586] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected [785963.121673] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected [785963.153682] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected ... This is about 2KB per second and will overwrite all contents of a 1MB dmesg buffer in under 10 minutes rendering them useless for debugging many kernel problems. This is also an extra 180 MB/day in /var/logs (or 1GB per week) rendering the majority of those logs useless too. When the link is up (expected state), spew amount is >2x higher: ... [786139.600992] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected [786139.632997] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink [786139.665097] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected [786139.697100] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink [786139.729094] cdc_ncm 2-1:2.0 enx002427880815: network connection: connected [786139.761108] cdc_ncm 2-1:2.0 enx002427880815: 2500 mbit/s downlink 2500 mbit/s uplink ... Chrome OS cannot support RTL8156 until this is fixed. Signed-off-by: Grant Grundler <grundler@chromium.org> Reviewed-by: Hayes Wang <hayeswang@realtek.com> Link: https://lore.kernel.org/r/20210120011208.3768105-1-grundler@chromium.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Alban Bedel authored
Multicast entries in the MAC table use the high bits of the MAC address to encode the ports that should get the packets. But this port mask does not work for the CPU port, to receive these packets on the CPU port the MAC_CPU_COPY flag must be set. Because of this IPv6 was effectively not working because neighbor solicitations were never received. This was not apparent before commit 9403c158 (net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb entries) as the IPv6 entries were broken so all incoming IPv6 multicast was then treated as unknown and flooded on all ports. To fix this problem rework the ocelot_mact_learn() to set the MAC_CPU_COPY flag when a multicast entry that target the CPU port is added. For this we have to read back the ports endcoded in the pseudo MAC address by the caller. It is not a very nice design but that avoid changing the callers and should make backporting easier. Signed-off-by: Alban Bedel <alban.bedel@aerq.com> Fixes: 9403c158 ("net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb entries") Link: https://lore.kernel.org/r/20210119140638.203374-1-alban.bedel@aerq.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct request_sock and then can allocate inet_rsk(req)->ireq_opt. After that, tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full socket into ehash and sets NULL to ireq_opt. Otherwise, tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full socket. The commit 01770a16 ("tcp: fix race condition when creating child sockets from syncookies") added a new path, in which more than one cores create full sockets for the same SYN cookie. Currently, the core which loses the race frees the full socket without resetting inet_opt, resulting in that both sock_put() and reqsk_put() call kfree() for the same memory: sock_put sk_free __sk_free sk_destruct __sk_destruct sk->sk_destruct/inet_sock_destruct kfree(rcu_dereference_protected(inet->inet_opt, 1)); reqsk_put reqsk_free __reqsk_free req->rsk_ops->destructor/tcp_v4_reqsk_destructor kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1)); Calling kmalloc() between the double kfree() can lead to use-after-free, so this patch fixes it by setting NULL to inet_opt before sock_put(). As a side note, this kind of issue does not happen for IPv6. This is because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which correspond to ireq_opt in IPv4. Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies") CC: Ricardo Dias <rdias@singlestore.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-