Commits · 01bdc58e94b46b88d4921f46f423bdeb8b137f28 · Kirill Smelkov / linux

06 Oct, 2021 21 commits

Johan Almbladh authored Oct 05, 2021

This patch enables the new eBPF JITs for 32-bit and 64-bit MIPS. It also
disables the old cBPF JIT to so cBPF programs are converted to use the
new JIT.

Workarounds for R4000 CPU errata are not implemented by the JIT, so the
JIT is disabled if any of those workarounds are configured.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211005165408.2305108-7-johan.almbladh@anyfinetworks.com

01bdc58e

mips, bpf: Add JIT workarounds for CPU errata · 72570224

Johan Almbladh authored Oct 05, 2021

This patch adds workarounds for the following CPU errata to the MIPS
eBPF JIT, if enabled in the kernel configuration.

  - R10000 ll/sc weak ordering
  - Loongson-3 ll/sc weak ordering
  - Loongson-2F jump hang

The Loongson-2F nop errata is implemented in uasm, which the JIT uses,
so no additional mitigations are needed for that.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Link: https://lore.kernel.org/bpf/20211005165408.2305108-6-johan.almbladh@anyfinetworks.com

72570224

mips, bpf: Add new eBPF JIT for 64-bit MIPS · fbc802de

Johan Almbladh authored Oct 05, 2021

This is an implementation on of an eBPF JIT for 64-bit MIPS III-V and
MIPS64r1-r6. It uses the same framework introduced by the 32-bit JIT.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211005165408.2305108-5-johan.almbladh@anyfinetworks.com

fbc802de

mips, bpf: Add eBPF JIT for 32-bit MIPS · eb63cfcd

Johan Almbladh authored Oct 05, 2021

This is an implementation of an eBPF JIT for 32-bit MIPS I-V and MIPS32.
The implementation supports all 32-bit and 64-bit ALU and JMP operations,
including the recently-added atomics. 64-bit div/mod and 64-bit atomics
are implemented using function calls to math64 and atomic64 functions,
respectively. All 32-bit operations are implemented natively by the JIT,
except if the CPU lacks ll/sc instructions.

Register mapping
================
All 64-bit eBPF registers are mapped to native 32-bit MIPS register pairs,
and does not use any stack scratch space for register swapping. This means
that all eBPF register data is kept in CPU registers all the time, and
this simplifies the register management a lot. It also reduces the JIT's
pressure on temporary registers since we do not have to move data around.

Native register pairs are ordered according to CPU endiannes, following
the O32 calling convention for passing 64-bit arguments and return values.
The eBPF return value, arguments and callee-saved registers are mapped to
their native MIPS equivalents.

Since the 32 highest bits in the eBPF FP (frame pointer) register are
always zero, only one general-purpose register is actually needed for the
mapping. The MIPS fp register is used for this purpose. The high bits are
mapped to MIPS register r0. This saves us one CPU register, which is much
needed for temporaries, while still allowing us to treat the R10 (FP)
register just like any other eBPF register in the JIT.

The MIPS gp (global pointer) and at (assembler temporary) registers are
used as internal temporary registers for constant blinding. CPU registers
t6-t9 are used internally by the JIT when constructing more complex 64-bit
operations. This is precisely what is needed - two registers to store an
operand value, and two more as scratch registers when performing the
operation.

The register mapping is shown below.

    R0 - $v1, $v0   return value
    R1 - $a1, $a0   argument 1, passed in registers
    R2 - $a3, $a2   argument 2, passed in registers
    R3 - $t1, $t0   argument 3, passed on stack
    R4 - $t3, $t2   argument 4, passed on stack
    R5 - $t4, $t3   argument 5, passed on stack
    R6 - $s1, $s0   callee-saved
    R7 - $s3, $s2   callee-saved
    R8 - $s5, $s4   callee-saved
    R9 - $s7, $s6   callee-saved
    FP - $r0, $fp   32-bit frame pointer
    AX - $gp, $at   constant-blinding
         $t6 - $t9  unallocated, JIT temporaries

Jump offsets
============
The JIT tries to map all conditional JMP operations to MIPS conditional
PC-relative branches. The MIPS branch offset field is 18 bits, in bytes,
which is equivalent to the eBPF 16-bit instruction offset. However, since
the JIT may emit more than one CPU instruction per eBPF instruction, the
field width may overflow. If that happens, the JIT converts the long
conditional jump to a short PC-relative branch with the condition
inverted, jumping over a long unconditional absolute jmp (j).

This conversion will change the instruction offset mapping used for jumps,
and may in turn result in more branch offset overflows. The JIT therefore
dry-runs the translation until no more branches are converted and the
offsets do not change anymore. There is an upper bound on this of course,
and if the JIT hits that limit, the last two iterations are run with all
branches being converted.

Tail call count
===============
The current tail call count is stored in the 16-byte area of the caller's
stack frame that is reserved for the callee in the o32 ABI. The value is
initialized in the prologue, and propagated to the tail-callee by skipping
the initialization instructions when emitting the tail call.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211005165408.2305108-4-johan.almbladh@anyfinetworks.com

eb63cfcd

mips, uasm: Add workaround for Loongson-2F nop CPU errata · f7c036c1

Johan Almbladh authored Oct 05, 2021

This patch implements a workaround for the Loongson-2F nop in generated,
code, if the existing option CONFIG_CPU_NOP_WORKAROUND is set. Before,
the binutils option -mfix-loongson2f-nop was enabled, but no workaround
was done when emitting MIPS code. Now, the nop pseudo instruction is
emitted as "or ax,ax,zero" instead of the default "sll zero,zero,0". This
is consistent with the workaround implemented by binutils.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Link: https://sourceware.org/legacy-ml/binutils/2009-11/msg00387.html
Link: https://lore.kernel.org/bpf/20211005165408.2305108-3-johan.almbladh@anyfinetworks.com

f7c036c1

mips, uasm: Enable muhu opcode for MIPS R6 · e737547e

Tony Ambardar authored Oct 05, 2021

Enable the 'muhu' instruction, complementing the existing 'mulu', needed
to implement a MIPS32 BPF JIT.

Also fix a typo in the existing definition of 'dmulu'.
Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211005165408.2305108-2-johan.almbladh@anyfinetworks.com

e737547e

selftests/bpf: Test new btf__add_btf() API · 9d057872

Andrii Nakryiko authored Oct 05, 2021

Add a test that validates that btf__add_btf() API is correctly copying
all the types from the source BTF into destination BTF object and
adjusts type IDs and string offsets properly.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211006051107.17921-4-andrii@kernel.org

9d057872

selftests/bpf: Refactor btf_write selftest to reuse BTF generation logic · c65eb808

Andrii Nakryiko authored Oct 05, 2021

Next patch will need to reuse BTF generation logic, which tests every
supported BTF kind, for testing btf__add_btf() APIs. So restructure
existing selftests and make it as a single subtest that uses bulk
VALIDATE_RAW_BTF() macro for raw BTF dump checking.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211006051107.17921-3-andrii@kernel.org

c65eb808

libbpf: Add API that copies all BTF types from one BTF object to another · 7ca61121

Andrii Nakryiko authored Oct 05, 2021

Add a bulk copying api, btf__add_btf(), that speeds up and simplifies
appending entire contents of one BTF object to another one, taking care
of copying BTF type data, adjusting resulting BTF type IDs according to
their new locations in the destination BTF object, as well as copying
and deduplicating all the referenced strings and updating all the string
offsets in new BTF types as appropriate.

This API is intended to be used from tools that are generating and
otherwise manipulating BTFs generically, such as pahole. In pahole's
case, this API is useful for speeding up parallelized BTF encoding, as
it allows pahole to offload all the intricacies of BTF type copying to
libbpf and handle the parallelization aspects of the process.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: https://lore.kernel.org/bpf/20211006051107.17921-2-andrii@kernel.org

7ca61121

bpf, x64: Save bytes for DIV by reducing reg copies · 57a610f1

Jie Meng authored Oct 01, 2021

Instead of unconditionally performing push/pop on %rax/%rdx in case of
division/modulo, we can save a few bytes in case of destination register
being either BPF r0 (%rax) or r3 (%rdx) since the result is written in
there anyway.

Also, we do not need to copy the source to %r11 unless the source is either
%rax, %rdx or an immediate.

For example, before the patch:

  22:   push   %rax
  23:   push   %rdx
  24:   mov    %rsi,%r11
  27:   xor    %edx,%edx
  29:   div    %r11
  2c:   mov    %rax,%r11
  2f:   pop    %rdx
  30:   pop    %rax
  31:   mov    %r11,%rax

After:

  22:   push   %rdx
  23:   xor    %edx,%edx
  25:   div    %rsi
  28:   pop    %rdx
Signed-off-by: Jie Meng <jmeng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211002035626.2041910-1-jmeng@fb.com

57a610f1

bpf: Avoid retpoline for bpf_for_each_map_elem · 0640c77c

Andrey Ignatov authored Oct 05, 2021

Similarly to 09772d92 ("bpf: avoid retpoline for
lookup/update/delete calls on maps") and 84430d42 ("bpf, verifier:
avoid retpoline for map push/pop/peek operation") avoid indirect call
while calling bpf_for_each_map_elem.

Before (a program fragment):

  ; if (rules_map) {
   142: (15) if r4 == 0x0 goto pc+8
   143: (bf) r3 = r10
  ; bpf_for_each_map_elem(rules_map, process_each_rule, &ctx, 0);
   144: (07) r3 += -24
   145: (bf) r1 = r4
   146: (18) r2 = subprog[+5]
   148: (b7) r4 = 0
   149: (85) call bpf_for_each_map_elem#143680  <-- indirect call via
                                                    helper

After (same program fragment):

   ; if (rules_map) {
    142: (15) if r4 == 0x0 goto pc+8
    143: (bf) r3 = r10
   ; bpf_for_each_map_elem(rules_map, process_each_rule, &ctx, 0);
    144: (07) r3 += -24
    145: (bf) r1 = r4
    146: (18) r2 = subprog[+5]
    148: (b7) r4 = 0
    149: (85) call bpf_for_each_array_elem#170336  <-- direct call

On a benchmark that calls bpf_for_each_map_elem() once and does many
other things (mostly checking fields in skb) with CONFIG_RETPOLINE=y it
makes program faster.

Before:

  ============================================================================
  Benchmark.cpp                                              time/iter iters/s
  ============================================================================
  IngressMatchByRemoteEndpoint                                80.78ns 12.38M
  IngressMatchByRemoteIP                                      80.66ns 12.40M
  IngressMatchByRemotePort                                    80.87ns 12.37M

After:

  ============================================================================
  Benchmark.cpp                                              time/iter iters/s
  ============================================================================
  IngressMatchByRemoteEndpoint                                73.49ns 13.61M
  IngressMatchByRemoteIP                                      71.48ns 13.99M
  IngressMatchByRemotePort                                    70.39ns 14.21M
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211006001838.75607-1-rdna@fb.com

0640c77c

Merge branch 'Support kernel module function calls from eBPF' · 32a16f6b

Alexei Starovoitov authored Oct 05, 2021

Kumar Kartikeya says:

====================

This set enables kernel module function calls, and also modifies verifier logic
to permit invalid kernel function calls as long as they are pruned as part of
dead code elimination. This is done to provide better runtime portability for
BPF objects, which can conditionally disable parts of code that are pruned later
by the verifier (e.g. const volatile vars, kconfig options). libbpf
modifications are made along with kernel changes to support module function
calls.

It also converts TCP congestion control objects to use the module kfunc support
instead of relying on IS_BUILTIN ifdef.

Changelog:
----------
v6 -> v7
v6: https://lore.kernel.org/bpf/20210930062948.1843919-1-memxor@gmail.com

 * Let __bpf_check_kfunc_call take kfunc_btf_id_list instead of generating
   callbacks (Andrii)
 * Rename it to bpf_check_mod_kfunc_call to reflect usage
 * Remove OOM checks (Alexei)
 * Remove resolve_btfids invocation for bpf_testmod (Andrii)
 * Move fd_array_cnt initialization near fd_array alloc (Andrii)
 * Rename helper to btf_find_by_name_kind and pass start_id (Andrii)
 * memset when data is NULL in add_data (Alexei)
 * Fix other nits

v5 -> v6
v5: https://lore.kernel.org/bpf/20210927145941.1383001-1-memxor@gmail.com

 * Rework gen_loader relocation emits
   * Only emit bpf_btf_find_by_name_kind call when required (Alexei)
   * Refactor code to emit ksym var and func relo into separate helpers, this
     will be easier to add future weak/typeless ksym support to (for my followup)
   * Count references for both ksym var and funcs, and avoid calling helpers
     unless required for both of them. This also means we share fds between
     ksym vars for the module BTFs. Also be careful with this when closing
     BTF fd so that we only close one instance of the fd for each ksym

v4 -> v5
v4: https://lore.kernel.org/bpf/20210920141526.3940002-1-memxor@gmail.com

 * Address comments from Alexei
   * Use reserved fd_array area in loader map instead of creating a new map
   * Drop selftest testing the 256 kfunc limit, however selftest testing reuse
     of BTF fd for same kfunc in gen_loader and libbpf is kept
 * Address comments from Andrii
   * Make --no-fail the default for resolve_btfids, i.e. only fail if we find
     BTF section and cannot process it
   * Use obj->btf_modules array to store index in the fd_array, so that we don't
     have to do any searching to reuse the index, instead only set it the first
     time a module BTF's fd is used
   * Make find_ksym_btf_id to return struct module_btf * in last parameter
   * Improve logging when index becomes bigger than INT16_MAX
   * Add btf__find_by_name_kind_own internal helper to only start searching for
     kfunc ID in module BTF, since find_ksym_btf_id already checks vmlinux BTF
     before iterating over module BTFs.
   * Fix various other nits
 * Fixes for failing selftests on BPF CI
 * Rearrange/cleanup selftests
   * Avoid testing kfunc limit (Alexei)
   * Do test gen_loader and libbpf BTF fd index dedup with 256 calls
   * Move invalid kfunc failure test to verifier selftest
   * Minimize duplication
 * Use consistent bpf_<type>_check_kfunc_call naming for module kfunc callback
 * Since we try to add fd using add_data while we can, cherry pick Alexei's
   patch from CO-RE RFC series to align gen_loader data.

v3 -> v4
v3: https://lore.kernel.org/bpf/20210915050943.679062-1-memxor@gmail.com

 * Address comments from Alexei
   * Drop MAX_BPF_STACK change, instead move map_fd and BTF fd to BPF array map
     and pass fd_array using BPF_PSEUDO_MAP_IDX_VALUE
 * Address comments from Andrii
   * Fix selftest to store to variable for observing function call instead of
     printk and polluting CI logs
 * Drop use of raw_tp for testing, instead reuse classifier based prog_test_run
 * Drop index + 1 based insn->off convention for kfunc module calls
 * Expand selftests to cover more corner cases
 * Misc cleanups

v2 -> v3
v2: https://lore.kernel.org/bpf/20210914123750.460750-1-memxor@gmail.com

 * Fix issues pointed out by Kernel Test Robot
 * Fix find_kfunc_desc to also take offset into consideration when comparing

RFC v1 -> v2
v1: https://lore.kernel.org/bpf/20210830173424.1385796-1-memxor@gmail.com

 * Address comments from Alexei
   * Reuse fd_array instead of introducing kfunc_btf_fds array
   * Take btf and module reference as needed, instead of preloading
   * Add BTF_KIND_FUNC relocation support to gen_loader infrastructure
 * Address comments from Andrii
   * Drop hashmap in libbpf for finding index of existing BTF in fd_array
   * Preserve invalid kfunc calls only when the symbol is weak
 * Adjust verifier selftests
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

32a16f6b

bpf: selftests: Add selftests for module kfunc support · c48e51c8

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This adds selftests that tests the success and failure path for modules
kfuncs (in presence of invalid kfunc calls) for both libbpf and
gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can
add module BTF ID set from bpf_testmod.

This also introduces a couple of test cases to verifier selftests for
validating whether we get an error or not depending on if invalid kfunc
call remains after elimination of unreachable instructions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com

c48e51c8

libbpf: Update gen_loader to emit BTF_KIND_FUNC relocations · 18f4fccb

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This change updates the BPF syscall loader to relocate BTF_KIND_FUNC
relocations, with support for weak kfunc relocations. The general idea
is to move map_fds to loader map, and also use the data for storing
kfunc BTF fds. Since both reuse the fd_array parameter, they need to be
kept together.

For map_fds, we reserve MAX_USED_MAPS slots in a region, and for kfunc,
we reserve MAX_KFUNC_DESCS. This is done so that insn->off has more
chances of being <= INT16_MAX than treating data map as a sparse array
and adding fd as needed.

When the MAX_KFUNC_DESCS limit is reached, we fall back to the sparse
array model, so that as long as it does remain <= INT16_MAX, we pass an
index relative to the start of fd_array.

We store all ksyms in an array where we try to avoid calling the
bpf_btf_find_by_name_kind helper, and also reuse the BTF fd that was
already stored. This also speeds up the loading process compared to
emitting calls in all cases, in later tests.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-9-memxor@gmail.com

18f4fccb

libbpf: Resolve invalid weak kfunc calls with imm = 0, off = 0 · 466b2e13

Kumar Kartikeya Dwivedi authored Oct 02, 2021

Preserve these calls as it allows verifier to succeed in loading the
program if they are determined to be unreachable after dead code
elimination during program load. If not, the verifier will fail at
runtime. This is done for ext->is_weak symbols similar to the case for
variable ksyms.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-8-memxor@gmail.com

466b2e13

libbpf: Support kernel module function calls · 9dbe6015

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This patch adds libbpf support for kernel module function call support.
The fd_array parameter is used during BPF program load to pass module
BTFs referenced by the program. insn->off is set to index into this
array, but starts from 1, because insn->off as 0 is reserved for
btf_vmlinux.

We try to use existing insn->off for a module, since the kernel limits
the maximum distinct module BTFs for kfuncs to 256, and also because
index must never exceed the maximum allowed value that can fit in
insn->off (INT16_MAX). In the future, if kernel interprets signed offset
as unsigned for kfunc calls, this limit can be increased to UINT16_MAX.

Also introduce a btf__find_by_name_kind_own helper to start searching
from module BTF's start id when we know that the BTF ID is not present
in vmlinux BTF (in find_ksym_btf_id).
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-7-memxor@gmail.com

9dbe6015

bpf: Enable TCP congestion control kfunc from modules · 0e32dfc8

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.

This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.

Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.

See following commits for background on use of:

 CONFIG_X86 ifdef:
 569c484f (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)

 CONFIG_DYNAMIC_FTRACE ifdef:
 7aae231a (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com

0e32dfc8

tools: Allow specifying base BTF file in resolve_btfids · f614f2c7

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This commit allows specifying the base BTF for resolving btf id
lists/sets during link time in the resolve_btfids tool. The base BTF is
set to NULL if no path is passed. This allows resolving BTF ids for
module kernel objects.

Also, drop the --no-fail option, as it is only used in case .BTF_ids
section is not present, instead make no-fail the default mode. The long
option name is same as that of pahole.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-5-memxor@gmail.com

f614f2c7

bpf: btf: Introduce helpers for dynamic BTF set registration · 14f267d9

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This adds helpers for registering btf_id_set from modules and the
bpf_check_mod_kfunc_call callback that can be used to look them up.

With in kernel sets, the way this is supposed to work is, in kernel
callback looks up within the in-kernel kfunc whitelist, and then defers
to the dynamic BTF set lookup if it doesn't find the BTF id. If there is
no in-kernel BTF id set, this callback can be used directly.

Also fix includes for btf.h and bpfptr.h so that they can included in
isolation. This is in preparation for their usage in tcp_bbr, tcp_cubic
and tcp_dctcp modules in the next patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-4-memxor@gmail.com

14f267d9

bpf: Be conservative while processing invalid kfunc calls · a5d82727

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This patch also modifies the BPF verifier to only return error for
invalid kfunc calls specially marked by userspace (with insn->imm == 0,
insn->off == 0) after the verifier has eliminated dead instructions.
This can be handled in the fixup stage, and skip processing during add
and check stages.

If such an invalid call is dropped, the fixup stage will not encounter
insn->imm as 0, otherwise it bails out and returns an error.

This will be exposed as weak ksym support in libbpf in later patches.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-3-memxor@gmail.com

a5d82727

bpf: Introduce BPF support for kernel module function calls · 2357672c

Kumar Kartikeya Dwivedi authored Oct 02, 2021

This change adds support on the kernel side to allow for BPF programs to
call kernel module functions. Userspace will prepare an array of module
BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
In the kernel, the module BTFs are placed in the auxilliary struct for
bpf_prog, and loaded as needed.

The verifier then uses insn->off to index into the fd_array. insn->off
0 is reserved for vmlinux BTF (for backwards compat), so userspace must
use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
sorted based on offset in an array, and each offset corresponds to one
descriptor, with a max limit up to 256 such module BTFs.

We also change existing kfunc_tab to distinguish each element based on
imm, off pair as each such call will now be distinct.

Another change is to check_kfunc_call callback, which now include a
struct module * pointer, this is to be used in later patch such that the
kfunc_id and module pointer are matched for dynamically registered BTF
sets from loadable modules, so that same kfunc_id in two modules doesn't
lead to check_kfunc_call succeeding. For the duration of the
check_kfunc_call, the reference to struct module exists, as it returns
the pointer stored in kfunc_btf_tab.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com

2357672c

05 Oct, 2021 19 commits

Merge tag 'for-net-next-2021-10-01' of... · d0f1c248

Jakub Kicinski authored Oct 05, 2021

Merge tag 'for-net-next-2021-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

 - Add support for MediaTek MT7922 and MT7921
 - Enable support for AOSP extention in Qualcomm WCN399x and Realtek
   8822C/8852A.
 - Add initial support for link quality and audio/codec offload.
 - Rework of sockets sendmsg to avoid locking issues.
 - Add vhci suspend/resume emulation.

====================

Link: https://lore.kernel.org/r/20211001230850.3635543-1-luiz.dentz@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d0f1c248

net: usb: use eth_hw_addr_set() for dev->addr_len cases · 49ed8dde

Jakub Kicinski authored Oct 04, 2021

Convert usb drivers from memcpy(... dev->addr_len)
to eth_hw_addr_set():

  @@
  expression dev, np;
  @@
  - memcpy(dev->dev_addr, np, dev->addr_len)
  + eth_hw_addr_set(dev, np)

Manually checked these are either usbnet or pure etherdevs.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

49ed8dde

ethernet: use eth_hw_addr_set() for dev->addr_len cases · a05e4c0a

Jakub Kicinski authored Oct 04, 2021

Convert all Ethernet drivers from memcpy(... dev->addr_len)
to eth_hw_addr_set():

  @@
  expression dev, np;
  @@
  - memcpy(dev->dev_addr, np, dev->addr_len)
  + eth_hw_addr_set(dev, np)

In theory addr_len may not be ETH_ALEN, but we don't expect
non-Ethernet devices to live under this directory, and only
the following cases of setting addr_len exist:
 - cxgb4 for mgmt device,
and the drivers which set it to ETH_ALEN: s2io, mlx4, vxge.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a05e4c0a

Merge branch 'mlx4-const-dev_addr' · 5e8fba84

David S. Miller authored Oct 05, 2021

Jakub Kicinski says:

====================
mlx4: prep for constant dev->dev_addr

This patch converts mlx4 for dev->dev_addr being const. It converts
to use of common helpers but also removes some seemingly unnecessary
idiosyncrasies.

Please review.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5e8fba84

mlx4: constify args for const dev_addr · ebb1fdb5

Jakub Kicinski authored Oct 04, 2021

netdev->dev_addr will become const soon. Make sure all
functions which pass it around mark appropriate args
as const.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebb1fdb5

mlx4: remove custom dev_addr clearing · e04ffd12

Jakub Kicinski authored Oct 04, 2021

mlx4_en_u64_to_mac() takes the dev->dev_addr pointer and writes
to it byte by byte. It also clears the two bytes _after_ ETH_ALEN
which seems unnecessary. dev->addr_len is set to ETH_ALEN just
before the call.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e04ffd12

mlx4: replace mlx4_u64_to_mac() with u64_to_ether_addr() · 1bb96a07

Jakub Kicinski authored Oct 04, 2021

mlx4_u64_to_mac() predates the common helper but doesn't
make the argument constant.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1bb96a07

mlx4: replace mlx4_mac_to_u64() with ether_addr_to_u64() · ded6e16b

Jakub Kicinski authored Oct 04, 2021

mlx4_mac_to_u64() predates and opencodes ether_addr_to_u64().
It doesn't make the argument constant so it'll be problematic
when dev->dev_addr becomes a const. Convert to the generic helper.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ded6e16b

netlink: remove netlink_broadcast_filtered · 549017aa

Florian Westphal authored Oct 05, 2021

No users in tree since commit a3498436 ("netns: restrict uevents"),
so remove this functionality.

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

549017aa

Merge tag 'mlx5-updates-2021-10-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 95bf387e

David S. Miller authored Oct 05, 2021

Saeed Mahameed says:

====================
mlx5-updates-2021-10-04

Misc updates for mlx5 driver

1) Add TX max rate support for MQPRIO channel mode
2) Trivial TC action and modify header refactoring
3) TC support for accept action in fdb offloads
4) Allow single IRQ for PCI functions

5) Bridge offload: Pop PVID VLAN header on egress miss

Vlad Buslov says:
=================

With current architecture of mlx5 bridge offload it is possible for a
packet to match in ingress table by source MAC (resulting VLAN header push
in case of port with configured PVID) and then miss in egress table when
destination MAC is not in FDB. Due to the lack of hardware learning in
NICs, this, in turn, results packet going to software data path with PVID
VLAN already added by hardware. This doesn't break software bridge since it
accepts either untagged packets or packets with any provisioned VLAN on
ports with PVID, but can break ingress TC, if affected part of Ethernet
header is matched by classifier.

Improve compatibility with software TC by restoring the packet header on
egress miss. Effectively, this change implements atomicity of mlx5 bridge
offload implementation - packet is either modified and redirected to
destination port or appears unmodified in software.

=================

=================
Signed-off-by: David S. Miller <davem@davemloft.net>

95bf387e

net: bgmac: support MDIO described in DT · 45c9d966

Rafał Miłecki authored Oct 02, 2021

Check ethernet controller DT node for "mdio" subnode and use it with
of_mdiobus_register() when present. That allows specifying MDIO and its
PHY devices in a standard DT based way.

This is required for BCM53573 SoC support. That family is sometimes
called Northstar (by marketing?) but is quite different from it. It uses
different CPU(s) and many different hw blocks.

One of shared blocks in BCM53573 is Ethernet controller. Switch however
is not SRAB accessible (as it Northstar) but is MDIO attached.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

45c9d966

net: bgmac: improve handling PHY · b5375509

Rafał Miłecki authored Oct 02, 2021

1. Use info from DT if available

It allows describing for example a fixed link. It's more accurate than
just guessing there may be one (depending on a chipset).

2. Verify PHY ID before trying to connect PHY

PHY addr 0x1e (30) is special in Broadcom routers and means a switch
connected as MDIO devices instead of a real PHY. Don't try connecting to
it.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

b5375509

ethernet: ehea: add missing cast · ceca777d

Jakub Kicinski authored Oct 04, 2021

We need to cast the pointer, unlike memcpy() eth_hw_addr_set()
does not take void *. The driver already casts &port->mac_addr
to u8 * in other places.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: a96d317f ("ethernet: use eth_hw_addr_set()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ceca777d

sparc: Fix typo. · fb8ece51

David S. Miller authored Oct 05, 2021

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

fb8ece51

net/mlx5: Enable single IRQ for PCI Function · f891b7cd

Shay Drory authored Aug 01, 2021

Prior to this patch the driver requires two IRQs to function properly,
one required IRQ for control and at least one required IRQ for IO.

This requirement can be relaxed to one as the driver now allows
sharing of IRQs, so control and IO EQs can share the same irq.

This is needed for high scale amount of VFs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

f891b7cd

net/mlx5: Shift control IRQ to the last index · 3663ad34

Shay Drory authored Aug 19, 2021

Control IRQ is the first IRQ vector. This complicates handling of
completion irqs as we need to offset them by one.
in the next patch, there are scenarios where completion and control EQs
will share the same irq. for example: functions with single IRQ. To ease
such scenarios, we shift control IRQ to the end of the irq array.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

3663ad34

net/mlx5: Bridge, pop VLAN on egress table miss · 575baa92

Vlad Buslov authored Sep 08, 2021

Create lowest priority flow group in egress table with single rule that
matches on special reg_c1 value that is set on ingress VLAN push with
single action that pops VLAN. The flow destination is skip table that is
used to skip any further processing of packet in FDB bridge priority.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

575baa92

net/mlx5: Bridge, mark reg_c1 when pushing VLAN · 5249001d

Vlad Buslov authored Sep 01, 2021

On ingress VLAN push also assign value 0x7FE to reg_c1 tunnel id+opts
bits (tunnel id 0, which is not a valid tunnel id, and option 0x7FE which
was reserved by one of previous patches in the series). In following patch
the reg value is matched on egress miss to restore the packet to its
original state by removing the VLAN before passing it to the software data
path.
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

5249001d

net/mlx5: Bridge, extract VLAN pop code to dedicated functions · 64fc4b35

Vlad Buslov authored Sep 02, 2021

Following patches in series need to pop VLAN when packet misses on egress.
To reuse existing bridge VLAN pop handling code, extract it to dedicated
helpers mlx5_esw_bridge_pkt_reformat_vlan_pop_supported() and
mlx5_esw_bridge_pkt_reformat_vlan_pop_create().
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

64fc4b35