Commits · 08eea4f312c09e1c203ad3e7840978fe3240888e · Kirill Smelkov / linux

27 Aug, 2019 5 commits

Merge branch 'bpf-precision-tracking-tests' · 08eea4f3

Daniel Borkmann authored Aug 28, 2019

Alexei Starovoitov says:

====================
Add few additional tests for precision tracking in the verifier.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

08eea4f3

selftests/bpf: add precision tracking test · 411cdb45

Alexei Starovoitov authored Aug 22, 2019

Copy-paste of existing test
"calls: cross frame pruning - liveness propagation"
but ran with different parentage chain heuristic
which stresses different path in precision tracking logic.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

411cdb45

selftests/bpf: verifier precise tests · e8c13c4d

Alexei Starovoitov authored Aug 22, 2019

Use BPF_F_TEST_STATE_FREQ flag to check that precision
tracking works as expected by comparing every step it takes.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

e8c13c4d

tools/bpf: sync bpf.h · 0fc2e0b8

Alexei Starovoitov authored Aug 22, 2019

sync bpf.h from kernel/ to tools/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0fc2e0b8

bpf: introduce verifier internal test flag · 10d274e8

Alexei Starovoitov authored Aug 22, 2019

Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain
and state pruning.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

10d274e8

21 Aug, 2019 10 commits

tools: bpftool: add "bpftool map freeze" subcommand · 0bb52b0d

Quentin Monnet authored Aug 21, 2019

Add a new subcommand to freeze maps from user space.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0bb52b0d

tools: bpftool: show frozen status for maps · c354ff2e

Quentin Monnet authored Aug 21, 2019

When listing maps, read their "frozen" status from procfs, and tell if
maps are frozen.

As commit log for map freezing command mentions that the feature might
be extended with flags (e.g. for write-only instead of read-only) in the
future, use an integer and not a boolean for JSON output.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

c354ff2e

bpf: sync bpf.h to tools/ · 1f8919b1

Peter Wu authored Aug 21, 2019

Fix a 'struct pt_reg' typo and clarify when bpf_trace_printk discards
lines. Affects documentation only.
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

1f8919b1

bpf: clarify when bpf_trace_printk discards lines · 55c33dfb

Peter Wu authored Aug 21, 2019

I opened /sys/kernel/tracing/trace once and kept reading from it.
bpf_trace_printk somehow did not seem to work, no entries were appended
to that trace file. It turns out that tracing is disabled when that file
is open. Save the next person some time and document this.

The trace file is described in Documentation/trace/ftrace.rst, however
the implication "tracing is disabled" did not immediate translate to
"bpf_trace_printk silently discards entries".
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

55c33dfb

bpf: fix 'struct pt_reg' typo in documentation · 8050a395

Peter Wu authored Aug 21, 2019

There is no 'struct pt_reg'.
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

8050a395

bpf: clarify description for CONFIG_BPF_EVENTS · 5cbd22c1

Peter Wu authored Aug 21, 2019

PERF_EVENT_IOC_SET_BPF supports uprobes since v4.3, and tracepoints
since v4.7 via commit 04a22fae ("tracing, perf: Implement BPF
programs attached to uprobes"), and commit 98b5c2c6 ("perf, bpf:
allow bpf programs attach to tracepoints") respectively.
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

5cbd22c1

btf: do not use CONFIG_OUTPUT_FORMAT · fdf37037

Ilya Leoshkevich authored Aug 20, 2019

Building s390 kernel with CONFIG_DEBUG_INFO_BTF fails, because
CONFIG_OUTPUT_FORMAT is not defined. As a matter of fact, this variable
appears to be x86-only, so other arches might be affected as well.

Fix by obtaining this value from objdump output, just like it's already
done for bin_arch. The exact objdump invocation is "inspired" by
arch/powerpc/boot/wrapper.

Also, use LANG=C for the existing bin_arch objdump invocation to avoid
potential build issues on systems with non-English locale.

Fixes: 341dfcf8 ("btf: expose BTF info through sysfs")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

fdf37037

samples: bpf: syscall_nrs: use mmap2 if defined · bb4b5c08

Ivan Khoronzhuk authored Aug 15, 2019

For arm32 xdp sockets mmap2 is preferred, so use it if it's defined.
Declaration of __NR_mmap can be skipped and it breaks build.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bb4b5c08

xdp: xdp_umem: replace kmap on vmap for umem map · 624676e7

Ivan Khoronzhuk authored Aug 15, 2019

For 64-bit there is no reason to use vmap/vunmap, so use page_address
as it was initially. For 32 bits, in some apps, like in samples
xdpsock_user.c when number of pgs in use is quite big, the kmap
memory can be not enough, despite on this, kmap looks like is
deprecated in such cases as it can block and should be used rather
for dynamic mm.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

624676e7

libbpf: use LFS (_FILE_OFFSET_BITS) instead of direct mmap2 syscall · 71dd77fd

Ivan Khoronzhuk authored Aug 15, 2019

Drop __NR_mmap2 fork in flavor of LFS, that is _FILE_OFFSET_BITS=64
(glibc & bionic) / LARGEFILE64_SOURCE (for musl) decision. It allows
mmap() to use 64bit offset that is passed to mmap2 syscall. As result
pgoff is not truncated and no need to use direct access to mmap2 for
32 bits systems.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

71dd77fd

20 Aug, 2019 10 commits

Merge branch 'btf_get_next_id' · 51746f94

Alexei Starovoitov authored Aug 20, 2019

Quentin Monnet says:

====================
This set adds a new command BPF_BTF_GET_NEXT_ID to the bpf() system call,
adds the relevant API function in libbpf, and uses it in bpftool to list
all BTF objects loaded on the system (and to dump the ids of maps and
programs associated with them, if any).

The main motivation of listing BTF objects is introspection and debugging
purposes. By getting BPF program and map information, it should already be
possible to list all BTF objects associated to at least one map or one
program. But there may be unattached BTF objects, held by a file descriptor
from a user space process only, and we may want to list them too.

As a side note, it also turned useful for examining the BTF objects
attached to offloaded programs, which would not show in program information
because the BTF id is not copied when retrieving such info. A fix is in
progress on that side.

v2:
- Rebase patch with new libbpf function on top of Andrii's changes
  regarding libbpf versioning.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

51746f94

tools: bpftool: implement "bpftool btf show|list" · 4d374ba0

Quentin Monnet authored Aug 20, 2019

Add a "btf list" (alias: "btf show") subcommand to bpftool in order to
dump all BTF objects loaded on a system.

When running the command, hash tables are built in bpftool to retrieve
all the associations between BTF objects and BPF maps and programs. This
allows for printing all such associations when listing the BTF objects.

The command is added at the top of the subcommands for "bpftool btf", so
that typing only "bpftool btf" also comes down to listing the programs.
We could not have this with the previous command ("dump"), which
required a BTF object id, so it should not break any previous behaviour.
This also makes the "btf" command behaviour consistent with "prog" or
"map".

Bash completion is updated to use "bpftool btf" instead of "bpftool
prog" to list the BTF ids, as it looks more consistent.

Example output (plain):

    # bpftool btf show
    9: size 2989B  prog_ids 21  map_ids 15
    17: size 2847B  prog_ids 36  map_ids 30,29,28
    26: size 2847B
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

4d374ba0

libbpf: add bpf_btf_get_next_id() to cycle through BTF objects · 09d7c2e3

Quentin Monnet authored Aug 20, 2019

Add an API function taking a BTF object id and providing the id of the
next BTF object in the kernel. This can be used to list all BTF objects
loaded on the system.

v2:
- Rebase on top of Andrii's changes regarding libbpf versioning.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

09d7c2e3

libbpf: refactor bpf_*_get_next_id() functions · a6e130c4

Quentin Monnet authored Aug 20, 2019

In preparation for the introduction of a similar function for retrieving
the id of the next BTF object, consolidate the code from
bpf_prog_get_next_id() and bpf_map_get_next_id() in libbpf.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

a6e130c4

tools: bpf: synchronise BPF UAPI header with tools · d2648e1e

Quentin Monnet authored Aug 20, 2019

Synchronise the bpf.h header under tools, to report the addition of the
new BPF_BTF_GET_NEXT_ID syscall command for bpf().
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d2648e1e

bpf: add new BPF_BTF_GET_NEXT_ID syscall command · 1b9ed84e

Quentin Monnet authored Aug 20, 2019

Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
to cycle through all BTF objects loaded on the system.

The motivation is to be able to inspect (list) all BTF objects presents
on the system.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

1b9ed84e

test_bpf: Fix a new clang warning about xor-ing two numbers · 09845436

Nathan Chancellor authored Aug 18, 2019

r369217 in clang added a new warning about potential misuse of the xor
operator as an exponentiation operator:

../lib/test_bpf.c:870:13: warning: result of '10 ^ 300' is 294; did you
mean '1e300'? [-Wxor-used-as-pow]
                { { 4, 10 ^ 300 }, { 20, 10 ^ 300 } },
                       ~~~^~~~~
                       1e300
../lib/test_bpf.c:870:13: note: replace expression with '0xA ^ 300' to
silence this warning
../lib/test_bpf.c:870:31: warning: result of '10 ^ 300' is 294; did you
mean '1e300'? [-Wxor-used-as-pow]
                { { 4, 10 ^ 300 }, { 20, 10 ^ 300 } },
                                         ~~~^~~~~
                                         1e300
../lib/test_bpf.c:870:31: note: replace expression with '0xA ^ 300' to
silence this warning

The commit link for this new warning has some good logic behind wanting
to add it but this instance appears to be a false positive. Adopt its
suggestion to silence the warning but not change the code. According to
the differential review link in the clang commit, GCC may eventually
adopt this warning as well.

Link: https://github.com/ClangBuiltLinux/linux/issues/643
Link: https://github.com/llvm/llvm-project/commit/920890e26812f808a74c60ebc14cc636dac661c1Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

09845436

bpf: add include guard to tnum.h · 69ecfdaa

Masahiro Yamada authored Aug 20, 2019

Add a header include guard just in case.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

69ecfdaa

bpf: add BTF ids in procfs for file descriptors to BTF objects · 3481e64b

Quentin Monnet authored Aug 20, 2019

Implement the show_fdinfo hook for BTF FDs file operations, and make it
print the id of the BTF object. This allows for a quick retrieval of the
BTF id from its FD; or it can help understanding what type of object
(BTF) the file descriptor points to.

v2:
- Do not expose data_size, only btf_id, in FD info.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

3481e64b

bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc() · ede6bc88

YueHaibing authored Aug 20, 2019

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

ede6bc88

17 Aug, 2019 15 commits

Merge branch 'bpf-af-xdp-xskmap-improvements' · 1f726723

Daniel Borkmann authored Aug 17, 2019

Björn Töpel says:

====================
This series (v5 and counting) add two improvements for the XSKMAP,
used by AF_XDP sockets.

1. Automatic cleanup when an AF_XDP socket goes out of scope/is
   released. Instead of require that the user manually clears the
   "released" state socket from the map, this is done
   automatically. Each socket tracks which maps it resides in, and
   remove itself from those maps at relase. A notable implementation
   change, is that the sockets references the map, instead of the map
   referencing the sockets. Which implies that when the XSKMAP is
   freed, it is by definition cleared of sockets.

2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
   which this patch addresses.

v1->v2: Fixed deadlock and broken cleanup. (Daniel)
v2->v3: Rebased onto bpf-next
v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
        Socket release/map update race. (Daniel)
v4->v5: Avoid use-after-free on XSKMAP self-assignment [1]. (Daniel)
        Removed redundant assignment in xsk_map_update_elem().
        Variable name consistency; Use map_entry everywhere.

[1] https://lore.kernel.org/bpf/20190802081154.30962-1-bjorn.topel@gmail.com/T/#mc68439e97bc07fa301dad9fc4850ed5aa392f385
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

1f726723

xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP · 36cc3435

Björn Töpel authored Aug 15, 2019

The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

36cc3435

xsk: remove AF_XDP socket from map when the socket is released · 0402acd6

Björn Töpel authored Aug 15, 2019

When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.
Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0402acd6

Merge branch 'bpf-sk-storage-clone' · 8e46c353

Daniel Borkmann authored Aug 17, 2019

Stanislav Fomichev says:

====================
Currently there is no way to propagate sk storage from the listener
socket to a newly accepted one. Consider the following use case:

        fd = socket();
        setsockopt(fd, SOL_IP, IP_TOS,...);
        /* ^^^ setsockopt BPF program triggers here and saves something
         * into sk storage of the listener.
         */
        listen(fd, ...);
        while (client = accept(fd)) {
                /* At this point all association between listener
                 * socket and newly accepted one is gone. New
                 * socket will not have any sk storage attached.
                 */
        }

Let's add new BPF_F_CLONE flag that can be specified when creating
a socket storage map. This new flag indicates that map contents
should be cloned when the socket is cloned.

v4:
* drop 'goto err' in bpf_sk_storage_clone (Yonghong Song)
* add comment about race with bpf_sk_storage_map_free to the
  bpf_sk_storage_clone side as well (Daniel Borkmann)

v3:
* make sure BPF_F_NO_PREALLOC is always present when creating
  a map (Martin KaFai Lau)
* don't call bpf_sk_storage_free explicitly, rely on
  sk_free_unlock_clone to do the cleanup (Martin KaFai Lau)

v2:
* remove spinlocks around selem_link_map/sk (Martin KaFai Lau)
* BPF_F_CLONE on a map, not selem (Martin KaFai Lau)
* hold a map while cloning (Martin KaFai Lau)
* use BTF maps in selftests (Yonghong Song)
* do proper cleanup selftests; don't call close(-1) (Yonghong Song)
* export bpf_map_inc_not_zero
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

8e46c353

selftests/bpf: add sockopt clone/inheritance test · c3bbf176

Stanislav Fomichev authored Aug 14, 2019

Add a test that calls setsockopt on the listener socket which triggers
BPF program. This BPF program writes to the sk storage and sets
clone flag. Make sure that sk storage is cloned for a newly
accepted connection.

We have two cloned maps in the tests to make sure we hit both cases
in bpf_sk_storage_clone: first element (sk_storage_alloc) and
non-first element(s) (selem_link_map).

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

c3bbf176

bpf: sync bpf.h to tools/ · 9e819ffc

Stanislav Fomichev authored Aug 14, 2019

Sync new sk storage clone flag.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

9e819ffc

bpf: support cloning sk storage on accept() · 8f51dfc7

Stanislav Fomichev authored Aug 14, 2019

Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

8f51dfc7

bpf: export bpf_map_inc_not_zero · b0e4701c

Stanislav Fomichev authored Aug 14, 2019

Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
indicate that it's caller's responsibility to do proper locking.
Create and export bpf_map_inc_not_zero wrapper that properly
locks map_idr_lock. Will be used in the next commit to
hold a map while cloning a socket.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

b0e4701c

selftests/bpf: fix race in test_tcp_rtt test · fae55527

Petar Penkov authored Aug 16, 2019

There is a race in this test between receiving the ACK for the
single-byte packet sent in the test, and reading the values from the
map.

This patch fixes this by having the client wait until there are no more
unacknowledged packets.

Before:
for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
< trimmed error messages >
993

After:
for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
10000

Fixes: b5587398 ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

fae55527

libbpf: relicense bpf_helpers.h and bpf_endian.h · 929ffa6e

Andrii Nakryiko authored Aug 15, 2019

bpf_helpers.h and bpf_endian.h contain useful macros and BPF helper
definitions essential to almost every BPF program. Which makes them
useful not just for selftests. To be able to expose them as part of
libbpf, though, we need them to be dual-licensed as LGPL-2.1 OR
BSD-2-Clause. This patch updates licensing of those two files.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hechao Li <hechaol@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Adam Barth <arb@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Adrian Ratiu <adrian.ratiu@collabora.com>
Acked-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Teng Qin <palmtenor@gmail.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Michal Rostecki <mrostecki@opensuse.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

929ffa6e

net: Don't call XDP_SETUP_PROG when nothing is changed · c14a9f63

Maxim Mikityanskiy authored Aug 14, 2019

Don't uninstall an XDP program when none is installed, and don't install
an XDP program that has the same ID as the one already installed.

dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
XDP program. It means that the driver's ndo_bpf can be called with
XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
case happens if the user runs `ip link set eth0 xdp off` when there is
no XDP program attached.

The symmetrical case is possible when the user tries to set the program
that is already set.

The drivers typically perform some heavy operations on XDP_SETUP_PROG,
so they all have to handle these cases internally to return early if
they happen. This patch puts this check into the kernel code, so that
all drivers will benefit from it.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

c14a9f63

Merge branch 'bpf-af-xdp-wakeup' · c8186c80

Daniel Borkmann authored Aug 17, 2019

Magnus Karlsson says:

====================
This patch set adds support for a new flag called need_wakeup in the
AF_XDP Tx and fill rings. When this flag is set by the driver, it
means that the application has to explicitly wake up the kernel Rx
(for the bit in the fill ring) or kernel Tx (for bit in the Tx ring)
processing by issuing a syscall. Poll() can wake up both and sendto()
will wake up Tx processing only.

The main reason for introducing this new flag is to be able to
efficiently support the case when application and driver is executing
on the same core. Previously, the driver was just busy-spinning on the
fill ring if it ran out of buffers in the HW and there were none to
get from the fill ring. This approach works when the application and
driver is running on different cores as the application can replenish
the fill ring while the driver is busy-spinning. Though, this is a
lousy approach if both of them are running on the same core as the
probability of the fill ring getting more entries when the driver is
busy-spinning is zero. With this new feature the driver now sets the
need_wakeup flag and returns to the application. The application can
then replenish the fill queue and then explicitly wake up the Rx
processing in the kernel using the syscall poll(). For Tx, the flag is
only set to one if the driver has no outstanding Tx completion
interrupts. If it has some, the flag is zero as it will be woken up by
a completion interrupt anyway. This flag can also be used in other
situations where the driver needs to be woken up explicitly.

As a nice side effect, this new flag also improves the Tx performance
of the case where application and driver are running on two different
cores as it reduces the number of syscalls to the kernel. The kernel
tells user space if it needs to be woken up by a syscall, and this
eliminates many of the syscalls. The Rx performance of the 2-core case
is on the other hand slightly worse, since there is a need to use a
syscall now to wake up the driver, instead of the driver
busy-spinning. It does waste less CPU cycles though, which might lead
to better overall system performance.

This new flag needs some simple driver support. If the driver does not
support it, the Rx flag is always zero and the Tx flag is always
one. This makes any application relying on this feature default to the
old behavior of not requiring any syscalls in the Rx path and always
having to call sendto() in the Tx path.

For backwards compatibility reasons, this feature has to be explicitly
turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
that you always turn it on as it has a large positive performance
impact for the one core case and does not degrade 2 core performance
and actually improves it for Tx heavy workloads.

Here are some performance numbers measured on my local,
non-performance optimized development system. That is why you are
seeing numbers lower than the ones from Björn and Jesper. 64 byte
packets at 40Gbit/s line rate. All results in Mpps. Cores == 1 means
that both application and driver is executing on the same core. Cores
== 2 that they are on different cores.

                              Applications
need_wakeup  cores    txpush    rxdrop      l2fwd
---------------------------------------------------------------
     n         1       0.07      0.06        0.03
     y         1       21.6      8.2         6.5
     n         2       32.3      11.7        8.7
     y         2       33.1      11.7        8.7

Overall, the need_wakeup flag provides the same or better performance
in all the micro-benchmarks. The reduction of sendto() calls in txpush
is large. Only a few per second is needed. For l2fwd, the drop is 50%
for the 1 core case and more than 99.9% for the 2 core case. Do not
know why I am not seeing the same drop for the 1 core case yet.

The name and inspiration of the flag has been taken from io_uring by
Jens Axboe. Details about this feature in io_uring can be found in
http://kernel.dk/io_uring.pdf, section 8.3. It also addresses most of
the denial of service and sendto() concerns raised by Maxim
Mikityanskiy in https://www.spinics.net/lists/netdev/msg554657.html.

The typical Tx part of an application will have to change from:

ret = sendto(fd,....)

to:

if (xsk_ring_prod__needs_wakeup(&xsk->tx))
       ret = sendto(fd,....)

and th Rx part from:

rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
if (!rcvd)
       return;

to:

rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
if (!rcvd) {
       if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
              ret = poll(fd,.....);
       return;
}

v3 -> v4:
* Maxim found a possible race in the Tx part of the driver. The
  setting of the flag needs to happen before the sending, otherwise it
  might trigger this race. Fixed in ixgbe and i40e driver.
* Mellanox support contributed by Maxim
* Removed the XSK_DRV_CAN_SLEEP flag as it was not used
  anymore. Thanks to Sridhar for discovering this.
* For consistency the feature is now always called need_wakeup. There
  were some places where it was referred to as might_sleep, but they
  have been removed. Thanks to Sridhar for spotting.
* Fixed some typos in the commit messages

v2 -> v3:
* Converted the Mellanox driver to the new ndo in patch 1 as pointed
  out by Maxim
* Fixed the compatibility code of XDP_MMAP_OFFSETS so it now works.

v1 -> v2:
* Fixed bisectability problem pointed out by Jakub
* Added missing initiliztion of the Tx need_wakeup flag to 1

This patch has been applied against commit b753c5a7 ("Merge branch 'r8152-RX-improve'")

Structure of the patch set:

Patch 1: Replaces the ndo_xsk_async_xmit with ndo_xsk_wakeup to
         support waking up both Rx and Tx processing
Patch 2: Implements the need_wakeup functionality in common code
Patch 3-4: Add need_wakeup support to the i40e and ixgbe drivers
Patch 5: Add need_wakeup support to libbpf
Patch 6: Add need_wakeup support to the xdpsock sample application
Patch 7-8: Add need_wakeup support to the Mellanox mlx5 driver
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

c8186c80

net/mlx5e: Add AF_XDP need_wakeup support · a7bd4018

Maxim Mikityanskiy authored Aug 14, 2019

This commit adds support for the new need_wakeup feature of AF_XDP. The
applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag.
When this feature is enabled, some behavior changes:

RX side: If the Fill Ring is empty, instead of busy-polling, set the
flag to tell the application to kick the driver when it refills the Fill
Ring.

TX side: If there are pending completions or packets queued for
transmission, set the flag to tell the application that it can skip the
sendto() syscall and save time.

The performance testing was performed on a machine with the following
configuration:

- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link

The results with retpoline disabled:

       | without need_wakeup  | with need_wakeup     |
       |----------------------|----------------------|
       | one core | two cores | one core | two cores |
-------|----------|-----------|----------|-----------|
txonly | 20.1     | 33.5      | 29.0     | 34.2      |
rxdrop | 0.065    | 14.1      | 12.0     | 14.1      |
l2fwd  | 0.032    | 7.3       | 6.6      | 7.2       |

"One core" means the application and NAPI run on the same core. "Two
cores" means they are pinned to different cores.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

a7bd4018

net/mlx5e: Move the SW XSK code from NAPI poll to a separate function · 871aa189

Maxim Mikityanskiy authored Aug 14, 2019

Two XSK tasks are performed during NAPI polling, that are not bound to
hardware interrupts: TXing packets and polling for frames in the Fill
Ring. They are special in a way that the hardware doesn't know about
these tasks, so it doesn't trigger interrupts if there is still some
work to be done, it's our driver's responsibility to ensure NAPI will be
rescheduled if needed.

Create a new function to handle these tasks and move the corresponding
code from mlx5e_napi_poll to the new function to improve modularity and
prepare for the changes in the following patch.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

871aa189

samples/bpf: add use of need_wakeup flag in xdpsock · 46738f73

Magnus Karlsson authored Aug 14, 2019

This commit adds using the need_wakeup flag to the xdpsock sample
application. It is turned on by default as we think it is a feature
that seems to always produce a performance benefit, if the application
has been written taking advantage of it. It can be turned off in the
sample app by using the '-m' command line option.

The txpush and l2fwd sub applications have also been updated to
support poll() with multiple sockets.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

46738f73