Commits · 4519efa6f8ea343e43ade21b0189b0b295439202 · Kirill Smelkov / linux

20 Apr, 2019 1 commit

libbpf: fix BPF_LOG_BUF_SIZE off-by-one error · 4519efa6

McCabe, Robert J authored Apr 19, 2019

The BPF_PROG_LOAD condition for kernel version <= 5.1 is

   log->len_total > UINT_MAX >> 8 /* (16 * 1024 * 1024) - 1 */
Signed-off-by: McCabe, Robert J <robert.mccabe@rockwellcollins.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

4519efa6

18 Apr, 2019 11 commits

bpf: move BPF_PROG_TYPE_FLOW_DISSECTOR documentation to a new common place · 80695946

Stanislav Fomichev authored Apr 18, 2019

In commit da703149 ("bpf: Document BPF_PROG_TYPE_CGROUP_SYSCTL")
Andrey proposes to put per-prog type docs under Documentation/bpf/

Let's move flow dissector documentation there as well.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

80695946

bpf: Increase MAX_NR_MAPS to 17 in test_verifier.c · 849f257f

Martin KaFai Lau authored Apr 18, 2019

map_fds[16] is the last one index-ed by fixup_map_array_small.
Hence, the MAX_NR_MAPS should be 17 instead.

Fixes: fb2abb73 ("bpf, selftest: test {rd, wr}only flags and direct value access")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

849f257f

selftests/bpf: fix compile errors due to unsync linux/in6.h and netinet/in.h · 5de35e3a

Wang YanQing authored Apr 19, 2019

I meet below compile errors:
"
In file included from test_tcpnotify_kern.c:12:
/usr/include/netinet/in.h:101:5: error: expected identifier
    IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
    ^
/usr/include/linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
                                ^
In file included from test_tcpnotify_kern.c:12:
/usr/include/netinet/in.h:103:5: error: expected identifier
    IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
    ^
/usr/include/linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
                                ^
In file included from test_tcpnotify_kern.c:12:
/usr/include/netinet/in.h:105:5: error: expected identifier
    IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
    ^
/usr/include/linux/in6.h:133:26: note: expanded from macro 'IPPROTO_FRAGMENT'
"
The same compile errors are reported for test_tcpbpf_kern.c too.

My environment:
lsb_release -a:
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.6 LTS
Release:        16.04
Codename:       xenial

dpkg -l | grep libc-dev:
ii  libc-dev-bin              2.23-0ubuntu11           amd64        GNU C Library: Development binaries
ii  linux-libc-dev:amd64      4.4.0-145.171            amd64        Linux Kernel Headers for development.

The reason is linux/in6.h and netinet/in.h aren't synchronous about how to
handle the same definitions, IPPROTO_HOPOPTS, etc.

This patch fixes the compile errors by moving <netinet/in.h> to before the
<linux/*.h>.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

5de35e3a

libbpf: remove compile time warning from libbpf_util.h · 79b1b30e

Magnus Karlsson authored Apr 18, 2019

Having a helpful compile time warning in libbpf_util.h is not a good
idea since all warnings are treated as errors. Change this into a
comment in the code instead.

Fixes: b7e3a280 ("libbpf: remove dependency on barrier.h in xsk.h")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

79b1b30e

bpf: Document BPF_PROG_TYPE_CGROUP_SYSCTL · da703149

Andrey Ignatov authored Apr 17, 2019

Add documentation for BPF_PROG_TYPE_CGROUP_SYSCTL, including general
info, attach type, context, return code, helpers, example and usage
considerations.

A separate file prog_cgroup_sysctl.rst is added to Documentation/bpf/.

In the future more program types can be documented in their own
prog_<name>.rst files.

Another way to place program type specific documentation would be to
group program types somehow (e.g. cgroup.rst for all cgroup-bpf
programs), but it may not scale well since some program types may belong
to different groups, e.g. BPF_PROG_TYPE_CGROUP_SKB can be documented
together with either cgroup-bpf programs or programs that access skb.

The new file is added to the index and verified by `make htmldocs` /
sanity-check by lynx.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

da703149

selftests/bpf: fix a compilation error · ba02de1a

Yonghong Song authored Apr 17, 2019

I hit the following compilation error with gcc 4.8.5.

  prog_tests/flow_dissector.c: In function ‘test_flow_dissector’:
  prog_tests/flow_dissector.c:155:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
    for (int i = 0; i < ARRAY_SIZE(tests); i++) {
    ^
  prog_tests/flow_dissector.c:155:2: note: use option -std=c99 or -std=gnu99 to compile your code

Let us fix the issue by avoiding this particular c99 feature.

Fixes: a5cb3346 ("selftests/bpf: make flow dissector tests more extensible")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ba02de1a

Merge branch 'bulk-cpumap-redirect' · 193d0002

Alexei Starovoitov authored Apr 17, 2019

Jesper Dangaard Brouer says:

====================
This patchset utilize a number of different kernel bulk APIs for optimizing
the performance for the XDP cpumap redirect feature.

Benchmark details are available here:
 https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org

Performance measurements can be considered micro benchmarks, as they measure
dropping packets at different stages in the network stack.
Summary based on above:

Baseline benchmarks
- baseline-redirect: UdpNoPorts: 3,180,074
- baseline-redirect: iptables-raw drop: 6,193,534

Patch1: bpf: cpumap use ptr_ring_consume_batched
- redirect: UdpNoPorts: 3,327,729
- redirect: iptables-raw drop: 6,321,540

Patch2: net: core: introduce build_skb_around
- redirect: UdpNoPorts: 3,221,303
- redirect: iptables-raw drop: 6,320,066

Patch3: bpf: cpumap do bulk allocation of SKBs
- redirect: UdpNoPorts: 3,290,563
- redirect: iptables-raw drop: 6,650,112

Patch4: bpf: cpumap memory prefetchw optimizations for struct page
- redirect: UdpNoPorts: 3,520,250
- redirect: iptables-raw drop: 7,649,604

In this V2 submission I have chosen drop the SKB-list patch using
netif_receive_skb_list() as it was not showing a performance improvement for
these micro benchmarks.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

193d0002

bpf: cpumap memory prefetchw optimizations for struct page · 86d23145

Jesper Dangaard Brouer authored Apr 12, 2019

A lot of the performance gain comes from this patch.

While analysing performance overhead it was found that the largest CPU
stalls were caused when touching the struct page area. It is first read with
a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
written by page_frag_free() call.

Measurements show that the prefetchw (W) variant operation is needed to
achieve the performance gain. We believe this optimization it two fold,
first the W-variant saves one step in the cache-coherency protocol, and
second it helps us to avoid the non-temporal prefetch HW optimizations and
bring this into all cache-levels. It might be worth investigating if
prefetch into L2 will have the same benefit.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

86d23145

bpf: cpumap do bulk allocation of SKBs · 8f0504a9

Jesper Dangaard Brouer authored Apr 12, 2019

As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
kmem_cache_alloc_bulk() API, and use the previously introduced function
build_skb_around().

Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
memory for us. This does clear a larger area than needed, but my micro
benchmarks on Intel CPUs show that this is slightly faster due to being a
cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
is a future optimization potential, because SKBs will with high probability
originate from same page. If we can find/identify continuous memory areas
then the Intel CPU memset rep stos will have a real performance gain.)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

8f0504a9

net: core: introduce build_skb_around · ba0509b6

Jesper Dangaard Brouer authored Apr 12, 2019

The function build_skb() also have the responsibility to allocate and clear
the SKB structure. Introduce a new function build_skb_around(), that moves
the responsibility of allocation and clearing to the caller. This allows
caller to use kmem_cache (slab/slub) bulk allocation API.

Next patch use this function combined with kmem_cache_alloc_bulk.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ba0509b6

bpf: cpumap use ptr_ring_consume_batched · 77361825

Jesper Dangaard Brouer authored Apr 12, 2019

Move ptr_ring dequeue outside loop, that allocate SKBs and calls network
stack, as these operations that can take some time. The ptr_ring is a
communication channel between CPUs, where we want to reduce/limit any
cacheline bouncing.

Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the
period and times the remote cacheline in ptr_ring is read

Batch size 8 is both to (1) limit BH-disable period, and (2) consume one
cacheline on 64-bit archs. After reducing the BH-disable section further
then we can consider changing this, while still thinking about L1 cacheline
size being active.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

77361825

17 Apr, 2019 13 commits

Merge branch 'af_xdp-smp_mb-fixes' · 00967e84

Alexei Starovoitov authored Apr 16, 2019

Magnus Karlsson says:

====================
This patch set fixes one bug and removes two dependencies on Linux
kernel headers from the XDP socket code in libbpf. A number of people
have pointed out that these two dependencies make it hard to build the
XDP socket part of libbpf without any kernel header dependencies. The
two removed dependecies are:

* Remove the usage of likely and unlikely (compiler.h) in xsk.h. It
  has been reported that the use of these actually decreases the
  performance of the ring access code due to an increase in
  instruction cache misses, so let us just remove these.

* Remove the dependency on barrier.h as it brings in a lot of kernel
  headers. As the XDP socket code only uses two simple functions from
  it, we can reimplement these. As a bonus, the new implementation is
  faster as it uses the same barrier primitives as the kernel does
  when the same code is compiled there. Without this patch, the user
  land code uses lfence and sfence on x86, which are unnecessarily
  harsh/thorough.

In the process of removing these dependencies a missing barrier
function for at least PPC64 was discovered. For a full explanation on
the missing barrier, please refer to patch 1. So the patch set now
starts with two patches fixing this. I have also added a patch at the
end removing this full memory barrier for x86 only, as it is not
needed there.

Structure of the patch set:
Patch 1-2: Adds the missing barrier function in kernel and user space.
Patch 3-4: Removes the dependencies
Patch 5: Optimizes the added barrier from patch 2 so that it does not
         do unnecessary work on x86.

v2 -> v3:
* Added missing memory barrier in ring code
* Added an explanation on the three barriers we use in the code
* Moved barrier functions from xsk.h to libbpf_util.h
* Added comment on why we have these functions in libbpf_util.h
* Added a new barrier function in user space that makes it possible to
  remove the full memory barrier on x86.

v1 -> v2:
* Added comment about validity of ARM 32-bit barriers.
  Only armv7 and above.

/Magnus
====================
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

00967e84

libbpf: optimize barrier for XDP socket rings · 2c5935f1

Magnus Karlsson authored Apr 16, 2019

The full memory barrier in the XDP socket rings on the consumer side
between the load of the data and the store of the consumer ring is
there to protect the store from being executed before the load of the
data. If this was allowed to happen, the producer might overwrite the
data field with a new entry before the consumer got the chance to read
it.

On x86, stores are guaranteed not to be reordered with older loads, so
it does not need a full memory barrier here. A compile time barrier
would be enough. This patch introdcues a new primitive in
libbpf_util.h that implements a new barrier type (libbpf_smp_rwmb)
hindering stores to be reordered with older loads. It is then used in
the XDP socket ring access code in libbpf to improve performance.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

2c5935f1

libbpf: remove dependency on barrier.h in xsk.h · b7e3a280

Magnus Karlsson authored Apr 16, 2019

The use of smp_rmb() and smp_wmb() creates a Linux header dependency
on barrier.h that is unnecessary in most parts. This patch implements
the two small defines that are needed from barrier.h. As a bonus, the
new implementations are faster than the default ones as they default
to sfence and lfence for x86, while we only need a compiler barrier in
our case. Just as it is when the same ring access code is compiled in
the kernel.

Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

b7e3a280

libbpf: remove likely/unlikely in xsk.h · a06d7296

Magnus Karlsson authored Apr 16, 2019

This patch removes the use of likely and unlikely in xsk.h since they
create a dependency on Linux headers as reported by several
users. There have also been reports that the use of these decreases
performance as the compiler puts the code on two different cache lines
instead of on a single one. All in all, I think we are better off
without them.

Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

a06d7296

libbpf: fix XDP socket ring buffer memory ordering · d5e63fdd

Magnus Karlsson authored Apr 16, 2019

The ring buffer code of XDP sockets is missing a memory barrier on the
consumer side between the load of the data and the write that signals
that it is ok for the producer to put new data into the buffer. On
architectures that does not guarantee that stores are not reordered
with older loads, the producer might put data into the ring before the
consumer had the chance to read it. As IA does guarantee this
ordering, it would only need a compiler barrier here, but there are no
primitives in barrier.h for this specific case (hinder writes to be ordered
before older reads) so I had to add a smp_mb() here which will
translate into a run-time synch operation on IA.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d5e63fdd

xsk: fix XDP socket ring buffer memory ordering · f63666de

Magnus Karlsson authored Apr 16, 2019

The ring buffer code of XDP sockets is missing a memory barrier on the
consumer side between the load of the data and the write that signals
that it is ok for the producer to put new data into the buffer. On
architectures that does not guarantee that stores are not reordered
with older loads, the producer might put data into the ring before the
consumer had the chance to read it. As IA does guarantee this
ordering, it would only need a compiler barrier here, but there are no
primitives in Linux for this specific case (hinder writes to be ordered
before older reads) so I had to add a smp_mb() here which will
translate into a run-time synch operation on IA.

Added a longish comment in the code explaining what each barrier in
the ring implementation accomplishes and what would happen if we
removed one of them.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

f63666de

tools/bpftool: show btf_id in map listing · d1b7725d

Prashant Bhole authored Apr 17, 2019

Let's print btf id of map similar to the way we are printing it
for programs.

Sample output:
user@test# bpftool map -f
61: lpm_trie  flags 0x1
	key 20B  value 8B  max_entries 1  memlock 4096B
133: array  name test_btf_id  flags 0x0
	key 4B  value 4B  max_entries 4  memlock 4096B
	pinned /sys/fs/bpf/test100
	btf_id 174
170: array  name test_btf_id  flags 0x0
	key 4B  value 4B  max_entries 4  memlock 4096B
	btf_id 240
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d1b7725d

tools/bpftool: re-organize newline printing for map listing · d459b59e

Prashant Bhole authored Apr 17, 2019

Let's move the final newline printing in show_map_close_plain() at
the end of the function because it looks correct and consistent with
prog.c. Also let's do related changes for the line which prints
pinned file name.
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d459b59e

bpftool: Support sysctl hook · f25377ee

Andrey Ignatov authored Apr 16, 2019

Add support for recently added BPF_PROG_TYPE_CGROUP_SYSCTL program type
and BPF_CGROUP_SYSCTL attach type.

Example of bpftool output with sysctl program from selftests:

  # bpftool p load ./test_sysctl_prog.o /mnt/bpf/sysctl_prog type cgroup/sysctl
  # bpftool p l
  9: cgroup_sysctl  name sysctl_tcp_mem  tag 0dd05f81a8d0d52e  gpl
          loaded_at 2019-04-16T12:57:27-0700  uid 0
          xlated 1008B  jited 623B  memlock 4096B
  # bpftool c a /mnt/cgroup2/bla sysctl id 9
  # bpftool c t
  CgroupPath
  ID       AttachType      AttachFlags     Name
  /mnt/cgroup2/bla
      9        sysctl                          sysctl_tcp_mem
  # bpftool c d /mnt/cgroup2/bla sysctl id 9
  # bpftool c t
  CgroupPath
  ID       AttachType      AttachFlags     Name
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

f25377ee

libbpf: fix printf formatter for ptrdiff_t argument · e1d1dc46

Andrii Nakryiko authored Apr 16, 2019

Using %ld for printing out value of ptrdiff_t type is not portable
between 32-bit and 64-bit archs. This is causing compilation errors for
libbpf on 32-bit platform (discovered as part of an effort to integrate
libbpf into systemd ([0])). Proper formatter is %td, which is used in
this patch.

v2->v1:
  - add Reported-by
  - provide more context on how this issue was discovered

[0] https://github.com/systemd/systemd/pull/12151Reported-by: Evgeny Vereshchagin <evvers@ya.ru>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e1d1dc46

bpf: use BPF_CAST_CALL for casting bpf call · 0d306c31

Prashant Bhole authored Apr 16, 2019

verifier.c uses BPF_CAST_CALL for casting bpf call except at one
place in jit_subprogs(). Let's use the macro for consistency.
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

0d306c31

bpf: allow clearing all sock_ops callback flags · 725721a6

Viet Hoang Tran authored Apr 15, 2019

The helper function bpf_sock_ops_cb_flags_set() can be used to both
set and clear the sock_ops callback flags. However, its current
behavior is not consistent. BPF program may clear a flag if more than
one were set, or replace a flag with another one, but cannot clear all
flags.

This patch also updates the documentation to clarify the ability to
clear flags of this helper function.
Signed-off-by: Hoang Tran <hoang.tran@uclouvain.be>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

725721a6

selftests: bpf: add VRF test cases to lwt_ip_encap test. · 809041e7

Peter Oskolkov authored Apr 03, 2019

This patch adds tests validating that VRF and BPF-LWT
encap work together well, as requested by David Ahern.
Signed-off-by: Peter Oskolkov <posk@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

809041e7

16 Apr, 2019 15 commits

bpf: add map helper functions push, pop, peek in more BPF programs · 02a8c817

Alban Crequy authored Apr 14, 2019

commit f1a2e44a ("bpf: add queue and stack maps") introduced new BPF
helper functions:
- BPF_FUNC_map_push_elem
- BPF_FUNC_map_pop_elem
- BPF_FUNC_map_peek_elem

but they were made available only for network BPF programs. This patch
makes them available for tracepoint, cgroup and lirc programs.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

02a8c817

selftests/bpf: make flow dissector tests more extensible · a5cb3346

Stanislav Fomichev authored Apr 12, 2019

Rewrite selftest to iterate over an array with input packet and
expected flow_keys. This should make it easier to extend this test
with additional cases without too much boilerplate.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

a5cb3346

selftests/bpf: two scale tests · 08de198c

Alexei Starovoitov authored Apr 12, 2019

Add two tests to check that sequence of 1024 jumps is verifiable.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

08de198c

bpftool: Improve handling of ENOSPC on reuseport_array map dumps · 3da6e7e4

Benjamin Poirier authored Apr 15, 2019

avoids outputting a series of
	value:
	No space left on device

The value itself is not wrong but bpf_fd_reuseport_array_lookup_elem() can
only return it if the map was created with value_size = 8. There's nothing
bpftool can do about it. Instead of repeating this error for every key in
the map, print an explanatory warning and a specialized error.

example before:
key: 00 00 00 00
value:
No space left on device
key: 01 00 00 00
value:
No space left on device
key: 02 00 00 00
value:
No space left on device
Found 0 elements

example after:
Warning: cannot read values from reuseport_sockarray map with value_size != 8
key: 00 00 00 00  value: <cannot read>
key: 01 00 00 00  value: <cannot read>
key: 02 00 00 00  value: <cannot read>
Found 0 elements
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

3da6e7e4

bpftool: Use print_entry_error() in case of ENOENT when dumping · 0478c3bf

Benjamin Poirier authored Apr 15, 2019

Commit bf598a8f ("bpftool: Improve handling of ENOENT on map dumps")
used print_entry_plain() in case of ENOENT. However, that commit introduces
dead code. Per-cpu maps are zero-filled. When reading them, it's all or
nothing. There will never be a case where some cpus have an entry and
others don't.

The truth is that ENOENT is an error case. Use print_entry_error() to
output the desired message. That function's "value" parameter is also
renamed to indicate that we never use it for an actual map value.

The output format is unchanged.
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0478c3bf

tools: bpftool: add a note on program statistics in man page · 25df480d

Quentin Monnet authored Apr 12, 2019

Linux kernel now supports statistics for BPF programs, and bpftool is
able to dump them. However, these statistics are not enabled by default,
and administrators may not know how to access them.

Add a paragraph in bpftool documentation, under the description of the
"bpftool prog show" command, to explain that such statistics are
available and that their collection is controlled via a dedicated sysctl
knob.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

25df480d

tools: bpftool: fix short option name for printing version in man pages · 88b3eed8

Quentin Monnet authored Apr 12, 2019

Manual pages would tell that option "-v" (lower case) would print the
version number for bpftool. This is wrong: the short name of the option
is "-V" (upper case). Fix the documentation accordingly.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

88b3eed8

tools: bpftool: fix man page documentation for "pinmaps" keyword · 9a487883

Quentin Monnet authored Apr 12, 2019

The "pinmaps" keyword is present in the man page, in the verbose
description of the "bpftool prog load" command. However, it is missing
from the summary of available commands at the beginning of the file. Add
it there as well.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

9a487883

tools: bpftool: reset errno for "bpftool cgroup tree" · 39c9f106

Quentin Monnet authored Apr 12, 2019

When trying to dump the tree of all cgroups under a given root node,
bpftool attempts to query programs of all available attach types. Some
of those attach types do not support queries, therefore several of the
calls are actually expected to fail.

Those calls set errno to EINVAL, which has no consequence for dumping
the rest of the tree. It does have consequences however if errno is
inspected at a later time. For example, bpftool batch mode relies on
errno to determine whether a command has succeeded, and whether it
should carry on with the next command. Setting errno to EINVAL when
everything worked as expected would therefore make such command fail:

        # echo 'cgroup tree \n net show' | \
                bpftool batch file -

To improve this, reset errno when its value is EINVAL after attempting
to show programs for all existing attach types in do_show_tree_fn().
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

39c9f106

tools: bpftool: remove blank line after btf_id when listing programs · 031ebc1a

Quentin Monnet authored Apr 12, 2019

Commit 569b0c77 ("tools/bpftool: show btf id in program information")
made bpftool print an empty line after each program entry when listing
the BPF programs loaded on the system (plain output). This is especially
confusing when some programs have an associated BTF id, and others
don't. Let's remove the blank line.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

031ebc1a

bpf: reserve flags in bpf_skb_net_shrink · 43537b8e

Willem de Bruijn authored Apr 12, 2019

The ENCAP flags in bpf_skb_adjust_room are ignored on decap with
bpf_skb_net_shrink. Reserve these bits for future use.

Fixes: 868d5235 ("bpf: add bpf_skb_adjust_room encap flags")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

43537b8e

bpf: fix whitespace for ENCAP_L2 defines in bpf.h · bfb35c27

Alan Maguire authored Apr 12, 2019

replace tab after #define with space in line with rest of definitions
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bfb35c27

selftests/bpf: bring back (void *) cast to set_ipv4_csum in test_tc_tunnel · bcbccad6

Stanislav Fomichev authored Apr 11, 2019

It was removed in commit 166b5a7f ("selftests_bpf: extend
test_tc_tunnel for UDP encap") without any explanation.

Otherwise I see:
progs/test_tc_tunnel.c:160:17: warning: taking address of packed member 'ip' of class or structure
      'v4hdr' may result in an unaligned pointer value [-Waddress-of-packed-member]
        set_ipv4_csum(&h_outer.ip);
                       ^~~~~~~~~~
1 warning generated.

Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Fixes: 166b5a7f ("selftests_bpf: extend test_tc_tunnel for UDP encap")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bcbccad6

selftests/btf: add VAR and DATASEC case for dedup tests · efb2ddc4

Andrii Nakryiko authored Apr 15, 2019

Add test case verifying that dedup happens (INTs are deduped in this
case) and VAR/DATASEC types are not deduped, but have their referenced
type IDs adjusted correctly.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

efb2ddc4

btf: add support for VAR and DATASEC in btf_dedup() · 189cf5a4

Andrii Nakryiko authored Apr 15, 2019

This patch adds support for VAR and DATASEC in btf_dedup(). VAR/DATASEC
are never deduplicated, but they need to be processed anyway as types
they refer to might need to be remapped due to deduplication and
compaction.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

189cf5a4