Commits · ffc8b144d5d056dd0ab8d995c7345cdd6a589fc7 · Kirill Smelkov / linux

31 May, 2019 8 commits

bpf: add memlock precharge check for cgroup_local_storage · ffc8b144

Roman Gushchin authored May 29, 2019

Cgroup local storage maps lack the memlock precharge check,
which is performed before the memory allocation for
most other bpf map types.

Let's add it in order to unify all map types.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ffc8b144

Merge branch 'propagate-cn-to-tcp' · 576240cf

Alexei Starovoitov authored May 31, 2019

Lawrence Brakmo says:

====================
This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.

Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
    NET_XMIT_SUCCESS    (0)    - continue with packet output
    NET_XMIT_DROP       (1)    - drop packet and do cn
    NET_XMIT_CN         (2)    - continue with packet output and do cn
    -EPERM                     - drop packet

Finally, HBM programs are modified to collect and return more
statistics.

There has been some discussion regarding the best place to manage
bandwidths. Some believe this should be done in the qdisc where it can
also be managed with a BPF program. We believe there are advantages
for doing it with a BPF program in the cgroup/skb callback. For example,
it reduces overheads in the cases where there is on primary workload and
one or more secondary workloads, where each workload is running on its
own cgroupv2. In this scenario, we only need to throttle the secondary
workloads and there is no overhead for the primary workload since there
will be no BPF program attached to its cgroup.

Regardless, we agree that this mechanism should not penalize those that
are not using it. We tested this by doing 1 byte req/reply RPCs over
loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
Each test was repeated 50 times with a 1 minute delay between each set
of 10. We then calculated the average RPCs/sec over the 50 tests. We
compare upstream with upstream + patchset and no BPF program as well
as upstream + patchset and a BPF program that just returns ALLOW_PKT.
Here are the results:

upstream                           80937 RPCs/sec
upstream + patches, no BPF program 80894 RPCs/sec
upstream + patches, BPF program    80634 RPCs/sec

These numbers indicate that there is no penalty for these patches

The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.

The following results are obtained for rate limits of 1Gbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic.

Command used was:
  ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                   -s=<server running netserver>
  where:
     <rate>   is 1000
     --no_cn  specifies no cwr notifications
     dctcp    uses dctcp

                       Cubic                    DCTCP
Lim, DA      Mbps cwnd cred drops  Mbps cwnd cred drops
--------     ---- ---- ---- -----  ---- ---- ---- -----
  1G, 40       35  462 -320 67%     995    1 -212  0.05%
  1G, 40,cn   736    9  -78  0.07   995    1 -212  0.05
  1G,  5,cn   941    2 -189  0.13   995    1 -212  0.05

Notes:
  --no_cn has no effect with DCTCP
  Lim = rate limit
  DA = maximum delay ack timer
  cred = credit in packets
  drops = % packets dropped

v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
        New egress values apply to all protocols, not just TCP
        Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
        Removed changes to __tcp_transmit_skb (patch 5), no longer needed
        Removed sample use of EDT
v2->v3: Removed the probe timer related changes
v3->v4: Replaced preempt_enable_no_resched() by preempt_enable()
        in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro
====================
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

576240cf

bpf: Add more stats to HBM · d58c6f72

brakmo authored May 28, 2019

Adds more stats to HBM, including average cwnd and rtt of all TCP
flows, percents of packets that are ecn ce marked and distribution
of return values.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d58c6f72

bpf: Add cn support to hbm_out_kern.c · ffd81558

brakmo authored May 28, 2019

Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ffd81558

bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls · 956fe219

brakmo authored May 28, 2019

Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

956fe219

bpf: Update __cgroup_bpf_run_filter_skb with cn · e7a3160d

brakmo authored May 28, 2019

For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.

For egress packets, this function can return:
   NET_XMIT_SUCCESS    (0)    - continue with packet output
   NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
   NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                to call cwr
   -EPERM                     - drop packet

For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e7a3160d

bpf: cgroup inet skb programs can return 0 to 3 · 5cf1e914

brakmo authored May 28, 2019

Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()

The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).

The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
break backward compatibility.

This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

5cf1e914

bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY · 1f52f6c0

brakmo authored May 28, 2019

Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
__cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
request cwr for TCP packets.

Current cgroup skb programs can only return 0 or 1 (0 to drop the
packet. This macro changes the behavior so the low order bit
indicates whether the packet should be dropped (0) or not (1)
and the next bit is used for congestion notification (cn).

Hence, new allowed return values of CGROUP EGRESS BPF programs are:
  0: drop packet
  1: keep packet
  2: drop packet and call cwr
  3: keep packet and call cwr

This macro then converts it to one of NET_XMIT values or -EPERM
that has the effect of dropping the packet with no cn.
  0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
  1: NET_XMIT_DROP     skb should be dropped and cwr called
  2: NET_XMIT_CN       skb should be transmitted and cwr called
  3: -EPERM            skb should be dropped (no cn)

Note that when more than one BPF program is called, the packet is
dropped if at least one of programs requests it be dropped, and
there is cn if at least one program returns cn.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

1f52f6c0

29 May, 2019 15 commits

libbpf: reduce unnecessary line wrapping · 399dc65e

Andrii Nakryiko authored May 29, 2019

There are a bunch of lines of code or comments that are unnecessary
wrapped into multi-lines. Fix that without violating any code
guidelines.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

399dc65e

libbpf: typo and formatting fixes · 76e1022b

Andrii Nakryiko authored May 29, 2019

A bunch of typo and formatting fixes.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

76e1022b

libbpf: simplify two pieces of logic · 7e8c328c

Andrii Nakryiko authored May 29, 2019

Extra check for type is unnecessary in first case.

Extra zeroing is unnecessary, as snprintf guarantees that it will
zero-terminate string.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

7e8c328c

libbpf: use negative fd to specify missing BTF · fba01a06

Andrii Nakryiko authored May 29, 2019

0 is a valid FD, so it's better to initialize it to -1, as is done in
other places. Also, technically, BTF type ID 0 is valid (it's a VOID
type), so it's more reliable to check btf_fd, instead of
btf_key_type_id, to determine if there is any BTF associated with a map.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

fba01a06

libbpf: fix error code returned on corrupted ELF · f102154d

Andrii Nakryiko authored May 29, 2019

All of libbpf errors are negative, except this one. Fix it.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

f102154d

libbpf: check map name retrieved from ELF · c51829bb

Andrii Nakryiko authored May 29, 2019

Validate there was no error retrieving symbol name corresponding to
a BPF map.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

c51829bb

libbpf: simplify endianness check · 12ef5634

Andrii Nakryiko authored May 29, 2019

Rewrite endianness check to use "more canonical" way, using
compiler-defined macros, similar to few other places in libbpf. It also
is more obvious and shorter.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

12ef5634

libbpf: preserve errno before calling into user callback · be5c5d4e

Andrii Nakryiko authored May 29, 2019

pr_warning ultimately may call into user-provided callback function,
which can clobber errno value, so we need to save it before that.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

be5c5d4e

libbpf: fix detection of corrupted BPF instructions section · 8ca990ce

Andrii Nakryiko authored May 29, 2019

Ensure that size of a section w/ BPF instruction is exactly a multiple
of BPF instruction size.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

8ca990ce

libbpf: prevent overwriting of log_level in bpf_object__load_progs() · 501b125a

Quentin Monnet authored May 29, 2019

There are two functions in libbpf that support passing a log_level
parameter for the verifier for loading programs:
bpf_object__load_xattr() and bpf_prog_load_xattr(). Both accept an
attribute object containing the log_level, and apply it to the programs
to load.

It turns out that to effectively load the programs, the latter function
eventually relies on the former. This was not taken into account when
adding support for log_level in bpf_object__load_xattr(), and the
log_level passed to bpf_prog_load_xattr() later gets overwritten with a
zero value, thus disabling verifier logs for the program in all cases:

bpf_prog_load_xattr()             // prog->log_level = attr1->log_level;
-> bpf_object__load()             // attr2->log_level = 0;
   -> bpf_object__load_xattr()    // <pass prog and attr2>
      -> bpf_object__load_progs() // prog->log_level = attr2->log_level;

Fix this by OR-ing the log_level in bpf_object__load_progs(), instead of
overwriting it.

v2: Fix commit log description (confusion on function names in v1).

Fixes: 60276f98 ("libbpf: add bpf_object__load_xattr() API function to pass log_level")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

501b125a

bpf: tracing: properly use bpf_prog_array api · e672db03

Stanislav Fomichev authored May 28, 2019

Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

e672db03

bpf: cgroup: properly use bpf_prog_array api · dbcc1ba2

Stanislav Fomichev authored May 28, 2019

Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.

We also don't need __rcu annotations on cgroup_bpf.inactive since
it's not read/updated concurrently.

v4:
* drop cgroup_rcu_xyz wrappers and use rcu APIs directly; presumably
  should be more clear to understand which mutex/refcount protects
  each particular place

v3:
* amend cgroup_rcu_dereference to include percpu_ref_is_dying;
  cgroup_bpf is now reference counted and we don't hold cgroup_mutex
  anymore in cgroup_bpf_release

v2:
* replace xchg with rcu_swap_protected

Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

dbcc1ba2

bpf: media: properly use bpf_prog_array api · 02205d2e

Stanislav Fomichev authored May 28, 2019

Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.

Cc: linux-media@vger.kernel.org
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sean Young <sean@mess.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

02205d2e

bpf: remove __rcu annotations from bpf_prog_array · 54e9c9d4

Stanislav Fomichev authored May 28, 2019

Drop __rcu annotations and rcu read sections from bpf_prog_array
helper functions. They are not needed since all existing callers
call those helpers from the rcu update side while holding a mutex.
This guarantees that use-after-free could not happen.

In the next patches I'll fix the callers with missing
rcu_dereference_protected to make sparse/lockdep happy, the proper
way to use these helpers is:

	struct bpf_prog_array __rcu *progs = ...;
	struct bpf_prog_array *p;

	mutex_lock(&mtx);
	p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
	bpf_prog_array_length(p);
	bpf_prog_array_copy_to_user(p, ...);
	bpf_prog_array_delete_safe(p, ...);
	bpf_prog_array_copy_info(p, ...);
	bpf_prog_array_copy(p, ...);
	bpf_prog_array_free(p);
	mutex_unlock(&mtx);

No functional changes! rcu_dereference_protected with lockdep_is_held
should catch any cases where we update prog array without a mutex
(I've looked at existing call sites and I think we hold a mutex
everywhere).

Motivation is to fix sparse warnings:
kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different address spaces)
kernel/bpf/core.c:1803:9:    expected struct callback_head *head
kernel/bpf/core.c:1803:9:    got struct callback_head [noderef] <asn:4> *
kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different address spaces)
kernel/bpf/core.c:1877:44:    expected struct bpf_prog_array_item *item
kernel/bpf/core.c:1877:44:    got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1901:26:    expected struct bpf_prog_array_item *existing
kernel/bpf/core.c:1901:26:    got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1935:26:    expected struct bpf_prog_array_item *[assigned] existing
kernel/bpf/core.c:1935:26:    got struct bpf_prog_array_item [noderef] <asn:4> *

v2:
* remove comment about potential race; that can't happen
  because all callers are in rcu-update section

Cc: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

54e9c9d4

selftests/bpf: fix compilation error for flow_dissector.c · fe937ea1

Alan Maguire authored May 29, 2019

When building the tools/testing/selftest/bpf subdirectory,
(running both a local directory "make" and a
"make -C tools/testing/selftests/bpf") I keep hitting the
following compilation error:

prog_tests/flow_dissector.c: In function ‘create_tap’:
prog_tests/flow_dissector.c:150:38: error: ‘IFF_NAPI’ undeclared (first
use in this function)
   .ifr_flags = IFF_TAP | IFF_NO_PI | IFF_NAPI | IFF_NAPI_FRAGS,
                                      ^
prog_tests/flow_dissector.c:150:38: note: each undeclared identifier is
reported only once for each function it appears in
prog_tests/flow_dissector.c:150:49: error: ‘IFF_NAPI_FRAGS’ undeclared

Adding include/uapi/linux/if_tun.h to tools/include/uapi/linux
resolves the problem and ensures the compilation of the file
does not depend on having up-to-date kernel headers locally.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

fe937ea1

28 May, 2019 15 commits

Merge branch 'cgroup-auto-detach' · d0a3a4b2

Alexei Starovoitov authored May 28, 2019

Roman Gushchin says:

====================
This patchset implements a cgroup bpf auto-detachment functionality:
bpf programs are detached as soon as possible after removal of the
cgroup, without waiting for the release of all associated resources.

Patches 2 and 3 are required to implement a corresponding kselftest
in patch 4.

v5:
  1) rebase

v4:
  1) release cgroup bpf data using a workqueue
  2) add test_cgroup_attach to .gitignore

v3:
  1) some minor changes and typo fixes

v2:
  1) removed a bogus check in patch 4
  2) moved buf[len] = 0 in patch 2
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d0a3a4b2

selftests/bpf: add auto-detach test · d5506591

Roman Gushchin authored May 25, 2019

Add a kselftest to cover bpf auto-detachment functionality.
The test creates a cgroup, associates some resources with it,
attaches a couple of bpf programs and deletes the cgroup.

Then it checks that bpf programs are going away in 5 seconds.

Expected output:
  $ ./test_cgroup_attach
  #override:PASS
  #multi:PASS
  #autodetach:PASS
  test_cgroup_attach:PASS

On a kernel without auto-detaching:
  $ ./test_cgroup_attach
  #override:PASS
  #multi:PASS
  #autodetach:FAIL
  test_cgroup_attach:FAIL
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d5506591

selftests/bpf: enable all available cgroup v2 controllers · 596092ef

Roman Gushchin authored May 25, 2019

Enable all available cgroup v2 controllers when setting up
the environment for the bpf kselftests. It's required to properly test
the bpf prog auto-detach feature. Also it will generally increase
the code coverage.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

596092ef

selftests/bpf: convert test_cgrp2_attach2 example into kselftest · ba0c0cc0

Roman Gushchin authored May 25, 2019

Convert test_cgrp2_attach2 example into a proper test_cgroup_attach
kselftest. It's better because we do run kselftest on a constant
basis, so there are better chances to spot a potential regression.

Also make it slightly less verbose to conform kselftests output style.

Output example:
  $ ./test_cgroup_attach
  #override:PASS
  #multi:PASS
  test_cgroup_attach:PASS
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ba0c0cc0

bpf: decouple the lifetime of cgroup_bpf from cgroup itself · 4bfc0bb2

Roman Gushchin authored May 25, 2019

Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.

A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.

Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.

The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

4bfc0bb2

samples/bpf: fix a couple of style issues in bpf_load · 37b54aed

Daniel T. Lee authored May 23, 2019

This commit fixes a few style problems in samples/bpf/bpf_load.c:

 - Magic string use of 'DEBUGFS'
 - Useless zero initialization of a global variable
 - Minor style fix with whitespace
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

37b54aed

selftests/bpf: fail test_tunnel.sh if subtests fail · 486d3f22

Stanislav Fomichev authored May 24, 2019

Right now test_tunnel.sh always exits with success even if some
of the subtests fail. Since the output is very verbose, it's
hard to spot the issues with subtests. Let's fail the script
if any subtest fails.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

486d3f22

Merge branch 'bpf-bpftool-dbg-output' · 463910a5

Daniel Borkmann authored May 28, 2019

Quentin Monnet says:

====================
This series adds an option to bpftool to make it print additional
information via libbpf and the kernel verifier when attempting to load
programs.

A new API function is added to libbpf in order to pass the log_level
from bpftool with the bpf_object__* part of the API.

Options for a finer control over the log levels to use for libbpf and
the verifier could be added in the future, if desired.

v3:
- Fix and clarify commit logs.

v2:
- Do not add distinct options for libbpf and verifier logs, just keep the
  one that sets all log levels to their maximum. Rename the option.
- Do not offer a way to pick desired log levels. The choice is "use the
  option to print all logs" or "stick with the defaults".
- Do not export BPF_LOG_* flags to user header.
- Update all man pages (most bpftool operations use libbpf and may print
  libbpf logs). Verifier logs are only used when attempting to load
  programs for now, so bpftool-prog.rst and bpftool.rst remain the only
  pages updated in that regard.

Previous discussion available at:

https://lore.kernel.org/bpf/20190523105426.3938-1-quentin.monnet@netronome.com/
https://lore.kernel.org/bpf/20190429095227.9745-1-quentin.monnet@netronome.com/
====================
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

463910a5

tools: bpftool: make -d option print debug output from verifier · 55d77807

Quentin Monnet authored May 24, 2019

The "-d" option is used to require all logs available for bpftool. So
far it meant telling libbpf to print even debug-level information. But
there is another source of info that can be made more verbose: when we
attemt to load programs with bpftool, we can pass a log_level parameter
to the verifier in order to control the amount of information that is
printed to the console.

Reuse the "-d" option to print all information the verifier can tell. At
this time, this means logs related to BPF_LOG_LEVEL1, BPF_LOG_LEVEL2 and
BPF_LOG_STATS. As mentioned in the discussion on the first version of
this set, these macros are internal to the kernel
(include/linux/bpf_verifier.h) and are not meant to be part of the
stable user API, therefore we simply use the related constants to print
whatever we can at this time, without trying to tell users what is
log_level1 or what is statistics.

Verifier logs are only used when loading programs for now (In the
future: for loading BTF objects with bpftool? Although libbpf does not
currently offer to print verifier info at debug level if no error
occurred when loading BTF objects), so bpftool.rst and bpftool-prog.rst
are the only man pages to get the update.

v3:
- Add details on log level and BTF loading at the end of commit log.

v2:
- Remove the possibility to select the log levels to use (v1 offered a
  combination of "log_level1", "log_level2" and "stats").
- The macros from kernel header bpf_verifier.h are not used (and
  therefore not moved to UAPI header).
- In v1 this was a distinct option, but is now merged in the only "-d"
  switch to activate libbpf and verifier debug-level logs all at the
  same time.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

55d77807

libbpf: add bpf_object__load_xattr() API function to pass log_level · 60276f98

Quentin Monnet authored May 24, 2019

libbpf was recently made aware of the log_level attribute for programs,
used to specify the level of information expected to be dumped by the
verifier. Function bpf_prog_load_xattr() got support for this log_level
parameter.

But some applications using libbpf rely on another function to load
programs, bpf_object__load(), which does accept any parameter for log
level. Create an API function based on bpf_object__load(), but accepting
an "attr" object as a parameter. Then add a log_level field to that
object, so that applications calling the new bpf_object__load_xattr()
can pick the desired log level.

v3:
- Rewrite commit log.

v2:
- We are in a new cycle, bump libbpf extraversion number.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

60276f98

tools: bpftool: add -d option to get debug output from libbpf · 775bc8ad

Quentin Monnet authored May 24, 2019

libbpf has three levels of priority for output messages: warn, info,
debug. By default, debug output is not printed to the console.

Add a new "--debug" (short name: "-d") option to bpftool to print libbpf
logs for all three levels.

Internally, we simply use the function provided by libbpf to replace the
default printing function by one that prints logs regardless of their
level.

v2:
- Remove the possibility to select the log-levels to use (v1 offered a
  combination of "warn", "info" and "debug").
- Rename option and offer a short name: -d|--debug.
- Add option description to all bpftool manual pages (instead of
  bpftool-prog.rst only), as all commands use libbpf.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

775bc8ad

libbpf: fix warning that PTR_ERR_OR_ZERO can be used · d98363b5

Hariprasad Kelam authored May 25, 2019

Fix below warning reported by coccicheck:

/tools/lib/bpf/libbpf.c:3461:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

d98363b5

bpf: style fix in while(!feof()) loop · 92bd6820

Chang-Hsien Tsai authored May 26, 2019

Use fgets() as the while loop condition.
Signed-off-by: Chang-Hsien Tsai <luke.tw@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

92bd6820

bpf: check signal validity in nmi for bpf_send_signal() helper · e1afb702

Yonghong Song authored May 25, 2019

Commit 8b401f9e ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper. If the context is nmi,
the sending signal work needs to be deferred to irq_work.
If the signal is invalid, the error will appear in irq_work
and it won't be propagated to user.

This patch did an early check in the helper itself to notify
user invalid signal, as suggested by Daniel.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

e1afb702

bpftool: auto-complete BTF IDs for btf dump · 0d97dacc

Andrii Nakryiko authored May 25, 2019

Auto-complete BTF IDs for `btf dump id` sub-command. List of possible BTF
IDs is scavenged from loaded BPF programs that have associated BTFs, as
there is currently no API in libbpf to fetch list of all BTFs in the
system.
Suggested-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0d97dacc

25 May, 2019 2 commits

samples: bpf: add ibumad sample to .gitignore · d9a6f413

Matteo Croce authored May 24, 2019

This commit adds ibumad to .gitignore which is
currently ommited from the ignore file.
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d9a6f413

Merge branch 'optimize-zext' · 198ae936

Alexei Starovoitov authored May 24, 2019

Jiong Wang says:

====================
v9:
  - Split patch 5 in v8.
    make bpf uapi header file sync a separate patch. (Alexei)

v8:
  - For stack slot read, mark them as REG_LIVE_READ64. (Alexei)
  - Change DEF_NOT_SUBREG from -1 to 0. (Alexei)
  - Rebased on top of latest bpf-next.

v7:
  - Drop the first patch in v6, the one adding 32-bit return value and
    argument type. (Alexei)
  - Rename bpf_jit_hardware_zext to bpf_jit_needs_zext. (Alexei)
  - Use mov32 with imm == 1 to indicate it is zext. (Alexei)
  - JIT back-ends peephole next insn to optimize out unnecessary zext
    inserted by verifier. (Alexei)
  - Testing:
    + patch set tested (bpf selftest) on x64 host with llvm 9.0
      no regression observed no both JIT and interpreter modes.
    + patch set tested (bpf selftest) on x32 host.
      By Yanqing Wang, thanks!
      no regression observed on both JIT and interpreter modes.
    + patch set tested (bpf selftest) on RV64 host with llvm 9.0,
      By Björn Töpel, thanks!
      no regression observed before and after this set with JIT_ALWAYS_ON.
      test_progs_32 also enabled as LLVM 9.0 is used by Björn.
    + cross compiled the other affected targets, arm, PowerPC, SPARC, S390.

v6:
  - Fixed s390 kbuild test robot error. (kbuild)
  - Make comment style in backends patches more consistent.

v5:
  - Adjusted several test_verifier helpers to make them works on hosts
    w and w/o hardware zext. (Naveen)
  - Make sure zext flag not set when verifier by-passed, for example,
    libtest_bpf.ko. (Naveen)
  - Conservatively mark bpf main return value as 64-bit. (Alexei)
  - Make sure read flag is either READ64 or READ32, not the mix of both.
    (Alexei)
  - Merged patch 1 and 2 in v4. (Alexei)
  - Fixed kbuild test robot warning on NFP. (kbuild)
  - Proposed new BPF_ZEXT insn to have optimal code-gen for various JIT
    back-ends.
  - Conservately set zext flags for patched-insn.
  - Fixed return value zext for helper function calls.
  - Also adjusted test_verifier scalability unit test to avoid triggerring
    too many insn patch which will hang computer.
  - re-tested on x86 host with llvm 9.0, no regression on test_verifier,
    test_progs, test_progs_32.
  - re-tested offload target (nfp), no regression on local testsuite.

v4:
  - added the two missing fixes which addresses two Jakub's reviewes in v3.
  - rebase on top of bpf-next.

v3:
  - remove redundant check in "propagate_liveness_reg". (Jakub)
  - add extra check in "mark_reg_read" to prune more search. (Jakub)
  - re-implemented "prog_flags" passing mechanism, removed use of
    global switch inside libbpf.
  - enabled high 32-bit randomization beyond "test_verifier" and
    "test_progs". Now it should have been enabled for all possible
    tests. Re-run all tests, haven't noticed regression.
  - remove RFC tag.

v2:
  - rebased on top of bpf-next master.
  - added comments for what is sub-register def index. (Edward, Alexei)
  - removed patch 1 which turns bit mask from enum to macro. (Alexei)
  - removed sysctl/bpf_jit_32bit_opt. (Alexei)
  - merged sub-register def insn index into reg state. (Alexei)
  - change test methodology (Alexei):
      + instead of simple unit tests on x86_64 for which this optimization
        doesn't enabled due to there is hardware support, poison high
        32-bit for whose def identified as safe to do so. this could let
        the correctness of this patch set checked when daily bpf selftest
        ran which delivers very stressful test on host machine like x86_64.
      + hi32 poisoning is gated by a new BPF_F_TEST_RND_HI32 prog flags.
      + BPF_F_TEST_RND_HI32 is enabled for all tests of "test_progs" and
        "test_verifier", the latter needs minor tweak on two unit tests,
        please see the patch for the change.
      + introduced a new global variable "libbpf_test_mode" into libbpf.
        once it is set to true, it will set BPF_F_TEST_RND_HI32 for all the
        later PROG_LOAD syscall, the goal is to easy the enable of hi32
        poison on exsiting testsuite.
        we could also introduce new APIs, for example "bpf_prog_test_load",
        then use -Dbpf_prog_load=bpf_prog_test_load to migrate tests under
        test_progs, but there are several load APIs, and such new API need
        some change on struture like "struct bpf_prog_load_attr".
      + removed old unit tests. it is based on insn scan and requires quite
        a few test_verifier generic code change. given hi32 randomization
        could offer good test coverage, the unit tests doesn't add much
        extra test value.
  - enhanced register width check ("is_reg64") when record sub-register
    write, now, it returns more accurate width.
  - Re-run all tests under "test_progs" and "test_verifier" on x86_64, no
    regression. Fixed a couple of bugs exposed:
      1. ctx field size transformation was not taken into account.
      2. insn patch could cause lost of original aux data which is
         important for ctx field conversion.
      3. return value for propagate_liveness was wrong and caused
         regression on processed insn number.
      4. helper call arg wasn't handled properly that path prune may cause
         64-bit read info in pruned path lost.
  - Re-run Cilium bpf prog for processed-insn-number benchmarking, no
    regression.

v1:
  - Fixed the missing handling on callee-saved for bpf-to-bpf call,
    sub-register defs therefore moved to frame state. (Jakub Kicinski)
  - Removed redundant "cross_reg". (Jakub Kicinski)
  - Various coding styles & grammar fixes. (Jakub Kicinski, Quentin Monnet)

eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
AArch64 ISA has the same semantics, so the corresponding JIT back-end
doesn't need to do extra work.

However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
(PowerPC, SPARC etc) need to do explicit zero extension to meet this
requirement, otherwise code like the following will fail.

  u64_value = (u64) u32_value
  ... other uses of u64_value

This is because compiler could exploit the semantic described above and
save those zero extensions for extending u32_value to u64_value, these JIT
back-ends are expected to guarantee this through inserting extra zero
extensions which however could be a significant increase on the code size.
Some benchmarks show there could be ~40% sub-register writes out of total
insns, meaning at least ~40% extra code-gen.

One observation is these extra zero extensions are not always necessary.
Take above code snippet for example, it is possible u32_value will never be
casted into a u64, the value of high 32-bit of u32_value then could be
ignored and extra zero extension could be eliminated.

This patch implements this idea, insns defining sub-registers will be
marked when the high 32-bit of the defined sub-register matters. For
those unmarked insns, it is safe to eliminate high 32-bit clearnace for
them.

Algo
====
We could use insn scan based static analysis to tell whether one
sub-register def doesn't need zero extension. However, using such static
analysis, we must do conservative assumption at branching point where
multiple uses could be introduced. So, for any sub-register def that is
active at branching point, we need to mark it as needing zero extension.
This could introducing quite a few false alarms, for example ~25% on
Cilium bpf_lxc.

It will be far better to use dynamic data-flow tracing which verifier
fortunately already has and could be easily extend to serve the purpose of
this patch set.

 - Split read flags into READ32 and READ64.

 - Record index of insn that does sub-register write. Keep the index inside
   reg state and update it during verifier insn walking.

 - A full register read on a sub-register marks its definition insn as
   needing zero extension on dst register.

   A new sub-register write overrides the old one.

 - When propagating read64 during path pruning, also mark any insn defining
   a sub-register that is read in the pruned path as full-register.

Benchmark
=========
 - I estimate the JITed image could be 10% ~ 30% smaller on these affected
   arches (nfp, arm, x32, risv, ppc, sparc, s390), depending on the prog.

 - For Cilium bpf_lxc, there is ~11500 insns in the compiled binary (use
   latest LLVM snapshot, and with -mcpu=v3 -mattr=+alu32 enabled), 4460 of
   them has sub-register writes (~40%). Calculated by:

    cat dump | grep -P "\tw" | wc -l       (ALU32)
    cat dump | grep -P "r.*=.*u32" | wc -l (READ_W)
    cat dump | grep -P "r.*=.*u16" | wc -l (READ_H)
    cat dump | grep -P "r.*=.*u8" | wc -l  (READ_B)

   After this patch set enabled, > 25% of those 4460 could be identified as
   doesn't needing zero extension on the destination, and the percentage
   could go further up to more than 50% with some follow up optimizations
   based on the infrastructure offered by this set. This leads to
   significant save on JITed image.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

198ae936