Commits · eff14fcd032bc1b403c1716f6823b3c72c58096a · Kirill Smelkov / linux

07 Jan, 2022 4 commits

Merge branch 'net: bpf: handle return value of post_bind{4,6} and add selftests for it' · eff14fcd

Alexei Starovoitov authored Jan 06, 2022

Menglong Dong says:

====================

From: Menglong Dong <imagedong@tencent.com>

The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:

        err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
        if (err) {
                inet->inet_saddr = inet->inet_rcv_saddr = 0;
                goto out_release_sock;
        }

Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.

To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().

Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.

The second patch use C99 initializers in test_sock.c

The third patch is the selftests for this modification.

Changes since v4:
- use C99 initializers in test_sock.c before adding the test case

Changes since v3:
- add the third patch which use C99 initializers in test_sock.c

Changes since v2:
- NULL check for sk->sk_prot->put_port

Changes since v1:
- introduce 'put_port' field for 'struct proto'
- add selftests for it
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

eff14fcd

bpf: selftests: Add bind retry for post_bind{4, 6} · f7342481

Menglong Dong authored Jan 06, 2022

With previous patch, kernel is able to 'put_port' after sys_bind()
fails. Add the test for that case: rebind another port after
sys_bind() fails. If the bind success, it means previous bind
operation is already undoed.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-4-imagedong@tencent.com

f7342481

bpf: selftests: Use C99 initializers in test_sock.c · 6fd92c7f

Menglong Dong authored Jan 06, 2022

Use C99 initializers for the initialization of 'tests' in test_sock.c.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-3-imagedong@tencent.com

6fd92c7f

net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() · 91a760b2

Menglong Dong authored Jan 06, 2022

The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:

	err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
	if (err) {
		inet->inet_saddr = inet->inet_rcv_saddr = 0;
		goto out_release_sock;
	}

Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.

To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().

Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com

91a760b2

06 Jan, 2022 18 commits

bpf/selftests: Test bpf_d_path on rdonly_mem. · 44bab87d

Hao Luo authored Jan 06, 2022

The second parameter of bpf_d_path() can only accept writable
memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not
be passed into bpf_d_path for modification. This patch adds
a selftest to verify this behavior.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220106205525.2116218-1-haoluo@google.com

44bab87d

libbpf: Add documentation for bpf_map batch operations · e59618f0

Grant Seltzer authored Jan 06, 2022

This adds documention for:

- bpf_map_delete_batch()
- bpf_map_lookup_batch()
- bpf_map_lookup_and_delete_batch()
- bpf_map_update_batch()

This also updates the public API for the `keys` parameter
of `bpf_map_delete_batch()`, and both the
`keys` and `values` parameters of `bpf_map_update_batch()`
to be constants.
Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220106201304.112675-1-grantseltzer@gmail.com

e59618f0

selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 · 70bc7933

Andrii Nakryiko authored Jan 06, 2022

PT_REGS*() macro on some architectures force-cast struct pt_regs to
other types (user_pt_regs, etc) and might drop volatile modifiers, if any.
Volatile isn't really required as pt_regs value isn't supposed to change
during the BPF program run, so this is correct behavior.

But progs/loop3.c relies on that volatile modifier to ensure that loop
is preserved. Fix loop3.c by declaring i and sum variables as volatile
instead. It preserves the loop and makes the test pass on all
architectures (including s390x which is currently broken).

Fixes: 3cc31d79 ("libbpf: Normalize PT_REGS_xxx() macro definitions")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220106205156.955373-1-andrii@kernel.org

70bc7933

xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames · 1372d34c

Toke Høiland-Jørgensen authored Jan 03, 2022

Add an xdp_do_redirect_frame() variant which supports pre-computed
xdp_frame structures. This will be used in bpf_prog_run() to avoid having
to write to the xdp_frame structure when the XDP program doesn't modify the
frame boundaries.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-6-toke@redhat.com

1372d34c

xdp: Move conversion to xdp_frame out of map functions · d53ad5d8

Toke Høiland-Jørgensen authored Jan 03, 2022

All map redirect functions except XSK maps convert xdp_buff to xdp_frame
before enqueueing it. So move this conversion of out the map functions
and into xdp_do_redirect(). This removes a bit of duplicated code, but more
importantly it makes it possible to support caller-allocated xdp_frame
structures, which will be added in a subsequent commit.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com

d53ad5d8

page_pool: Store the XDP mem id · 64693ec7

Toke Høiland-Jørgensen authored Jan 03, 2022

Store the XDP mem ID inside the page_pool struct so it can be retrieved
later for use in bpf_prog_run().
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com

64693ec7

page_pool: Add callback to init pages when they are allocated · 35b2e549

Toke Høiland-Jørgensen authored Jan 03, 2022

Add a new callback function to page_pool that, if set, will be called every
time a new page is allocated. This will be used from bpf_test_run() to
initialise the page data with the data provided by userspace when running
XDP programs with redirect turned on.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20220103150812.87914-3-toke@redhat.com

35b2e549

xdp: Allow registering memory model without rxq reference · 4a48ef70

Toke Høiland-Jørgensen authored Jan 03, 2022

The functions that register an XDP memory model take a struct xdp_rxq as
parameter, but the RXQ is not actually used for anything other than pulling
out the struct xdp_mem_info that it embeds. So refactor the register
functions and export variants that just take a pointer to the xdp_mem_info.

This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a
page_pool instance that is not connected to any network device.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com

4a48ef70

Merge branch 'samples/bpf: xdpsock app enhancements' · 640a171c

Alexei Starovoitov authored Jan 05, 2022

Ong Boon says:

====================

First of all, sorry for taking more time to get back to this series and
thanks to all valuble feedback in series-1 at [1] from Jesper and Song
Liu.

Since then I have looked into what Jesper suggested in [2] and worked on
revising the patch series into several patches for ease of review:

v1->v2:
1/7: [No change]. Add VLAN tag (ID & Priority) to the generated Tx-Only
     frames.

2/7: [No change]. Add DMAC and SMAC setting to the generated Tx-Only
     frames. If parameters are not set, previous DMAC and SMAC are used.

3/7: [New]. Add support for selecting different CLOCK for clock_gettime()
     used in get_nsecs.

4/7: [New]. This is a total rework from series-1 3/4-patch [3]. It uses
     clock_nanosleep() suggested by Jesper. In addition, added statistic
     for Tx schedule variance under application stat (-a|--app-stats).
     Make the cyclic Tx operation and --poll mode to be mutually-
     exclusive. Still, the ability to specify TX cycle time and used
     together with batch size and packet count remain the same.

5/7: [New]. Add the support for TX process schedule policy and priority
     setting. By default, SCHED_OTHER policy is used. This too is matching
     the schedule policy setting in [2].

6/7: [Change]. This is update from series-1 4/4-patch [4]. Added TX clean
     process time-out in 1s granularity with configurable retries count
     (-O|--retries).

7/7: [New]. Added timestamp for TX packet following pktgen_hdr format
     matching the implementation in [2]. However, the sequence ID remains
     the same as it is instead of process schedule diff in [2].

To summarize on what program options have been added with v2 series
using an example below:-

 DMAC (-G)                 = fa:8d:f1:e2:0b:e8
 SMAC (-H)                 = ce:17:07:17:3e:3a

 VLAN tagged (-V)
 VLAN ID (-J)              = 12
 VLAN Pri (-K)             = 3

 Tx Queue (-q)             = 3
 Cycle Time in us (-T)     = 1000
 Batch (-b)                = 2
 Packet Count              = 6
 Tx schedule policy (-W)   = FIFO
 Tx schedule priority (-U) = 50
 Clock selection (-w)      = REALTIME

 Tx timeout retries(-O)    = 5
 Tx timestamp (-y)
 Cyclic Tx schedule stat (-a)

Note: xdpsock sets UDP dest-port and src-port to 0x1000 as default.

 Sending Board
 =============
 $ xdpsock -i eth0 -t -N -z -H ce:17:07:17:3e:3a -G fa:8d:f1:e2:0b:e8 \
   -V -J 12 -K 3 -q 3 \
   -T 1000 -b 2 -C 6 -W FIFO -U 50 -w REALTIME \
   -O 5 -y -a

  sock0@eth0:3 txonly xdp-drv
                    pps            pkts           0.00
 rx                 0              0
 tx                 0              6

                    calls/s        count
 rx empty polls     0              0
 fill fail polls    0              0
 copy tx sendtos    0              0
 tx wakeup sendtos  0              5
 opt polls          0              0

                    period     min        ave        max        cycle
 Cyclic TX          1000000    31033      32009      33397      3

 Receiving Board
 ===============
 $ tcpdump -nei eth0 udp port 0x1000 -vv -Q in -X \
    --time-stamp-precision nano
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:46:40.520111580 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
    10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
        0x0000:  4500 002c 0000 0000 4011 527e 0a0a 0a10  E..,....@.R~....
        0x0010:  0a0a 0a20 1000 1000 0018 e997 be9b e955  ...............U
        0x0020:  0000 0000 61cd 2ba1 0006 987c            ....a.+....|
03:46:40.520112163 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
    10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
        0x0000:  4500 002c 0000 0000 4011 527e 0a0a 0a10  E..,....@.R~....
        0x0010:  0a0a 0a20 1000 1000 0018 e996 be9b e955  ...............U
        0x0020:  0000 0001 61cd 2ba1 0006 987c            ....a.+....|
03:46:40.521066860 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
    10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
        0x0000:  4500 002c 0000 0000 4011 527e 0a0a 0a10  E..,....@.R~....
        0x0010:  0a0a 0a20 1000 1000 0018 e5af be9b e955  ...............U
        0x0020:  0000 0002 61cd 2ba1 0006 9c62            ....a.+....b
03:46:40.521067012 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
    10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
        0x0000:  4500 002c 0000 0000 4011 527e 0a0a 0a10  E..,....@.R~....
        0x0010:  0a0a 0a20 1000 1000 0018 e5ae be9b e955  ...............U
        0x0020:  0000 0003 61cd 2ba1 0006 9c62            ....a.+....b
03:46:40.522061935 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
    10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
        0x0000:  4500 002c 0000 0000 4011 527e 0a0a 0a10  E..,....@.R~....
        0x0010:  0a0a 0a20 1000 1000 0018 e1c5 be9b e955  ...............U
        0x0020:  0000 0004 61cd 2ba1 0006 a04a            ....a.+....J
03:46:40.522062173 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
    10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
        0x0000:  4500 002c 0000 0000 4011 527e 0a0a 0a10  E..,....@.R~....
        0x0010:  0a0a 0a20 1000 1000 0018 e1c4 be9b e955  ...............U
        0x0020:  0000 0005 61cd 2ba1 0006 a04a            ....a.+....J

I have tested the above with both tagged and untagged packet format and
based on the timestamp in tcpdump found that the timing of the batch
cyclic transmission is correct.

Appreciate if community can give the patch series v2 a try and point out
any gap.

Thanks
Boon Leong

[1] https://patchwork.kernel.org/project/netdevbpf/cover/20211124091821.3916046-1-boon.leong.ong@intel.com/
[2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_pacer.c
[3] https://patchwork.kernel.org/project/netdevbpf/patch/20211124091821.3916046-4-boon.leong.ong@intel.com/
[4] https://patchwork.kernel.org/project/netdevbpf/patch/20211124091821.3916046-5-boon.leong.ong@intel.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

640a171c

samples/bpf: xdpsock: Add timestamp for Tx-only operation · eb68db45

Ong Boon Leong authored Dec 30, 2021

It may be useful to add timestamp for Tx packets for continuous or cyclic
transmit operation. The timestamp and sequence ID of a Tx packet are
stored according to pktgen header format. To enable per-packet timestamp,
use -y|--tstamp option. If timestamp is off, pktgen header is not
included in the UDP payload. This means receiving side can use the magic
number for pktgen for differentiation.

The implementation supports both VLAN tagged and untagged option. By
default, the minimum packet size is set at 64B. However, if VLAN tagged
is on (-V), the minimum packet size is increased to 66B just so to fit
the pktgen_hdr size.

Added hex_dump() into the code path just for future cross-checking.
As before, simply change to "#define DEBUG_HEXDUMP 1" to inspect the
accuracy of TX packet.
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-8-boon.leong.ong@intel.com

eb68db45

samples/bpf: xdpsock: Add time-out for cleaning Tx · 8121e789

Ong Boon Leong authored Dec 30, 2021

When user sets tx-pkt-count and in case where there are invalid Tx frame,
the complete_tx_only_all() process polls indefinitely. So, this patch
adds a time-out mechanism into the process so that the application
can terminate automatically after it retries 3*polling interval duration.

v1->v2:
 Thanks to Jesper's and Song Liu's suggestion.
 - clean-up git message to remove polling log
 - make the Tx time-out retries configurable with 1s granularity
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-7-boon.leong.ong@intel.com

8121e789

samples/bpf: xdpsock: Add sched policy and priority support · fa24d0b1

Ong Boon Leong authored Dec 30, 2021

By default, TX schedule policy is SCHED_OTHER (round-robin time-sharing).
To improve TX cyclic scheduling, we add SCHED_FIFO policy and its priority
by using -W FIFO or --policy=FIFO and -U <PRIO> or --schpri=<PRIO>.

A) From xdpsock --app-stats, for SCHED_OTHER policy:
   $ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a

                      period     min        ave        max        cycle
   Cyclic TX          1000000    53507      75334      712642     6250

B) For SCHED_FIFO policy and schpri=50:
   $ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a -W FIFO -U 50

                      period     min        ave        max        cycle
   Cyclic TX          1000000    3699       24859      54397      6250
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-6-boon.leong.ong@intel.com

fa24d0b1

samples/bpf: xdpsock: Add cyclic TX operation capability · fa0d27a1

Ong Boon Leong authored Dec 30, 2021

Tx cycle time is in micro-seconds unit. By combining the batch size (-b M)
and Tx cycle time (-T|--tx-cycle N), xdpsock now can transmit batch-size of
packets every N-us periodically. Cyclic TX operation is not applicable if
--poll mode is used.

To transmit 16 packets every 1ms cycle time for total of 100000 packets
silently:
 $ xdpsock -i eth0 -T -N -z -T 1000 -b 16 -C 100000

To print cyclic TX schedule variance stats, use --app-stats|-a:
 $ xdpsock -i eth0 -T -N -z -T 1000 -b 16 -C 100000 -a

 sock0@eth0:0 txonly xdp-drv
                   pps            pkts           0.00
rx                 0              0
tx                 0              100000

                   calls/s        count
rx empty polls     0              0
fill fail polls    0              0
copy tx sendtos    0              0
tx wakeup sendtos  0              6254
opt polls          0              0

                   period     min        ave        max        cycle
Cyclic TX          1000000    53507      75334      712642     6250
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-5-boon.leong.ong@intel.com

fa0d27a1

samples/bpf: xdpsock: Add clockid selection support · 5a388254

Ong Boon Leong authored Dec 30, 2021

User specifies the clock selection by using -w CLOCK or --clock=CLOCK
where CLOCK=[REALTIME, TAI, BOOTTIME, MONOTONIC].

The default CLOCK selection is MONOTONIC.

The implementation of clock selection parsing is borrowed from
iproute2/tc/q_taprio.c
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-4-boon.leong.ong@intel.com

5a388254

samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation · 6440a6c2

Ong Boon Leong authored Dec 30, 2021

To set Dest MAC address (-G|--tx-dmac) only:
 $ xdpsock -i eth0 -t -N -z -G aa:bb:cc:dd:ee:ff

To set Source MAC address (-H|--tx-smac) only:
 $ xdpsock -i eth0 -t -N -z -H 11:22:33:44:55:66

To set both Dest and Source MAC address:
 $ xdpsock -i eth0 -t -N -z -G aa:bb:cc:dd:ee:ff \
   -H 11:22:33:44:55:66

The default Dest and Source MAC address remain the same as before.
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20211230035447.523177-3-boon.leong.ong@intel.com

6440a6c2

samples/bpf: xdpsock: Add VLAN support for Tx-only operation · 2741a049

Ong Boon Leong authored Dec 30, 2021

In multi-queue environment testing, the support for VLAN-tag based
steering is useful. So, this patch adds the capability to add
VLAN tag (VLAN ID and Priority) to the generated Tx frame.

To set the VLAN ID=10 and Priority=2 for Tx only through TxQ=3:
 $ xdpsock -i eth0 -t -N -z -q 3 -V -J 10 -K 2

If VLAN ID (-J) and Priority (-K) is set, it default to
  VLAN ID = 1
  VLAN Priority = 0.

For example, VLAN-tagged Tx only, xdp copy mode through TxQ=1:
 $ xdpsock -i eth0 -t -N -c -q 1 -V
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211230035447.523177-2-boon.leong.ong@intel.com

2741a049

libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API · 5f608264

Christy Lee authored Jan 04, 2022

API created with simplistic assumptions about BPF map definitions.
It hasn’t worked for a while, deprecate it in preparation for
libbpf 1.0.

  [0] Closes: https://github.com/libbpf/libbpf/issues/302Signed-off-by: Christy Lee <christylee@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220105003120.2222673-1-christylee@fb.com

5f608264

libbpf 1.0: Deprecate bpf_map__is_offload_neutral() · 9855c131

Christy Lee authored Jan 04, 2022

Deprecate bpf_map__is_offload_neutral(). It’s most probably broken
already. PERF_EVENT_ARRAY isn’t the only map that’s not suitable
for hardware offloading. Applications can directly check map type
instead.

  [0] Closes: https://github.com/libbpf/libbpf/issues/306Signed-off-by: Christy Lee <christylee@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220105000601.2090044-1-christylee@fb.com

9855c131

05 Jan, 2022 18 commits

libbpf: Support repeated legacy kprobes on same function · 51a33c60

Qiang Wang authored Dec 27, 2021

If repeated legacy kprobes on same function in one process,
libbpf will register using the same probe name and got -EBUSY
error. So append index to the probe name format to fix this
problem.
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Qiang Wang <wangqiang.wq.frank@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211227130713.66933-2-wangqiang.wq.frank@bytedance.com

51a33c60

libbpf: Use probe_name for legacy kprobe · 71cff670

Qiang Wang authored Dec 27, 2021

Fix a bug in commit 46ed5fc3, which wrongly used the
func_name instead of probe_name to register legacy kprobe.

Fixes: 46ed5fc3 ("libbpf: Refactor and simplify legacy kprobe code")
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Qiang Wang <wangqiang.wq.frank@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Reviewed-by: Hengqi Chen <hengqi.chen@gmail.com>
Link: https://lore.kernel.org/bpf/20211227130713.66933-1-wangqiang.wq.frank@bytedance.com

71cff670

libbpf: Deprecate bpf_perf_event_read_simple() API · 7218c28c

Christy Lee authored Dec 29, 2021

With perf_buffer__poll() and perf_buffer__consume() APIs available,
there is no reason to expose bpf_perf_event_read_simple() API to
users. If users need custom perf buffer, they could re-implement
the function.

Mark bpf_perf_event_read_simple() and move the logic to a new
static function so it can still be called by other functions in the
same file.

  [0] Closes: https://github.com/libbpf/libbpf/issues/310Signed-off-by: Christy Lee <christylee@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211229204156.13569-1-christylee@fb.com

7218c28c

bpf: Add SO_RCVBUF/SO_SNDBUF in _bpf_getsockopt(). · 28479934

Kuniyuki Iwashima authored Jan 04, 2022

This patch exposes SO_RCVBUF/SO_SNDBUF through bpf_getsockopt().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220104013153.97906-3-kuniyu@amazon.co.jp

28479934

bpf: Fix SO_RCVBUF/SO_SNDBUF handling in _bpf_setsockopt(). · 04c350b1

Kuniyuki Iwashima authored Jan 04, 2022

The commit 4057765f ("sock: consistent handling of extreme
SO_SNDBUF/SO_RCVBUF values") added a change to prevent underflow
in setsockopt() around SO_SNDBUF/SO_RCVBUF.

This patch adds the same change to _bpf_setsockopt().

Fixes: 4057765f ("sock: consistent handling of extreme SO_SNDBUF/SO_RCVBUF values")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220104013153.97906-2-kuniyu@amazon.co.jp

04c350b1

bpf: Fix verifier support for validation of async callbacks · a5bebc4f

Kris Van Hees authored Jan 05, 2022

Commit bfc6bb74 ("bpf: Implement verifier support for validation of async callbacks.")
added support for BPF_FUNC_timer_set_callback to
the __check_func_call() function.  The test in __check_func_call() is
flaweed because it can mis-interpret a regular BPF-to-BPF pseudo-call
as a BPF_FUNC_timer_set_callback callback call.

Consider the conditional in the code:

	if (insn->code == (BPF_JMP | BPF_CALL) &&
	    insn->imm == BPF_FUNC_timer_set_callback) {

The BPF_FUNC_timer_set_callback has value 170.  This means that if you
have a BPF program that contains a pseudo-call with an instruction delta
of 170, this conditional will be found to be true by the verifier, and
it will interpret the pseudo-call as a callback.  This leads to a mess
with the verification of the program because it makes the wrong
assumptions about the nature of this call.

Solution: include an explicit check to ensure that insn->src_reg == 0.
This ensures that calls cannot be mis-interpreted as an async callback
call.

Fixes: bfc6bb74 ("bpf: Implement verifier support for validation of async callbacks.")
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220105210150.GH1559@oracle.com

a5bebc4f

bpf, docs: Fully document the JMP mode modifiers · 58d8a3fc

Christoph Hellwig authored Jan 03, 2022

Add a description for all the modifiers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103183556.41040-7-hch@lst.de

58d8a3fc

bpf, docs: Fully document the JMP opcodes · 9e533e22

Christoph Hellwig authored Jan 03, 2022

Add pseudo-code to document all the different BPF_JMP / BPF_JMP64
opcodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103183556.41040-6-hch@lst.de

9e533e22

bpf, docs: Fully document the ALU opcodes · 03c517ee

Christoph Hellwig authored Jan 03, 2022

Add pseudo-code to document all the different BPF_ALU / BPF_ALU64
opcodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103183556.41040-5-hch@lst.de

03c517ee

bpf, docs: Document the opcode classes · 894cda55

Christoph Hellwig authored Jan 03, 2022

Add a description for each opcode class.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103183556.41040-4-hch@lst.de

894cda55

bpf, docs: Add subsections for ALU and JMP instructions · be3193cd

Christoph Hellwig authored Jan 03, 2022

Add a little more stucture to the ALU/JMP documentation with sections and
improve the example text.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103183556.41040-3-hch@lst.de

be3193cd

bpf, docs: Add a setion to explain the basic instruction encoding · 62e46838

Christoph Hellwig authored Jan 03, 2022

The eBPF instruction set document does not currently document the basic
instruction encoding.  Add a section to do that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103183556.41040-2-hch@lst.de

62e46838

bpf, selftests: Add verifier test for mem_or_null register with offset. · ca796fe6

Daniel Borkmann authored Jan 05, 2022

Add a new test case with mem_or_null typed register with off > 0 to ensure
it gets rejected by the verifier:

  # ./test_verifier 1011
  #1009/u check with invalid reg offset 0 OK
  #1009/p check with invalid reg offset 0 OK
  Summary: 2 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ca796fe6

bpf: Don't promote bogus looking registers after null check. · e60b0d12

Daniel Borkmann authored Jan 05, 2022

If we ever get to a point again where we convert a bogus looking <ptr>_or_null
typed register containing a non-zero fixed or variable offset, then lets not
reset these bounds to zero since they are not and also don't promote the register
to a <ptr> type, but instead leave it as <ptr>_or_null. Converting to a unknown
register could be an avenue as well, but then if we run into this case it would
allow to leak a kernel pointer this way.

Fixes: f1174f77 ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e60b0d12

bpf, sockmap: Fix double bpf_prog_put on error case in map_link · 218d747a

John Fastabend authored Jan 04, 2022

sock_map_link() is called to update a sockmap entry with a sk. But, if the
sock_map_init_proto() call fails then we return an error to the map_update
op against the sockmap. In the error path though we need to cleanup psock
and dec the refcnt on any programs associated with the map, because we
refcnt them early in the update process to ensure they are pinned for the
psock. (This avoids a race where user deletes programs while also updating
the map with new socks.)

In current code we do the prog refcnt dec explicitely by calling
bpf_prog_put() when the program was found in the map. But, after commit
'38207a5e' in this error path we've already done the prog to psock
assignment so the programs have a reference from the psock as well. This
then causes the psock tear down logic, invoked by sk_psock_put() in the
error path, to similarly call bpf_prog_put on the programs there.

To be explicit this logic does the prog->psock assignment:

  if (msg_*)
    psock_set_prog(...)

Then the error path under the out_progs label does a similar check and
dec with:

  if (msg_*)
     bpf_prog_put(...)

And the teardown logic sk_psock_put() does ...

  psock_set_prog(msg_*, NULL)

... triggering another bpf_prog_put(...). Then KASAN gives us this splat,
found by syzbot because we've created an inbalance between bpf_prog_inc and
bpf_prog_put calling put twice on the program.

  BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline]
  BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] kernel/bpf/syscall.c:1829
  BUG: KASAN: vmalloc-out-of-bounds in bpf_prog_put+0x8c/0x4f0 kernel/bpf/syscall.c:1829 kernel/bpf/syscall.c:1829
  Read of size 8 at addr ffffc90000e76038 by task syz-executor020/3641

To fix clean up error path so it doesn't try to do the bpf_prog_put in the
error path once progs are assigned then it relies on the normal psock
tear down logic to do complete cleanup.

For completness we also cover the case whereh sk_psock_init_strp() fails,
but this is not expected because it indicates an incorrect socket type
and should be caught earlier.

Fixes: 38207a5e ("bpf, sockmap: Attach map progs to psock early for feature probes")
Reported-by: syzbot+bb73e71cf4b8fd376a4f@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220104214645.290900-1-john.fastabend@gmail.com

218d747a

bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser() · 5b2c5540

John Fastabend authored Jan 04, 2022

Applications can be confused slightly because we do not always return the
same error code as expected, e.g. what the TCP stack normally returns. For
example on a sock err sk->sk_err instead of returning the sock_error we
return EAGAIN. This usually means the application will 'try again'
instead of aborting immediately. Another example, when a shutdown event
is received we should immediately abort instead of waiting for data when
the user provides a timeout.

These tend to not be fatal, applications usually recover, but introduces
bogus errors to the user or introduces unexpected latency. Before
'c5d2177a' we fell back to the TCP stack when no data was available
so we managed to catch many of the cases here, although with the extra
latency cost of calling tcp_msg_wait_data() first.

To fix lets duplicate the error handling in TCP stack into tcp_bpf so
that we get the same error codes.

These were found in our CI tests that run applications against sockmap
and do longer lived testing, at least compared to test_sockmap that
does short-lived ping/pong tests, and in some of our test clusters
we deploy.

Its non-trivial to do these in a shorter form CI tests that would be
appropriate for BPF selftests, but we are looking into it so we can
ensure this keeps working going forward. As a preview one idea is to
pull in the packetdrill testing which catches some of this.

Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com

5b2c5540

bpf, arm64: Use emit_addr_mov_i64() for BPF_PSEUDO_FUNC · e4a41c2c

Hou Tao authored Dec 31, 2021

The following error is reported when running "./test_progs -t for_each"
under arm64:

  bpf_jit: multi-func JIT bug 58 != 56
  [...]
  JIT doesn't support bpf-to-bpf calls

The root cause is the size of BPF_PSEUDO_FUNC instruction increases
from 2 to 3 after the address of called bpf-function is settled and
there are two bpf-to-bpf calls in test_pkt_access. The generated
instructions are shown below:

  0x48:  21 00 C0 D2    movz x1, #0x1, lsl #32
  0x4c:  21 00 80 F2    movk x1, #0x1

  0x48:  E1 3F C0 92    movn x1, #0x1ff, lsl #32
  0x4c:  41 FE A2 F2    movk x1, #0x17f2, lsl #16
  0x50:  81 70 9F F2    movk x1, #0xfb84

Fixing it by using emit_addr_mov_i64() for BPF_PSEUDO_FUNC, so
the size of jited image will not change.

Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211231151018.3781550-1-houtao1@huawei.com

e4a41c2c

bpf/selftests: Fix namespace mount setup in tc_redirect · 5e22dd18

Jiri Olsa authored Jan 04, 2022

The tc_redirect umounts /sys in the new namespace, which can be
mounted as shared and cause global umount. The lazy umount also
takes down mounted trees under /sys like debugfs, which won't be
available after sysfs mounts again and could cause fails in other
tests.

  # cat /proc/self/mountinfo | grep debugfs
  34 23 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime shared:14 - debugfs debugfs rw
  # cat /proc/self/mountinfo | grep sysfs
  23 86 0:22 / /sys rw,nosuid,nodev,noexec,relatime shared:2 - sysfs sysfs rw
  # mount | grep debugfs
  debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)

  # ./test_progs -t tc_redirect
  #164 tc_redirect:OK
  Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED

  # mount | grep debugfs
  # cat /proc/self/mountinfo | grep debugfs
  # cat /proc/self/mountinfo | grep sysfs
  25 86 0:22 / /sys rw,relatime shared:2 - sysfs sysfs rw

Making the sysfs private under the new namespace so the umount won't
trigger the global sysfs umount.
Reported-by: Hangbin Liu <haliu@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jussi Maki <joamaki@gmail.com>
Link: https://lore.kernel.org/bpf/20220104121030.138216-1-jolsa@kernel.org

5e22dd18