Commits · 9c3de619e13ee6693ec5ac74f50b7aa89056a70e · Kirill Smelkov / linux

12 Feb, 2022 1 commit

libbpf: Use dynamically allocated buffer when receiving netlink messages · 9c3de619

Toke Høiland-Jørgensen authored Feb 12, 2022

When receiving netlink messages, libbpf was using a statically allocated
stack buffer of 4k bytes. This happened to work fine on systems with a 4k
page size, but on systems with larger page sizes it can lead to truncated
messages. The user-visible impact of this was that libbpf would insist no
XDP program was attached to some interfaces because that bit of the netlink
message got chopped off.

Fix this by switching to a dynamically allocated buffer; we borrow the
approach from iproute2 of using recvmsg() with MSG_PEEK|MSG_TRUNC to get
the actual size of the pending message before receiving it, adjusting the
buffer as necessary. While we're at it, also add retries on interrupted
system calls around the recvmsg() call.

v2:
  - Move peek logic to libbpf_netlink_recv(), don't double free on ENOMEM.

Fixes: 8bbb77b7 ("libbpf: Add various netlink helpers")
Reported-by: Zhiqian Guan <zhguan@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20220211234819.612288-1-toke@redhat.com

9c3de619

11 Feb, 2022 6 commits

libbpf: Fix libbpf.map inheritance chain for LIBBPF_0.7.0 · d130e954

Andrii Nakryiko authored Feb 11, 2022

Ensure that LIBBPF_0.7.0 inherits everything from LIBBPF_0.6.0.

Fixes: dbdd2c7f ("libbpf: Add API to get/set log_level at per-program level")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211205235.2089104-1-andrii@kernel.org

d130e954

Merge branch 'bpftool: Switch to new versioning scheme (align on libbpf's)' · 4407fa06

Andrii Nakryiko authored Feb 10, 2022

Quentin Monnet says:

====================

Hi, this set aims at updating the way bpftool versions are numbered.
Instead of copying the version from the kernel (given that the sources for
the kernel and bpftool are shipped together), align it on libbpf's version
number, with a fixed offset (6) to avoid going backwards. Please refer to
the description of the second commit for details on the motivations.

The patchset also adds the number of the version of libbpf that was used to
compile to the output of "bpftool version". Bpftool makes such a heavy
usage of libbpf that it makes sense to indicate what version was used to
build it.

v3:
- Compute bpftool's version at compile time, but from the macros exposed by
  libbpf instead of calling a shell to compute $(BPFTOOL_VERSION) in the
  Makefile.
- Drop the commit which would add a "libbpfversion" target to libbpf's
  Makefile. This is no longer necessary.
- Use libbpf's major, minor versions with jsonw_printf() to avoid
  offsetting the version string to skip the "v" prefix.
- Reword documentation change.

v2:
- Align on libbpf's version number instead of creating an independent
  versioning scheme.
- Use libbpf_version_string() to retrieve and display libbpf's version.
- Re-order patches (1 <-> 2).
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

4407fa06

bpftool: Update versioning scheme, align on libbpf's version number · 9910a74d

Quentin Monnet authored Feb 10, 2022

Since the notion of versions was introduced for bpftool, it has been
following the version number of the kernel (using the version number
corresponding to the tree in which bpftool's sources are located). The
rationale was that bpftool's features are loosely tied to BPF features
in the kernel, and that we could defer versioning to the kernel
repository itself.

But this versioning scheme is confusing today, because a bpftool binary
should be able to work with both older and newer kernels, even if some
of its recent features won't be available on older systems. Furthermore,
if bpftool is ported to other systems in the future, keeping a
Linux-based version number is not a good option.

Looking at other options, we could either have a totally independent
scheme for bpftool, or we could align it on libbpf's version number
(with an offset on the major version number, to avoid going backwards).
The latter comes with a few drawbacks:

- We may want bpftool releases in-between two libbpf versions. We can
  always append pre-release numbers to distinguish versions, although
  those won't look as "official" as something with a proper release
  number. But at the same time, having bpftool with version numbers that
  look "official" hasn't really been an issue so far.

- If no new feature lands in bpftool for some time, we may move from
  e.g. 6.7.0 to 6.8.0 when libbpf levels up and have two different
  versions which are in fact the same.

- Following libbpf's versioning scheme sounds better than kernel's, but
  ultimately it doesn't make too much sense either, because even though
  bpftool uses the lib a lot, its behaviour is not that much conditioned
  by the internal evolution of the library (or by new APIs that it may
  not use).

Having an independent versioning scheme solves the above, but at the
cost of heavier maintenance. Developers will likely forget to increase
the numbers when adding features or bug fixes, and we would take the
risk of having to send occasional "catch-up" patches just to update the
version number.

Based on these considerations, this patch aligns bpftool's version
number on libbpf's. This is not a perfect solution, but 1) it's
certainly an improvement over the current scheme, 2) the issues raised
above are all minor at the moment, and 3) we can still move to an
independent scheme in the future if we realise we need it.

Given that libbpf is currently at version 0.7.0, and bpftool, before
this patch, was at 5.16, we use an offset of 6 for the major version,
bumping bpftool to 6.7.0. Libbpf does not export its patch number;
leave bpftool's patch number at 0 for now.

It remains possible to manually override the version number by setting
BPFTOOL_VERSION when calling make.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220210104237.11649-3-quentin@isovalent.com

9910a74d

bpftool: Add libbpf's version number to "bpftool version" output · 61fce969

Quentin Monnet authored Feb 10, 2022

To help users check what version of libbpf is being used with bpftool,
print the number along with bpftool's own version number.

Output:

    $ ./bpftool version
    ./bpftool v5.16.0
    using libbpf v0.7
    features: libbfd, libbpf_strict, skeletons

    $ ./bpftool version --json --pretty
    {
        "version": "5.16.0",
        "libbpf_version": "0.7",
        "features": {
            "libbfd": true,
            "libbpf_strict": true,
            "skeletons": true
        }
    }

Note that libbpf does not expose its patch number.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220210104237.11649-2-quentin@isovalent.com

61fce969

bpf: Fix bpf_prog_pack build for ppc64_defconfig · 4cc0991a

Song Liu authored Feb 10, 2022

bpf_prog_pack causes build error with powerpc ppc64_defconfig:

kernel/bpf/core.c:830:23: error: variably modified 'bitmap' at file scope
  830 |         unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
      |                       ^~~~~~

This is because the marco expands as:

unsigned long bitmap[((((((1UL) << (16 + __pte_index_size)) / (1 << 6))) \
     + ((sizeof(long) * 8)) - 1) / ((sizeof(long) * 8)))];

where __pte_index_size is a global variable.

Fix it by turning bitmap into a 0-length array.

Fixes: 57631054 ("bpf: Introduce bpf_prog_pack allocator")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211024939.2962537-1-song@kernel.org

4cc0991a

selftest/bpf: Check invalid length in test_xdp_update_frags · a5a358ab

Lorenzo Bianconi authored Feb 09, 2022

Update test_xdp_update_frags adding a test for a buffer size
set to (MAX_SKB_FRAGS + 2) * PAGE_SIZE. The kernel is supposed
to return -ENOMEM.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/3e4afa0ee4976854b2f0296998fe6754a80b62e5.1644366736.git.lorenzo@kernel.org

a5a358ab

10 Feb, 2022 30 commits

Merge branch 'bpf-light-skel' · 85fbd233

Daniel Borkmann authored Feb 10, 2022

Alexei Starovoitov says:

====================
The libbpf performs a set of complex operations to load BPF programs.
With "loader program" and "CO-RE in the kernel" the loading job of
libbpf was diminished. The light skeleton became lean enough to perform
program loading and map creation tasks without libbpf.
It's now possible to tweak it further to make light skeleton usable
out of user space and out of kernel module.
This allows bpf_preload.ko to drop user-mode-driver usage,
drop host compiler dependency, allow cross compilation and simplify the
code. It's a building block toward safe and portable kernel modules.

v3->v4:
- inlined skel_prep_init_value() as direct assignment in lskel

v2->v3:
- dropped vm_mmap() and switched to bpf_loader_ctx->flags & KERNEL approach.
  It allows bpf_preload.ko to be built-in.
  The kernel is able to load bpf progs before init process starts.
- added comments (Yonghong's review)
- added error checks in lskel (Andrii's review)
- added Acks in all but 2nd patch.

v1->v2:
- removed redundant anon struct and added comments (Andrii's reivew)
- added Yonghong's ack
- fixed build warning when JIT is off
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

85fbd233

bpf: Convert bpf_preload.ko to use light skeleton. · cb80ddc6

Alexei Starovoitov authored Feb 09, 2022

The main change is a move of the single line
  #include "iterators.lskel.h"
from iterators/iterators.c to bpf_preload_kern.c.
Which means that generated light skeleton can be used from user space or
user mode driver like iterators.c or from the kernel module or the kernel itself.
The direct use of light skeleton from the kernel module simplifies the code,
since UMD is no longer necessary. The libbpf.a required user space and UMD. The
CO-RE in the kernel and generated "loader bpf program" used by the light
skeleton are capable to perform complex loading operations traditionally
provided by libbpf. In addition UMD approach was launching UMD process
every time bpffs has to be mounted. With light skeleton in the kernel
the bpf_preload kernel module loads bpf iterators once and pins them
multiple times into different bpffs mounts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-6-alexei.starovoitov@gmail.com

cb80ddc6

bpf: Update iterators.lskel.h. · d7beb3d6

Alexei Starovoitov authored Feb 09, 2022

Light skeleton and skel_internal.h have changed.
Update iterators.lskel.h.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-5-alexei.starovoitov@gmail.com

d7beb3d6

bpftool: Generalize light skeleton generation. · 28d743f6

Alexei Starovoitov authored Feb 09, 2022

Generealize light skeleton by hiding mmap details in skel_internal.h
In this form generated lskel.h is usable both by user space and by the kernel.

Note that previously #include <bpf/bpf.h> was in *.lskel.h file.
To avoid #ifdef-s in a generated lskel.h the include of bpf.h is moved
to skel_internal.h, but skel_internal.h is also used by gen_loader.c
which is part of libbpf. Therefore skel_internal.h does #include "bpf.h"
in case of user space, so gen_loader.c and lskel.h have necessary definitions.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-4-alexei.starovoitov@gmail.com

28d743f6

libbpf: Prepare light skeleton for the kernel. · 6fe65f1b

Alexei Starovoitov authored Feb 09, 2022

Prepare light skeleton to be used in the kernel module and in the user space.
The look and feel of lskel.h is mostly the same with the difference that for
user space the skel->rodata is the same pointer before and after skel_load
operation, while in the kernel the skel->rodata after skel_open and the
skel->rodata after skel_load are different pointers.
Typical usage of skeleton remains the same for kernel and user space:
skel = my_bpf__open();
skel->rodata->my_global_var = init_val;
err = my_bpf__load(skel);
err = my_bpf__attach(skel);
// access skel->rodata->my_global_var;
// access skel->bss->another_var;
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-3-alexei.starovoitov@gmail.com

6fe65f1b

bpf: Extend sys_bpf commands for bpf_syscall programs. · b1d18a75

Alexei Starovoitov authored Feb 09, 2022

bpf_sycall programs can be used directly by the kernel modules
to load programs and create maps via kernel skeleton.
. Export bpf_sys_bpf syscall wrapper to be used in kernel skeleton.
. Export bpf_map_get to be used in kernel skeleton.
. Allow prog_run cmd for bpf_syscall programs with recursion check.
. Enable link_create and raw_tp_open cmds.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220209232001.27490-2-alexei.starovoitov@gmail.com

b1d18a75

net: dsa: qca8k: fix noderef.cocci warnings · 4f5e483b

kernel test robot authored Feb 10, 2022

drivers/net/dsa/qca8k.c:422:37-43: ERROR: application of sizeof to pointer

 sizeof when applied to a pointer typed expression gives the size of
 the pointer

Generated by: scripts/coccinelle/misc/noderef.cocci

Fixes: 90386223 ("net: dsa: qca8k: add support for larger read/write size with mgmt Ethernet")
CC: Ansuel Smith <ansuelsmth@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220209221304.GA17529@d2214a582157Signed-off-by: Jakub Kicinski <kuba@kernel.org>

4f5e483b

net/switchdev: use struct_size over open coded arithmetic · d8c28581

Minghao Chi (CGEL ZTE) authored Feb 10, 2022

Replace zero-length array with flexible-array member and make use
of the struct_size() helper in kmalloc(). For example:

struct switchdev_deferred_item {
    ...
    unsigned long data[];
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi (CGEL ZTE) <chi.minghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

d8c28581

ipv4: Reject again rules with high DSCP values · dc513a40

Guillaume Nault authored Feb 10, 2022

Commit 563f8e97 ("ipv4: Stop taking ECN bits into account in
fib4-rules") replaced the validation test on frh->tos. While the new
test is stricter for ECN bits, it doesn't detect the use of high order
DSCP bits. This would be fine if IPv4 could properly handle them. But
currently, most IPv4 lookups are done with the three high DSCP bits
masked. Therefore, using these bits doesn't lead to the expected
result.

Let's reject such configurations again, so that nobody starts to
use and make any assumption about how the stack handles the three high
order DSCP bits in fib4 rules.

Fixes: 563f8e97 ("ipv4: Stop taking ECN bits into account in fib4-rules")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc513a40

octeontx2-pf: Add TC feature for VFs · 4b0385bc

Subbaraya Sundeep authored Feb 10, 2022

This patch adds TC feature for VFs also. When MCAM
rules are allocated for a VF then either TC or ntuple
filters can be used. Below are the commands to use
TC feature for a VF(say lbk0):

devlink dev param set pci/0002:01:00.1 name mcam_count value 16 \
 cmode runtime
ethtool -K lbk0 hw-tc-offload on
ifconfig lbk0 up
tc qdisc add dev lbk0 ingress
tc filter add dev lbk0 parent ffff: protocol ip flower skip_sw \
 dst_mac 98:03:9b:83:aa:12 action police rate 100Mbit burst 5000

Also to modify any fields of the hardware context with
NIX_AQ_INSTOP_WRITE command then corresponding masks of those
fields must be set as per hardware. This was missing in
ingress ratelimiting context. This patch sets those masks also.
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4b0385bc

net: make net->dev_unreg_count atomic · ede6c39c

Eric Dumazet authored Feb 09, 2022

Having to acquire rtnl from netdev_run_todo() for every dismantled
device is not desirable when/if rtnl is under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ede6c39c

qed: prevent a fw assert during device shutdown · ca2d5f1f

Venkata Sudheer Kumar Bhavaraju authored Feb 09, 2022

Device firmware can assert if the device shutdown path in driver
encounters an async. events from mfw (processed in
qed_mcp_handle_events()) after qed_mcp_unload_req() returns.
A call to qed_mcp_unload_req() currently marks the device as inactive
and thus stops any new events, but there is a windows where in-flight
events might still be received by the driver.

To prevent this race condition, atomically set QED_MCP_BYPASS_PROC_BIT
in qed_mcp_unload_req() to make sure qed_mcp_handle_events() ignores all
events. Wait for any event that might already be in-process to complete
by monitoring QED_MCP_IN_PROCESSING_BIT.
Signed-off-by: Pravin Kumar Ganesh Dhende <pdhende@marvell.com>
Signed-off-by: Venkata Sudheer Kumar Bhavaraju <vbhavaraju@marvell.com>
Signed-off-by: Alok Prasad <palok@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca2d5f1f

Merge branch 'ping6-cmsg' · 57ea56b0

David S. Miller authored Feb 10, 2022

Jakub Kicinski says:

====================
net: ping6: support basic socket cmsgs

Add support for common SOL_SOCKET cmsgs in ICMPv6 sockets.
Extend the cmsg tests to cover more cmsgs and socket types.

SOL_IPV6 cmsgs to follow.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

57ea56b0

selftests: net: test standard socket cmsgs across UDP and ICMP sockets · af6ca205

Jakub Kicinski authored Feb 09, 2022

Test TIMESTAMPING and TXTIME across UDP / ICMP and IP versions.

Before ICMPv6 support:
  # ./tools/testing/selftests/net/cmsg_time.sh
    Case ICMPv6  - ts cnt returned '0', expected '2'
    Case ICMPv6  - ts0 SCHED returned '', expected 'OK'
    Case ICMPv6  - ts0 SND returned '', expected 'OK'
    Case ICMPv6  - TXTIME abs returned '', expected 'OK'
    Case ICMPv6  - TXTIME rel returned '', expected 'OK'
  FAIL - 5/36 cases failed

After:
  # ./tools/testing/selftests/net/cmsg_time.sh
  OK
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

af6ca205

selftests: net: cmsg_sender: support Tx timestamping · eb8f3116

Jakub Kicinski authored Feb 09, 2022

Support requesting Tx timestamps:

 $ ./cmsg_sender -p i -t -4 $tgt 123 -d 1000
 SCHED ts0 61us
   SND ts0 1071us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb8f3116

selftests: net: cmsg_sender: support setting SO_TXTIME · 4d397424

Jakub Kicinski authored Feb 09, 2022

Add ability to send delayed packets.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

4d397424

selftests: net: cmsg_so_mark: test with SO_MARK set by setsockopt · 9bbfbc92

Jakub Kicinski authored Feb 09, 2022

Test if setting SO_MARK with setsockopt works and if cmsg
takes precedence over it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

9bbfbc92

selftests: net: cmsg_so_mark: test ICMP and RAW sockets · 0344488e

Jakub Kicinski authored Feb 09, 2022

Use new capabilities of cmsg_sender to test ICMP and RAW sockets,
previously only UDP was tested.

Before SO_MARK support was added to ICMPv6:

  # ./cmsg_so_mark.sh
   Case ICMP rejection returned 0, expected 1
  FAIL - 1/12 cases failed

After:

  # ./cmsg_so_mark.sh
  OK
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0344488e

selftests: net: cmsg_sender: support icmp and raw sockets · de17e305

Jakub Kicinski authored Feb 09, 2022

Support sending fake ICMP(v6) messages and UDP via RAW sockets.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

de17e305

selftests: net: make cmsg_so_mark ready for more options · 49b78613

Jakub Kicinski authored Feb 09, 2022

Parametrize the code so that it can support UDP and ICMP
sockets in the future, and more cmsg types.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

49b78613

selftests: net: rename cmsg_so_mark · a086ee24

Jakub Kicinski authored Feb 09, 2022

Rename the file in prep for generalization.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a086ee24

net: ping6: support setting socket options via cmsg · 3ebb0b10

Jakub Kicinski authored Feb 09, 2022

Minor reordering of the code and a call to sock_cmsg_send()
gives us support for setting the common socket options via
cmsg (the usual ones - SO_MARK, SO_TIMESTAMPING_OLD, SCM_TXTIME).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ebb0b10

net: ping6: support packet timestamping · e7b06046

Jakub Kicinski authored Feb 09, 2022

Nothing prevents the user from requesting timestamping
on ping6 sockets, yet timestamps are not going to be reported.
Plumb the flags through.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e7b06046

net: ping6: remove a pr_debug() statement · 42652239

Jakub Kicinski authored Feb 09, 2022

We have ftrace and BPF today, there's no need for printing arguments
at the start of a function.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

42652239

Merge tag 'ieee802154-for-davem-2022-02-10' of... · 9557167b

David S. Miller authored Feb 10, 2022

Merge tag 'ieee802154-for-davem-2022-02-10' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next

Stefan Schmidt says:

====================
pull-request: ieee802154-next 2022-02-10

An update from ieee802154 for your *net-next* tree.

There is more ongoing in ieee802154 than usual. This will be the first pull
request for this cycle, but I expect one more. Depending on review and rework
times.

Pavel Skripkin ported the atusb driver over to the new USB api to avoid unint
problems as well as making use of the modern api without kmalloc() needs in he
driver.

Miquel Raynal landed some changes to ensure proper frame checksum checking with
hwsim, documenting our use of wake and stop_queue and eliding a magic value by
using the proper define.

David Girault documented the address struct used in ieee802154.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9557167b

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · adc27288

David S. Miller authored Feb 10, 2022

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2022-02-09

This series contains updates to ice driver only.

Brett adds support for QinQ. This begins with code refactoring and
re-organization of VLAN configuration functions to allow for
introduction of VSI VLAN ops to enable setting and calling of
respective operations based on device support of single or double
VLANs. Implementations are added for outer VLAN support.

To support QinQ, the device must be set to double VLAN mode (DVM).
In order for this to occur, the DDP package and NVM must also support
DVM. Functions to determine compatibility and properly configure the
device are added as well as setting the proper bits to advertise and
utilize the proper offloads. Support for VIRTCHNL_VF_OFFLOAD_VLAN_V2
is also included to allow for VF to negotiate and utilize this
functionality.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

adc27288

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next · 45230829

Jakub Kicinski authored Feb 09, 2022

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

1) Conntrack sets on CHECKSUM_UNNECESSARY for UDP packet with no checksum,
   from Kevin Mitchell.

2) skb->priority support for nfqueue, from Nicolas Dichtel.

3) Remove conntrack extension register API, from Florian Westphal.

4) Move nat destroy hook to nf_nat_hook instead, to remove
   nf_ct_ext_destroy(), also from Florian.

5) Wrap pptp conntrack NAT hooks into single structure, from Florian Westphal.

6) Support for tcp option set to noop for nf_tables, also from Florian.

7) Do not run x_tables comment match from packet path in nf_tables,
   from Florian Westphal.

8) Replace spinlock by cmpxchg() loop to update missed ct event,
   from Florian Westphal.

9) Wrap cttimeout hooks into single structure, from Florian.

10) Add fast nft_cmp expression for up to 16-bytes.

11) Use cb->ctx to store context in ctnetlink dump, instead of using
    cb->args[], from Florian Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: ctnetlink: use dump structure instead of raw args
  nfqueue: enable to set skb->priority
  netfilter: nft_cmp: optimize comparison for 16-bytes
  netfilter: cttimeout: use option structure
  netfilter: ecache: don't use nf_conn spinlock
  netfilter: nft_compat: suppress comment match
  netfilter: exthdr: add support for tcp option removal
  netfilter: conntrack: pptp: use single option structure
  netfilter: conntrack: remove extension register api
  netfilter: conntrack: handle ->destroy hook via nat_ops instead
  netfilter: conntrack: move extension sizes into core
  netfilter: conntrack: make all extensions 8-byte alignned
  netfilter: nfqueue: enable to get skb->priority
  netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY
====================

Link: https://lore.kernel.org/r/20220209133616.165104-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

45230829

tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH. · 4f9bf2a2

Sebastian Andrzej Siewior authored Feb 09, 2022

Commit
   9652dc2e ("tcp: relax listening_hash operations")

removed the need to disable bottom half while acquiring
listening_hash.lock. There are still two callers left which disable
bottom half before the lock is acquired.

On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves is also acquired with enabled bottom halves
followed by disabling bottom halves. This is the reverse locking order.
It has been observed with inet_listen_hashbucket::lock:

local_bh_disable() + spin_lock(&ilb->lock):
  inet_listen()
    inet_csk_listen_start()
      sk->sk_prot->hash() := inet_hash()
	local_bh_disable()
	__inet_hash()
	  spin_lock(&ilb->lock);
	    acquire(&ilb->lock);

Reverse order: spin_lock(&ilb2->lock) + local_bh_disable():
  tcp_seq_next()
    listening_get_next()
      spin_lock(&ilb2->lock);
	acquire(&ilb2->lock);

  tcp4_seq_show()
    get_tcp4_sock()
      sock_i_ino()
	read_lock_bh(&sk->sk_callback_lock);
	  acquire(softirq_ctrl)	// <---- whoops
	  acquire(&sk->sk_callback_lock)

Drop local_bh_disable() around __inet_hash() which acquires
listening_hash->lock. Split inet_unhash() and acquire the
listen_hashbucket lock without disabling bottom halves; the inet_ehash
lock with disabled bottom halves.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/12d6f9879a97cd56c09fb53dee343cbb14f7f1f7.camel@gmx.de
Link: https://lkml.kernel.org/r/X9CheYjuXWc75Spa@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/r/YgQOebeZ10eNx1W6@linutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4f9bf2a2

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 1127170d

Jakub Kicinski authored Feb 09, 2022

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-02-09

We've added 126 non-merge commits during the last 16 day(s) which contain
a total of 201 files changed, 4049 insertions(+), 2215 deletions(-).

The main changes are:

1) Add custom BPF allocator for JITs that pack multiple programs into a huge
   page to reduce iTLB pressure, from Song Liu.

2) Add __user tagging support in vmlinux BTF and utilize it from BPF
   verifier when generating loads, from Yonghong Song.

3) Add per-socket fast path check guarding from cgroup/BPF overhead when
   used by only some sockets, from Pavel Begunkov.

4) Continued libbpf deprecation work of APIs/features and removal of their
   usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko
   and various others.

5) Improve BPF instruction set documentation by adding byte swap
   instructions and cleaning up load/store section, from Christoph Hellwig.

6) Switch BPF preload infra to light skeleton and remove libbpf dependency
   from it, from Alexei Starovoitov.

7) Fix architecture-agnostic macros in libbpf for accessing syscall
   arguments from BPF progs for non-x86 architectures,
   from Ilya Leoshkevich.

8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be
   of 16-bit field with anonymous zero padding, from Jakub Sitnicki.

9) Add new bpf_copy_from_user_task() helper to read memory from a different
   task than current. Add ability to create sleepable BPF iterator progs,
   from Kenny Yu.

10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and
    utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski.

11) Generate temporary netns names for BPF selftests to avoid naming
    collisions, from Hangbin Liu.

12) Implement bpf_core_types_are_compat() with limited recursion for
    in-kernel usage, from Matteo Croce.

13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5
    to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor.

14) Misc minor fixes to libbpf and selftests from various folks.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits)
  selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup
  bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide
  libbpf: Fix compilation warning due to mismatched printf format
  selftests/bpf: Test BPF_KPROBE_SYSCALL macro
  libbpf: Add BPF_KPROBE_SYSCALL macro
  libbpf: Fix accessing the first syscall argument on s390
  libbpf: Fix accessing the first syscall argument on arm64
  libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL
  selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390
  libbpf: Fix accessing syscall arguments on riscv
  libbpf: Fix riscv register names
  libbpf: Fix accessing syscall arguments on powerpc
  selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro
  libbpf: Add PT_REGS_SYSCALL_REGS macro
  selftests/bpf: Fix an endianness issue in bpf_syscall_macro test
  bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE
  bpf: Fix leftover header->pages in sparc and powerpc code.
  libbpf: Fix signedness bug in btf_dump_array_data()
  selftests/bpf: Do not export subtest as standalone test
  bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures
  ...
====================

Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1127170d

net: drop_monitor: support drop reason · 5cad527d

Menglong Dong authored Feb 09, 2022

In the commit c504e5c2 ("net: skb: introduce kfree_skb_reason()")
drop reason is introduced to the tracepoint of kfree_skb. Therefore,
drop_monitor is able to report the drop reason to users by netlink.

The drop reasons are reported as string to users, which is exactly
the same as what we do when reporting it to ftrace.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220209060838.55513-1-imagedong@tencent.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5cad527d

09 Feb, 2022 3 commits

Merge branch 'Split bpf_sk_lookup remote_port field' · e5313968

Alexei Starovoitov authored Feb 09, 2022

Jakub Sitnicki says:

====================

Following the recent split-up of the bpf_sock dst_port field, apply the same to
technique to the bpf_sk_lookup remote_port field to make uAPI more user
friendly.

v1 -> v2:
- Remove remote_port range check and cast to be16 in TEST_RUN for sk_lookup
  (kernel test robot)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

e5313968

selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup · 2ed0dc59

Jakub Sitnicki authored Feb 09, 2022

Extend the context access tests for sk_lookup prog to cover the surprising
case of a 4-byte load from the remote_port field, where the expected value
is actually shifted by 16 bits.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220209184333.654927-3-jakub@cloudflare.com

2ed0dc59

bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide · 9a69e2b3

Jakub Sitnicki authored Feb 09, 2022

remote_port is another case of a BPF context field documented as a 32-bit
value in network byte order for which the BPF context access converter
generates a load of a zero-padded 16-bit integer in network byte order.

First such case was dst_port in bpf_sock which got addressed in commit
4421a582 ("bpf: Make dst_port field in struct bpf_sock 16-bit wide").

Loading 4-bytes from the remote_port offset and converting the value with
bpf_ntohl() leads to surprising results, as the expected value is shifted
by 16 bits.

Reduce the confusion by splitting the field in two - a 16-bit field holding
a big-endian integer, and a 16-bit zero-padding anonymous field that
follows it.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220209184333.654927-2-jakub@cloudflare.com

9a69e2b3