Commits · 6e62702feb6d474e969b52f0379de93e9729e457 · Kirill Smelkov / linux

13 May, 2024 40 commits

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 6e62702f

Jakub Kicinski authored May 13, 2024

Daniel Borkmann says:

====================
pull-request: bpf-next 2024-05-13

We've added 119 non-merge commits during the last 14 day(s) which contain
a total of 134 files changed, 9462 insertions(+), 4742 deletions(-).

The main changes are:

1) Add BPF JIT support for 32-bit ARCv2 processors, from Shahab Vahedi.

2) Add BPF range computation improvements to the verifier in particular
   around XOR and OR operators, refactoring of checks for range computation
   and relaxing MUL range computation so that src_reg can also be an unknown
   scalar, from Cupertino Miranda.

3) Add support to attach kprobe BPF programs through kprobe_multi link in
   a session mode, meaning, a BPF program is attached to both function entry
   and return, the entry program can decide if the return program gets
   executed and the entry program can share u64 cookie value with return
   program. Session mode is a common use-case for tetragon and bpftrace,
   from Jiri Olsa.

4) Fix a potential overflow in libbpf's ring__consume_n() and improve libbpf
   as well as BPF selftest's struct_ops handling, from Andrii Nakryiko.

5) Improvements to BPF selftests in context of BPF gcc backend,
   from Jose E. Marchesi & David Faust.

6) Migrate remaining BPF selftest tests from test_sock_addr.c to prog_test-
   -style in order to retire the old test, run it in BPF CI and additionally
   expand test coverage, from Jordan Rife.

7) Big batch for BPF selftest refactoring in order to remove duplicate code
   around common network helpers, from Geliang Tang.

8) Another batch of improvements to BPF selftests to retire obsolete
   bpf_tcp_helpers.h as everything is available vmlinux.h,
   from Martin KaFai Lau.

9) Fix BPF map tear-down to not walk the map twice on free when both timer
   and wq is used, from Benjamin Tissoires.

10) Fix BPF verifier assumptions about socket->sk that it can be non-NULL,
    from Alexei Starovoitov.

11) Change BTF build scripts to using --btf_features for pahole v1.26+,
    from Alan Maguire.

12) Small improvements to BPF reusing struct_size() and krealloc_array(),
    from Andy Shevchenko.

13) Fix s390 JIT to emit a barrier for BPF_FETCH instructions,
    from Ilya Leoshkevich.

14) Extend TCP ->cong_control() callback in order to feed in ack and
    flag parameters and allow write-access to tp->snd_cwnd_stamp
    from BPF program, from Miao Xu.

15) Add support for internal-only per-CPU instructions to inline
    bpf_get_smp_processor_id() helper call for arm64 and riscv64 BPF JITs,
    from Puranjay Mohan.

16) Follow-up to remove the redundant ethtool.h from tooling infrastructure,
    from Tushar Vyavahare.

17) Extend libbpf to support "module:<function>" syntax for tracing
    programs, from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
  bpf: make list_for_each_entry portable
  bpf: ignore expected GCC warning in test_global_func10.c
  bpf: disable strict aliasing in test_global_func9.c
  selftests/bpf: Free strdup memory in xdp_hw_metadata
  selftests/bpf: Fix a few tests for GCC related warnings.
  bpf: avoid gcc overflow warning in test_xdp_vlan.c
  tools: remove redundant ethtool.h from tooling infra
  selftests/bpf: Expand ATTACH_REJECT tests
  selftests/bpf: Expand getsockname and getpeername tests
  sefltests/bpf: Expand sockaddr hook deny tests
  selftests/bpf: Expand sockaddr program return value tests
  selftests/bpf: Retire test_sock_addr.(c|sh)
  selftests/bpf: Remove redundant sendmsg test cases
  selftests/bpf: Migrate ATTACH_REJECT test cases
  selftests/bpf: Migrate expected_attach_type tests
  selftests/bpf: Migrate wildcard destination rewrite test
  selftests/bpf: Migrate sendmsg6 v4 mapped address tests
  selftests/bpf: Migrate sendmsg deny test cases
  selftests/bpf: Migrate WILDCARD_IP test
  selftests/bpf: Handle SYSCALL_EPERM and SYSCALL_ENOTSUPP test cases
  ...
====================

Link: https://lore.kernel.org/r/20240513134114.17575-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e62702f

net: pcs: lynx: no need to read LPA in lynx_pcs_get_state_2500basex() · afd29f36

Vladimir Oltean authored May 13, 2024

Nothing useful is done with the LPA variable in lynx_pcs_get_state_2500basex(),
we can just remove the read.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240513115345.2452799-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

afd29f36

Merge branch 'mlx5-misc-patches' · d20e391c

Jakub Kicinski authored May 13, 2024

Tariq Toukan says:

====================
mlx5 misc patches

This series includes patches for the mlx5 driver.

Patch 1 by Shay enables LAG with HCAs of 8 ports.

Patch 2 by Carolina optimizes the safe switch channels operation for the
TX-only changes.

Patch 3 by Parav cleans up some unused code.
====================

Link: https://lore.kernel.org/r/20240512124306.740898-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d20e391c

net/mlx5: Remove unused msix related exported APIs · db5944e1

Parav Pandit authored May 12, 2024

MSIX irq allocation and free APIs are no longer
in use. Hence, remove the dead code.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://lore.kernel.org/r/20240512124306.740898-4-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

db5944e1

net/mlx5e: Modifying channels number and updating TX queues · bcee0937

Carolina Jubran authored May 12, 2024

It is not appropriate for the mlx5e_num_channels_changed
function to be called solely for updating the TX queues,
even if the channels number has not been changed.

Move the code responsible for updating the TC and TX queues
from mlx5e_num_channels_changed and produce a new function
called mlx5e_update_tc_and_tx_queues. This new function should
only be called when the channels number remains unchanged.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512124306.740898-3-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bcee0937

net/mlx5: Enable 8 ports LAG · e0e6adfe

Shay Drory authored May 12, 2024

This patch adds to mlx5 drivers support for 8 ports HCAs.
Starting with ConnectX-8 HCAs with 8 ports are possible.

As most driver parts aren't affected by such configuration most driver
code is unchanged.

Specially the only affected areas are:
- Lag
- Multiport E-Switch
- Single FDB E-Switch

All of the above are already factored in generic way, and LAG and VF LAG
are tested, so all that left is to change a #define and remove checks
which are no longer needed.
However, Multiport E-Switch is not tested yet, so it is left untouched.

This patch will allow to create hardware LAG/VF LAG when all 8 ports are
added to the same bond device.

for example, In order to activate the hardware lag a user can execute
the following:

ip link add bond0 type bond
ip link set bond0 type bond miimon 100 mode 2
ip link set eth2 master bond0
ip link set eth3 master bond0
ip link set eth4 master bond0
ip link set eth5 master bond0
ip link set eth6 master bond0
ip link set eth7 master bond0
ip link set eth8 master bond0
ip link set eth9 master bond0

Where eth2, eth3, eth4, eth5, eth6, eth7, eth8 and eth9 are the PFs of
the same HCA.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512124306.740898-2-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e0e6adfe

test: hsr: Extend the hsr_redbox.sh to have more SAN devices connected · eafbf057

Lukasz Majewski authored May 10, 2024

After this change the single SAN device (ns3eth1) is now replaced with
two SAN devices - respectively ns4eth1 and ns5eth1.

It is possible to extend this script to have more SAN devices connected
by adding them to ns3br1 bridge.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Link: https://lore.kernel.org/r/20240510143710.3916631-1-lukma@denx.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

eafbf057

Merge branch 'net-dsa-microchip-dcb-fixes' · ef318fc2

Jakub Kicinski authored May 13, 2024

Oleksij Rempel says:

====================
net: dsa: microchip: DCB fixes

This patch series address recommendation to rename IPV to IPM to avoid
confusion with IPV name used in 802.1Qci PSFP. And restores default "PCP
only" configuration as source of priorities to avoid possible
regressions.
====================

Link: https://lore.kernel.org/r/20240510053828.2412516-1-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ef318fc2

net: dsa: microchip: dcb: set default apptrust to PCP only · 01e400f2

Oleksij Rempel authored May 10, 2024

Before DCB support, the KSZ driver had only PCP as source of packet
priority values. To avoid regressions, make PCP only as default value.
User will need enable DSCP support manually.

This patch do not affect other KSZ8 related quirks. User will still be
warned by setting not support configurations for the port 2.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510053828.2412516-4-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

01e400f2

net: dsa: microchip: dcb: add comments for DSCP related functions · 593d6ad1

Oleksij Rempel authored May 10, 2024

All other functions are commented. Add missing comments to following
functions:
ksz_set_global_dscp_entry()
ksz_port_add_dscp_prio()
ksz_port_del_dscp_prio()
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510053828.2412516-3-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

593d6ad1

net: dsa: microchip: dcb: rename IPV to IPM · 2ccb1ac2

Oleksij Rempel authored May 10, 2024

IPV is added and used term in 802.1Qci PSFP and merged into 802.1Q (from
802.1Q-2018) for another functions.

Even it does similar operation holding temporal priority value
internally (as it is named), because KSZ datasheet doesn't use the term
of IPV (Internal Priority Value) and avoiding any confusion later when
PSFP is in the Linux world, it is better to rename IPV to IPM (Internal
Priority Mapping).

In addition, LAN937x documentation already use IPV for 802.1Qci PSFP
related functionality.
Suggested-by: Woojung Huh <Woojung.Huh@microchip.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Woojung Huh <woojung.huh@microchip.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510053828.2412516-2-o.rempel@pengutronix.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2ccb1ac2

l2tp: Support different protocol versions with same IP/port quadruple · 36479805

Samuel Thibault authored May 09, 2024

628bc3e5 ("l2tp: Support several sockets with same IP/port quadruple")
added support for several L2TPv2 tunnels using the same IP/port quadruple,
but if an L2TPv3 socket exists it could eat all the trafic. We thus have to
first use the version from the packet to get the proper tunnel, and only
then check that the version matches.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Reviewed-by: James Chapman <jchapman@katalix.com>
Link: https://lore.kernel.org/r/20240509205812.4063198-1-samuel.thibault@ens-lyon.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

36479805

ynl: ensure exact-len value is resolved · ec8c2574

Antonio Quartulli authored May 11, 2024

For type String and Binary we are currently usinig the exact-len
limit value as is without attempting any name resolution.
However, the spec may specify the name of a constant rather than an
actual value, which would result in using the constant name as is
and thus break the policy.

Ensure the limit value is passed to get_limit(), which will always
attempt resolving the name before printing the policy rule.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Link: https://lore.kernel.org/r/20240510232202.24051-1-a@unstable.ccSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ec8c2574

Merge branch 'add-tx-stop-wake-counters' · e5a28026

Jakub Kicinski authored May 13, 2024

Daniel Jurgens says:

====================
Add TX stop/wake counters

Several drivers provide TX stop and wake counters via ethtool stats. Add
those to the netdev queue stats, and use them in virtio_net.
====================

Link: https://lore.kernel.org/r/20240510201927.1821109-1-danielj@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e5a28026

virtio_net: Add TX stopped and wake counters · c39add9b

Daniel Jurgens authored May 10, 2024

Add a tx queue stop and wake counters, they are useful for debugging.

$ ./tools/net/ynl/cli.py --spec netlink/specs/netdev.yaml \
--dump qstats-get --json '{"scope": "queue"}'
...
 {'ifindex': 13,
  'queue-id': 0,
  'queue-type': 'tx',
  'tx-bytes': 14756682850,
  'tx-packets': 226465,
  'tx-stop': 113208,
  'tx-wake': 113208},
 {'ifindex': 13,
  'queue-id': 1,
  'queue-type': 'tx',
  'tx-bytes': 18167675008,
  'tx-packets': 278660,
  'tx-stop': 8632,
  'tx-wake': 8632}]
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240510201927.1821109-3-danielj@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c39add9b

netdev: Add queue stats for TX stop and wake · b5603510

Daniel Jurgens authored May 10, 2024

TX queue stop and wake are counted by some drivers.
Support reporting these via netdev-genl queue stats.
Signed-off-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://lore.kernel.org/r/20240510201927.1821109-2-danielj@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b5603510

tcp: socket option to check for MPTCP fallback to TCP · c084ebd7

Matthieu Baerts (NGI0) authored May 09, 2024

A way for an application to know if an MPTCP connection fell back to TCP
is to use getsockopt(MPTCP_INFO) and look for errors. The issue with
this technique is that the same errors -- EOPNOTSUPP (IPv4) and
ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the
kernel doesn't support this socket option. The userspace then has to
look at the kernel version to understand what the errors mean.

It is not clean, and it doesn't take into account older kernels where
the socket option has been backported. A cleaner way would be to expose
this info to the TCP socket level. In case of MPTCP socket where no
fallback happened, the socket options for the TCP level will be handled
in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in
TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value
1, while TCP has to set 0.

If the socket option is not supported, one of these two errors will be
reported:
- EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets
- ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the
socket received after an 'accept()', when the client didn't request to
use MPTCP: this socket will be a TCP one, even if the listen socket
was an MPTCP one.

With this new option, the kernel can return a clear answer to both "Is
this kernel new enough to tell me the fallback status?" and "If it is
new enough, is it currently a TCP or MPTCP socket?" questions, while not
breaking the previous method.
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c084ebd7

Merge branch 'net-gro-remove-network_header-use-move-p-flush-flush_id-calculations-to-l4' · e6e43570

Jakub Kicinski authored May 13, 2024

Richard Gobert says:

====================
net: gro: remove network_header use, move p->{flush/flush_id} calculations to L4

The cb fields network_offset and inner_network_offset are used instead of
skb->network_header throughout GRO.

These fields are then leveraged in the next commit to remove flush_id state
from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be
unnecessarily complicated due to encapsulation support in GRO. These fields
are checked in L4 instead.

3rd patch adds tests for different flush_id flows in GRO.
====================

Link: https://lore.kernel.org/r/20240509190819.2985-1-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e6e43570

selftests/net: add flush id selftests · bc21faef

Richard Gobert authored May 09, 2024

Added flush id selftests to test different cases where DF flag is set or
unset and id value changes in the following packets. All cases where the
packets should coalesce or should not coalesce are tested.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-4-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bc21faef

net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment · 4b0ebbca

Richard Gobert authored May 09, 2024

{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.

These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.

This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.

This results in less parsing code for non-loop flush tests for TCP and UDP
flows.

To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).

perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_receive_network_flush is compiled inline to tcp_gro_receive)
net-next:
        6.94% [kernel] [k] inet_gro_receive
        3.02% [kernel] [k] tcp_gro_receive

patch applied:
        4.27% [kernel] [k] tcp_gro_receive
        4.22% [kernel] [k] inet_gro_receive

perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
        10.09% [kernel] [k] inet_gro_receive
        2.08% [kernel] [k] tcp_gro_receive

patch applied:
        6.97% [kernel] [k] inet_gro_receive
        3.68% [kernel] [k] tcp_gro_receive
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-3-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4b0ebbca

net: gro: use cb instead of skb->network_header · 186b1ea7

Richard Gobert authored May 09, 2024

This patch converts references of skb->network_header to napi_gro_cb's
network_offset and inner_network_offset.
Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240509190819.2985-2-richardbgobert@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

186b1ea7

Merge branch 'ena-driver-changes-may-2024' · 9af9b891

Jakub Kicinski authored May 13, 2024

David Arinzon says:

====================
ENA driver changes May 2024

This patchset contains several misc and minor
changes to the ENA driver.
====================

Link: https://lore.kernel.org/r/20240512134637.25299-1-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9af9b891

net: ena: Change initial rx_usec interval · 1cc0a47d

David Arinzon authored May 12, 2024

For the purpose of obtaining better CPU utilization,
minimum rx moderation interval is set to 20 usec.
Signed-off-by: Osama Abboud <osamaabb@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-6-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1cc0a47d

net: ena: Changes around strscpy calls · 97776caf

David Arinzon authored May 12, 2024

strscpy copies as much of the string as possible,
meaning that the destination string will be truncated
in case of no space. As this is a non-critical error in
our case, adding a debug level print for indication.

This patch also removes a -1 which was added to ensure
enough space for NUL, but strscpy destination string is
guaranteed to be NUL-terminted, therefore, the -1 is
not needed.
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-5-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

97776caf

net: ena: Add validation for completion descriptors consistency · b37b98a3

David Arinzon authored May 12, 2024

Validate that `first` flag is set only for the first
descriptor in multi-buffer packets.
In case of an invalid descriptor, a reset will occur.
A new reset reason for RX data corruption has been added.
Signed-off-by: Shahar Itzko <itzko@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-4-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b37b98a3

net: ena: Reduce holes in ena_com structures · 48673ef4

David Arinzon authored May 12, 2024

This patch makes two changes in order to fill holes and
reduce ther overall size of the structures ena_com_dev
and ena_com_rx_ctx.
Signed-off-by: Shahar Itzko <itzko@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-3-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

48673ef4

net: ena: Add a counter for driver's reset failures · 62a261f6

David Arinzon authored May 12, 2024

This patch adds a counter to the ena_adapter struct in
order to keep track of reset failures.
The counter is incremented every time either ena_restore_device()
or ena_destroy_device() fail.
Signed-off-by: Osama Abboud <osamaabb@amazon.com>
Signed-off-by: David Arinzon <darinzon@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240512134637.25299-2-darinzon@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

62a261f6

selftests: netfilter: nft_flowtable.sh: bump socat timeout to 1m · 5fcc17df

Florian Westphal authored May 11, 2024

Now that this test runs in netdev CI it looks like 10s isn't enough
for debug kernels:
  selftests: net/netfilter: nft_flowtable.sh
  2024/05/10 20:33:08 socat[12204] E write(7, 0x563feb16a000, 8192): Broken pipe
  FAIL: file mismatch for ns1 -> ns2
  -rw------- 1 root root 37345280 May 10 20:32 /tmp/tmp.Am0yEHhNqI
 ...

Looks like socat gets zapped too quickly, so increase timeout to 1m.

Could also reduce tx file size for KSFT_MACHINE_SLOW, but its preferrable
to have same test for both debug and nondebug.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240511064814.561525-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

5fcc17df

selftests: net: use upstream mtools · cfc2eefd

Vladimir Oltean authored May 10, 2024

Joachim kindly merged the IPv6 support in
https://github.com/troglobit/mtools/pull/2, so we can just use his
version now. A few more fixes subsequently came in for IPv6, so even
better.

Check that the deployed mtools version is 3.0 or above. Note that the
version check breaks compatibility with my fork where I didn't bump the
version, but I assume that won't be a problem.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240510112856.1262901-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

cfc2eefd

selftest: epoll_busy_poll: Fix spelling mistake "couldnt" -> "couldn't" · f37dc28a

Colin Ian King authored May 10, 2024

There is a spelling mistake in a TH_LOG message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240510084811.3299685-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f37dc28a

net: phy: air_en8811h: reset netdev rules when LED is set manually · 87bfdbbb

Daniel Golle authored May 09, 2024

Setting LED_OFF via brightness_set should deactivate hw control, so make
sure netdev trigger rules also get cleared in that case.
This fixes unwanted restoration of the default netdev trigger rules and
matches the behaviour when using the 'netdev' trigger without any
hardware offloading.

Fixes: 71e79430 ("net: phy: air_en8811h: Add the Airoha EN8811H PHY driver")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://lore.kernel.org/r/5ed8ea615890a91fa4df59a7ae8311bbdf63cdcf.1715248281.git.daniel@makrotopia.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

87bfdbbb

Merge tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next · c85e41bf

Jakub Kicinski authored May 13, 2024

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

Patch #1 skips transaction if object type provides no .update interface.

Patch #2 skips NETDEV_CHANGENAME which is unused.

Patch #3 enables conntrack to handle Multicast Router Advertisements and
	 Multicast Router Solicitations from the Multicast Router Discovery
	 protocol (RFC4286) as untracked opposed to invalid packets.
	 From Linus Luessing.

Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of
	 dropping them, from Jason Xing.

Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0,
	 also from Jason.

Patch #6 removes reference in netfilter's sysctl documentation on pickup
	 entries which were already removed by Florian Westphal.

Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which
	 allows to evict entries from the conntrack table,
	 also from Florian.

Patches #8 to #16 updates nf_tables pipapo set backend to allocate
	 the datastructure copy on-demand from preparation phase,
	 to better deal with OOM situations where .commit step is too late
	 to fail. Series from Florian Westphal.

Patch #17 adds a selftest with packetdrill to cover conntrack TCP state
	 transitions, also from Florian.

Patch #18 use GFP_KERNEL to clone elements from control plane to avoid
	 quick atomic reserves exhaustion with large sets, reporter refers
	 to million entries magnitude.

* tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: allow clone callbacks to sleep
  selftests: netfilter: add packetdrill based conntrack tests
  netfilter: nft_set_pipapo: remove dirty flag
  netfilter: nft_set_pipapo: move cloning of match info to insert/removal path
  netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone
  netfilter: nft_set_pipapo: merge deactivate helper into caller
  netfilter: nft_set_pipapo: prepare walk function for on-demand clone
  netfilter: nft_set_pipapo: prepare destroy function for on-demand clone
  netfilter: nft_set_pipapo: make pipapo_clone helper return NULL
  netfilter: nft_set_pipapo: move prove_locking helper around
  netfilter: conntrack: remove flowtable early-drop test
  netfilter: conntrack: documentation: remove reference to non-existent sysctl
  netfilter: use NF_DROP instead of -NF_DROP
  netfilter: conntrack: dccp: try not to drop skb in conntrack
  netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery
  netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler
  netfilter: nf_tables: skip transaction if update object is not implemented
====================

Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c85e41bf

bpf: make list_for_each_entry portable · ba39486d

Jose E. Marchesi authored May 11, 2024

[Changes from V1:
- The __compat_break has been abandoned in favor of
  a more readable can_loop macro that can be used anywhere, including
  loop conditions.]

The macro list_for_each_entry is defined in bpf_arena_list.h as
follows:

  #define list_for_each_entry(pos, head, member)				\
	for (void * ___tmp = (pos = list_entry_safe((head)->first,		\
						    typeof(*(pos)), member),	\
			      (void *)0);					\
	     pos && ({ ___tmp = (void *)pos->member.next; 1; });		\
	     cond_break,							\
	     pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))

The macro cond_break, in turn, expands to a statement expression that
contains a `break' statement.  Compound statement expressions, and the
subsequent ability of placing statements in the header of a `for'
loop, are GNU extensions.

Unfortunately, clang implements this GNU extension differently than
GCC:

- In GCC the `break' statement is bound to the containing "breakable"
  context in which the defining `for' appears.  If there is no such
  context, GCC emits a warning: break statement without enclosing `for'
  o `switch' statement.

- In clang the `break' statement is bound to the defining `for'.  If
  the defining `for' is itself inside some breakable construct, then
  clang emits a -Wgcc-compat warning.

This patch adds a new macro can_loop to bpf_experimental, that
implements the same logic than cond_break but evaluates to a boolean
expression.  The patch also changes all the current instances of usage
of cond_break withing the header of loop accordingly.

Tested in bpf-next master.
No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Link: https://lore.kernel.org/r/20240511212243.23477-1-jose.marchesi@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

ba39486d

bpf: ignore expected GCC warning in test_global_func10.c · 6a2f786e

Jose E. Marchesi authored May 11, 2024

The BPF selftest global_func10 in progs/test_global_func10.c contains:

  struct Small {
  	long x;
  };

  struct Big {
  	long x;
  	long y;
  };

  [...]

  __noinline int foo(const struct Big *big)
  {
	if (!big)
		return 0;

	return bpf_get_prandom_u32() < big->y;
  }

  [...]

  SEC("cgroup_skb/ingress")
  __failure __msg("invalid indirect access to stack")
  int global_func10(struct __sk_buff *skb)
  {
	const struct Small small = {.x = skb->len };

	return foo((struct Big *)&small) ? 1 : 0;
  }

GCC emits a "maybe uninitialized" warning for the code above, because
it knows `foo' accesses `big->y'.

Since the purpose of this selftest is to check that the verifier will
fail on this sort of invalid memory access, this patch just silences
the compiler warning.

Tested in bpf-next master.
No regressions.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240511212349.23549-1-jose.marchesi@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

6a2f786e

bpf: disable strict aliasing in test_global_func9.c · 73868988

Jose E. Marchesi authored May 11, 2024

The BPF selftest test_global_func9.c performs type punning and breaks
srict-aliasing rules.

In particular, given:

  int global_func9(struct __sk_buff *skb)
  {
	int result = 0;

	[...]
	{
		const struct C c = {.x = skb->len, .y = skb->family };

		result |= foo((const struct S *)&c);
	}
  }

When building with strict-aliasing enabled (the default) the
initialization of `c' gets optimized away in its entirely:

	[... no initialization of `c' ...]
	r1 = r10
	r1 += -40
	call	foo
	w0 |= w6

Since GCC knows that `foo' accesses s->x, we get a "maybe
uninitialized" warning.

On the other hand, when strict-aliasing is disabled GCC only optimizes
away the store to `.y':

	r1 = *(u32 *) (r6+0)
	*(u32 *) (r10+-40) = r1  ; This is .x = skb->len in `c'
	r1 = r10
	r1 += -40
	call	foo
	w0 |= w6

In this case the warning is not emitted, because s-> is initialized.

This patch disables strict aliasing in this test when building with
GCC.  clang seems to not optimize this particular code even when
strict aliasing is enabled.

Tested in bpf-next master.
Signed-off-by: Jose E. Marchesi <jose.marchesi@oracle.com>
Cc: david.faust@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240511212213.23418-1-jose.marchesi@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

73868988

selftests/bpf: Free strdup memory in xdp_hw_metadata · a3c1c955

Geliang Tang authored May 11, 2024

The strdup() function returns a pointer to a new string which is a
duplicate of the string "ifname". Memory for the new string is obtained
with malloc(), and need to be freed with free().

This patch adds this missing "free(saved_hwtstamp_ifname)" in cleanup()
to avoid a potential memory leak in xdp_hw_metadata.c.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/af9bcccb96655e82de5ce2b4510b88c9c8ed5ed0.1715417367.git.tanggeliang@kylinos.cnSigned-off-by: Alexei Starovoitov <ast@kernel.org>

a3c1c955

selftests/bpf: Fix a few tests for GCC related warnings. · 5ddafcc3

Cupertino Miranda authored May 10, 2024

This patch corrects a few warnings to allow selftests to compile for
GCC.

-- progs/cpumask_failure.c --

progs/bpf_misc.h:136:22: error: ‘cpumask’ is used uninitialized
[-Werror=uninitialized]
  136 | #define __sink(expr) asm volatile("" : "+g"(expr))
      |                      ^~~
progs/cpumask_failure.c:68:9: note: in expansion of macro ‘__sink’
   68 |         __sink(cpumask);

The macro __sink(cpumask) with the '+' contraint modifier forces the
the compiler to expect a read and write from cpumask. GCC detects
that cpumask is never initialized and reports an error.
This patch removes the spurious non required definitions of cpumask.

-- progs/dynptr_fail.c --

progs/dynptr_fail.c:1444:9: error: ‘ptr1’ may be used uninitialized
[-Werror=maybe-uninitialized]
 1444 |         bpf_dynptr_clone(&ptr1, &ptr2);

Many of the tests in the file are related to the detection of
uninitialized pointers by the verifier. GCC is able to detect possible
uninitialized values, and reports this as an error.
The patch initializes all of the previous uninitialized structs.

-- progs/test_tunnel_kern.c --

progs/test_tunnel_kern.c:590:9: error: array subscript 1 is outside
array bounds of ‘struct geneve_opt[1]’ [-Werror=array-bounds=]
  590 |         *(int *) &gopt.opt_data = bpf_htonl(0xdeadbeef);
      |         ^~~~~~~~~~~~~~~~~~~~~~~
progs/test_tunnel_kern.c:575:27: note: at offset 4 into object ‘gopt’ of
size 4
  575 |         struct geneve_opt gopt;

This tests accesses beyond the defined data for the struct geneve_opt
which contains as last field "u8 opt_data[0]" which clearly does not get
reserved space (in stack) in the function header. This pattern is
repeated in ip6geneve_set_tunnel and geneve_set_tunnel functions.
GCC is able to see this and emits a warning.
The patch introduces a local struct that allocates enough space to
safely allow the write to opt_data field.

-- progs/jeq_infer_not_null_fail.c --

progs/jeq_infer_not_null_fail.c:21:40: error: array subscript ‘struct
bpf_map[0]’ is partly outside array bounds of ‘struct <anonymous>[1]’
[-Werror=array-bounds=]
   21 |         struct bpf_map *inner_map = map->inner_map_meta;
      |                                        ^~
progs/jeq_infer_not_null_fail.c:14:3: note: object ‘m_hash’ of size 32
   14 | } m_hash SEC(".maps");

This example defines m_hash in the context of the compilation unit and
casts it to struct bpf_map which is much smaller than the size of struct
bpf_map. It errors out in GCC when it attempts to access an element that
would be defined in struct bpf_map outsize of the defined limits for
m_hash.
This patch disables the warning through a GCC pragma.

This changes were tested in bpf-next master selftests without any
regressions.
Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: jose.marchesi@oracle.com
Cc: david.faust@oracle.com
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/r/20240510183850.286661-2-cupertino.miranda@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5ddafcc3

bpf: avoid gcc overflow warning in test_xdp_vlan.c · 792a04be

David Faust authored May 08, 2024

This patch fixes an integer overflow warning raised by GCC in
xdp_prognum1 of progs/test_xdp_vlan.c:

  GCC-BPF  [test_maps] test_xdp_vlan.bpf.o
progs/test_xdp_vlan.c: In function 'xdp_prognum1':
progs/test_xdp_vlan.c:163:25: error: integer overflow in expression
 '(short int)(((__builtin_constant_p((int)vlan_hdr->h_vlan_TCI)) != 0
   ? (int)(short unsigned int)((short int)((int)vlan_hdr->h_vlan_TCI
   << 8 >> 8) << 8 | (short int)((int)vlan_hdr->h_vlan_TCI << 0 >> 8
   << 0)) & 61440 : (int)__builtin_bswap16(vlan_hdr->h_vlan_TCI)
   & 61440) << 8 >> 8) << 8' of type 'short int' results in '0' [-Werror=overflow]
  163 |                         bpf_htons((bpf_ntohs(vlan_hdr->h_vlan_TCI) & 0xf000)
      |                         ^~~~~~~~~

The problem lies with the expansion of the bpf_htons macro and the
expression passed into it.  The bpf_htons macro (and similarly the
bpf_ntohs macro) expand to a ternary operation using either
__builtin_bswap16 or ___bpf_swab16 to swap the bytes, depending on
whether the expression is constant.

For an expression, with 'value' as a u16, like:

  bpf_htons (value & 0xf000)

The entire (value & 0xf000) is 'x' in the expansion of ___bpf_swab16
and we get as one part of the expanded swab16:

  ((__u16)(value & 0xf000) << 8 >> 8 << 8

This will always evaluate to 0, which is intentional since this
subexpression deals with the byte guaranteed to be 0 by the mask.

However, GCC warns because the precise reason this always evaluates to 0
is an overflow.  Specifically, the plain 0xf000 in the expression is a
signed 32-bit integer, which causes 'value' to also be promoted to a
signed 32-bit integer, and the combination of the 8-bit left shift and
down-cast back to __u16 results in a signed overflow (really a 'warning:
overflow in conversion from int to __u16' which is propegated up through
the rest of the expression leading to the ultimate overflow warning
above), which is a valid warning despite being the intended result of
this code.

Clang does not warn on this case, likely because it performs constant
folding later in the compilation process relative to GCC.  It seems that
by the time clang does constant folding for this expression, the side of
the ternary with this overflow has already been discarded.

Fortunately, this warning is easily silenced by simply making the 0xf000
mask explicitly unsigned.  This has no impact on the result.
Signed-off-by: David Faust <david.faust@oracle.com>
Cc: jose.marchesi@oracle.com
Cc: cupertino.miranda@oracle.com
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240508193512.152759-1-david.faust@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

792a04be

tools: remove redundant ethtool.h from tooling infra · bbe91a9f

Tushar Vyavahare authored May 08, 2024

Remove the redundant ethtool.h header file from tools/include/uapi/linux.
The file is unnecessary as the system uses the kernel's
include/uapi/linux/ethtool.h directly.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240508104123.434769-1-tushar.vyavahare@intel.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

bbe91a9f

Merge branch 'retire-progs-test_sock_addr' · e9dd2290

Alexei Starovoitov authored May 12, 2024

Jordan Rife says:

====================
Retire progs/test_sock_addr.c

This patch series migrates remaining tests from bpf/test_sock_addr.c to
prog_tests/sock_addr.c and progs/verifier_sock_addr.c in order to fully
retire the old-style test program and expands test coverage to test
previously untested scenarios related to sockaddr hooks.

This is a continuation of the work started recently during the expansion
of prog_tests/sock_addr.c.

Link: https://lore.kernel.org/bpf/20240429214529.2644801-1-jrife@google.com/T/#u

=======
Patches
=======
* Patch 1 moves tests that check valid return values for recvmsg hooks
  into progs/verifier_sock_addr.c, a new addition to the verifier test
  suite.
* Patches 2-5 lay the groundwork for test migration, enabling
  prog_tests/sock_addr.c to handle more test dimensions.
* Patches 6-11 move existing tests to prog_tests/sock_addr.c.
* Patch 12 removes some redundant test cases.
* Patches 14-17 expand on existing test coverage.
====================

Link: https://lore.kernel.org/r/20240510190246.3247730-1-jrife@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

e9dd2290