Commits · 7de3c2218eed77ed4771521459a18bef938f7089 · Kirill Smelkov / linux

29 Mar, 2024 38 commits

bnxt_en: Add a timeout parameter to bnxt_hwrm_port_ts_query() · 7de3c221

Michael Chan authored Mar 25, 2024

The caller can pass this new timeout parameter to the function to
specify the firmware timeout value when requesting the TX timestamp
from the firmware. This will allow the caller to precisely control
the timeout and will be used in the next patch. In this patch, the
parameter is 0 which means to use the current default value.

Cc: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://lore.kernel.org/r/20240325222902.220712-2-michael.chan@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7de3c221

Merge branch 'fix-missing-phy-to-mac-rx-clock' · 7f9d82a0

Jakub Kicinski authored Mar 28, 2024

Romain Gantois says:

====================
Fix missing PHY-to-MAC RX clock

There is an issue with some stmmac/PHY combinations that has been reported
some time ago in a couple of different series:

Clark Wang's report:
https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/
Clément Léger's report:
https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/

Stmmac controllers require an RX clock signal from the MII bus to perform
their hardware initialization successfully. This causes issues with some
PHY/PCS devices. If these devices do not bring the clock signal up before
the MAC driver initializes its hardware, then said initialization will
fail. This can happen at probe time or when the system wakes up from a
suspended state.

This series introduces new flags for phy_device and phylink_pcs. These
flags allow MAC drivers to signal to PHY/PCS drivers that the RX clock
signal should be enabled as soon as possible, and that it should always
stay enabled.

I have included specific uses of these flags that fix the RZN1 GMAC1 stmmac
driver that I am currently working on and that is not yet upstream. I have
also included changes to the at803x PHY driver that should fix the issue
that Clark Wang was having.
====================

Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-0-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7f9d82a0

net: pcs: rzn1-miic: Init RX clock early if MAC requires it · 0f671b3b

Romain Gantois authored Mar 26, 2024

The GMAC1 controller in the RZN1 IP requires the RX MII clock signal to be
started before it initializes its own hardware, thus before it calls
phylink_start.

Implement the pcs_pre_init() callback so that the RX clock signal can be
enabled early if necessary.
Reported-by: Clément Léger <clement.leger@bootlin.com>
Link: https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-7-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0f671b3b

net: phy: qcom: at803x: Avoid hibernating if MAC requires RX clock · 30dc5873

Russell King (Oracle) authored Mar 26, 2024

Stmmac controllers connected to an at803x PHY cannot resume properly after
suspend when WoL is enabled. This happens because the MAC requires an RX
clock generated by the PHY to initialize its hardware properly. But the RX
clock is cut when the PHY suspends and isn't brought up until the MAC
driver resumes the phylink.

Prevent the at803x PHY driver from going into suspend if the attached MAC
driver always requires an RX clock signal.
Reported-by: Clark Wang <xiaoning.wang@nxp.com>
Link: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[rgantois: commit log]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-6-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

30dc5873

net: stmmac: Signal to PHY/PCS drivers to keep RX clock on · 58329b03

Romain Gantois authored Mar 26, 2024

There is a reocurring issue with stmmac controllers where the MAC fails to
initialize its hardware if an RX clock signal isn't provided on the MAC/PHY
link.

This causes issues when PHY or PCS devices either go into suspend while
cutting the RX clock or do not bring the clock signal up early enough for
the MAC to initialize successfully.

Set the mac_requires_rxc flag in the stmmac phylink config so that PHY/PCS
drivers know to keep the RX clock up at all times.
Reported-by: Clark Wang <xiaoning.wang@nxp.com>
Link: https://lore.kernel.org/all/20230202081559.3553637-1-xiaoning.wang@nxp.com/Reported-by: Clément Léger <clement.leger@bootlin.com>
Link: https://lore.kernel.org/linux-arm-kernel/20230116103926.276869-4-clement.leger@bootlin.com/Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-5-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

58329b03

net: stmmac: Support a generic PCS field in mac_device_info · f7bff228

Romain Gantois authored Mar 26, 2024

Global stmmac support for early initialization of PCS devices requires a
generic PCS reference that can be passed to phylink_pcs_pre_init().
Currently, the mac_device_info struct contains only one PCS field, which is
specific to the Lynx model.

As PCS models are hardware-specific, it is more appropriate to have a
generic PCS field in the mac_device_info struct.

Refactor the lynx_pcs field into a generic phylink_pcs field.
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-4-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f7bff228

net: stmmac: don't rely on lynx_pcs presence to check for a PHY · 10658e99

Maxime Chevallier authored Mar 26, 2024

When initializing attached PHYs, there are some cases where we don't expect
any PHY to be connected. The logic uses conditions based on various local
PCS configuration, but also calls-in phylink_expects_phy() via
stmmac_init_phy(), which is enough to ensure we don't try to initialize a
PHY when using a Lynx PCS, as long as we have the phy_interface set to a
802.3z mode and are using inband negociation.

Drop the lynx check, making the stmmac generic code more pcs_lynx-agnostic.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
[rgantois: commit log]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-3-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

10658e99

net: phylink: add rxc_always_on flag to phylink_pcs · dceb393a

Romain Gantois authored Mar 26, 2024

Some MAC drivers (e.g. stmmac) require a continuous receive clock signal to
be generated by a PCS that is handled by a standalone PCS driver.

Such a PCS driver does not have access to a PHY device, thus cannot check
the PHY_F_RXC_ALWAYS_ON flag. They cannot check max_requires_rxc in the
phylink config either, since it is a private member. Therefore, a new flag
is needed to signal to the PCS that it should keep the RX clock signal up
at all times.
Co-developed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-2-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

dceb393a

net: phylink: add PHY_F_RXC_ALWAYS_ON to PHY dev flags · 21d9ba5b

Russell King (Oracle) authored Mar 26, 2024

Some MAC controllers (e.g. stmmac) require their connected PHY to
continuously provide a receive clock signal. This can cause issues in two
cases:

  1. The clock signal hasn't been started yet by the time the MAC driver
     initializes its hardware. This can make the initialization fail, as in
      the case of the rzn1 GMAC1 driver.
  2. The clock signal is cut during a power saving event. By the time the
     MAC is brought back up, the clock signal is still not active since
     phylink_start hasn't been called yet. This brings us back to case 1.

If a PHY driver reads this flag, it should ensure that the receive clock
signal is started as soon as possible, and that it isn't brought down when
the PHY goes into suspend.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[rgantois: commit log]
Signed-off-by: Romain Gantois <romain.gantois@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240326-rxc_bugfix-v6-1-24a74e5c761f@bootlin.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

21d9ba5b

Merge branch 'compiler_types-add-endianness-dependent-__counted_by_-le-be' · af352c3b

Jakub Kicinski authored Mar 28, 2024

Alexander Lobakin says:

====================
compiler_types: add Endianness-dependent __counted_by_{le,be}

Some structures contain flexible arrays at the end and the counter for
them, but the counter has explicit Endianness and thus __counted_by()
can't be used directly.

To increase test coverage for potential problems without breaking
anything, introduce __counted_by_{le,be} defined depending on platform's
Endianness to either __counted_by() when applicable or noop otherwise.
The first user will be virtchnl2.h from idpf just as example with 9 flex
structures having Little Endian counters.

Maybe it would be a good idea to introduce such attributes on compiler
level if possible, but for now let's stop on what we have.
====================

Link: https://lore.kernel.org/r/20240327142241.1745989-1-aleksander.lobakin@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

af352c3b

idpf: sprinkle __counted_by{,_le}() in the virtchnl2 header · 93d24acf

Alexander Lobakin authored Mar 27, 2024

Both virtchnl2.h and its consumer idpf_virtchnl.c are very error-prone.
There are 10 structures with flexible arrays at the end, but 9 of them
has flex member counter in Little Endian.
Make the code a bit more robust by applying __counted_by_le() to those
9. LE platforms is the main target for this driver, so they would
receive additional protection.
While we're here, add __counted_by() to virtchnl2_ptype::proto_id, as
its counter is `u8` regardless of the Endianness.
Compile test on x86_64 (LE) didn't reveal any new issues after applying
the attributes.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20240327142241.1745989-4-aleksander.lobakin@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

93d24acf

idpf: make virtchnl2.h self-contained · c00d33f1

Alexander Lobakin authored Mar 27, 2024

To ease maintaining of virtchnl2.h, which already is messy enough,
make it self-contained by adding missing if_ether.h include due to
%ETH_ALEN usage.
At the same time, virtchnl2_lan_desc.h is not used anywhere in the
file, so move this include to idpf_txrx.h to speed up C preprocessing.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20240327142241.1745989-3-aleksander.lobakin@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c00d33f1

compiler_types: add Endianness-dependent __counted_by_{le,be} · ca7e324e

Alexander Lobakin authored Mar 27, 2024

Some structures contain flexible arrays at the end and the counter for
them, but the counter has explicit Endianness and thus __counted_by()
can't be used directly.

To increase test coverage for potential problems without breaking
anything, introduce __counted_by_{le,be}() defined depending on
platform's Endianness to either __counted_by() when applicable or noop
otherwise.
Maybe it would be a good idea to introduce such attributes on compiler
level if possible, but for now let's stop on what we have.
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20240327142241.1745989-2-aleksander.lobakin@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ca7e324e

net: remove gfp_mask from napi_alloc_skb() · 6e9b0190

Jakub Kicinski authored Mar 26, 2024

__napi_alloc_skb() is napi_alloc_skb() with the added flexibility
of choosing gfp_mask. This is a NAPI function, so GFP_ATOMIC is
implied. The only practical choice the caller has is whether to
set __GFP_NOWARN. But that's a false choice, too, allocation failures
in atomic context will happen, and printing warnings in logs,
effectively for a packet drop, is both too much and very likely
non-actionable.

This leads me to a conclusion that most uses of napi_alloc_skb()
are simply misguided, and should use __GFP_NOWARN in the first
place. We also have a "standard" way of reporting allocation
failures via the queue stat API (qstats::rx-alloc-fail).

The direct motivation for this patch is that one of the drivers
used at Meta calls napi_alloc_skb() (so prior to this patch without
__GFP_NOWARN), and the resulting OOM warning is the top networking
warning in our fleet.
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240327040213.3153864-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6e9b0190

qed: Drop useless pci_params.pm_cap · 49d665b8

Bjorn Helgaas authored Mar 25, 2024

qed_init_pci() used pci_params.pm_cap only to cache the pci_dev.pm_cap.
Drop the cache and use pci_dev.pm_cap directly.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240325224931.1462051-1-helgaas@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

49d665b8

gve: Add counter adminq_get_ptype_map_cnt to stats report · 3bcbc67b

John Fraker authored Mar 25, 2024

This counter counts the number of times get_ptype_map is executed on the
admin queue, and was previously missing from the stats report.
Signed-off-by: John Fraker <jfraker@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240325223308.618671-1-jfraker@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3bcbc67b

Merge branch 'ravb-support-describing-the-mdio-bus' · c602f4ca

Jakub Kicinski authored Mar 28, 2024

Niklas Söderlund says:

====================
ravb: Support describing the MDIO bus

This series adds support to the binding and driver of the Renesas
Ethernet AVB to described the MDIO bus. Currently the driver uses
the OF node of the device itself when registering the MDIO bus.
This forces any MDIO bus properties the MDIO core should react on
to be set on the device OF node. This is confusing and none of
the MDIO bus properties are described in the Ethernet AVB bindings.

Patch 1/2 extends the bindings with an optional mdio child-node
to the device that can be used to contain the MDIO bus settings.
While patch 2/2 changes the driver to use this node (if present)
when registering the MDIO bus.

If the new optional mdio child-node is not present the driver
fallback to the old behavior and uses the device OF node like before.
This change is fully backward compatible with existing usage
of the bindings.
====================

Link: https://lore.kernel.org/r/20240325153451.2366083-1-niklas.soderlund+renesas@ragnatech.seSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c602f4ca

ravb: Add support for an optional MDIO mode · 2c60c4c0

Niklas Söderlund authored Mar 25, 2024

The driver used the DT node of the device itself when registering the
MDIO bus. While this works, it creates a problem: it forces any MDIO bus
properties to also be set on the devices DT node. This mixes the
properties of two distinctly different things and is confusing.

This change adds support for an optional mdio node to be defined as a
child to the device DT node. The child node can then be used to describe
MDIO bus properties that the MDIO core can act on when registering the
bus.

If no mdio child node is found the driver fallback to the old behavior
and register the MDIO bus using the device DT node. This change is
backward compatible with old bindings in use.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240325153451.2366083-3-niklas.soderlund+renesas@ragnatech.seSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2c60c4c0

dt-bindings: net: renesas,etheravb: Add optional MDIO bus node · a87590c4

Niklas Söderlund authored Mar 25, 2024

The Renesas Ethernet AVB bindings do not allow the MDIO bus to be
described. This has not been needed as only a single PHY is
supported and no MDIO bus properties have been needed.

Add an optional mdio node to the binding which allows the MDIO bus to be
described and allow bus properties to be set.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20240325153451.2366083-2-niklas.soderlund+renesas@ragnatech.seSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a87590c4

Merge branch 'doc-netlink-specs-add-vlan-support' · fb984d17

Jakub Kicinski authored Mar 28, 2024

Hangbin Liu says:

====================
doc/netlink/specs: Add vlan support

Add vlan support in rt_link spec.
====================

Link: https://lore.kernel.org/r/20240327123130.1322921-1-liuhangbin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fb984d17

doc/netlink/specs: Add vlan attr in rt_link spec · 782c1084

Hangbin Liu authored Mar 27, 2024

With command:
 # ./tools/net/ynl/cli.py \
 --spec Documentation/netlink/specs/rt_link.yaml \
 --do getlink --json '{"ifname": "eno1.2"}' --output-json | \
 jq -C '.linkinfo'

Before:
Exception: No message format for 'vlan' in sub-message spec 'linkinfo-data-msg'

After:
 {
   "kind": "vlan",
   "data": {
     "protocol": "8021q",
     "id": 2,
     "flag": {
       "flags": [
         "reorder-hdr"
       ],
       "mask": "0xffffffff"
     },
     "egress-qos": {
       "mapping": [
         {
           "from": 1,
           "to": 2
         },
         {
           "from": 4,
           "to": 4
         }
       ]
     }
   }
 }
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240327123130.1322921-3-liuhangbin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

782c1084

ynl: support hex display_hint for integer · b334f5ed

Hangbin Liu authored Mar 27, 2024

Some times it would be convenient to read the integer as hex, like
mask values.
Suggested-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240327123130.1322921-2-liuhangbin@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b334f5ed

Merge branch 'selftests-fixes-for-kernel-ci' · 51cf49f6

Jakub Kicinski authored Mar 28, 2024

Petr Machata says:

====================
selftests: Fixes for kernel CI

As discussed on the bi-weekly call on Jan 30, and in mailing around
kernel CI effort, some changes are desirable in the suite of forwarding
selftests the better to work with the CI tooling. Namely:

- The forwarding selftests use a configuration file where names of
interfaces are defined and various variables can be overridden. There
is also forwarding.config.sample that users can use as a template to
refer to when creating the config file. What happens a fair bit is
that users either do not know about this at all, or simply forget, and
are confused by cryptic failures about interfaces that cannot be
created.

In patches #1 - #3 have lib.sh just be the single source of truth with
regards to which variables exist. That includes the topology variables
which were previously only in the sample file, and any "tweak
variables", such as what tools to use, sleep times, etc.

forwarding.config.sample then becomes just a placeholder with a couple
examples. Unless specific HW should be exercised, or specific tools
used, the defaults are usually just fine.

- Several net/forwarding/ selftests (and one net/ one) cannot be run on
veth pairs, they need an actual HW interface to run on. They are
generic in the sense that any capable HW should pass them, which is
why they have been put to net/forwarding/ as opposed to drivers/net/,
but they do not generalize to veth. The fact that these tests are in
net/forwarding/, but still complaining when run, is confusing.

In patches #4 - #6 move these tests to a new directory
drivers/net/hw.

- The following patches extend the codebase to handle well test results
other than pass and fail.

Patch #7 is preparatory. It converts several log_test_skip to XFAIL,
so that tests do not spuriously end up returning non-0 when they
are not supposed to.

In patches #8 - #10, introduce some missing ksft constants, then support
having those constants in RET, and then finally in EXIT_STATUS.

- The traffic scheduler tests generate a large amount of network traffic
to test the behavior of the scheduler. This demands a relatively
high-performance computer. On slow machines, such as with a debugging
kernel, the test would spuriously fail.

It can still be useful to "go through the motions" though, to possibly
catch bugs in setup of the scheduler graph and passing packets around.
Thus we still want to run the tests, just with lowered demands.

To that end, in patches #11 - #12, introduce an environment variable
KSFT_MACHINE_SLOW, with obvious meaning. Tests can then make checks
more lenient, such as mark failures as XFAIL. A helper, xfail_on_slow,
is provided to mark performance-sensitive parts of the selftest.

- In patch #13, use a similar mechanism to mark a NH group stats
selftest to XFAIL HW stats tests when run on VETH pairs.

- All these changes complicate the hitherto straightforward logging and
checking logic, so in patch #14, add a selftest that checks this
functionality in lib.sh.

v1 (vs. an RFC circulated through linux-kselftest):
- Patch #9:
- Clarify intended usage by s/set_ret/ret_set_ksft_status/,
s/nret/ksft_status/
====================

Link: https://lore.kernel.org/r/cover.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

51cf49f6

selftests: forwarding: Add a test for testing lib.sh functionality · 8ff2d7ab

Petr Machata authored Mar 26, 2024

Rerunning various scenarios to make sure lib.sh changes do not impact the
observable behavior is no fun. Add a selftest at least for the bare basics
-- the mechanics of setting RET, retmsg, and EXIT_STATUS.

Since the selftest itself uses lib.sh, it would be possible to break lib.sh
in such a way that invalidates result of the selftest. Since the metatest
only uses the bare basics (just pass/fail), hopefully such fundamental
breakages would be noticed.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/6d25cedbf2d4b83614944809a34fe023fbe8db38.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8ff2d7ab

selftests: forwarding: router_mpath_nh_lib: Don't skip, xfail on veth · 6db870bb

Petr Machata authored Mar 26, 2024

When the NH group stats tests are currently run on a veth topology, the
HW-stats leg of each test is SKIP'ped. But kernel networking CI interprets
skips as a sign that tooling is missing, and prompts maintainer
investigation. Lack of capability to pass a test should be expressed as
XFAIL.

Selftests that require HW should normally be put in drivers/net/hw, but
doing so for the NH counter selftests would just lead to a lot of
duplicity.

So instead, introduce a helper, xfail_on_veth(), which can be used to mark
selftests that should XFAIL instead of FAILing when run on a veth topology.
On non-veth topology, they don't do anything.

Use the helper in the HW-stats part of router_mpath_nh_lib selftest.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/15f0ab9637aa0497f164ec30e83c1c8f53d53719.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6db870bb

selftests: forwarding: Mark performance-sensitive tests · e1039109

Petr Machata authored Mar 26, 2024

When run on a slow machine, the scheduler traffic tests can be expected to
fail, and should be reported as XFAIL in that case. Therefore run these
tests through the perf_sensitive wrapper.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/9a357f8cf34f5ececac08d43a3eb023008996035.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e1039109

selftests: forwarding: Support for performance sensitive tests · e16a8d20

Petr Machata authored Mar 26, 2024

Several tests in the suite use large amounts of traffic to e.g. cause
congestion and evaluate RED or shaper performance. These tests will not run
well on a slow machine, be it one with heavy debug kernel, or a VM, or e.g.
a single-board computer. Allow users to specify an environment variable,
KSFT_MACHINE_SLOW=yes, to indicate that the tests are being run on one such
machine.

Performance sensitive tests can then use a new helper, xfail_on_slow(), to
mark parts of the test that are sensitive to low-performance machines.
The helper can be used to just mark the whole suite, like so:

	xfail_on_slow tests_run

... or, on the other side of the granularity spectrum, to override
individual checks:

	xfail_on_slow check_err $? "Expected much, got little."
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/99a376a2d2ffdaeee7752b1910cb0c3ea5d80fbe.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e16a8d20

selftests: forwarding: Convert log_test() to recognize RET values · a923af1c

Petr Machata authored Mar 26, 2024

In a previous patch, the interpretation of RET value was changed to mean
the kselftest framework constant with the test outcome: $ksft_pass,
$ksft_xfail, etc.

Update log_test() to recognize the various possible RET values.

Then have EXIT_STATUS track the RET value of the current test. This differs
subtly from the way RET tracks the value: while for RET we want to
recognize XFAIL as a separate status, for purposes of exit code, we want to
to conflate XFAIL and PASS, because they both communicate non-failure. Thus
add a new helper, ksft_exit_status_merge().

With this log_test_skip() and log_test_xfail() can be reexpressed as thin
wrappers around log_test.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/e5f807cb5476ab795fd14ac74da53a731a9fc432.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a923af1c

selftests: forwarding: Have RET track kselftest framework constants · 596c8819

Petr Machata authored Mar 26, 2024

The variable RET keeps track of whether the test under execution has so far
failed or not. Currently it works in binary fashion: zero means everything
is fine, non-zero means something failed. log_test() then uses the value to
given a human-readable message.

In order to allow log_test() to report skips and xfails, the semantics of
RET need to be more fine-grained. Therefore have RET value be one of
kselftest framework constants: $ksft_fail, $ksft_xfail, etc.

The current logic in check_err() is such that first non-zero value of RET
trumps all those that follow. But that is not right when RET has more
fine-grained value semantics. Different outcomes have different weights.

The results of PASS and XFAIL are mostly the same: they both communicate a
test that did not go wrong. SKIP communicates lack of tooling, which the
user should go and try to fix, and as such should not be overridden by the
passes. So far, the higher-numbered statuses can be considered weightier.
But FAIL should be the weightiest.

Add a helper, ksft_status_merge(), which merges two statuses in a way that
respects the above conditions. Express it in a generic manner, because exit
status merge is subtly different, and we want to reuse the same logic.

Use the new helper when setting RET in check_err().

Re-express check_fail() in terms of check_err() to avoid duplication.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/7dfff51cc925c7a3ac879b9050a0d6a327c8d21f.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

596c8819

selftests: lib: Define more kselftest exit codes · 51ccf267

Petr Machata authored Mar 26, 2024

The following patches will operate with more exit codes besides
ksft_skip. Add them here.

Additionally, move a duplicated skip exit code definition from
forwarding/tc_tunnel_key.sh. Keep a similar duplicate in
forwarding/devlink_lib.sh, because even though lib.sh will have
been sourced in all cases where devlink_lib is, the inclusion is not
visible in the file itself, and relying on it would be confusing.

Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/545a03046c7aca0628a51a389a9b81949ab288ce.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

51ccf267

selftests: forwarding: Change inappropriate log_test_skip() calls · 677f3949

Petr Machata authored Mar 26, 2024

The SKIP return should be used for cases where tooling of the machine under
test is lacking. For cases where HW is lacking, the appropriate outcome is
XFAIL.

This is the case with ethtool_rmon and mlxsw_lib. For these, introduce a
new helper, log_test_xfail().

Do the same for router_mpath_nh_lib. Note that it will be fixed using a
more reusable way in a following patch.

For the two resource_scale selftests, the log should simply not be written,
because there is no problem.

Cc: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/3d668d8fb6fa0d9eeb47ce6d9e54114348c7c179.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

677f3949

selftests: forwarding: Ditch skip_on_veth() · 0c499a35

Petr Machata authored Mar 26, 2024

Since the selftests that are not supposed to run on veth pairs are now in
their own dedicated directory, the skip_on_veth logic can go away. Drop it
from the selftests, and from lib.sh.

Cc: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/63b470e10d65270571ee7de709b31672ce314872.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0c499a35

selftests: forwarding: Move several selftests · 40d269c0

Petr Machata authored Mar 26, 2024

The tests in net/forwarding are generally expected to be HW-independent.
There are however several tests that, while not depending on any HW in
particular, nevertheless depend on being used on HW interfaces. Placing
these selftests to net/forwarding is confusing, because the selftest will
just report it can't be run on veth pairs. At the same time, placing them
to a particular driver's selftests subdirectory would be wrong.

Instead, add a new directory, drivers/net/hw, where these generic but HW
independent selftests should be placed. Move over several such tests
including one helper library.

Since typically these tests will not be expected to run, omit the directory
drivers/net/hw from the TARGETS list in selftests/Makefile. Retain a
Makefile in the new directory itself, so that a user can make -C into that
directory and act on those tests explicitly.

Cc: Roger Quadros <rogerq@kernel.org>
Cc: Tobias Waldekranz <tobias@waldekranz.com>
Cc: Danielle Ratson <danieller@nvidia.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Cc: Johannes Nixdorf <jnixdorf-oss@avm.de>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/e11dae1f62703059e9fc2240004288ac7cc15756.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

40d269c0

selftests: forwarding: ipip_lib: Do not import lib.sh · 0faa565b

Petr Machata authored Mar 26, 2024

This library is always sourced in the context where lib.sh has already been
sourced as well. Therefore drop the explicit sourcing and expect the client
to already have done it. This will simplify moving some of the clients to a
different directory.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/a4da5e9cd42a34cbace917a048ca71081719d6ac.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0faa565b

selftests: forwarding: README: Document customization · 0cb86287

Petr Machata authored Mar 26, 2024

That any sort of customization is possible at all, let alone how it should
be done, is currently not at all clear. Document the whats and hows in
README.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/e819623af6aaeea49e9dc36cecd95694fad73bb8.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0cb86287

selftests: forwarding.config.sample: Move overrides to lib.sh · fd36fd26

Petr Machata authored Mar 26, 2024

forwarding.config.sample, net/lib.sh and net/forwarding/lib.sh contain
definitions and redefinitions of some of the same variables. The overlap
between net/forwarding/lib.sh and forwarding.config.sample is especially
large. This duplication is a potential source of confusion and problems.

It would be overall less error prone if each variable were defined in one
place only. In this patch set, that place is the library itself. Therefore
move all comments from forwarding.config.sample to net/forwarding/lib.sh.

Move over also a definition of TC_FLAG, which was missing from lib.sh
entirely.

Additionally, add to lib.sh a default definition of the topology variables.
The logic behind this is that forgetting to specify forwarding.config was a
frequent source of frustration for the selftest users. But really, most of
the time the default veth based topology is just fine. We considered just
sourcing forwarding.config.sample instead if forwarding.config is not
available, but this is a cleaner solution.

That means the syntax of the forwarding.config.sample override has to
change to an array assignment, so that the whole variable is overwritten,
not just individual keys, which could leave the value of some keys
unchanged. Do the same in lib.sh for any cut'n'pasters out there.

The config file is then given a sort of carte blanche to redefine whatever
variables it sees fit from the libraries. This is described in a comment in
the file. Only a handful of variables are left behind, to illustrate the
customization.

The fact that the variables are now missing from forwarding.config.sample,
and therefore would miss from forwarding.config derived from that file as
well, should not change anything. This is just the sample file. Users that
keep their own forwarding.config would retain it as before.

The only observable change is introduction of TC_FLAG to lib.sh, because
now the filters would not be attempted to install to HW datapath. For veth
pairs this does not change anything. For HW deployments, users presumably
have forwarding.config with this value overridden.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/b9b8a11a22821a7aa532211ff461a34f596e26bf.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fd36fd26

selftests: net: libs: Change variable fallback syntax · fa61e9ae

Petr Machata authored Mar 26, 2024

The current syntax of X=${X:=X} first evaluates the ${X:=Y} expression,
which either uses the existing value of $X if there is one, or uses the
value of "Y" as a fallback, and assigns it to X. The expression is then
replaced with the now-current value of $X. Assigning that value to X once
more is meaningless.

So avoid the outer X=... bit, and instead express the same idea though the
do-nothing ":" built-in as : "${X:=Y}". This also cleans up the block
nicely and makes it more readable.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/1890ddc58420c2c0d5ba3154c87ecc6d9faf6947.1711464583.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fa61e9ae

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 5e47fbe5

Jakub Kicinski authored Mar 28, 2024

Cross-merge networking fixes after downstream PR.

No conflicts, or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5e47fbe5

28 Mar, 2024 2 commits

Merge tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 50108c35

Linus Torvalds authored Mar 28, 2024

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf, WiFi and netfilter.

  Current release - regressions:

   - ipv6: fix address dump when IPv6 is disabled on an interface

  Current release - new code bugs:

   - bpf: temporarily disable atomic operations in BPF arena

   - nexthop: fix uninitialized variable in nla_put_nh_group_stats()

  Previous releases - regressions:

   - bpf: protect against int overflow for stack access size

   - hsr: fix the promiscuous mode in offload mode

   - wifi: don't always use FW dump trig

   - tls: adjust recv return with async crypto and failed copy to
     userspace

   - tcp: properly terminate timers for kernel sockets

   - ice: fix memory corruption bug with suspend and rebuild

   - at803x: fix kernel panic with at8031_probe

   - qeth: handle deferred cc1

  Previous releases - always broken:

   - bpf: fix bug in BPF_LDX_MEMSX

   - netfilter: reject table flag and netdev basechain updates

   - inet_defrag: prevent sk release while still in use

   - wifi: pick the version of SESSION_PROTECTION_NOTIF

   - wwan: t7xx: split 64bit accesses to fix alignment issues

   - mlxbf_gige: call request_irq() after NAPI initialized

   - hns3: fix kernel crash when devlink reload during pf
     initialization"

* tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  inet: inet_defrag: prevent sk release while still in use
  Octeontx2-af: fix pause frame configuration in GMP mode
  net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips
  net: bcmasp: Remove phy_{suspend/resume}
  net: bcmasp: Bring up unimac after PHY link up
  net: phy: qcom: at803x: fix kernel panic with at8031_probe
  netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c
  netfilter: nf_tables: skip netdev hook unregistration if table is dormant
  netfilter: nf_tables: reject table flag and netdev basechain updates
  netfilter: nf_tables: reject destroy command to remove basechain hooks
  bpf: update BPF LSM designated reviewer list
  bpf: Protect against int overflow for stack access size
  bpf: Check bloom filter map value size
  bpf: fix warning for crash_kexec
  selftests: netdevsim: set test timeout to 10 minutes
  net: wan: framer: Add missing static inline qualifiers
  mlxbf_gige: call request_irq() after NAPI initialized
  tls: get psock ref after taking rxlock to avoid leak
  selftests: tls: add test with a partially invalid iov
  tls: adjust recv return with async crypto and failed copy to userspace
  ...

50108c35

inet: inet_defrag: prevent sk release while still in use · 18685451

Florian Westphal authored Mar 26, 2024

ip_local_out() and other functions can pass skb->sk as function argument.

If the skb is a fragment and reassembly happens before such function call
returns, the sk must not be released.

This affects skb fragments reassembled via netfilter or similar
modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.

Eric Dumazet made an initial analysis of this bug.  Quoting Eric:
  Calling ip_defrag() in output path is also implying skb_orphan(),
  which is buggy because output path relies on sk not disappearing.

  A relevant old patch about the issue was :
  8282f274 ("inet: frag: Always orphan skbs inside ip_defrag()")

  [..]

  net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
  inet socket, not an arbitrary one.

  If we orphan the packet in ipvlan, then downstream things like FQ
  packet scheduler will not work properly.

  We need to change ip_defrag() to only use skb_orphan() when really
  needed, ie whenever frag_list is going to be used.

Eric suggested to stash sk in fragment queue and made an initial patch.
However there is a problem with this:

If skb is refragmented again right after, ip_do_fragment() will copy
head->sk to the new fragments, and sets up destructor to sock_wfree.
IOW, we have no choice but to fix up sk_wmem accouting to reflect the
fully reassembled skb, else wmem will underflow.

This change moves the orphan down into the core, to last possible moment.
As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
offset into the FRAG_CB, else skb->sk gets clobbered.

This allows to delay the orphaning long enough to learn if the skb has
to be queued or if the skb is completing the reasm queue.

In the former case, things work as before, skb is orphaned.  This is
safe because skb gets queued/stolen and won't continue past reasm engine.

In the latter case, we will steal the skb->sk reference, reattach it to
the head skb, and fix up wmem accouting when inet_frag inflates truesize.

Fixes: 7026b1dd ("netfilter: Pass socket pointer down through okfn().")
Diagnosed-by: Eric Dumazet <edumazet@google.com>
Reported-by: xingwei lee <xrivendell7@gmail.com>
Reported-by: yue sun <samsun1006219@gmail.com>
Reported-by: syzbot+e5167d7144a62715044c@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240326101845.30836-1-fw@strlen.deSigned-off-by: Paolo Abeni <pabeni@redhat.com>

18685451