Commits · f78dca691287813b6f101b207272786f0cb39c65 · Kirill Smelkov / linux

21 Jul, 2023 29 commits

octeontx2-pf: implement transmit schedular allocation algorithm · f78dca69

Naveen Mamindlapalli authored Jul 19, 2023

unlike strict priority, where number of classes are limited to max
8, there is no restriction on the number of dwrr child nodes unless
the count increases the max number of child nodes supported.

Hardware expects strict priority transmit schedular indexes mapped
to their priority. This patch adds defines transmit schedular allocation
algorithm such that the above requirement is honored.
Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f78dca69

Merge branch 'mlxsw-enslavement' · c6514f36

David S. Miller authored Jul 21, 2023

Petr Machata says:

====================
mlxsw: Permit enslavement to netdevices with uppers

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.

Similarly, detaching a front panel port from a configured topology means
unoffloading of this whole topology -- VLAN uppers, next hops, etc.
Attaching the port back is then not permitted at all. If it were, it would
not result in a working configuration, because much of mlxsw is written to
react to changes in immediate configuration. There is nothing that would go
visit netdevices in the attached-to topology and offload existing routes
and VLAN memberships, for example.

In this patchset, introduce a number of replays to be invoked so that this
sort of post-hoc offload is supported. Then remove the vetoes that
disallowed enslavement of front panel ports to other netdevices with
uppers.

The patchset progresses as follows:

- In patch #1, fix an issue in the bridge driver. To my knowledge, the
  issue could not have resulted in a buggy behavior previously, and thus is
  packaged with this patchset instead of being sent separately to net.

- In patch #2, add a new helper to the switchdev code.

- In patch #3, drop mlxsw selftests that will not be relevant after this
  patchset anymore.

- Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother
  introduction of the rest of the code.

- Patches #9, #10, #11, #12, #13 and #14 replay various aspects of upper
  configuration when a front panel port is introduced into a topology.
  Individual patches take care of bridge and LAG RIF memberships, switchdev
  replay, nexthop and neighbors replay, and MACVLAN offload.

- Patches #15 and #16 introduce RIFs for newly-relevant netdevices when a
  front panel port is enslaved (in which case all uppers are newly
  relevant), or, respectively, deslaved (in which case the newly-relevant
  netdevice is the one being deslaved).

- Up until this point, the introduced scaffolding was not really used,
  because mlxsw still forbids enslavement of mlxsw netdevices to uppers
  with uppers. In patch #17, this condition is finally relaxed.

A sizable selftest suite is available to test all this new code. That will
be sent in a separate patchset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c6514f36

mlxsw: spectrum: Permit enslavement to netdevices with uppers · 2c5ffe8d

Petr Machata authored Jul 19, 2023

Enslaving of front panel ports (and their uppers) to netdevices that
already have uppers is currently forbidden. In the previous patches, a
number of replays have been added. Those ensure that various bits of state,
such as next hops or switchdev objects, are offloaded when they become
relevant due to a mlxsw lower being introduced into the topology.

However the act of actually, for example, enslaving a front-panel port to
a bridge with uppers, has been vetoed so far. In this patch, remove the
vetoes and permit the operation.

mlxsw currently validates creation of "interesting" uppers. Thus creating
VLAN netdevices on top of 802.1ad bridges is forbidden if the bridge has an
mlxsw lower, but permitted in general. This validation code never gets run
when a port is introduced as a lower of an existing netdevice structure.

Thus when enslaving an mlxsw netdevice to netdevices with uppers, invoke
the PRECHANGEUPPER event handler for each netdevice above the one that the
front panel port is being enslaved to. This way the tower of netdevices
above the attachment point is validated.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c5ffe8d

mlxsw: spectrum_router: Replay IP NETDEV_UP on device deslavement · 4560cf40

Petr Machata authored Jul 19, 2023

When a netdevice is removed from a bridge or a LAG, and it has an IP
address, it should join the router and gain a RIF. Do that by replaying
address addition event on the netdevice.

When handling deslavement of LAG or its upper from a bridge device, the
replay should be done after all the lowers of the LAG have left the bridge.
Thus these scenarios are handled by passing replay_deslavement of false,
and by invoking, after the lowers have been processed, a new helper,
mlxsw_sp_netdevice_post_lag_event(), which does the per-LAG / -upper
handling, and in particular invokes the replay.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4560cf40

mlxsw: spectrum_router: Replay IP NETDEV_UP on device enslavement · 31618b22

Petr Machata authored Jul 19, 2023

Enslaving of front panel ports (and their uppers) to netdevices that
already have uppers is currently forbidden. When this is permitted, any
uppers with IP addresses need to have the NETDEV_UP inetaddr event
replayed, so that any RIFs are created.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31618b22

mlxsw: spectrum_router: Replay neighbours when RIF is made · 8fdb09a7

Petr Machata authored Jul 19, 2023

As neighbours are created, mlxsw is involved through the netevent
notifications. When at the time there is no RIF for a given neighbour, the
notification is not acted upon. When the RIF is later created, these
outstanding neighbours are left unoffloaded and cause traffic to go through
the SW datapath.

In order to fix this issue, as a RIF is created, walk the ARP and ND tables
and find neighbours for the netdevice that represents the RIF. Then
schedule neighbour work for them, allowing them to be offloaded.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fdb09a7

mlxsw: spectrum_router: Replay MACVLANs when RIF is made · 49c3a615

Petr Machata authored Jul 19, 2023

If IP address is added to a MACVLAN netdevice, the effect is of configuring
VRRP on the RIF for the netdevice linked to the MACVLAN. Because the
MACVLAN offload is tied to existence of a RIF at the linked netdevice,
adding a MACVLAN is currently not allowed until a RIF is present.

If this requirement stays, it will never be possible to attach a first port
into a topology that involves a MACVLAN. Thus topologies would need to be
built in a certain order, which is impractical.

Additionally, IP address removal, which leads to disappearance of the RIF
that the MACVLAN depends on, cannot be vetoed. Thus even as things stand
now it is possible to get to a state where a MACVLAN netdevice exists
without a RIF, despite having mlxsw lowers. And once the MACVLAN is
un-offloaded due to RIF getting destroyed, recreating the RIF does not
bring it back.

In this patch, accept that MACVLAN can be created out of order and support
that use case.

One option would seem to be to simply recognize MACVLAN netdevices as
"interesting", and let the existing replay mechanisms take care of the
offload. However, that does not address the necessity to reoffload MACVLAN
once a RIF is created.

Thus add a new replay hook, symmetrical to mlxsw_sp_rif_macvlan_flush(),
called mlxsw_sp_rif_macvlan_replay(), which instead of unwinding the
existing offloads, applies the configuration as if the netdevice were
created just now.

Additionally, remove all vetoes and warning messages that checked for
presence of a RIF at the linked device.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49c3a615

mlxsw: spectrum_router: Offload ethernet nexthops when RIF is made · cfc01a92

Petr Machata authored Jul 19, 2023

As RIF is created, refresh each netxhop group tracked at the CRIF for which
the RIF was created.

Note that nothing needs to be done for IPIP nexthops. The RIF for these is
either available from the get-go, or will never be available, so no after
the fact offloading needs to be done.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cfc01a92

mlxsw: spectrum_router: Join RIFs of LAG upper VLANs · ef59713c

Petr Machata authored Jul 19, 2023

In the following patches, the requirement that ports be only enslaved to
masters without uppers, is going to be relaxed. It will therefore be
necessary to join not only RIF for the immediate LAG, as is currently the
case, but also RIFs for VLAN netdevices upper to the LAG.

In this patch, extend mlxsw_sp_netdevice_router_join_lag() to walk the
uppers of a LAG being joined, and also join any VLAN ones.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ef59713c

mlxsw: spectrum_switchdev: Replay switchdev objects on port join · ec4643ca

Petr Machata authored Jul 19, 2023

Currently it never happens that a netdevice that is already a bridge slave
would suddenly become mlxsw upper. The only case where this might be
possible as far as mlxsw is concerned, is with LAG netdevices. But if a LAG
has any upper (e.g. is enslaved), enlaving mlxsw port to that LAG is
forbidden. Thus the only way to install a LAG between a bridge and a mlxsw
port is by first enslaving the port to the LAG, and then enslaving that LAG
to a bridge. At that point there are no bridge objects (such as port VLANs)
to replay. Those are added afterwards, and notified as they are created.
This holds even for the PVID.

However in the following patches, the requirement that ports be only
enslaved to masters without uppers, is going to be relaxed. It will
therefore be necessary to replay the existing bridge objects. Without this
replay, e.g. the mlxsw bridge_port_vlan objects are not instantiated, which
causes issues later, as a lot of code relies on their presence.

To that end, add a new notifier block whose sole role is to filter out
events related to the one relevant upper, and forward those to the existing
switchdev notifier block. Pass the new notifier block to
switchdev_bridge_port_offload() when the bridge port is created.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec4643ca

mlxsw: spectrum: On port enslavement to a LAG, join upper's bridges · 987c7782

Petr Machata authored Jul 19, 2023

Currently it never happens that a netdevice that is already a bridge slave
would suddenly become mlxsw upper. The only case where this might be
possible as far as mlxsw is concerned, is with LAG netdevices. But if a LAG
already has an upper, enslaving mlxsw port to that LAG is forbidden. Thus
the only way to install a LAG between a bridge and a mlxsw port is by first
enslaving the port to the LAG, and then enslaving that LAG to a bridge.

However in the following patches, the requirement that ports be only
enslaved to masters without uppers, is going to be relaxed. It will
therefore be necessary to join bridges of LAG uppers. Without this replay,
the mlxsw bridge_port objects are not instantiated, which causes issues
later, as a lot of code relies on their presence.

Therefore in this patch, when the first mlxsw physical netdevice is
enslaved to a LAG, consider bridges upper to the LAG (both the direct
master, if any, and any bridge masters of VLAN uppers), and have the
relevant netdevices join their bridges.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

987c7782

mlxsw: spectrum: Add a replay_deslavement argument to event handlers · 1c47e65b

Petr Machata authored Jul 19, 2023

When handling deslavement of LAG or its upper from a bridge device, when
the deslaved netdevice has an IP address, it should join the router. This
should be done after all the lowers of the LAG have left the bridge. The
replay intended to cause the device to join the router therefore cannot be
invoked unconditionally in the event handlers themselves. It can be done
right away if the handler is invoked for a sole device, but when it is
invoked repeated for each LAG lower, the replay needs to be postponed
until after this processing is done.

To that end, add a boolean parameter, replay_deslavement, to
mlxsw_sp_netdevice_port_upper_event(), mlxsw_sp_netdevice_port_vlan_event()
and one helper on the call path. Have the invocations that are done for
sole netdevices pass true, and those done for LAG lowers pass false.

Nothing depends on this flag at this point, but it removes some noise from
the patch that introduces the replay itself.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c47e65b

mlxsw: spectrum: Allow event handlers to check unowned bridges · 40b7b423

Petr Machata authored Jul 19, 2023

Currently the bridge-related handlers bail out when the event is related to
a netdevice that is not an upper of one of the front-panel ports. In order
to allow enslavement of front-panel ports to bridges that already have
uppers, it will be necessary to replay CHANGEUPPER events to validate that
the configuration is offloadable. In order for the replay to be effective,
it must be possible to ignore unsupported configuration in the context of
an actual notifier event, but to still "veto" these configurations when the
validation is performed.

To that end, introduce two parameters to a number of handlers: mlxsw_sp,
because it will not be possible to deduce that from the netdevice lowers;
and process_foreign to indicate whether netdevices that are not front panel
uppers should be validated.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

40b7b423

mlxsw: spectrum: Split a helper out of mlxsw_sp_netdevice_event() · 721717fa

Petr Machata authored Jul 19, 2023

Move the meat of mlxsw_sp_netdevice_event() to a separate function that
does just the validation. This separate helper will be possible to call
later for recursive ascent when validating attachment of a front panel port
to a bridge with uppers.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

721717fa

mlxsw: spectrum_router: Extract a helper to schedule neighbour work · 96c3e45c

Petr Machata authored Jul 19, 2023

This will come in handy for neighbour replay.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

96c3e45c

mlxsw: spectrum_router: Allow address handlers to run on bridge ports · 6bbc9ca6

Petr Machata authored Jul 19, 2023

Currently the IP address event handlers bail out when the event is related
to a netdevice that is a bridge port or a member of a LAG. In order to
create a RIF when a bridged or LAG'd port is unenslaved, these event
handlers will be replayed. However, at the point in time when the
NETDEV_CHANGEUPPER event is delivered, informing of the loss of
enslavement, the port is still formally enslaved.

In order for the operation to have any effect, these handlers need an extra
parameter to indicate that the check for bridge or LAG membership should
not be done. In this patch, add an argument "nomaster" to several event
handlers.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6bbc9ca6

selftests: mlxsw: rtnetlink: Drop obsolete tests · d7eb1f17

Petr Machata authored Jul 19, 2023

Support for enslaving ports to LAGs with uppers will be added in the
following patches. Selftests to make sure it actually does the right thing
are ready and will be sent as a follow-up.

Similarly, ordering of MACVLAN creation and RIF creation will be relaxed
and it will be permitted to create a MACVLAN first.

Thus these two tests are obsolete. Drop them.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7eb1f17

net: switchdev: Add a helper to replay objects on a bridge port · f2e2857b

Petr Machata authored Jul 19, 2023

When a front panel joins a bridge via another netdevice (typically a LAG),
the driver needs to learn about the objects configured on the bridge port.
When the bridge port is offloaded by the driver for the first time, this
can be achieved by passing a notifier to switchdev_bridge_port_offload().
The notifier is then invoked for the individual objects (such as VLANs)
configured on the bridge, and can look for the interesting ones.

Calling switchdev_bridge_port_offload() when the second port joins the
bridge lower is unnecessary, but the replay is still needed. To that end,
add a new function, switchdev_bridge_port_replay(), which does only the
replay part of the _offload() function in exactly the same way as that
function.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Ivan Vecera <ivecera@redhat.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f2e2857b

net: bridge: br_switchdev: Tolerate -EOPNOTSUPP when replaying MDB · 989280d6

Petr Machata authored Jul 19, 2023

There are two kinds of MDB entries to be replayed: port MDB entries, and
host MDB entries. They are both replayed by br_switchdev_mdb_replay(). If
the driver supports one kind, but lacks the other, the first -EOPNOTSUPP
returned terminates the whole replay, including any further still-supported
objects in the list.

For this to cause issues, there must be MDB entries for both the host and
the port being replayed. In that case, if the driver bails out from
handling the host entry, the port entries are never replayed. However, the
replay is currently only done when a switchdev port joins a bridge. There
would be no port memberships at that point. Thus despite being erroneous,
the code does not cause observable bugs.

This is not an issue with other object kinds either, because there, each
function replays one object kind. If a driver does not support that kind,
it makes sense to bail out early. -EOPNOTSUPP is then ignored in
nbp_switchdev_sync_objs().

For MDB, suppress the -EOPNOTSUPP error code in br_switchdev_mdb_replay()
already, so that the whole list gets replayed.

The reason we need this patch is that a future patch will introduce a
replay that should be used when a front-panel port netdevice is enslaved to
a bridge lower, in particular a LAG. The LAG netdevice can already have
both host and port MDB entries. The port entries need to be replayed so
that they are offloaded on the port that joins the LAG.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Ivan Vecera <ivecera@redhat.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

989280d6

net: ethernet: mtk_ppe: add MTK_FOE_ENTRY_V{1,2}_SIZE macros · a5dc694e

Lorenzo Bianconi authored Jul 19, 2023

Introduce MTK_FOE_ENTRY_V{1,2}_SIZE macros in order to make more
explicit foe_entry size for different chipset revisions.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5dc694e

Merge branch 'nexthop-refactor-and-fix-nexthop-selection-for-multipath-routes' · bf837e8f

Jakub Kicinski authored Jul 20, 2023

Benjamin Poirier says:

====================
nexthop: Refactor and fix nexthop selection for multipath routes

In order to select a nexthop for multipath routes, fib_select_multipath()
is used with legacy nexthops and nexthop_select_path_hthr() is used with
nexthop objects. Those two functions perform a validity test on the
neighbor related to each nexthop but their logic is structured differently.
This causes a divergence in behavior and nexthop_select_path_hthr() may
return a nexthop that failed the neighbor validity test even if there was
one that passed.

Refactor nexthop_select_path_hthr() to make it more similar to
fib_select_multipath() and fix the problem mentioned above.

v1: https://lore.kernel.org/netdev/20230529201914.69828-1-bpoirier@nvidia.com/
====================

Link: https://lore.kernel.org/r/20230719-nh_select-v2-0-04383e89f868@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

bf837e8f

selftests: net: Add test cases for nexthop groups with invalid neighbors · c7e95bbd

Benjamin Poirier authored Jul 19, 2023

Add test cases for hash threshold (multipath) nexthop groups with invalid
neighbors. Check that a nexthop with invalid neighbor is not selected when
there is another nexthop with a valid neighbor. Check that there is no
crash when there is no nexthop with a valid neighbor.

The first test fails before the previous commit in this series.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-4-04383e89f868@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c7e95bbd

nexthop: Do not return invalid nexthop object during multipath selection · 75f5f04c

Benjamin Poirier authored Jul 19, 2023

With legacy nexthops, when net.ipv4.fib_multipath_use_neigh is set,
fib_select_multipath() will never set res->nhc to a nexthop that is not
good (as per fib_good_nh()). OTOH, with nexthop objects,
nexthop_select_path_hthr() may return a nexthop that failed the
nexthop_is_good_nh() test even if there was one that passed. Refactor
nexthop_select_path_hthr() to follow a selection logic more similar to
fib_select_multipath().

The issue can be demonstrated with the following sequence of commands. The
first block shows that things work as expected with legacy nexthops. The
last sequence of `ip rou get` in the second block shows the problem case -
some routes still use the .2 nexthop.

sysctl net.ipv4.fib_multipath_use_neigh=1
ip link add dummy1 up type dummy
ip rou add 198.51.100.0/24 nexthop via 192.0.2.1 dev dummy1 onlink nexthop via 192.0.2.2 dev dummy1 onlink
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh add 192.0.2.1 dev dummy1 nud failed
echo ".1 failed:"  # results should not use .1
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh del 192.0.2.1 dev dummy1
ip neigh add 192.0.2.2 dev dummy1 nud failed
echo ".2 failed:"  # results should not use .2
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip link del dummy1

ip link add dummy1 up type dummy
ip nexthop add id 1 via 192.0.2.1 dev dummy1 onlink
ip nexthop add id 2 via 192.0.2.2 dev dummy1 onlink
ip nexthop add id 1001 group 1/2
ip rou add 198.51.100.0/24 nhid 1001
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh add 192.0.2.1 dev dummy1 nud failed
echo ".1 failed:"  # results should not use .1
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip neigh del 192.0.2.1 dev dummy1
ip neigh add 192.0.2.2 dev dummy1 nud failed
echo ".2 failed:"  # results should not use .2
for i in {10..19}; do ip -o rou get 198.51.100.$i; done
ip link del dummy1
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-3-04383e89f868@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

75f5f04c

nexthop: Factor out neighbor validity check · 4bb5239b

Benjamin Poirier authored Jul 19, 2023

For legacy nexthops, there is fib_good_nh() to check the neighbor validity.
In order to make the nexthop object code more similar to the legacy nexthop
code, factor out the nexthop object neighbor validity check into its own
function.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-2-04383e89f868@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4bb5239b

nexthop: Factor out hash threshold fdb nexthop selection · eedd47a6

Benjamin Poirier authored Jul 19, 2023

The loop in nexthop_select_path_hthr() includes code to check for neighbor
validity. Since this does not apply to fdb nexthops, simplify the loop by
moving the fdb nexthop selection to its own function.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230719-nh_select-v2-1-04383e89f868@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

eedd47a6

Merge branch 'eth-bnxt-handle-invalid-tx-completions-more-gracefully' · 022add1d

Jakub Kicinski authored Jul 20, 2023

Jakub Kicinski says:

====================
eth: bnxt: handle invalid Tx completions more gracefully

bnxt trusts the events generated by the device which may lead to kernel
crashes. These are extremely rare but they do happen. For a while
I thought crashing may be intentional, because device reporting invalid
completions should never happen, and having a core dump could be useful
if it does. But in practice I haven't found any clues in the core dumps,
and panic_on_warn exists.

Series was tested by forcing the recovery path manually. Because of
how rare the real crashes are I can't confirm it works for the actual
device errors until it's been widely deployed.

v1: https://lore.kernel.org/all/20230710205611.1198878-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20230720010440.1967136-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

022add1d

eth: bnxt: handle invalid Tx completions more gracefully · 2b56b3d9

Jakub Kicinski authored Jul 19, 2023

Invalid Tx completions should never happen (tm) but when they do
they crash the host, because driver blindly trusts that there is
a valid skb pointer on the ring.

The completions I've seen appear to be some form of FW / HW
miscalculation or staleness, they have typical (small) values
(<100), but they are most often higher than number of queued
descriptors. They usually happen after boot.

Instead of crashing, print a warning and schedule a reset.

Link: https://lore.kernel.org/r/20230720010440.1967136-4-kuba@kernel.orgReviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2b56b3d9

eth: bnxt: take the bit to set as argument of bnxt_queue_sp_work() · 9b1a00fd

Jakub Kicinski authored Jul 19, 2023

Most callers of bnxt_queue_sp_work() set a bit to indicate what work
to perform right before calling it. Pass it to the function instead.

Link: https://lore.kernel.org/r/20230720010440.1967136-3-kuba@kernel.orgReviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9b1a00fd

eth: bnxt: move and rename reset helpers · fea2993a

Jakub Kicinski authored Jul 19, 2023

Move the reset helpers, subsequent patches will need some
of them on the Tx path.

While at it rename bnxt_sched_reset(), on more recent chips
it schedules a queue reset, instead of a fuller reset.

Link: https://lore.kernel.org/r/20230720010440.1967136-2-kuba@kernel.orgReviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fea2993a

20 Jul, 2023 11 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 59be3baa

Jakub Kicinski authored Jul 20, 2023

Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

59be3baa

Merge tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 57f1f9dd

Linus Torvalds authored Jul 20, 2023

Pull networking fixes from Jakub Kicinski:
 "Including fixes from BPF, netfilter, bluetooth and CAN.

  Current release - regressions:

   - eth: r8169: multiple fixes for PCIe ASPM-related problems

   - vrf: fix RCU lockdep splat in output path

  Previous releases - regressions:

   - gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set

   - dsa: mv88e6xxx: do a final check before timing out when polling

   - nf_tables: fix sleep in atomic in nft_chain_validate

  Previous releases - always broken:

   - sched: fix undoing tcf_bind_filter() in multiple classifiers

   - bpf, arm64: fix BTI type used for freplace attached functions

   - can: gs_usb: fix time stamp counter initialization

   - nft_set_pipapo: fix improper element removal (leading to UAF)

  Misc:

   - net: support STP on bridge in non-root netns, STP prevents packet
     loops so not supporting it results in freezing systems of
     unsuspecting users, and in turn very upset noises being made

   - fix kdoc warnings

   - annotate various bits of TCP state to prevent data races"

* tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
  net: phy: prevent stale pointer dereference in phy_init()
  tcp: annotate data-races around fastopenq.max_qlen
  tcp: annotate data-races around icsk->icsk_user_timeout
  tcp: annotate data-races around tp->notsent_lowat
  tcp: annotate data-races around rskq_defer_accept
  tcp: annotate data-races around tp->linger2
  tcp: annotate data-races around icsk->icsk_syn_retries
  tcp: annotate data-races around tp->keepalive_probes
  tcp: annotate data-races around tp->keepalive_intvl
  tcp: annotate data-races around tp->keepalive_time
  tcp: annotate data-races around tp->tsoffset
  tcp: annotate data-races around tp->tcp_tx_delay
  Bluetooth: MGMT: Use correct address for memcpy()
  Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
  Bluetooth: SCO: fix sco_conn related locking and validity issues
  Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
  Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
  Bluetooth: coredump: fix building with coredump disabled
  Bluetooth: ISO: fix iso_conn related locking and validity issues
  Bluetooth: hci_event: call disconnect callback before deleting conn
  ...

57f1f9dd

Merge tag 'for-net-2023-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · 75d42b35

Jakub Kicinski authored Jul 20, 2023

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - Fix building with coredump disabled
 - Fix use-after-free in hci_remove_adv_monitor
 - Use RCU for hci_conn_params and iterate safely in hci_sync
 - Fix locking issues on ISO and SCO
 - Fix bluetooth on Intel Macbook 2014

* tag 'for-net-2023-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Use correct address for memcpy()
  Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014
  Bluetooth: SCO: fix sco_conn related locking and validity issues
  Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link
  Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor()
  Bluetooth: coredump: fix building with coredump disabled
  Bluetooth: ISO: fix iso_conn related locking and validity issues
  Bluetooth: hci_event: call disconnect callback before deleting conn
  Bluetooth: use RCU for hci_conn_params and iterate safely in hci_sync
====================

Link: https://lore.kernel.org/r/20230720190201.446469-1-luiz.dentz@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

75d42b35

Merge tag 'nf-23-07-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 9b39f758

Jakub Kicinski authored Jul 20, 2023

Florian Westphal says:

====================
Netfilter fixes for net:

The following patchset contains Netfilter fixes for net:

1. Fix spurious -EEXIST error from userspace due to
   padding holes, this was broken since 4.9 days
   when 'ignore duplicate entries on insert' feature was
   added.

2. Fix a sched-while-atomic bug, present since 5.19.

3. Properly remove elements if they lack an "end range".
   nft userspace always sets an end range attribute, even
   when its the same as the start, but the abi doesn't
   have such a restriction. Always broken since it was
   added in 5.6, all three from myself.

4 + 5: Bound chain needs to be skipped in netns release
   and on rule flush paths, from Pablo Neira.

* tag 'nf-23-07-20' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: skip bound chain on rule flush
  netfilter: nf_tables: skip bound chain in netns release path
  netfilter: nft_set_pipapo: fix improper element removal
  netfilter: nf_tables: can't schedule in nft_chain_validate
  netfilter: nf_tables: fix spurious set element insertion failure
====================

Link: https://lore.kernel.org/r/20230720165143.30208-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9b39f758

net: phy: prevent stale pointer dereference in phy_init() · 1c613bea

Vladimir Oltean authored Jul 20, 2023

mdio_bus_init() and phy_driver_register() both have error paths, and if
those are ever hit, ethtool will have a stale pointer to the
phy_ethtool_phy_ops stub structure, which references memory from a
module that failed to load (phylib).

It is probably hard to force an error in this code path even manually,
but the error teardown path of phy_init() should be the same as
phy_exit(), which is now simply not the case.

Fixes: 55d8f053 ("net: phy: Register ethtool PHY operations")
Link: https://lore.kernel.org/netdev/ZLaiJ4G6TaJYGJyU@shell.armlinux.org.uk/Suggested-by: Russell King (Oracle) <linux@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230720000231.1939689-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1c613bea

Merge branch 'tcp-add-missing-annotations' · 7998c0ad

Jakub Kicinski authored Jul 20, 2023

Eric Dumazet says:

====================
tcp: add missing annotations

This series was inspired by one syzbot (KCSAN) report.

do_tcp_getsockopt() does not lock the socket, we need to
annotate most of the reads there (and other places as well).

This is a first round, another series will come later.
====================

Link: https://lore.kernel.org/r/20230719212857.3943972-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7998c0ad

tcp: annotate data-races around fastopenq.max_qlen · 70f360dd

Eric Dumazet authored Jul 19, 2023

This field can be read locklessly.

Fixes: 1536e285 ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-12-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

70f360dd

tcp: annotate data-races around icsk->icsk_user_timeout · 26023e91

Eric Dumazet authored Jul 19, 2023

This field can be read locklessly from do_tcp_getsockopt()

Fixes: dca43c75 ("tcp: Add TCP_USER_TIMEOUT socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-11-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

26023e91

tcp: annotate data-races around tp->notsent_lowat · 1aeb87bc

Eric Dumazet authored Jul 19, 2023

tp->notsent_lowat can be read locklessly from do_tcp_getsockopt()
and tcp_poll().

Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-10-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1aeb87bc

tcp: annotate data-races around rskq_defer_accept · ae488c74

Eric Dumazet authored Jul 19, 2023

do_tcp_getsockopt() reads rskq_defer_accept while another cpu
might change its value.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-9-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ae488c74

tcp: annotate data-races around tp->linger2 · 9df5335c

Eric Dumazet authored Jul 19, 2023

do_tcp_getsockopt() reads tp->linger2 while another cpu
might change its value.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-8-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9df5335c