Commits · c592286a527fff2c4dc6ea10c6b6d9143ca65872 · Kirill Smelkov / linux

26 Jan, 2022 28 commits

net: dpaa2-mac: use .mac_select_pcs() interface · c592286a

Russell King (Oracle) authored Jan 25, 2022

Convert dpaa2-mac to use the mac_select_pcs() interface rather than
using phylink_set_pcs(). The intention here is to unify the approach
for PCS and eventually to remove phylink_set_pcs().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

c592286a

Merge branch 'axienet-pcs-modernize' · 098db2f0

David S. Miller authored Jan 26, 2022

Russell King says:

====================
net: axienet: modernise pcs implementation

These two patches modernise the Xilinx axienet PCS implementation to
use the phylink split PCS support.

The first patch adds split PCS support and makes use of the newly
introduced mac_select_pcs() function, which is the preferred way to
conditionally attach a PCS.

The second patch cleans up the use of mdiobus_write() since we now have
bus accessors for mdio devices.

There should be no functional change to the driver.

This series was previously sent CFT on the 16th December (message ID
Ybs1cdM3KUTsq4Vx@shell.armlinux.org.uk), and feedback addressed. CFT v2
sent 4th January (message ID YdQlI8gcVwg2sR+5@shell.armlinux.org.uk).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

098db2f0

net: axienet: replace mdiobus_write() with mdiodev_write() · 03854d8a

Russell King (Oracle) authored Jan 25, 2022

Commit 0ebecb26 ("net: mdio: Add helper functions for accessing
MDIO devices") added support for mdiodev accessor operations that
neatly wrap the mdiobus accessor operations. Since we are dealing with
a mdio device here, update the driver to use mdiodev_write().
Tested-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

03854d8a

net: axienet: convert to phylink_pcs · 7a86be6a

Russell King (Oracle) authored Jan 25, 2022

Convert axienet to use the phylink_pcs layer, resulting in it no longer
being a legacy driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

7a86be6a

Merge branch 'bnxt_en-RTC' · 71f390f5

David S. Miller authored Jan 26, 2022

Michael Chan says:

====================
bnxt_en: Add RTC mode for PTP

This series adds Real Time Clock (RTC) mode for PTP timestamping.  In
RTC mode, the 64-bit time value is programmed into the NIC's PTP
hardware clock (PHC).  Prior to this, the PHC is running as a free
counter.  For example, in multi-function environment, we need to run
PTP in RTC mode.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

71f390f5

bnxt_en: Handle async event when the PHC is updated in RTC mode · 8bcf6f04

Pavan Chebbi authored Jan 25, 2022

In Multi-host environment, when the PHC is updated by one host,
an async message from firmware will be sent to other hosts.
Re-initialize the timecounter when the driver receives this
async message.

Cc: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8bcf6f04

bnxt_en: Implement .adjtime() for PTP RTC mode · e7b0afb6

Pavan Chebbi authored Jan 25, 2022

The adjusted time is set in the PHC in RTC mode.  We also need to
update the snapshots ptp->current_time and ptp->old_time when the
time is adjusted.

Cc: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e7b0afb6

bnxt_en: Add driver support to use Real Time Counter for PTP · 24ac1ecd

Pavan Chebbi authored Jan 25, 2022

Add support for RTC mode if it is supported by firmware.  In RTC
mode, the PHC is set to the 64-bit clock.  Because the legacy interface
is 48-bit, the driver still has to keep track of the upper 16 bits and
handle the rollover.

Cc: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

24ac1ecd

bnxt_en: PTP: Refactor PTP initialization functions · 740c342e

Pavan Chebbi authored Jan 25, 2022

Making the ptp free and timecounter initialization code into separate
functions so that later patches can use them.

Cc: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

740c342e

bnxt_en: Update firmware interface to 1.10.2.73 · 2895c153

Michael Chan authored Jan 25, 2022

The main changes are PTP support for RTC, additional NVM error codes,
backing store v2 firmware APIs.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2895c153

Merge branch 'stmmac-PCS-modernize' · d28b159b

David S. Miller authored Jan 26, 2022

Russell King says:

====================
net: stmmac/xpcs: modernise PCS support

This series updates xpcs and stmmac for the recent changes to phylink
to better support split PCS and to get rid of private MAC validation
functions.

This series is slightly more involved than other conversions as stmmac
has already had optional proper split PCS support.

The first six patches of this series were originally posted on 16th
December for CFT, and Wong Vee Khee reported his Intel Elkhart Lake
setup was fine the first six these. However, no tested-by was given.

The patches:

1) Provide a function to query the xpcs for the interface modes that
   are supported.

2) Populates the MAC capabilities and switches stmmac_validate() to use
   phylink_get_linkmodes(). We do not use phylink_generic_validate() yet
   as (a) we do not always have the supported interfaces populated, and
   (b) the existing code does not restrict based on interface. There
   should be no functional effect from this patch.

3) Populates phylink's supported interfaces from the xpcs when the xpcs
   is configured by firmware and also the firmware configured interface
   mode. Note: this will restrict stmmac to only supporting these
   interfaces modes - stmmac maintainers need to verify that this
   behaviour is acceptable.

4) stmmac_validate() tail-calls xpcs_validate(), but we don't need it to
   now that PCS have their own validation method. Convert stmmac and
   xpcs to use this method instead.

5) xpcs sets the poll field of phylink_pcs to true, meaning xpcs
   requires its status to be polled. There is no need to also set the
   phylink_config.pcs_poll. Remove this.

6) Switch to phylink_generic_validate(). This is probably the most
   contravertial change in this patch set as this will cause the MAC to
   restrict link modes based on the interface mode. From an inspection
   of the xpcs driver, this should be safe, as XPCS only further
   restricts the link modes to a subset of these (whether that is
   correct or not is not an issue I am addressing here.) For
   implementations that do not use xpcs, this is a more open question
   and needs feedback from stmmac maintainers.

7) Convert to use mac_select_pcs() rather than phylink_set_pcs() to set
   the PCS - the intention is to eventually remove phylink_set_pcs()
   once there are no more users of this.

v2: fix signoff and temporary warning in patch 4
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d28b159b

net: stmmac: use .mac_select_pcs() interface · 72e94511

Russell King (Oracle) authored Jan 26, 2022

Convert stmmac to use the mac_select_pcs() interface rather than using
phylink_set_pcs(). The intention here is to unify the approach for PCS
and eventually to remove phylink_set_pcs().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

72e94511

net: stmmac: convert to phylink_generic_validate() · 04a0683f

Russell King (Oracle) authored Jan 26, 2022

Convert stmmac to use phylink_generic_validate() now that we have the
MAC capabilities and supported interfaces filled in, and we have the
PCS validation handled via the PCS operations.
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

04a0683f

net: stmmac: remove phylink_config.pcs_poll usage · f4c296c9

Russell King (Oracle) authored Jan 26, 2022

Phylink will use PCS polling whenever the PCS's poll member is set, so
setting phylink_config.pcs_poll as well is redundant.
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

f4c296c9

net: stmmac/xpcs: convert to pcs_validate() · fe70fb74

Russell King (Oracle) authored Jan 26, 2022

stmmac explicitly calls the xpcs driver to validate the ethtool
linkmodes. This is no longer necessary as phylink now supports
validation through a PCS method. Convert both drivers to use this
new mechanism.

Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe70fb74

net: stmmac: fill in supported_interfaces · d194923d

Russell King (Oracle) authored Jan 26, 2022

Fill in phylink's supported_interfaces bitmap with the PHY interface
modes which can be used to talk to the PHY.

We indicate that the PHY interface mode passed in platform data is
always supported, as this is the initial mode passed into phylink.
When there is no PCS specified, we assume that this is the only mode
that is supported - indeed, the driver appears not to support dynamic
switching of interface types at present.

When a xpcs is present, it defines the PHY interface modes that the
stmmac driver can support. Request the supported interfaces from the
xpcs driver, and pass them to phylink.
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

d194923d

net: stmmac: convert to phylink_get_linkmodes() · 92c3807b

Russell King (Oracle) authored Jan 26, 2022

Add the MAC speed, duplex and pause capabilities to the phylink_config
structure, and switch stmmac_validate() to use phylink_get_linkmodes()
to generate the mask of supported ethtool link modes.
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

92c3807b

net: xpcs: add support for retrieving supported interface modes · be6ec5b7

Russell King (Oracle) authored Jan 26, 2022

Add a function to the xpcs driver to retrieve the supported PHY
interface modes, which can be used by drivers to fill in phylink's
supported_interfaces mask.

We validate the interface bit index to ensure that it fits within the
bitmap as xpcs lists PHY_INTERFACE_MODE_MAX in an entry.
Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

be6ec5b7

Merge branch 'mlxsw-RJ45' · 3cade91d

David S. Miller authored Jan 26, 2022

Ido Schimmel says:

====================
mlxsw: Add RJ45 ports support

We are in the process of qualifying a new system that has RJ45 ports as
opposed to the transceiver modules (e.g., SFP, QSFP) present on all
existing systems.

This patchset adds support for these ports in mlxsw by adding a couple of
missing BaseT link modes and rejecting ethtool operations that are
specific to transceiver modules.

Patchset overview:

Patches #1-#3 are cleanups and preparations.

Patch #4 adds support for two new link modes.

Patches #5-#6 query and cache the port module's type (e.g., QSFP, RJ45)
during initialization.

Patches #7-#9 forbid ethtool operations that are invalid on RJ45 ports.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3cade91d

mlxsw: core_env: Forbid module reset on RJ45 ports · b7347cdf

Danielle Ratson authored Jan 26, 2022

Transceiver module reset through 'rst' field in PMAOS register is not
supported on RJ45 ports, so module reset should be rejected.

Therefore, before trying to access this field, validate the port module
type that was queried during initialization and return an error to user
space in case the port module type is RJ45 (twisted pair).

Output example:

 # ethtool --reset swp11 phy
 ETHTOOL_RESET 0x40
 Cannot issue ETHTOOL_RESET: Invalid argument
 $ dmesg
 mlxsw_spectrum 0000:03:00.0 swp11: Reset module is not supported on port module type
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b7347cdf

mlxsw: core_env: Forbid power mode set and get on RJ45 ports · c8f994cc

Danielle Ratson authored Jan 26, 2022

PMMP (Port Module Memory Map Properties) and MCION (Management Cable IO
and Notifications) registers are not supported on RJ45 ports, so setting
and getting power mode should be rejected.

Therefore, before trying to access those registers, validate the port
module type that was queried during initialization and return an error
to user space in case the port module type is RJ45 (twisted pair).

Set output example:

 # ethtool --set-module swp1 power-mode-policy auto
 netlink error: mlxsw_core: Power mode is not supported on port module type
 netlink error: Invalid argument

Get output example:

 $ ethtool --show-module swp11
 netlink error: mlxsw_core: Power mode is not supported on port module type
 netlink error: Invalid argument
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c8f994cc

mlxsw: core_env: Forbid getting module EEPROM on RJ45 ports · 615ebb8c

Danielle Ratson authored Jan 26, 2022

MCIA (Management Cable Info Access) register is not supported on RJ45
ports, so getting module EEPROM should be rejected.

Therefore, before trying to access this register, validate the port
module type that was queried during initialization and return an error
to user space in case the port module type is RJ45 (twisted pair).

Examples for output when trying to get EEPROM module:

Using netlink:

 # ethtool -m swp1
 netlink error: mlxsw_core: EEPROM is not equipped on port module type
 netlink error: Invalid argument

Using IOCTL:

 # ethtool -m swp1
 Cannot get module EEPROM information: Invalid argument
 $ dmesg
 mlxsw_spectrum 0000:03:00.0 swp1: EEPROM is not equipped on port module type
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

615ebb8c

mlxsw: core_env: Query and store port module's type during initialization · e62f5b0e

Danielle Ratson authored Jan 26, 2022

Query and store port module's type during initialization so that it
could be later used to determine if certain configurations are allowed
based on the type.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e62f5b0e

mlxsw: reg: Add Port Module Type Mapping register · 0d31441e

Danielle Ratson authored Jan 26, 2022

Add the Port Module Type Mapping (PMTP) register. It will be used by
subsequent patches to query port module types and forbid certain
configurations based on the port module's type.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d31441e

mlxsw: spectrum_ethtool: Add support for two new link modes · 78cf4b92

Danielle Ratson authored Jan 26, 2022

As part of a process for supporting a new system with RJ45 connectors,
100BaseT and 1000BaseT link modes need to be supported.

Add support for these two link modes by adding the two corresponding
bits in PTYS (Port Type and Speed) register.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

78cf4b92

mlxsw: Add netdev argument to mlxsw_env_get_module_info() · 5eaec6d8

Danielle Ratson authored Jan 26, 2022

The next patches will forbid querying the port module's EEPROM info when
its type is RJ45 as in this case no transceiver module can ever be
connected to the port.

Add netdev argument to mlxsw_env_get_module_info() so it could be used
to print an error to the kernel log via netdev_err().
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5eaec6d8

mlxsw: core_env: Do not pass number of modules as argument · 6af5f7b6

Ido Schimmel authored Jan 26, 2022

The number of modules can be resolved from the first argument, so do not
pass it.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6af5f7b6

mlxsw: spectrum_ethtool: Remove redundant variable · 5c759fe2

Ido Schimmel authored Jan 26, 2022

Remove the 'err' variable and simply return.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c759fe2

25 Jan, 2022 12 commits

net: Adjust sk_gso_max_size once when set · ab14f180

David Ahern authored Jan 24, 2022

sk_gso_max_size is set based on the dst dev. Both users of it
adjust the value by the same offset - (MAX_TCP_HEADER + 1). Rather
than compute the same adjusted value on each call do the adjustment
once when set.
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220125024511.27480-1-dsahern@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ab14f180

net: tulip: remove redundant assignment to variable new_csr6 · 6b0671a2

Colin Ian King authored Jan 23, 2022

Variable new_csr6 is being initialized with a value that is never
read, it is being re-assigned later on. The assignment is redundant
and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20220123183440.112495-1-colin.i.king@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6b0671a2

ipv6: gro: flush instead of assuming different flows on hop_limit mismatch · 6fc2f383

Jakub Kicinski authored Jan 24, 2022

IPv6 GRO considers packets to belong to different flows when their
hop_limit is different. This seems counter-intuitive, the flow is
the same. hop_limit may vary because of various bugs or hacks but
that doesn't mean it's okay for GRO to reorder packets.

Practical impact of this problem on overall TCP performance
is unclear, but TCP itself detects this reordering and bumps
TCPSACKReorder resulting in user complaints.

Eric warns that there may be performance regressions in setups
which do packet spraying across links with similar RTT but different
hop count. To be safe let's target -next and not treat this
as a fix. If the packet spraying is using flow label there should
be no difference in behavior as flow label is checked first.

Note that the code plays an easy to miss trick by upcasting next_hdr
to a u16 pointer and compares next_hdr and hop_limit in one go.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6fc2f383

net: mana: Use struct_size() helper in mana_gd_create_dma_region() · 10cdc794

Gustavo A. R. Silva authored Jan 24, 2022

Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worst scenario, could lead to heap overflows.

Also, address the following sparse warnings:
drivers/net/ethernet/microsoft/mana/gdma_main.c:677:24: warning: using sizeof on a flexible structure

Link: https://github.com/KSPP/linux/issues/174Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10cdc794

r8169: use new PM macros · 8fe6e670

Heiner Kallweit authored Jan 24, 2022

This is based on series [0] that extended the PM core. Now the compiler
can see the PM callbacks also on systems not defining CONFIG_PM.
The optimizer will remove the functions then in this case.

[0] https://lore.kernel.org/netdev/20211207002102.26414-1-paul@crapouillou.net/Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fe6e670

Merge branch 'dsa-avoid-cross-chip-vlan-sync' · 934d0f03

David S. Miller authored Jan 25, 2022

Tobias Waldekranz says:

====================
net: dsa: Avoid cross-chip syncing of VLAN filtering

This bug has been latent in the source for quite some time, I suspect
due to the homogeneity of both typical configurations and hardware.

On singlechip systems, this would never be triggered. The only reason
I saw it on my multichip system was because not all chips had the same
number of ports, which means that the misdemeanor alien call turned
into a felony array-out-of-bounds access.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

934d0f03

net: dsa: Avoid cross-chip syncing of VLAN filtering · 108dc874

Tobias Waldekranz authored Jan 24, 2022

Changes to VLAN filtering are not applicable to cross-chip
notifications.

On a system like this:

.-----.   .-----.   .-----.
| sw1 +---+ sw2 +---+ sw3 |
'-1-2-'   '-1-2-'   '-1-2-'

Before this change, upon sw1p1 leaving a bridge, a call to
dsa_port_vlan_filtering would also be made to sw2p1 and sw3p1.

In this scenario:

.---------.   .-----.   .-----.
|   sw1   +---+ sw2 +---+ sw3 |
'-1-2-3-4-'   '-1-2-'   '-1-2-'

When sw1p4 would leave a bridge, dsa_port_vlan_filtering would be
called for sw2 and sw3 with a non-existing port - leading to array
out-of-bounds accesses and crashes on mv88e6xxx.

Fixes: d371b7c9 ("net: dsa: Unset vlan_filtering when ports leave the bridge")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

108dc874

net: dsa: Move VLAN filtering syncing out of dsa_switch_bridge_leave · 381a7301

Tobias Waldekranz authored Jan 24, 2022

Most of dsa_switch_bridge_leave was, in fact, dealing with the syncing
of VLAN filtering for switches on which that is a global
setting. Separate the two phases to prepare for the cross-chip related
bugfix in the following commit.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

381a7301

Merge branch 'netns-speedup-dismantle' · 51d555cf

David S. Miller authored Jan 25, 2022

Eric Dumazet says:

====================
netns: speedup netns dismantles

netns are dismantled by a single thread, from cleanup_net()

On hosts with many TCP sockets, and/or many cpus, this thread
is spending too many cpu cycles, and can not keep up with some
workloads.

- Removing 3*num_possible_cpus() sockets per netns, for icmp and tcp protocols.
- Iterating over all TCP sockets to remove stale timewait sockets.

This patch series removes ~50% of cleanup_net() cpu costs on
hosts with 256 cpus. It also reduces per netns memory footprint.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

51d555cf

ipv4/tcp: do not use per netns ctl sockets · 37ba017d

Eric Dumazet authored Jan 24, 2022

TCP ipv4 uses per-cpu/per-netns ctl sockets in order to send
RST and some ACK packets (on behalf of TIMEWAIT sockets).

This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.

tcp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer
in order to be able to use IPv4 output functions.

Note that I attempted a related change in the past, that had
to be hot-fixed in commit bdbbb852 ("ipv4: tcp: get rid of ugly unicast_sock")

This patch could very well surface old bugs, on layers not
taking care of sk->sk_kern_sock properly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

37ba017d

ipv6: do not use per netns icmp sockets · 6a17b961

Eric Dumazet authored Jan 24, 2022

Back in linux-2.6.25 (commit 98c6d1b2 "[NETNS]: Make icmpv6_sk per namespace.",
we added private per-cpu/per-netns ipv6 icmp sockets.

This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.

icmp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer.

icmpv6_xmit_lock() already makes sure to lock the chosen per-cpu
socket.

This patch has a considerable impact on the number of netns
that the worker thread in cleanup_net() can dismantle per second,
because ip6mr_sk_done() is no longer called, meaning we no longer
acquire the rtnl mutex, competing with other threads adding new netns.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a17b961

ipv4: do not use per netns icmp sockets · a15c89c7

Eric Dumazet authored Jan 24, 2022

Back in linux-2.6.25 (commit 4a6ad7a1 "[NETNS]: Make icmp_sk per namespace."),
we added private per-cpu/per-netns ipv4 icmp sockets.

This adds memory and cpu costs, which do not seem needed.
Now typical servers have 256 or more cores, this adds considerable
tax to netns users.

icmp sockets are used from BH context, are not receiving packets,
and do not store any persistent state but the 'struct net' pointer.

icmp_xmit_lock() already makes sure to lock the chosen per-cpu
socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a15c89c7