Commits · a4aa1f1e69df5612bcc0d7cf2ca23b9fae79941b · Kirill Smelkov / linux

24 Jun, 2021 26 commits

gve: Add DQO fields for core data structures · a4aa1f1e

Bailey Forrest authored Jun 24, 2021

- Add new DQO datapath structures:
  - `gve_rx_buf_queue_dqo`
  - `gve_rx_compl_queue_dqo`
  - `gve_rx_buf_state_dqo`
  - `gve_tx_desc_dqo`
  - `gve_tx_pending_packet_dqo`

- Incorporate these into the existing ring data structures:
  - `gve_rx_ring`
  - `gve_tx_ring`

Noteworthy mentions:

- `gve_rx_buf_state` represents an RX buffer which was posted to HW.
  Each RX queue has an array of these objects and the index into the
  array is used as the buffer_id when posted to HW.

- `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The
  completion_tag is the index into the array.

- These two structures have links for linked lists which are represented
  by 16b indexes into a contiguous array of these structures.
  This reduces memory footprint compared to 64b pointers.

- We use unions for the writeable datapath structures to reduce cache
  footprint. GQI specific members will renamed like DQO members in a
  future patch.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4aa1f1e

gve: Add dqo descriptors · 22319818

Bailey Forrest authored Jun 24, 2021

General description of rings and descriptors:

TX ring is used for sending TX packet buffers to the NIC. It has the
following descriptors:
- `gve_tx_pkt_desc_dqo` - Data buffer descriptor
- `gve_tx_tso_context_desc_dqo` - TSO context descriptor
- `gve_tx_general_context_desc_dqo` - Generic metadata descriptor

Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo`
which represents the logical interpetation of the metadata bytes. It's
helpful to define this structure because the metadata bytes exist in
multiple descriptor types (including `gve_tx_tso_context_desc_dqo`),
and the device requires same field has the same value in all
descriptors.

The TX completion ring is used to receive completions from the NIC.
Having a separate ring allows for completions to be out of order. The
completion descriptor `gve_tx_compl_desc` has several different types,
most important are packet and descriptor completions. Descriptor
completions are used to notify the driver when descriptors sent on the
TX ring are done being consumed. The descriptor completion is only used
to signal that space is cleared in the TX ring. A packet completion will
be received when a packet transmitted on the TX queue is done being
transmitted.

In addition there are "miss" and "reinjection" completions. The device
implements a "flow-miss model". Most packets will simply receive a
packet completion. The flow-miss system may choose to process a packet
based on its contents. A TX packet which experiences a flow miss would
receive a miss completion followed by a later reinjection completion.
The miss-completion is received when the packet starts to be processed
by the flow-miss system and the reinjection completion is received when
the flow-miss system completes processing the packet and sends it on the
wire.

The RX buffer ring is used to send buffers to HW via the
`gve_rx_desc_dqo` descriptor.

Received packets are put into the RX queue by the device, which
populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors
refer to buffers posted by the buffer queue. Received buffers may be
returned out of order, such as when HW LRO is enabled.

Important concepts:
- "TX" and "RX buffer" queues, which send descriptors to the device, use
  MMIO doorbells to notify the device of new descriptors.

- "RX" and "TX completion" queues, which receive descriptors from the
  device, use a "generation bit" to know when a descriptor was populated
  by the device. The driver initializes all bits with the "current
  generation". The device will populate received descriptors with the
  "next generation" which is inverted from the current generation. When
  the ring wraps, the current/next generation are swapped.

- It's the driver's responsibility to ensure that the RX and TX
  completion queues are not overrun. This can be accomplished by
  limiting the number of descriptors posted to HW.

- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
  buffer_id. These will be returned on the TX completion and RX queues
  respectively to let the driver know which packet/buffer was completed.

Bitfields are used to describe descriptor fields. This notation is more
concise and readable than shift-and-mask. It is possible because the
driver is restricted to little endian platforms.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22319818

gve: Add support for DQO RX PTYPE map · c4b87ac8

Bailey Forrest authored Jun 24, 2021

Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the
packet. L3 and L4 types are necessary in order to set the hash and csum
on RX SKBs correctly.

DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map
enables the device to tell the driver how to map from PTYPE index to
L3/L4 type.

The device doesn't provide any guarantees about the range of possible
PTYPEs, so we just use a 1024 entry array to implement a fast mapping
structure.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c4b87ac8

gve: adminq: DQO specific device descriptor logic · 5ca2265e

Bailey Forrest authored Jun 24, 2021

- In addition to TX and RX queues, DQO has TX completion and RX buffer
  queues.
  - TX completions are received when the device has completed sending a
    packet on the wire.
  - RX buffers are posted on a separate queue form the RX completions.
- DQO descriptor rings are allowed to be smaller than PAGE_SIZE.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5ca2265e

gve: Introduce per netdev `enum gve_queue_format` · a5886ef4

Bailey Forrest authored Jun 24, 2021

The currently supported queue formats are:
- GQI_RDA - GQI with raw DMA addressing
- GQI_QPL - GQI with queue page list
- DQO_RDA - DQO with raw DMA addressing

The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we
remove it in favor of just checking against GQI_RDA
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5886ef4

gve: Introduce a new model for device options · 8a39d3e0

Bailey Forrest authored Jun 24, 2021

The current model uses an integer ID and a fixed size struct for the
parameters of each device option.

The new model allows the device option structs to grow in size over
time. A driver may assume that changes to device option structs will
always be appended.

New device options will also generally have a
`supported_features_mask` so that the driver knows which fields within a
particular device option are enabled.

`gve_device_option.feat_mask` is changed to `required_features_mask`,
and it is a bitmask which must match the value expected by the driver.
This gives the device the ability to break backwards compatibility with
old drivers for certain features by blocking the old drivers from trying
to use the feature.

We maintain ABI compatibility with the old model for
GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which
does not support the new model.

This patch introduces some new terminology:

RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA
      mapped and read/updated by the device.

QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped
      with the device for read/write and data is copied from/to SKBs.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a39d3e0

gve: Make gve_rx_slot_page_info.page_offset an absolute offset · 920fb451

Bailey Forrest authored Jun 24, 2021

Using `page_offset` like a boolean means a page may only be split into
two sections. With page sizes larger than 4k, this can be very wasteful.
Future commits in this patchset use `struct gve_rx_slot_page_info` in a
way which supports a fixed buffer size and a variable page size.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

920fb451

gve: gve_rx_copy: Move padding to an argument · 35f9b2f4

Bailey Forrest authored Jun 24, 2021

Future use cases will have a different padding value.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35f9b2f4

gve: Move some static functions to a common file · dbdaa675

Bailey Forrest authored Jun 24, 2021

These functions will be shared by the GQI and DQO variants of the GVNIC
driver as of follow-up patches in this series.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbdaa675

gve: Update GVE documentation to describe DQO · c6a7ed77

Bailey Forrest authored Jun 24, 2021

DQO is a new descriptor format for our next generation virtual NIC.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6a7ed77

usbnet: add usbnet_event_names[] for kevent · 47889068

Yajun Deng authored Jun 24, 2021

Modify the netdev_dbg content from int to char * in usbnet_defer_kevent(),
this looks more readable.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

47889068

Merge branch 'add-sparx5i-driver' · 67faf76d

David S. Miller authored Jun 24, 2021

Steen Hegelund says:

====================
Adding the Sparx5i Switch Driver

This series provides the Microchip Sparx5i Switch Driver

The SparX-5 Enterprise Ethernet switch family provides a rich set of
Enterprise switching features such as advanced TCAM-based VLAN and QoS
processing enabling delivery of differentiated services, and security
through TCAMbased frame processing using versatile content aware processor
(VCAP). IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is supported
with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K IPv4/3K IPv6
(S,G) multicast groups. L3 security features include source guard and
reverse path forwarding (uRPF) tasks. Additional L3 features include
VRF-Lite and IP tunnels (IP over GRE/IP).

The SparX-5 switch family features a highly flexible set of Ethernet ports
with support for 10G and 25G aggregation links, QSGMII, USGMII, and
USXGMII. The device integrates a powerful 1 GHz dual-core ARM® Cortex®-A53
CPU enabling full management of the switch and advanced Enterprise
applications.

The SparX-5 switch family targets managed Layer 2 and Layer 3 equipment in
SMB, SME, and Enterprise where high port count 1G/2.5G/5G/10G switching
with 10G/25G aggregation links is required.

The SparX-5 switch family consists of following SKUs:

VSC7546 SparX-5-64 supports up to 64 Gbps of bandwidth with the following
primary port configurations.
- 6 ×10G
- 16 × 2.5G + 2 × 10G
- 24 × 1G + 4 × 10G

VSC7549 SparX-5-90 supports up to 90 Gbps of bandwidth with the following
primary port configurations.
- 9 × 10G
- 16 × 2.5G + 4 × 10G
- 48 × 1G + 4 × 10G

VSC7552 SparX-5-128 supports up to 128 Gbps of bandwidth with the
following primary port configurations.
- 12 × 10G
- 6 x 10G + 2 x 25G
- 16 × 2.5G + 8 × 10G
- 48 × 1G + 8 × 10G

VSC7556 SparX-5-160 supports up to 160 Gbps of bandwidth with the
following primary port configurations.
- 16 × 10G
- 10 × 10G + 2 × 25G
- 16 × 2.5G + 10 × 10G
- 48 × 1G + 10 × 10G

VSC7558 SparX-5-200 supports up to 200 Gbps of bandwidth with the
following primary port configurations.
- 20 × 10G
- 8 × 25G

In addition, the device supports one 10/100/1000/2500/5000 Mbps
SGMII/SerDes node processor interface (NPI) Ethernet port.

Time sensitive networking (TSN) is supported through a comprehensive set of
features including frame preemption, cut-through, frame replication and
elimination for reliability, enhanced scheduling: credit-based shaping,
time-aware shaping, cyclic queuing, and forwarding, and per-stream policing
and filtering.

Together with IEEE 1588 and IEEE 802.1AS support, this guarantees
low-latency deterministic networking for Industrial Ethernet.

The Sparx5i support is developed on the PCB134 and PCB135 evaluation boards.

- PCB134 main networking features:
- 12x SFP+ front 10G module slots (connected to Sparx5i through SFI).
- 8x SFP28 front 25G module slots (connected to Sparx5i through SFI high
speed).
- Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port
(on-board VSC8211 PHY connected to Sparx5i through SGMII).

- PCB135 main networking features:
- 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each
connected to VSC7558 through QSGMII.
- 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY
each port connects to VSC7558 through SFI.
- 4x SFP28 25G module slots on back connected to VSC7558 through SFI high
speed.
- Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board
VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII
using a loopback add-on PCB)

This series provides support for:
- SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G
devices and media types.
- Port module configuration for 10M to 25G speeds with SGMII, QSGMII,
1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes.
- SerDes configuration via the Sparx5i SerDes driver (see below).
- Host mode providing register based injection and extraction.
- Switch mode providing MAC/VLAN table learning and Layer2 switching
offloaded to the Sparx5i switch.
- STP state, VLAN support, host/bridge port mode, Forwarding DB, and
configuration and statistics via ethtool.

More support will be added at a later stage.

The Sparx5i Chip Register Model can be browsed at this location:
https://github.com/microchip-ung/sparx-5_reginfo
and the datasheet is available here:
https://ww1.microchip.com/downloads/en/DeviceDoc/SparX-5_Family_L2L3_Enterprise_10G_Ethernet_Switches_Datasheet_00003822B.pdf

The series depends on the following series currently on their way
into the kernel:

- 25G Base-R phy mode
Link: https://lore.kernel.org/r/20210611125453.313308-1-steen.hegelund@microchip.com/
- Sparx5 Reset Driver
Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/

ChangeLog:
v5:
- cover letter
- updated the description to match the latest data sheets
- basic driver
- added error message in case of reset controller error
- port struct: replacing has_sfp with inband, adding pause_adv
- host mode
- port cleanup: unregisters netdevs and then removes phylink etc
- checking for pause_adv when comparing port config changes
- getting duplex and pause state in the link_up callback.
- getting inband, autoneg and pause_adv config in the pcs_config
callback.
- port
- use only the pause_adv bits when getting aneg status
- use the inband state when updating the PCS and port config
v4:
- basic driver:
Using devm_reset_control_get_optional_shared to get the reset
control, and let the reset framework check if it is valid.
- host mode (phylink):
Use the PCS operations to get state and update configuration.
Removed the setting of interface modes. Let phylink control this.
Using the new 5gbase-r and 25gbase-r modes.
Using a helper function to check if one of the 3 base-r modes has
been selected.
Currently it will not be possible to change the interface mode by
changing the speed (e.g via ethtool). This will be added later.
v3:
- basic driver:
- removed unneeded braces
- release reference to ports node after use
- use dev_err_probe to handle DEFER
- update error value when bailing out (a few cases)
- updated formatting of port struct and grouping of bool values
- simplified the spx5_rmw and spx5_inst_rmw inline functions
- host mode (netdev):
- removed lockless flag
- added port timer init
- host mode (packet - manual injection):
- updated error counters in error situations
- implemented timer handling of watermark threshold: stop and
restart netif queues.
- fixed error message handling (rate limited)
- fixed comment style error
- used DIV_ROUND_UP macro
- removed a debug message for open ports

v2:
- Updated bindings:
- drop minItems for the reg property
- Statistics implementation:
- Reorganized statistics into ethtool groups:
eth-phy, eth-mac, eth-ctrl, rmon
as defined by the IEEE 802.3 categories and RFC 2819.
- The remaining statistics are provided by the classic ethtool
statistics command.
- Hostmode support:
- Removed netdev renaming
- Validate ethernet address in sparx5_set_mac_address()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

67faf76d

arm64: dts: sparx5: Add the Sparx5 switch node · d0f482bb

Steen Hegelund authored Jun 24, 2021

This provides the configuration for the currently available evaluation
boards PCB134 and PCB135.

The series depends on the following series currently on its way
into the kernel:

- Sparx5 Reset Driver
  Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d0f482bb

net: sparx5: add ethtool configuration and statistics support · af4b1102

Steen Hegelund authored Jun 24, 2021

This adds statistic counters for the network interfaces provided
by the driver.  It also adds CPU port counters (which are not
exposed by ethtool).
This also adds support for configuring the network interface
parameters via ethtool: speed, duplex, aneg etc.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af4b1102

net: sparx5: add calendar bandwidth allocation support · 0a9d48ad

Steen Hegelund authored Jun 24, 2021

This configures the Sparx5 calendars according to the bandwidth
requested in the Device Tree nodes.
It also checks if the total requested bandwidth is within the
specs of the detected Sparx5 models limits.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a9d48ad

net: sparx5: add switching support · d6fce514

Steen Hegelund authored Jun 24, 2021

This adds SwitchDev support by hardware offloading the
software bridge.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6fce514

net: sparx5: add vlan support · 78eab33b

Steen Hegelund authored Jun 24, 2021

This adds Sparx5 VLAN support.

Sparx5 has more VLAN features than provided here, but these will be added
in later series. For now we only add the basic L2 features.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

78eab33b

net: sparx5: add mactable support · b37a1bae

Steen Hegelund authored Jun 24, 2021

This adds the Sparx5 MAC tables: listening for MAC table updates and
updating on request.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b37a1bae

net: sparx5: add port module support · 946e7fd5

Steen Hegelund authored Jun 24, 2021

This add configuration of the Sparx5 port module instances.

Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33
physical SerDes connections (S0 to S32).  The 65th port (D64) is fixed
allocated to SerDes0 (S0). The remaining 64 ports can in various
multiplexing scenarios be connected to the remaining 32 SerDes using
QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1
mapping to the 32 SerDes.

Some additional ports (D65 to D69) are internal to the device and do not
connect to port modules or SerDes macros. For example, internal ports are
used for frame injection and extraction to the CPU queues.

The 65 logical ports are split up into the following blocks.

- 13 x 5G ports (D0-D11, D64)
- 32 x 2G5 ports (D16-D47)
- 12 x 10G ports (D12-D15, D48-D55)
- 8 x 25G ports (D56-D63)

Each logical port supports different line speeds, and depending on the
speeds supported, different port modules (MAC+PCS) are needed. A port
supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a
DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5
Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it
will have a shadow DEV2G5 port module to support the lower speeds
(10/100/1000/2500Mbps). When a port needs to operate at lower speed and the
shadow DEV2G5 needs to be connected to its corresponding SerDes

Not all interface modes are supported in this series, but will be added at
a later stage.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

946e7fd5

net: sparx5: add hostmode with phylink support · f3cad261

Steen Hegelund authored Jun 24, 2021

This patch adds netdevs and phylink support for the ports in the switch.
It also adds register based injection and extraction for these ports.

Frame DMA support for injection and extraction will be added in a later
series.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3cad261

net: sparx5: add the basic sparx5 driver · 3cfa11ba

Steen Hegelund authored Jun 24, 2021

This adds the Sparx5 basic SwitchDev driver framework with IO range
mapping, switch device detection and core clock configuration.

Support for ports, phylink, netdev, mactable etc. are in the following
patches.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3cfa11ba

dt-bindings: net: sparx5: Add sparx5-switch bindings · f8c63088

Steen Hegelund authored Jun 24, 2021

Document the Sparx5 switch device driver bindings
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8c63088

net: mdiobus: fix fwnode_mdbiobus_register() fallback case · c88c192d

Marcin Wojtas authored Jun 24, 2021

The fallback case of fwnode_mdbiobus_register()
(relevant for !CONFIG_FWNODE_MDIO) was defined with wrong
argument name, causing a compilation error. Fix that.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

c88c192d

net: ip: avoid OOM kills with large UDP sends over loopback · 6d123b81

Jakub Kicinski authored Jun 23, 2021

Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with page frags.

af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.

v4: pre-calculate all the additions to alloclen so
    we can be sure it won't go over order-2
Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d123b81

tools/testing: add a selftest for SO_NETNS_COOKIE · ae24bab2

Lorenz Bauer authored Jun 23, 2021

Make sure that SO_NETNS_COOKIE returns a non-zero value, and
that sockets from different namespaces have a distinct cookie
value.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae24bab2

net: retrieve netns cookie via getsocketopt · e8b9eab9

Martynas Pumputis authored Jun 23, 2021

It's getting more common to run nested container environments for
testing cloud software. One of such examples is Kind [1] which runs a
Kubernetes cluster in Docker containers on a single host. Each container
acts as a Kubernetes node, and thus can run any Pod (aka container)
inside the former. This approach simplifies testing a lot, as it
eliminates complicated VM setups.

Unfortunately, such a setup breaks some functionality when cgroupv2 BPF
programs are used for load-balancing. The load-balancer BPF program
needs to detect whether a request originates from the host netns or a
container netns in order to allow some access, e.g. to a service via a
loopback IP address. Typically, the programs detect this by comparing
netns cookies with the one of the init ns via a call to
bpf_get_netns_cookie(NULL). However, in nested environments the latter
cannot be used given the Kubernetes node's netns is outside the init ns.
To fix this, we need to pass the Kubernetes node netns cookie to the
program in a different way: by extending getsockopt() with a
SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes
node netns can retrieve the cookie and pass it to the program instead.

Thus, this is following up on Eric's commit 3d368ab8 ("net:
initialize net->net_cookie at netns setup") to allow retrieval via
SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie
via SO_COOKIE.

[1] https://kind.sigs.k8s.io/Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e8b9eab9

23 Jun, 2021 14 commits

Merge branch 'devlink-rate-limit-fixes' · 35713d9b

David S. Miller authored Jun 23, 2021

Dmytro Linkin says:

====================
Fixes for devlink rate objects API

Patch #1 fixes not decreased refcount of parent node for destroyed leaf
object.

Patch #2 fixes incorect eswitch mode check.

Patch #3 protects list traversing with a lock.

====================
Signed-off-by: David S. Miller <davem@davemloft.net>

35713d9b

devlink: Protect rate list with lock while switching modes · a3e5e579

Dmytro Linkin authored Jun 23, 2021

Devlink eswitch set command doesn't hold devlink->lock, which makes
possible race condition between rate list traversing and others devlink
rate KAPI calls, like devlink_rate_nodes_destroy().
Hold devlink lock while traversing the list.

Fixes: a8ecb93e ("devlink: Introduce rate nodes")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3e5e579

devlink: Remove eswitch mode check for mode set call · ff99324d

Dmytro Linkin authored Jun 23, 2021

When eswitch is disabled, querying its current mode results in error.
Due to this when trying to set the eswitch mode for mlx5 devices, it
fails to set the eswitch switchdev mode.
Hence remove such check.

Fixes: a8ecb93e ("devlink: Introduce rate nodes")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff99324d

devlink: Decrease refcnt of parent rate object on leaf destroy · 1321ed5e

Dmytro Linkin authored Jun 23, 2021

Port functions, like SFs, can be deleted by the user when its leaf rate
object has parent node. In such case node refcnt won't be decreased
which blocks the node from deletion later.
Do simple refcnt decrease, since driver in cleanup stage. This:
1) assumes that driver took proper internal parent unset action;
2) allows to avoid nested callbacks call and deadlock.

Fixes: d7555984 ("devlink: Allow setting parent node of rate objects")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1321ed5e

virtio_net: Use virtio_find_vqs_ctx() helper · a2f7dc00

Xianting Tian authored Jun 23, 2021

virtio_find_vqs_ctx() is defined but never be called currently,
it is the right place to use it.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2f7dc00

net/tls: Remove the __TLS_DEC_STATS() macro. · 10ed7ce4

Kuniyuki Iwashima authored Jun 23, 2021

The commit d26b698d ("net/tls: add skeleton of MIB statistics")
introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
not defined also. Let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

10ed7ce4

tcp: Add stats for socket migration. · 55d444b3

Kuniyuki Iwashima authored Jun 23, 2021

This commit adds two stats for the socket migration feature to evaluate the
effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).

If the migration fails because of the own_req race in receiving ACK and
sending SYN+ACK paths, we do not increment the failure stat. Then another
CPU is responsible for the req.

Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55d444b3

ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM. · 7525de25

David Wilder authored Jun 22, 2021

TCP checksums on received packets may be set to NULL by the sender if CSO
is enabled. The hypervisor flags these packets as check-sum-ok and the
skb is then flagged CHECKSUM_UNNECESSARY. If these packets are then
forwarded the sender will not request CSO due to the CHECKSUM_UNNECESSARY
flag. The result is a TCP packet sent with a bad checksum. This change
sets up CHECKSUM_PARTIAL on these packets causing the sender to correctly
request CSUM offload.
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Tested-by: Cristobal Forno <cforno12@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7525de25

Merge tag 'mlx5-net-next-2021-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · fe87797b

David S. Miller authored Jun 23, 2021

Saeed Mahameed says:

====================
mlx5-net-next-2021-06-22

1) Various minor cleanups and fixes from net-next branch
2) Optimize mlx5 feature check on tx and
   a fix to allow Vxlan with Ipsec offloads
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fe87797b

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · a7b62112

David S. Miller authored Jun 23, 2021

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
   from Phil Sutter.

2) Simplify TCP option sanity check for TCP packets, also from Phil.

3) Add a new expression to store when the rule has been used last time.

4) Pass the hook state object to log function, from Florian Westphal.

5) Document the new sysctl knobs to tune the flowtable timeouts,
   from Oz Shlomo.

6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
   from Dan Carpenter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a7b62112

selftests: icmp_redirect: support expected failures · 0a36a75c

Andrea Righi authored Jun 22, 2021

According to a comment in commit 99513cfa ("selftest: Fixes for
icmp_redirect test") the test "IPv6: mtu exception plus redirect" is
expected to fail, because of a bug in the IPv6 logic that hasn't been
fixed yet apparently.

We should probably consider this failure as an "expected failure",
therefore change the script to return XFAIL for that particular test and
also report the total amount of expected failures at the end of the run.
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a36a75c

Merge branch 'lockless-qdisc-opts' · e940eb3c

David S. Miller authored Jun 23, 2021

Yunsheng Lin says:

====================
Some optimization for lockless qdisc

Patch 1: remove unnecessary seqcount operation.
Patch 2: implement TCQ_F_CAN_BYPASS.
Patch 3: remove qdisc->empty.

Performance data for pktgen in queue_xmit mode + dummy netdev
with pfifo_fast:

 threads    unpatched           patched             delta
    1       2.60Mpps            3.21Mpps             +23%
    2       3.84Mpps            5.56Mpps             +44%
    4       5.52Mpps            5.58Mpps             +1%
    8       2.77Mpps            2.76Mpps             -0.3%
   16       2.24Mpps            2.23Mpps             -0.4%

Performance for IP forward testing: 1.05Mpps increases to
1.16Mpps, about 10% improvement.

V3: Add 'Acked-by' from Jakub and 'Tested-by' from Vladimir,
    and resend based on latest net-next.
V2: Adjust the comment and commit log according to discussion
    in V1.
V1: Drop RFC tag, add nolock_qdisc_is_empty() and do the qdisc
    empty checking without the protection of qdisc->seqlock to
    aviod doing unnecessary spin_trylock() for contention case.
RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
        qdisc, and add patch 1 and 3.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e940eb3c

net: sched: remove qdisc->empty for lockless qdisc · d3e0f575

Yunsheng Lin authored Jun 22, 2021

As MISSED and DRAINING state are used to indicate a non-empty
qdisc, qdisc->empty is not longer needed, so remove it.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3e0f575

net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b

Yunsheng Lin authored Jun 22, 2021

Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2 ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:

      thread1               thread2             thread3
qdisc_run_begin() # true
                        qdisc_run_begin(q)
                             set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                            enqueue skb1
                                             qdisc empty # true
                                          qdisc_run_begin() # true
                                          sch_direct_xmit() # skb2
                         qdisc_run_begin()
                            set(MISSED)

When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.

This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

1. https://lkml.org/lkml/2021/5/29/215Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c4fef01b