1. 24 Jun, 2021 26 commits
    • Bailey Forrest's avatar
      gve: Add DQO fields for core data structures · a4aa1f1e
      Bailey Forrest authored
      - Add new DQO datapath structures:
        - `gve_rx_buf_queue_dqo`
        - `gve_rx_compl_queue_dqo`
        - `gve_rx_buf_state_dqo`
        - `gve_tx_desc_dqo`
        - `gve_tx_pending_packet_dqo`
      
      - Incorporate these into the existing ring data structures:
        - `gve_rx_ring`
        - `gve_tx_ring`
      
      Noteworthy mentions:
      
      - `gve_rx_buf_state` represents an RX buffer which was posted to HW.
        Each RX queue has an array of these objects and the index into the
        array is used as the buffer_id when posted to HW.
      
      - `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The
        completion_tag is the index into the array.
      
      - These two structures have links for linked lists which are represented
        by 16b indexes into a contiguous array of these structures.
        This reduces memory footprint compared to 64b pointers.
      
      - We use unions for the writeable datapath structures to reduce cache
        footprint. GQI specific members will renamed like DQO members in a
        future patch.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4aa1f1e
    • Bailey Forrest's avatar
      gve: Add dqo descriptors · 22319818
      Bailey Forrest authored
      General description of rings and descriptors:
      
      TX ring is used for sending TX packet buffers to the NIC. It has the
      following descriptors:
      - `gve_tx_pkt_desc_dqo` - Data buffer descriptor
      - `gve_tx_tso_context_desc_dqo` - TSO context descriptor
      - `gve_tx_general_context_desc_dqo` - Generic metadata descriptor
      
      Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo`
      which represents the logical interpetation of the metadata bytes. It's
      helpful to define this structure because the metadata bytes exist in
      multiple descriptor types (including `gve_tx_tso_context_desc_dqo`),
      and the device requires same field has the same value in all
      descriptors.
      
      The TX completion ring is used to receive completions from the NIC.
      Having a separate ring allows for completions to be out of order. The
      completion descriptor `gve_tx_compl_desc` has several different types,
      most important are packet and descriptor completions. Descriptor
      completions are used to notify the driver when descriptors sent on the
      TX ring are done being consumed. The descriptor completion is only used
      to signal that space is cleared in the TX ring. A packet completion will
      be received when a packet transmitted on the TX queue is done being
      transmitted.
      
      In addition there are "miss" and "reinjection" completions. The device
      implements a "flow-miss model". Most packets will simply receive a
      packet completion. The flow-miss system may choose to process a packet
      based on its contents. A TX packet which experiences a flow miss would
      receive a miss completion followed by a later reinjection completion.
      The miss-completion is received when the packet starts to be processed
      by the flow-miss system and the reinjection completion is received when
      the flow-miss system completes processing the packet and sends it on the
      wire.
      
      The RX buffer ring is used to send buffers to HW via the
      `gve_rx_desc_dqo` descriptor.
      
      Received packets are put into the RX queue by the device, which
      populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors
      refer to buffers posted by the buffer queue. Received buffers may be
      returned out of order, such as when HW LRO is enabled.
      
      Important concepts:
      - "TX" and "RX buffer" queues, which send descriptors to the device, use
        MMIO doorbells to notify the device of new descriptors.
      
      - "RX" and "TX completion" queues, which receive descriptors from the
        device, use a "generation bit" to know when a descriptor was populated
        by the device. The driver initializes all bits with the "current
        generation". The device will populate received descriptors with the
        "next generation" which is inverted from the current generation. When
        the ring wraps, the current/next generation are swapped.
      
      - It's the driver's responsibility to ensure that the RX and TX
        completion queues are not overrun. This can be accomplished by
        limiting the number of descriptors posted to HW.
      
      - TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
        buffer_id. These will be returned on the TX completion and RX queues
        respectively to let the driver know which packet/buffer was completed.
      
      Bitfields are used to describe descriptor fields. This notation is more
      concise and readable than shift-and-mask. It is possible because the
      driver is restricted to little endian platforms.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22319818
    • Bailey Forrest's avatar
      gve: Add support for DQO RX PTYPE map · c4b87ac8
      Bailey Forrest authored
      Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the
      packet. L3 and L4 types are necessary in order to set the hash and csum
      on RX SKBs correctly.
      
      DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map
      enables the device to tell the driver how to map from PTYPE index to
      L3/L4 type.
      
      The device doesn't provide any guarantees about the range of possible
      PTYPEs, so we just use a 1024 entry array to implement a fast mapping
      structure.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4b87ac8
    • Bailey Forrest's avatar
      gve: adminq: DQO specific device descriptor logic · 5ca2265e
      Bailey Forrest authored
      - In addition to TX and RX queues, DQO has TX completion and RX buffer
        queues.
        - TX completions are received when the device has completed sending a
          packet on the wire.
        - RX buffers are posted on a separate queue form the RX completions.
      - DQO descriptor rings are allowed to be smaller than PAGE_SIZE.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ca2265e
    • Bailey Forrest's avatar
      gve: Introduce per netdev `enum gve_queue_format` · a5886ef4
      Bailey Forrest authored
      The currently supported queue formats are:
      - GQI_RDA - GQI with raw DMA addressing
      - GQI_QPL - GQI with queue page list
      - DQO_RDA - DQO with raw DMA addressing
      
      The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we
      remove it in favor of just checking against GQI_RDA
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5886ef4
    • Bailey Forrest's avatar
      gve: Introduce a new model for device options · 8a39d3e0
      Bailey Forrest authored
      The current model uses an integer ID and a fixed size struct for the
      parameters of each device option.
      
      The new model allows the device option structs to grow in size over
      time. A driver may assume that changes to device option structs will
      always be appended.
      
      New device options will also generally have a
      `supported_features_mask` so that the driver knows which fields within a
      particular device option are enabled.
      
      `gve_device_option.feat_mask` is changed to `required_features_mask`,
      and it is a bitmask which must match the value expected by the driver.
      This gives the device the ability to break backwards compatibility with
      old drivers for certain features by blocking the old drivers from trying
      to use the feature.
      
      We maintain ABI compatibility with the old model for
      GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which
      does not support the new model.
      
      This patch introduces some new terminology:
      
      RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA
            mapped and read/updated by the device.
      
      QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped
            with the device for read/write and data is copied from/to SKBs.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a39d3e0
    • Bailey Forrest's avatar
      gve: Make gve_rx_slot_page_info.page_offset an absolute offset · 920fb451
      Bailey Forrest authored
      Using `page_offset` like a boolean means a page may only be split into
      two sections. With page sizes larger than 4k, this can be very wasteful.
      Future commits in this patchset use `struct gve_rx_slot_page_info` in a
      way which supports a fixed buffer size and a variable page size.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      920fb451
    • Bailey Forrest's avatar
      gve: gve_rx_copy: Move padding to an argument · 35f9b2f4
      Bailey Forrest authored
      Future use cases will have a different padding value.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35f9b2f4
    • Bailey Forrest's avatar
      gve: Move some static functions to a common file · dbdaa675
      Bailey Forrest authored
      These functions will be shared by the GQI and DQO variants of the GVNIC
      driver as of follow-up patches in this series.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbdaa675
    • Bailey Forrest's avatar
      gve: Update GVE documentation to describe DQO · c6a7ed77
      Bailey Forrest authored
      DQO is a new descriptor format for our next generation virtual NIC.
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarCatherine Sullivan <csully@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6a7ed77
    • Yajun Deng's avatar
      usbnet: add usbnet_event_names[] for kevent · 47889068
      Yajun Deng authored
      Modify the netdev_dbg content from int to char * in usbnet_defer_kevent(),
      this looks more readable.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47889068
    • David S. Miller's avatar
      Merge branch 'add-sparx5i-driver' · 67faf76d
      David S. Miller authored
      Steen Hegelund says:
      
      ====================
      Adding the Sparx5i Switch Driver
      
      This series provides the Microchip Sparx5i Switch Driver
      
      The SparX-5 Enterprise Ethernet switch family provides a rich set of
      Enterprise switching features such as advanced TCAM-based VLAN and QoS
      processing enabling delivery of differentiated services, and security
      through TCAMbased frame processing using versatile content aware processor
      (VCAP). IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is supported
      with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K IPv4/3K IPv6
      (S,G) multicast groups. L3 security features include source guard and
      reverse path forwarding (uRPF) tasks. Additional L3 features include
      VRF-Lite and IP tunnels (IP over GRE/IP).
      
      The SparX-5 switch family features a highly flexible set of Ethernet ports
      with support for 10G and 25G aggregation links, QSGMII, USGMII, and
      USXGMII.  The device integrates a powerful 1 GHz dual-core ARM® Cortex®-A53
      CPU enabling full management of the switch and advanced Enterprise
      applications.
      
      The SparX-5 switch family targets managed Layer 2 and Layer 3 equipment in
      SMB, SME, and Enterprise where high port count 1G/2.5G/5G/10G switching
      with 10G/25G aggregation links is required.
      
      The SparX-5 switch family consists of following SKUs:
      
        VSC7546 SparX-5-64 supports up to 64 Gbps of bandwidth with the following
        primary port configurations.
         - 6 ×10G
         - 16 × 2.5G + 2 × 10G
         - 24 × 1G + 4 × 10G
      
        VSC7549 SparX-5-90 supports up to 90 Gbps of bandwidth with the following
        primary port configurations.
         - 9 × 10G
         - 16 × 2.5G + 4 × 10G
         - 48 × 1G + 4 × 10G
      
        VSC7552 SparX-5-128 supports up to 128 Gbps of bandwidth with the
        following primary port configurations.
         - 12 × 10G
         - 6 x 10G + 2 x 25G
         - 16 × 2.5G + 8 × 10G
         - 48 × 1G + 8 × 10G
      
        VSC7556 SparX-5-160 supports up to 160 Gbps of bandwidth with the
        following primary port configurations.
         - 16 × 10G
         - 10 × 10G + 2 × 25G
         - 16 × 2.5G + 10 × 10G
         - 48 × 1G + 10 × 10G
      
        VSC7558 SparX-5-200 supports up to 200 Gbps of bandwidth with the
        following primary port configurations.
         - 20 × 10G
         - 8 × 25G
      
      In addition, the device supports one 10/100/1000/2500/5000 Mbps
      SGMII/SerDes node processor interface (NPI) Ethernet port.
      
      Time sensitive networking (TSN) is supported through a comprehensive set of
      features including frame preemption, cut-through, frame replication and
      elimination for reliability, enhanced scheduling: credit-based shaping,
      time-aware shaping, cyclic queuing, and forwarding, and per-stream policing
      and filtering.
      
      Together with IEEE 1588 and IEEE 802.1AS support, this guarantees
      low-latency deterministic networking for Industrial Ethernet.
      
      The Sparx5i support is developed on the PCB134 and PCB135 evaluation boards.
      
      - PCB134 main networking features:
        - 12x SFP+ front 10G module slots (connected to Sparx5i through SFI).
        - 8x SFP28 front 25G module slots (connected to Sparx5i through SFI high
          speed).
        - Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port
          (on-board VSC8211 PHY connected to Sparx5i through SGMII).
      
      - PCB135 main networking features:
        - 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each
          connected to VSC7558 through QSGMII.
        - 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY
          each port connects to VSC7558 through SFI.
        - 4x SFP28 25G module slots on back connected to VSC7558 through SFI high
          speed.
        - Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board
          VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII
          using a loopback add-on PCB)
      
      This series provides support for:
        - SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G
          devices and media types.
        - Port module configuration for 10M to 25G speeds with SGMII, QSGMII,
          1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes.
        - SerDes configuration via the Sparx5i SerDes driver (see below).
        - Host mode providing register based injection and extraction.
        - Switch mode providing MAC/VLAN table learning and Layer2 switching
          offloaded to the Sparx5i switch.
        - STP state, VLAN support, host/bridge port mode, Forwarding DB, and
          configuration and statistics via ethtool.
      
      More support will be added at a later stage.
      
      The Sparx5i Chip Register Model can be browsed at this location:
      https://github.com/microchip-ung/sparx-5_reginfo
      and the datasheet is available here:
      https://ww1.microchip.com/downloads/en/DeviceDoc/SparX-5_Family_L2L3_Enterprise_10G_Ethernet_Switches_Datasheet_00003822B.pdf
      
      The series depends on the following series currently on their way
      into the kernel:
      
      - 25G Base-R phy mode
        Link: https://lore.kernel.org/r/20210611125453.313308-1-steen.hegelund@microchip.com/
      - Sparx5 Reset Driver
        Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/
      
      ChangeLog:
      v5:
          - cover letter
              - updated the description to match the latest data sheets
          - basic driver
              - added error message in case of reset controller error
              - port struct: replacing has_sfp with inband, adding pause_adv
          - host mode
              - port cleanup: unregisters netdevs and then removes phylink etc
              - checking for pause_adv when comparing port config changes
              - getting duplex and pause state in the link_up callback.
              - getting inband, autoneg and pause_adv config in the pcs_config
                callback.
          - port
              - use only the pause_adv bits when getting aneg status
              - use the inband state when updating the PCS and port config
      v4:
          - basic driver:
              Using devm_reset_control_get_optional_shared to get the reset
              control, and let the reset framework check if it is valid.
          - host mode (phylink):
              Use the PCS operations to get state and update configuration.
              Removed the setting of interface modes.  Let phylink control this.
              Using the new 5gbase-r and 25gbase-r modes.
              Using a helper function to check if one of the 3 base-r modes has
              been selected.
              Currently it will not be possible to change the interface mode by
              changing the speed (e.g via ethtool).  This will be added later.
      v3:
          - basic driver:
              - removed unneeded braces
              - release reference to ports node after use
              - use dev_err_probe to handle DEFER
              - update error value when bailing out (a few cases)
              - updated formatting of port struct and grouping of bool values
              - simplified the spx5_rmw and spx5_inst_rmw inline functions
          - host mode (netdev):
              - removed lockless flag
              - added port timer init
          - host mode (packet - manual injection):
              - updated error counters in error situations
              - implemented timer handling of watermark threshold: stop and
                restart netif queues.
              - fixed error message handling (rate limited)
              - fixed comment style error
              - used DIV_ROUND_UP macro
              - removed a debug message for open ports
      
      v2:
          - Updated bindings:
              - drop minItems for the reg property
          - Statistics implementation:
              - Reorganized statistics into ethtool groups:
                  eth-phy, eth-mac, eth-ctrl, rmon
                as defined by the IEEE 802.3 categories and RFC 2819.
              - The remaining statistics are provided by the classic ethtool
                statistics command.
          - Hostmode support:
              - Removed netdev renaming
              - Validate ethernet address in sparx5_set_mac_address()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67faf76d
    • Steen Hegelund's avatar
      arm64: dts: sparx5: Add the Sparx5 switch node · d0f482bb
      Steen Hegelund authored
      This provides the configuration for the currently available evaluation
      boards PCB134 and PCB135.
      
      The series depends on the following series currently on its way
      into the kernel:
      
      - Sparx5 Reset Driver
        Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0f482bb
    • Steen Hegelund's avatar
      net: sparx5: add ethtool configuration and statistics support · af4b1102
      Steen Hegelund authored
      This adds statistic counters for the network interfaces provided
      by the driver.  It also adds CPU port counters (which are not
      exposed by ethtool).
      This also adds support for configuring the network interface
      parameters via ethtool: speed, duplex, aneg etc.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af4b1102
    • Steen Hegelund's avatar
      net: sparx5: add calendar bandwidth allocation support · 0a9d48ad
      Steen Hegelund authored
      This configures the Sparx5 calendars according to the bandwidth
      requested in the Device Tree nodes.
      It also checks if the total requested bandwidth is within the
      specs of the detected Sparx5 models limits.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a9d48ad
    • Steen Hegelund's avatar
      net: sparx5: add switching support · d6fce514
      Steen Hegelund authored
      This adds SwitchDev support by hardware offloading the
      software bridge.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6fce514
    • Steen Hegelund's avatar
      net: sparx5: add vlan support · 78eab33b
      Steen Hegelund authored
      This adds Sparx5 VLAN support.
      
      Sparx5 has more VLAN features than provided here, but these will be added
      in later series. For now we only add the basic L2 features.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78eab33b
    • Steen Hegelund's avatar
      net: sparx5: add mactable support · b37a1bae
      Steen Hegelund authored
      This adds the Sparx5 MAC tables: listening for MAC table updates and
      updating on request.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b37a1bae
    • Steen Hegelund's avatar
      net: sparx5: add port module support · 946e7fd5
      Steen Hegelund authored
      This add configuration of the Sparx5 port module instances.
      
      Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33
      physical SerDes connections (S0 to S32).  The 65th port (D64) is fixed
      allocated to SerDes0 (S0). The remaining 64 ports can in various
      multiplexing scenarios be connected to the remaining 32 SerDes using
      QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1
      mapping to the 32 SerDes.
      
      Some additional ports (D65 to D69) are internal to the device and do not
      connect to port modules or SerDes macros. For example, internal ports are
      used for frame injection and extraction to the CPU queues.
      
      The 65 logical ports are split up into the following blocks.
      
      - 13 x 5G ports (D0-D11, D64)
      - 32 x 2G5 ports (D16-D47)
      - 12 x 10G ports (D12-D15, D48-D55)
      - 8 x 25G ports (D56-D63)
      
      Each logical port supports different line speeds, and depending on the
      speeds supported, different port modules (MAC+PCS) are needed. A port
      supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a
      DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5
      Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it
      will have a shadow DEV2G5 port module to support the lower speeds
      (10/100/1000/2500Mbps). When a port needs to operate at lower speed and the
      shadow DEV2G5 needs to be connected to its corresponding SerDes
      
      Not all interface modes are supported in this series, but will be added at
      a later stage.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      946e7fd5
    • Steen Hegelund's avatar
      net: sparx5: add hostmode with phylink support · f3cad261
      Steen Hegelund authored
      This patch adds netdevs and phylink support for the ports in the switch.
      It also adds register based injection and extraction for these ports.
      
      Frame DMA support for injection and extraction will be added in a later
      series.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3cad261
    • Steen Hegelund's avatar
      net: sparx5: add the basic sparx5 driver · 3cfa11ba
      Steen Hegelund authored
      This adds the Sparx5 basic SwitchDev driver framework with IO range
      mapping, switch device detection and core clock configuration.
      
      Support for ports, phylink, netdev, mactable etc. are in the following
      patches.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Reviewed-by: default avatarPhilipp Zabel <p.zabel@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cfa11ba
    • Steen Hegelund's avatar
      dt-bindings: net: sparx5: Add sparx5-switch bindings · f8c63088
      Steen Hegelund authored
      Document the Sparx5 switch device driver bindings
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarLars Povlsen <lars.povlsen@microchip.com>
      Signed-off-by: default avatarBjarni Jonasson <bjarni.jonasson@microchip.com>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8c63088
    • Marcin Wojtas's avatar
      net: mdiobus: fix fwnode_mdbiobus_register() fallback case · c88c192d
      Marcin Wojtas authored
      The fallback case of fwnode_mdbiobus_register()
      (relevant for !CONFIG_FWNODE_MDIO) was defined with wrong
      argument name, causing a compilation error. Fix that.
      Signed-off-by: default avatarMarcin Wojtas <mw@semihalf.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c88c192d
    • Jakub Kicinski's avatar
      net: ip: avoid OOM kills with large UDP sends over loopback · 6d123b81
      Jakub Kicinski authored
      Dave observed number of machines hitting OOM on the UDP send
      path. The workload seems to be sending large UDP packets over
      loopback. Since loopback has MTU of 64k kernel will try to
      allocate an skb with up to 64k of head space. This has a good
      chance of failing under memory pressure. What's worse if
      the message length is <32k the allocation may trigger an
      OOM killer.
      
      This is entirely avoidable, we can use an skb with page frags.
      
      af_unix solves a similar problem by limiting the head
      length to SKB_MAX_ALLOC. This seems like a good and simple
      approach. It means that UDP messages > 16kB will now
      use fragments if underlying device supports SG, if extra
      allocator pressure causes regressions in real workloads
      we can switch to trying the large allocation first and
      falling back.
      
      v4: pre-calculate all the additions to alloclen so
          we can be sure it won't go over order-2
      Reported-by: default avatarDave Jones <dsj@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d123b81
    • Lorenz Bauer's avatar
      tools/testing: add a selftest for SO_NETNS_COOKIE · ae24bab2
      Lorenz Bauer authored
      Make sure that SO_NETNS_COOKIE returns a non-zero value, and
      that sockets from different namespaces have a distinct cookie
      value.
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae24bab2
    • Martynas Pumputis's avatar
      net: retrieve netns cookie via getsocketopt · e8b9eab9
      Martynas Pumputis authored
      It's getting more common to run nested container environments for
      testing cloud software. One of such examples is Kind [1] which runs a
      Kubernetes cluster in Docker containers on a single host. Each container
      acts as a Kubernetes node, and thus can run any Pod (aka container)
      inside the former. This approach simplifies testing a lot, as it
      eliminates complicated VM setups.
      
      Unfortunately, such a setup breaks some functionality when cgroupv2 BPF
      programs are used for load-balancing. The load-balancer BPF program
      needs to detect whether a request originates from the host netns or a
      container netns in order to allow some access, e.g. to a service via a
      loopback IP address. Typically, the programs detect this by comparing
      netns cookies with the one of the init ns via a call to
      bpf_get_netns_cookie(NULL). However, in nested environments the latter
      cannot be used given the Kubernetes node's netns is outside the init ns.
      To fix this, we need to pass the Kubernetes node netns cookie to the
      program in a different way: by extending getsockopt() with a
      SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes
      node netns can retrieve the cookie and pass it to the program instead.
      
      Thus, this is following up on Eric's commit 3d368ab8 ("net:
      initialize net->net_cookie at netns setup") to allow retrieval via
      SO_NETNS_COOKIE.  This is also in line in how we retrieve socket cookie
      via SO_COOKIE.
      
        [1] https://kind.sigs.k8s.io/Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8b9eab9
  2. 23 Jun, 2021 14 commits
    • David S. Miller's avatar
      Merge branch 'devlink-rate-limit-fixes' · 35713d9b
      David S. Miller authored
      Dmytro Linkin says:
      
      ====================
      Fixes for devlink rate objects API
      
      Patch #1 fixes not decreased refcount of parent node for destroyed leaf
      object.
      
      Patch #2 fixes incorect eswitch mode check.
      
      Patch #3 protects list traversing with a lock.
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35713d9b
    • Dmytro Linkin's avatar
      devlink: Protect rate list with lock while switching modes · a3e5e579
      Dmytro Linkin authored
      Devlink eswitch set command doesn't hold devlink->lock, which makes
      possible race condition between rate list traversing and others devlink
      rate KAPI calls, like devlink_rate_nodes_destroy().
      Hold devlink lock while traversing the list.
      
      Fixes: a8ecb93e ("devlink: Introduce rate nodes")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3e5e579
    • Dmytro Linkin's avatar
      devlink: Remove eswitch mode check for mode set call · ff99324d
      Dmytro Linkin authored
      When eswitch is disabled, querying its current mode results in error.
      Due to this when trying to set the eswitch mode for mlx5 devices, it
      fails to set the eswitch switchdev mode.
      Hence remove such check.
      
      Fixes: a8ecb93e ("devlink: Introduce rate nodes")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff99324d
    • Dmytro Linkin's avatar
      devlink: Decrease refcnt of parent rate object on leaf destroy · 1321ed5e
      Dmytro Linkin authored
      Port functions, like SFs, can be deleted by the user when its leaf rate
      object has parent node. In such case node refcnt won't be decreased
      which blocks the node from deletion later.
      Do simple refcnt decrease, since driver in cleanup stage. This:
      1) assumes that driver took proper internal parent unset action;
      2) allows to avoid nested callbacks call and deadlock.
      
      Fixes: d7555984 ("devlink: Allow setting parent node of rate objects")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1321ed5e
    • Xianting Tian's avatar
      virtio_net: Use virtio_find_vqs_ctx() helper · a2f7dc00
      Xianting Tian authored
      virtio_find_vqs_ctx() is defined but never be called currently,
      it is the right place to use it.
      Signed-off-by: default avatarXianting Tian <xianting.tian@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2f7dc00
    • Kuniyuki Iwashima's avatar
      net/tls: Remove the __TLS_DEC_STATS() macro. · 10ed7ce4
      Kuniyuki Iwashima authored
      The commit d26b698d ("net/tls: add skeleton of MIB statistics")
      introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
      not defined also. Let's remove it.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10ed7ce4
    • Kuniyuki Iwashima's avatar
      tcp: Add stats for socket migration. · 55d444b3
      Kuniyuki Iwashima authored
      This commit adds two stats for the socket migration feature to evaluate the
      effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).
      
      If the migration fails because of the own_req race in receiving ACK and
      sending SYN+ACK paths, we do not increment the failure stat. Then another
      CPU is responsible for the req.
      
      Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/Suggested-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55d444b3
    • David Wilder's avatar
      ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM. · 7525de25
      David Wilder authored
      TCP checksums on received packets may be set to NULL by the sender if CSO
      is enabled. The hypervisor flags these packets as check-sum-ok and the
      skb is then flagged CHECKSUM_UNNECESSARY. If these packets are then
      forwarded the sender will not request CSO due to the CHECKSUM_UNNECESSARY
      flag. The result is a TCP packet sent with a bad checksum. This change
      sets up CHECKSUM_PARTIAL on these packets causing the sender to correctly
      request CSUM offload.
      Signed-off-by: default avatarDavid Wilder <dwilder@us.ibm.com>
      Reviewed-by: default avatarPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Tested-by: default avatarCristobal Forno <cforno12@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7525de25
    • David S. Miller's avatar
      Merge tag 'mlx5-net-next-2021-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · fe87797b
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-net-next-2021-06-22
      
      1) Various minor cleanups and fixes from net-next branch
      2) Optimize mlx5 feature check on tx and
         a fix to allow Vxlan with Ipsec offloads
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe87797b
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · a7b62112
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
         from Phil Sutter.
      
      2) Simplify TCP option sanity check for TCP packets, also from Phil.
      
      3) Add a new expression to store when the rule has been used last time.
      
      4) Pass the hook state object to log function, from Florian Westphal.
      
      5) Document the new sysctl knobs to tune the flowtable timeouts,
         from Oz Shlomo.
      
      6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
         from Dan Carpenter.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7b62112
    • Andrea Righi's avatar
      selftests: icmp_redirect: support expected failures · 0a36a75c
      Andrea Righi authored
      According to a comment in commit 99513cfa ("selftest: Fixes for
      icmp_redirect test") the test "IPv6: mtu exception plus redirect" is
      expected to fail, because of a bug in the IPv6 logic that hasn't been
      fixed yet apparently.
      
      We should probably consider this failure as an "expected failure",
      therefore change the script to return XFAIL for that particular test and
      also report the total amount of expected failures at the end of the run.
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a36a75c
    • David S. Miller's avatar
      Merge branch 'lockless-qdisc-opts' · e940eb3c
      David S. Miller authored
      Yunsheng Lin says:
      
      ====================
      Some optimization for lockless qdisc
      
      Patch 1: remove unnecessary seqcount operation.
      Patch 2: implement TCQ_F_CAN_BYPASS.
      Patch 3: remove qdisc->empty.
      
      Performance data for pktgen in queue_xmit mode + dummy netdev
      with pfifo_fast:
      
       threads    unpatched           patched             delta
          1       2.60Mpps            3.21Mpps             +23%
          2       3.84Mpps            5.56Mpps             +44%
          4       5.52Mpps            5.58Mpps             +1%
          8       2.77Mpps            2.76Mpps             -0.3%
         16       2.24Mpps            2.23Mpps             -0.4%
      
      Performance for IP forward testing: 1.05Mpps increases to
      1.16Mpps, about 10% improvement.
      
      V3: Add 'Acked-by' from Jakub and 'Tested-by' from Vladimir,
          and resend based on latest net-next.
      V2: Adjust the comment and commit log according to discussion
          in V1.
      V1: Drop RFC tag, add nolock_qdisc_is_empty() and do the qdisc
          empty checking without the protection of qdisc->seqlock to
          aviod doing unnecessary spin_trylock() for contention case.
      RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
              qdisc, and add patch 1 and 3.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e940eb3c
    • Yunsheng Lin's avatar
      net: sched: remove qdisc->empty for lockless qdisc · d3e0f575
      Yunsheng Lin authored
      As MISSED and DRAINING state are used to indicate a non-empty
      qdisc, qdisc->empty is not longer needed, so remove it.
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3e0f575
    • Yunsheng Lin's avatar
      net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b
      Yunsheng Lin authored
      Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
      flag set, but queue discipline by-pass does not work for lockless
      qdisc because skb is always enqueued to qdisc even when the qdisc
      is empty, see __dev_xmit_skb().
      
      This patch calls sch_direct_xmit() to transmit the skb directly
      to the driver for empty lockless qdisc, which aviod enqueuing
      and dequeuing operation.
      
      As qdisc->empty is not reliable to indicate a empty qdisc because
      there is a time window between enqueuing and setting qdisc->empty.
      So we use the MISSED state added in commit a90c57f2 ("net:
      sched: fix packet stuck problem for lockless qdisc"), which
      indicate there is lock contention, suggesting that it is better
      not to do the qdisc bypass in order to avoid packet out of order
      problem.
      
      In order to make MISSED state reliable to indicate a empty qdisc,
      we need to ensure that testing and clearing of MISSED state is
      within the protection of qdisc->seqlock, only setting MISSED state
      can be done without the protection of qdisc->seqlock. A MISSED
      state testing is added without the protection of qdisc->seqlock to
      aviod doing unnecessary spin_trylock() for contention case.
      
      As the enqueuing is not within the protection of qdisc->seqlock,
      there is still a potential data race as mentioned by Jakub [1]:
      
            thread1               thread2             thread3
      qdisc_run_begin() # true
                              qdisc_run_begin(q)
                                   set(MISSED)
      pfifo_fast_dequeue
        clear(MISSED)
        # recheck the queue
      qdisc_run_end()
                                  enqueue skb1
                                                   qdisc empty # true
                                                qdisc_run_begin() # true
                                                sch_direct_xmit() # skb2
                               qdisc_run_begin()
                                  set(MISSED)
      
      When above happens, skb1 enqueued by thread2 is transmited after
      skb2 is transmited by thread3 because MISSED state setting and
      enqueuing is not under the qdisc->seqlock. If qdisc bypass is
      disabled, skb1 has better chance to be transmited quicker than
      skb2.
      
      This patch does not take care of the above data race, because we
      view this as similar as below:
      Even at the same time CPU1 and CPU2 write the skb to two socket
      which both heading to the same qdisc, there is no guarantee that
      which skb will hit the qdisc first, because there is a lot of
      factor like interrupt/softirq/cache miss/scheduling afffecting
      that.
      
      There are below cases that need special handling:
      1. When MISSED state is cleared before another round of dequeuing
         in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
         dequeue all skb in one round and call __netif_schedule(), which
         might result in a non-empty qdisc without MISSED set. In order
         to avoid this, the MISSED state is set for lockless qdisc and
         __netif_schedule() will be called at the end of qdisc_run_end.
      
      2. The MISSED state also need to be set for lockless qdisc instead
         of calling __netif_schedule() directly when requeuing a skb for
         a similar reason.
      
      3. For netdev queue stopped case, the MISSED case need clearing
         while the netdev queue is stopped, otherwise there may be
         unnecessary __netif_schedule() calling. So a new DRAINING state
         is added to indicate this case, which also indicate a non-empty
         qdisc.
      
      4. As there is already netif_xmit_frozen_or_stopped() checking in
         dequeue_skb() and sch_direct_xmit(), which are both within the
         protection of qdisc->seqlock, but the same checking in
         __dev_xmit_skb() is without the protection, which might cause
         empty indication of a lockless qdisc to be not reliable. So
         remove the checking in __dev_xmit_skb(), and the checking in
         the protection of qdisc->seqlock seems enough to avoid the cpu
         consumption problem for netdev queue stopped case.
      
      1. https://lkml.org/lkml/2021/5/29/215Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4fef01b