1. 03 Oct, 2022 25 commits
    • Shenwei Wang's avatar
      net: fec: using page pool to manage RX buffers · 95698ff6
      Shenwei Wang authored
      This patch optimizes the RX buffer management by using the page
      pool. The purpose for this change is to prepare for the following
      XDP support. The current driver uses one frame per page for easy
      management.
      
      Added __maybe_unused attribute to the following functions to avoid
      the compiling warning. Those functions will be removed by a separate
      patch once this page pool solution is accepted.
       - fec_enet_new_rxbdp
       - fec_enet_copybreak
      
      The following are the comparing result between page pool implementation
      and the original implementation (non page pool).
      
       --- small packet (64 bytes) testing are almost the same
       --- no matter what the implementation is
       --- on both i.MX8 and i.MX6SX platforms.
      
      shenwei@5810:~/pktgen$ iperf -c 10.81.16.245 -w 2m -i 1 -l 64
      ------------------------------------------------------------
      Client connecting to 10.81.16.245, TCP port 5001
      TCP window size:  416 KByte (WARNING: requested 1.91 MByte)
      ------------------------------------------------------------
      [  1] local 10.81.17.20 port 39728 connected with 10.81.16.245 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  1] 0.0000-1.0000 sec  37.0 MBytes   311 Mbits/sec
      [  1] 1.0000-2.0000 sec  36.6 MBytes   307 Mbits/sec
      [  1] 2.0000-3.0000 sec  37.2 MBytes   312 Mbits/sec
      [  1] 3.0000-4.0000 sec  37.1 MBytes   312 Mbits/sec
      [  1] 4.0000-5.0000 sec  37.2 MBytes   312 Mbits/sec
      [  1] 5.0000-6.0000 sec  37.2 MBytes   312 Mbits/sec
      [  1] 6.0000-7.0000 sec  37.2 MBytes   312 Mbits/sec
      [  1] 7.0000-8.0000 sec  37.2 MBytes   312 Mbits/sec
      [  1] 0.0000-8.0943 sec   299 MBytes   310 Mbits/sec
      
       --- Page Pool implementation on i.MX8 ----
      
      shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
      ------------------------------------------------------------
      Client connecting to 10.81.16.245, TCP port 5001
      TCP window size:  416 KByte (WARNING: requested 1.91 MByte)
      ------------------------------------------------------------
      [  1] local 10.81.17.20 port 43204 connected with 10.81.16.245 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  1] 0.0000-1.0000 sec   111 MBytes   933 Mbits/sec
      [  1] 1.0000-2.0000 sec   111 MBytes   934 Mbits/sec
      [  1] 2.0000-3.0000 sec   112 MBytes   935 Mbits/sec
      [  1] 3.0000-4.0000 sec   111 MBytes   933 Mbits/sec
      [  1] 4.0000-5.0000 sec   111 MBytes   934 Mbits/sec
      [  1] 5.0000-6.0000 sec   111 MBytes   933 Mbits/sec
      [  1] 6.0000-7.0000 sec   111 MBytes   931 Mbits/sec
      [  1] 7.0000-8.0000 sec   112 MBytes   935 Mbits/sec
      [  1] 8.0000-9.0000 sec   111 MBytes   933 Mbits/sec
      [  1] 9.0000-10.0000 sec   112 MBytes   935 Mbits/sec
      [  1] 0.0000-10.0077 sec  1.09 GBytes   933 Mbits/sec
      
       --- Non Page Pool implementation on i.MX8 ----
      
      shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
      ------------------------------------------------------------
      Client connecting to 10.81.16.245, TCP port 5001
      TCP window size:  416 KByte (WARNING: requested 1.91 MByte)
      ------------------------------------------------------------
      [  1] local 10.81.17.20 port 49154 connected with 10.81.16.245 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  1] 0.0000-1.0000 sec   104 MBytes   868 Mbits/sec
      [  1] 1.0000-2.0000 sec   105 MBytes   878 Mbits/sec
      [  1] 2.0000-3.0000 sec   105 MBytes   881 Mbits/sec
      [  1] 3.0000-4.0000 sec   105 MBytes   879 Mbits/sec
      [  1] 4.0000-5.0000 sec   105 MBytes   878 Mbits/sec
      [  1] 5.0000-6.0000 sec   105 MBytes   878 Mbits/sec
      [  1] 6.0000-7.0000 sec   104 MBytes   875 Mbits/sec
      [  1] 7.0000-8.0000 sec   104 MBytes   875 Mbits/sec
      [  1] 8.0000-9.0000 sec   104 MBytes   873 Mbits/sec
      [  1] 9.0000-10.0000 sec   104 MBytes   875 Mbits/sec
      [  1] 0.0000-10.0073 sec  1.02 GBytes   875 Mbits/sec
      
       --- Page Pool implementation on i.MX6SX ----
      
      shenwei@5810:~/pktgen$ iperf -c 10.81.16.245 -w 2m -i 1
      ------------------------------------------------------------
      Client connecting to 10.81.16.245, TCP port 5001
      TCP window size:  416 KByte (WARNING: requested 1.91 MByte)
      ------------------------------------------------------------
      [  1] local 10.81.17.20 port 57288 connected with 10.81.16.245 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  1] 0.0000-1.0000 sec  78.8 MBytes   661 Mbits/sec
      [  1] 1.0000-2.0000 sec  82.5 MBytes   692 Mbits/sec
      [  1] 2.0000-3.0000 sec  82.4 MBytes   691 Mbits/sec
      [  1] 3.0000-4.0000 sec  82.4 MBytes   691 Mbits/sec
      [  1] 4.0000-5.0000 sec  82.5 MBytes   692 Mbits/sec
      [  1] 5.0000-6.0000 sec  82.4 MBytes   691 Mbits/sec
      [  1] 6.0000-7.0000 sec  82.5 MBytes   692 Mbits/sec
      [  1] 7.0000-8.0000 sec  82.4 MBytes   691 Mbits/sec
      [  1] 8.0000-9.0000 sec  82.4 MBytes   691 Mbits/sec
      [  1] 9.0000-9.5506 sec  45.0 MBytes   686 Mbits/sec
      [  1] 0.0000-9.5506 sec   783 MBytes   688 Mbits/sec
      
       --- Non Page Pool implementation on i.MX6SX ----
      
      shenwei@5810:~/pktgen$ iperf -c 10.81.16.245 -w 2m -i 1
      ------------------------------------------------------------
      Client connecting to 10.81.16.245, TCP port 5001
      TCP window size:  416 KByte (WARNING: requested 1.91 MByte)
      ------------------------------------------------------------
      [  1] local 10.81.17.20 port 36486 connected with 10.81.16.245 port 5001
      [ ID] Interval       Transfer     Bandwidth
      [  1] 0.0000-1.0000 sec  70.5 MBytes   591 Mbits/sec
      [  1] 1.0000-2.0000 sec  64.5 MBytes   541 Mbits/sec
      [  1] 2.0000-3.0000 sec  73.6 MBytes   618 Mbits/sec
      [  1] 3.0000-4.0000 sec  73.6 MBytes   618 Mbits/sec
      [  1] 4.0000-5.0000 sec  72.9 MBytes   611 Mbits/sec
      [  1] 5.0000-6.0000 sec  73.4 MBytes   616 Mbits/sec
      [  1] 6.0000-7.0000 sec  73.5 MBytes   617 Mbits/sec
      [  1] 7.0000-8.0000 sec  73.4 MBytes   616 Mbits/sec
      [  1] 8.0000-9.0000 sec  73.4 MBytes   616 Mbits/sec
      [  1] 9.0000-10.0000 sec  73.9 MBytes   620 Mbits/sec
      [  1] 0.0000-10.0174 sec   723 MBytes   605 Mbits/sec
      Signed-off-by: default avatarShenwei Wang <shenwei.wang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95698ff6
    • Guillaume Nault's avatar
      net: Remove DECnet leftovers from flow.h. · 9bc61c04
      Guillaume Nault authored
      DECnet was removed by commit 1202cdd6 ("Remove DECnet support from
      kernel"). Let's also revome its flow structure.
      
      Compile-tested only (allmodconfig).
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bc61c04
    • Coco Li's avatar
      gro: add support of (hw)gro packets to gro stack · 5eddb249
      Coco Li authored
      Current GRO stack only supports incoming packets containing
      one frame/MSS.
      
      This patch changes GRO to accept packets that are already GRO.
      
      HW-GRO (aka RSC for some vendors) is very often limited in presence
      of interleaved packets. Linux SW GRO stack can complete the job
      and provide larger GRO packets, thus reducing rate of ACK packets
      and cpu overhead.
      
      This also means BIG TCP can still be used, even if HW-GRO/RSC was
      able to cook ~64 KB GRO packets.
      
      v2: fix logic in tcp_gro_receive()
      
          Only support TCP for the moment (Paolo)
      Co-Developed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarCoco Li <lixiaoyan@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5eddb249
    • David S. Miller's avatar
      Merge branch 'mptcp-fastclose' · 197060c1
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: Fastclose edge cases and error handling
      
      MPTCP has existing code to use the MP_FASTCLOSE option header, which
      works like a RST for the MPTCP-level connection (regular RSTs only
      affect specific subflows in MPTCP). This series has some improvements
      for fastclose.
      
      Patch 1 aligns fastclose socket error handling with TCP RST behavior on
      TCP sockets.
      
      Patch 2 adds use of MP_FASTCLOSE in some more edge cases, like file
      descriptor close, FIN_WAIT timeout, and when the socket has unread data.
      
      Patch 3 updates the fastclose self tests.
      
      Patch 4 does not change any code, just fixes some outdated comments.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      197060c1
    • Paolo Abeni's avatar
      mptcp: update misleading comments. · d89e3ed7
      Paolo Abeni authored
      The MPTCP data path is quite complex and hard to understend even
      without some foggy comments referring to modified code and/or
      completely misleading from the beginning.
      
      Update a few of them to more accurately describing the current
      status.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d89e3ed7
    • Paolo Abeni's avatar
      selftests: mptcp: update and extend fastclose test-cases · 6bf41020
      Paolo Abeni authored
      After the previous patches, the MPTCP protocol can generate
      fast-closes on both ends of the connection. Rework the relevant
      test-case to carefully trigger the fast-close code-path on a
      single end at the time, while ensuring than a predictable amount
      of data is spooled on both ends.
      
      Additionally add another test-cases for the passive socket
      fast-close.
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bf41020
    • Paolo Abeni's avatar
      mptcp: use fastclose on more edge scenarios · d21f8348
      Paolo Abeni authored
      Daire reported a user-space application hang-up when the
      peer is forcibly closed before the data transfer completion.
      
      The relevant application expects the peer to either
      do an application-level clean shutdown or a transport-level
      connection reset.
      
      We can accommodate a such user by extending the fastclose
      usage: at fd close time, if the msk socket has some unread
      data, and at FIN_WAIT timeout.
      
      Note that at MPTCP close time we must ensure that the TCP
      subflows will reset: set the linger socket option to a suitable
      value.
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d21f8348
    • Paolo Abeni's avatar
      mptcp: propagate fastclose error · 69800e51
      Paolo Abeni authored
      When an mptcp socket is closed due to an incoming FASTCLOSE
      option, so specific sk_err is set and later syscall will
      fail usually with EPIPE.
      
      Align the current fastclose error handling with TCP reset,
      properly setting the socket error according to the current
      msk state and propagating such error.
      
      Additionally sendmsg() is currently not handling properly
      the sk_err, always returning EPIPE.
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69800e51
    • David S. Miller's avatar
      Merge branch 'RollBall-Hilink-Turris-10G-copper-SFP-support' · 7171e8a1
      David S. Miller authored
      Marek Behún says:
      
      ====================
      RollBall / Hilink / Turris 10G copper SFP support
      
      I am resurrecting my attempt to add support for RollBall / Hilink /
      Turris 10G copper SFPs modules.
      
      The modules contain Marvell 88X3310 PHY, which can communicate with
      the system via sgmii, 2500base-x, 5gbase-r, 10gbase-r or usxgmii mode.
      
      Some of the patches I've taken from Russell King's net-queue [1]
      (with some rebasing).
      
      The important change from my previous attempts are:
      - I am including the changes needed to phylink and marvell10g driver,
        so that the 88X3310 PHY is configured to use PHY modes supported by
        the host (the PHY defaults to use 10gbase-r only on host's side)
      - I have changed the patch that informs phylib about the interfaces
        supported by the host (patch 5 of this series): it now fills in the
        phydev->host_interfaces member only when connecting a PHY that is
        inside a SFP module. This may change in the future.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7171e8a1
    • Marek Behún's avatar
      net: sfp: add support for multigig RollBall transceivers · 324e88cb
      Marek Behún authored
      This adds support for multigig copper SFP modules from RollBall/Hilink.
      These modules have a specific way to access clause 45 registers of the
      internal PHY.
      
      We also need to wait at least 22 seconds after deasserting TX disable
      before accessing the PHY. The code waits for 25 seconds just to be sure.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      324e88cb
    • Marek Behún's avatar
      net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules · 09bbedac
      Marek Behún authored
      Some multigig SFPs from RollBall and Hilink do not expose functional
      MDIO access to the internal PHY of the SFP via I2C address 0x56
      (although there seems to be read-only clause 22 access on this address).
      
      Instead these SFPs PHY can be accessed via I2C via the SFP Enhanced
      Digital Diagnostic Interface - I2C address 0x51. The SFP_PAGE has to be
      selected to 3 and the password must be filled with 0xff bytes for this
      PHY communication to work.
      
      This extends the mdio-i2c driver to support this protocol by adding a
      special parameter to mdio_i2c_alloc function via which this RollBall
      protocol can be selected.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09bbedac
    • Marek Behún's avatar
      net: sfp: create/destroy I2C mdiobus before PHY probe/after PHY release · e85b1347
      Marek Behún authored
      Instead of configuring the I2C mdiobus when SFP driver is probed,
      create/destroy the mdiobus before the PHY is probed for/after it is
      released.
      
      This way we can tell the mdio-i2c code which protocol to use for each
      SFP transceiver.
      
      Move the code that determines MDIO I2C protocol from
      sfp_sm_probe_for_phy() to sfp_sm_mod_probe(), where most of the SFP ID
      parsing is done. Don't allocate I2C bus if no PHY is expected.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e85b1347
    • Marek Behún's avatar
      net: sfp: Add and use macros for SFP quirks definitions · 13c8adcf
      Marek Behún authored
      Add macros SFP_QUIRK(), SFP_QUIRK_M() and SFP_QUIRK_F() for defining SFP
      quirk table entries. Use them to deduplicate the code a little bit.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13c8adcf
    • Marek Behún's avatar
      net: phylink: allow attaching phy for SFP modules on 802.3z mode · 31eb8907
      Marek Behún authored
      Some SFPs may contain an internal PHY which may in some cases want to
      connect with the host interface in 1000base-x/2500base-x mode.
      Do not fail if such PHY is being attached in one of these PHY interface
      modes.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarPali Rohár <pali@kernel.org>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31eb8907
    • Russell King's avatar
      net: phy: marvell10g: select host interface configuration · d6d29292
      Russell King authored
      Select the host interface configuration according to the capabilities of
      the host if the host provided them. This is currently provided only when
      connecting PHY that is inside a SFP.
      
      The PHY supports several configurations of host communication:
      - always communicate with host in 10gbase-r, even if copper speed is
        lower (rate matching mode),
      - the same as above but use xaui/rxaui instead of 10gbase-r,
      - switch host SerDes mode between 10gbase-r, 5gbase-r, 2500base-x and
        sgmii according to copper speed,
      - the same as above but use xaui/rxaui instead of 10gbase-r.
      
      This mode of host communication, called MACTYPE, is by default selected
      by strapping pins, but it can be changed in software.
      
      This adds support for selecting this mode according to which modes are
      supported by the host.
      
      This allows the kernel to:
      - support SFP modules with 88X33X0 or 88E21X0 inside them
      
      Note: we use mv3310_select_mactype() for both 88X3310 and 88X3340,
      although 88X3340 does not support XAUI. This is not a problem because
      88X3340 does not declare XAUI in it's supported_interfaces, and so this
      function will never choose that MACTYPE.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      [ rebase, updated, also added support for 88E21X0 ]
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6d29292
    • Marek Behún's avatar
      net: phy: marvell10g: Use tabs instead of spaces for indentation · 3891569b
      Marek Behún authored
      Some register definitions were defined with spaces used for indentation.
      Change them to tabs.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3891569b
    • Marek Behún's avatar
      net: phylink: pass supported host PHY interface modes to phylib for SFP's PHYs · eca68a3c
      Marek Behún authored
      Pass the supported PHY interface types to phylib if the PHY we are
      connecting is inside a SFP, so that the PHY driver can select an
      appropriate host configuration mode for their interface according to
      the host capabilities.
      
      For example the Marvell 88X3310 PHY inside RollBall SFP modules
      defaults to 10gbase-r mode on host's side, and the marvell10g
      driver currently does not change this setting. But a host may not
      support 10gbase-r. For example Turris Omnia only supports sgmii,
      1000base-x and 2500base-x modes. The PHY can be configured to use
      those modes, but in order for the PHY driver to do that, it needs
      to know which modes are supported.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eca68a3c
    • Russell King (Oracle)'s avatar
      net: phylink: rename phylink_sfp_config() · e6084637
      Russell King (Oracle) authored
      phylink_sfp_config() now only deals with configuring the MAC for a
      SFP containing a PHY. Rename it to be specific.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6084637
    • Russell King's avatar
      net: phylink: use phy_interface_t bitmaps for optical modules · f81fa96d
      Russell King authored
      Where a MAC provides a phy_interface_t bitmap, use these bitmaps to
      select the operating interface mode for optical SFP modules, rather
      than using the linkmode bitmaps.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f81fa96d
    • Russell King's avatar
      net: sfp: augment SFP parsing with phy_interface_t bitmap · fd580c98
      Russell King authored
      We currently parse the SFP EEPROM to a bitmap of ethtool link modes,
      and then attempt to convert the link modes to a PHY interface mode.
      While this works at present, there are cases where this is sub-optimal.
      For example, where a module can operate with several different PHY
      interface modes.
      
      To start addressing this, arrange for the SFP EEPROM parsing to also
      provide a bitmap of the possible PHY interface modes.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd580c98
    • Russell King (Oracle)'s avatar
      net: phylink: add ability to validate a set of interface modes · 1645f44d
      Russell King (Oracle) authored
      Rather than having the ability to validate all supported interface
      modes or a single interface mode, introduce the ability to validate
      a subset of supported modes.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      [ rebased on current net-next ]
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1645f44d
    • David S. Miller's avatar
      Merge branch 'ip_tunnel-netlink-parms' · 3735264d
      David S. Miller authored
      Liu Jian says:
      
      ====================
      Add helper functions to parse netlink msg of ip_tunnel
      
      v1->v2: Move the implementation of the helper function to ip_tunnel_core.c
      v2->v3: Change EXPORT_SYMBOL to EXPORT_SYMBOL_GPL
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3735264d
    • Liu Jian's avatar
      net: Add helper function to parse netlink msg of ip_tunnel_parm · b86fca80
      Liu Jian authored
      Add ip_tunnel_netlink_parms to parse netlink msg of ip_tunnel_parm.
      Reduces duplicate code, no actual functional changes.
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b86fca80
    • Liu Jian's avatar
      net: Add helper function to parse netlink msg of ip_tunnel_encap · 537dd2d9
      Liu Jian authored
      Add ip_tunnel_netlink_encap_parms to parse netlink msg of ip_tunnel_encap.
      Reduces duplicate code, no actual functional changes.
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      537dd2d9
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next · 42e8e6d9
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      1) Refactor selftests to use an array of structs in xfrm_fill_key().
         From Gautam Menghani.
      
      2) Drop an unused argument from xfrm_policy_match.
         From Hongbin Wang.
      
      3) Support collect metadata mode for xfrm interfaces.
         From Eyal Birger.
      
      4) Add netlink extack support to xfrm.
         From Sabrina Dubroca.
      
      Please note, there is a merge conflict in:
      
      include/net/dst_metadata.h
      
      between commit:
      
      0a28bfd4 ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support")
      
      from the net-next tree and commit:
      
      5182a5d4 ("net: allow storing xfrm interface metadata in metadata_dst")
      
      from the ipsec-next tree.
      
      Can be solved as done in linux-next.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42e8e6d9
  2. 02 Oct, 2022 4 commits
  3. 01 Oct, 2022 11 commits
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-xsk-updates-part3-2022-09-30' · bc37b24e
      Jakub Kicinski authored
      Saeed Mahameed says:
      
      ====================
      mlx5 xsk updates part3 2022-09-30
      
      The gist of this 4 part series is in this patchset's last patch
      
      This series contains performance optimizations. XSK starts using the
      batching allocator, and XSK data path gets separated from the regular
      RX, allowing to drop some branches not relevant for non-XSK use cases.
      Some minor optimizations for indirect calls and need_wakeup are also
      included.
      
      Other than that, this series adds a few features to the mlx5e
      implementation of XSK:
      
      1. XDP metadata support on XSK RQs.
      
      2. RSS contexts support for XSK RQs.
      
      3. Some other optimizations
      
      4. Last but not least, change the queuing scheme, so that XSK RQs no longer
      use higher indices, but replace the regular RQs.
      
      Maxim Says:
      ==========
      
      In the initial implementation of XSK in mlx5e, XSK RQs coexisted with
      regular RQs in the same channel. The main idea was to allow RSS work the
      same for regular traffic, without need to reconfigure RSS to exclude XSK
      queues.
      
      However, this scheme didn't prove to be beneficial, mainly because of
      incompatibility with other vendors. Some tools don't properly support
      using higher indices for XSK queues, some tools get confused with the
      double amount of RQs exposed in sysfs. Some use cases are purely XSK,
      and allocating the same amount of unused regular RQs is a waste of
      resources.
      
      This commit changes the queuing scheme to the standard one, where XSK
      RQs replace regular RQs on the channels where XSK sockets are open. Two
      RQs still exist in the channel to allow failsafe disable of XSK, but
      only one is exposed at a time. The next commit will achieve the desired
      memory save by flushing the buffers when the regular RQ is unused.
      
      As the result of this transition:
      
      1. It's possible to use RSS contexts over XSK RQs.
      
      2. It's possible to dedicate all queues to XSK.
      
      3. When XSK RQs coexist with regular RQs, the admin should make sure no
      unwanted traffic goes into XSK RQs by either excluding them from RSS or
      settings up the XDP program to return XDP_PASS for non-XSK traffic.
      
      4. When using a mixed fleet of mlx5e devices and other netdevs, the same
      configuration can be applied. If the application supports the fallback
      to copy mode on unsupported drivers, it will work too.
      
      ==========
      
      Part 4 will include some final xsk optimizations and minor improvements
      
      part 1: https://lore.kernel.org/netdev/20220927203611.244301-1-saeed@kernel.org/
      part 2: https://lore.kernel.org/netdev/20220929072156.93299-1-saeed@kernel.org/
      ====================
      
      Link: https://lore.kernel.org/r/20220930162903.62262-1-saeed@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bc37b24e
    • Maxim Mikityanskiy's avatar
      net/mlx5e: xsk: Use queue indices starting from 0 for XSK queues · 3db4c85c
      Maxim Mikityanskiy authored
      In the initial implementation of XSK in mlx5e, XSK RQs coexisted with
      regular RQs in the same channel. The main idea was to allow RSS work the
      same for regular traffic, without need to reconfigure RSS to exclude XSK
      queues.
      
      However, this scheme didn't prove to be beneficial, mainly because of
      incompatibility with other vendors. Some tools don't properly support
      using higher indices for XSK queues, some tools get confused with the
      double amount of RQs exposed in sysfs. Some use cases are purely XSK,
      and allocating the same amount of unused regular RQs is a waste of
      resources.
      
      This commit changes the queuing scheme to the standard one, where XSK
      RQs replace regular RQs on the channels where XSK sockets are open. Two
      RQs still exist in the channel to allow failsafe disable of XSK, but
      only one is exposed at a time. The next commit will achieve the desired
      memory save by flushing the buffers when the regular RQ is unused.
      
      As the result of this transition:
      
      1. It's possible to use RSS contexts over XSK RQs.
      
      2. It's possible to dedicate all queues to XSK.
      
      3. When XSK RQs coexist with regular RQs, the admin should make sure no
      unwanted traffic goes into XSK RQs by either excluding them from RSS or
      settings up the XDP program to return XDP_PASS for non-XSK traffic.
      
      4. When using a mixed fleet of mlx5e devices and other netdevs, the same
      configuration can be applied. If the application supports the fallback
      to copy mode on unsupported drivers, it will work too.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3db4c85c
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Introduce the mlx5e_flush_rq function · d9ba64de
      Maxim Mikityanskiy authored
      Add a function to flush an RQ: clean up descriptors, release pages and
      reset the RQ. This procedure is used by the recovery flow, and it will
      also be used in a following commit to free some memory when switching a
      channel to the XSK mode.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d9ba64de
    • Maxim Mikityanskiy's avatar
      net/mlx5e: xsk: Support XDP metadata on XSK RQs · a752b2ed
      Maxim Mikityanskiy authored
      Add support for XDP metadata on XSK RQs for cross-program
      communication. The driver no longer calls xdp_set_data_meta_invalid and
      copies the metadata to a newly allocated SKB on XDP_PASS.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a752b2ed
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Optimize RQ page deallocation · ddb7afee
      Maxim Mikityanskiy authored
      mlx5e_free_rx_mpwqe loops over all pages of a MPWQE, calling
      mlx5e_page_release for ones that are not scheduled for XDP_TX or
      XDP_REDIRECT; and mlx5e_page_release checks whether it's an XSK RQ or a
      regular one for each page/XSK frame. This check can be moved outside the
      loop to reduce the number of branches.
      
      mlx5e_free_rx_wqe loops over all fragments, calling mlx5e_page_release
      for the ones that are last in a page; and mlx5e_page_release checks
      whether it's an XSK RQ or a regular one for each fragment. Using the
      fact that XSK doesn't support multiple fragments, it can be optimized
      for both XSK and regular usages:
      
      1. Make an early check for XSK and call its deallocator directly, saving
      3 branches (loop condition, frag->last_in_page and selection of
      deallocator).
      
      2. Call the regular deallocator directly in the non-XSK case, saving a
      branch per fragment, except the first one.
      
      After the changes, mlx5e_page_release is removed, as there are no
      callers left.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ddb7afee
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Call mlx5e_page_release_dynamic directly where possible · 96d37d86
      Maxim Mikityanskiy authored
      mlx5e_page_release calls the appropriate deallocator depending on
      whether it's an XSK RQ or a regular one. Some flows that call this
      function are not compatible with XSK, so they can call the non-XSK
      deallocator directly to save a branch.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96d37d86
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Use non-XSK page allocator in SHAMPO · 132857d9
      Maxim Mikityanskiy authored
      The SHAMPO flow is not compatible with XSK, it can call the page pool
      allocator directly to save a branch.
      
      mlx5e_page_alloc is removed, as it's no longer used in any flow.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      132857d9
    • Maxim Mikityanskiy's avatar
      net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQ · cf544517
      Maxim Mikityanskiy authored
      XSK provides a function to allocate frames in batches for more efficient
      processing. This commit starts using this function on striding RQ and
      creates an optimized flow for XSK. A side effect is an opportunity to
      optimize the regular RX flow by dropping branching for XSK cases.
      
      Performance improvement is up to 6.4% in the aligned mode and up to 7.5%
      in the unaligned mode.
      
      Aligned mode, 2048-byte frames: 12.9 Mpps -> 13.8 Mpps
      Aligned mode, 4096-byte frames: 11.8 Mpps -> 12.5 Mpps
      Unaligned mode, 2048-byte frames: 11.9 Mpps -> 12.8 Mpps
      Unaligned mode, 3072-byte frames: 11.4 Mpps -> 12.1 Mpps
      Unaligned mode, 4096-byte frames: 11.0 Mpps -> 11.2 Mpps
      
      CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cf544517
    • Maxim Mikityanskiy's avatar
      net/mlx5e: xsk: Use xsk_buff_alloc_batch on legacy RQ · 259bbc64
      Maxim Mikityanskiy authored
      XSK provides a function to allocate frames in batches for more efficient
      processing. This commit starts using this function on legacy RQ, adding
      a special case for XSK. The new branch introduced basically replaces the
      branch that was removed from the same place a few commits before.
      
      A check is made that DMA sync is not needed, because the batching
      allocator falls back to returning one frame when DMA sync is needed, and
      this is best handled by the loop in the standard case.
      
      Performance improvement is up to 8% in the aligned mode and up to 9% in
      the unaligned mode.
      
      Aligned mode, 2048-byte frames: 12.8 Mpps -> 13.5 Mpps
      Aligned mode, 4096-byte frames: 11.5 Mpps -> 12.4 Mpps
      Unaligned mode, 2048-byte frames: 12.2 Mpps -> 13.4 Mpps
      Unaligned mode, 3072-byte frames: 11.6 Mpps -> 12.5 Mpps
      Unaligned mode, 4096-byte frames: 11.2 Mpps -> 12.2 Mpps
      
      CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      259bbc64
    • Maxim Mikityanskiy's avatar
      net/mlx5e: xsk: Split out WQE allocation for legacy XSK RQ · a2e5ba24
      Maxim Mikityanskiy authored
      Allocation of XSK frames on legacy RQ may be made more efficient with a
      specialized routine that relies on certain assumptions, such as there is
      only one fragment, allocation units (XSK frames) are not shared among
      multiple packets. It reduces the number of branches both in the XSK code
      and in the regular RQ, because with this approach there is only a single
      check whether it's an XSK or regular RQ.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a2e5ba24
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Remove the outer loop when allocating legacy RQ WQEs · 0b482232
      Maxim Mikityanskiy authored
      Legacy RQ WQEs are allocated in a loop in small batches (8 WQEs). As
      partial batches are allowed, there is no point to have a loop in a loop,
      so the outer loop is removed, and the batch size is increased up to the
      total number of WQEs to allocate, still not smaller than 8.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0b482232