1. 10 May, 2022 20 commits
    • Mark Bloch's avatar
      net/mlx5: Lag, add debugfs to query hardware lag state · 7f46a0b7
      Mark Bloch authored
      Lag state has become very complicated with many modes, flags, types and
      port selections methods and future work will add additional features.
      
      Add a debugfs to query the current lag state. A new directory named "lag"
      will be created under the mlx5 debugfs directory. As the driver has
      debugfs per pci function the location will be: <debugfs>/mlx5/<BDF>/lag
      
      For example:
      /sys/kernel/debug/mlx5/0000:08:00.0/lag
      
      The following files are exposed:
      
      - state: Returns "active" or "disabled". If "active" it means hardware
               lag is active.
      
      - members: Returns the BDFs of all the members of lag object.
      
      - type: Returns the type of the lag currently configured. Valid only
      	if hardware lag is active.
      	* "roce" - Members are bare metal PFs.
      	* "switchdev" - Members are in switchdev mode.
      	* "multipath" - ECMP offloads.
      
      - port_sel_mode: Returns the egress port selection method, valid
      		 only if hardware lag is active.
      		 * "queue_affinity" - Egress port is selected by
      		   the QP/SQ affinity.
      		 * "hash" - Egress port is selected by hash done on
      		   each packet. Controlled by: xmit_hash_policy of the
      		   bond device.
      - flags: Returns flags that are specific per lag @type. Valid only if
      	 hardware lag is active.
      	 * "shared_fdb" - "on" or "off", if "on" single FDB is used.
      
      - mapping: Returns the mapping which is used to select egress port.
      	   Valid only if hardware lag is active.
      	   If @port_sel_mode is "hash" returns the active egress ports.
      	   The hash result will select only active ports.
      	   if @port_sel_mode is "queue_affinity" returns the mapping
      	   between the configured port affinity of the QP/SQ and actual
      	   egress port. For example:
      	   * 1:1 - Mapping means if the configured affinity is port 1
      	           traffic will egress via port 1.
      	   * 1:2 - Mapping means if the configured affinity is port 1
      		   traffic will egress via port 2. This can happen
      		   if port 1 is down or in active/backup mode and port 1
      		   is backup.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7f46a0b7
    • Mark Bloch's avatar
      net/mlx5: Lag, use buckets in hash mode · 352899f3
      Mark Bloch authored
      When in hardware lag and the NIC has more than 2 ports when one port
      goes down need to distribute the traffic between the remaining
      active ports.
      
      For better spread in such cases instead of using 1-to-1 mapping and only
      4 slots in the hash, use many.
      
      Each port will have many slots that point to it. When a port goes down
      go over all the slots that pointed to that port and spread them between
      the remaining active ports. Once the port comes back restore the default
      mapping.
      
      We will have number_of_ports * MLX5_LAG_MAX_HASH_BUCKETS slots.
      Each MLX5_LAG_MAX_HASH_BUCKETS belong to a different port.
      The native mapping is such that:
      
      port 1: The first MLX5_LAG_MAX_HASH_BUCKETS slots are: [1, 1, .., 1]
      which means if a packet is hased into one of this slots it will hit the
      wire via port 1.
      
      port 2: The second MLX5_LAG_MAX_HASH_BUCKETS slots are: [2, 2, .., 2]
      which means if a packet is hased into one of this slots it will hit the
      wire via port2.
      
      and this mapping is the same of the rest of the ports.
      On a failover, lets say port 2 goes down (port 1, 3, 4 are still up).
      the new mapping for port 2 will be:
      
      port 2: The second MLX5_LAG_MAX_HASH_BUCKETS are: [1, 3, 1, 4, .., 4]
      which means the mapping was changed from the native mapping to a mapping
      that consists of only the active ports.
      
      With this if a port goes down the traffic will be split between the
      active ports randomly
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      352899f3
    • Mark Bloch's avatar
      net/mlx5: Lag, refactor dmesg print · 24b3599e
      Mark Bloch authored
      Combine dmesg lag prints into a single function.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      24b3599e
    • Mark Bloch's avatar
      net/mlx5: Support devices with more than 2 ports · 4cd14d44
      Mark Bloch authored
      Increase the define MLX5_MAX_PORTS to 4 as the driver is ready
      to support NICs with 4 ports.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      4cd14d44
    • Mark Bloch's avatar
      net/mlx5: Lag, use actual number of lag ports · 7e978e77
      Mark Bloch authored
      Refactor the entire lag code to use ldev->ports instead of hard-coded
      defines (like MLX5_MAX_PORTS) for its operations.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7e978e77
    • Mark Bloch's avatar
      net/mlx5: Lag, use hash when in roce lag on 4 ports · cdf611d1
      Mark Bloch authored
      Downstream patches will add support for lag over 4 ports.
      In that mode we will only use hash as the uplink selection method.
      Using hash instead of queue affinity (before this patch) offers key
      advantages like:
      
      - Align ports selection method with the method used by the bond device
      
      - Better packets distribution where a single queue can transmit from
        multiple ports (with queue affinity a queue is bound to a single port
        regardless of the packet being sent).
      
      - In case of failover we traffic is split between multiple ports and not
        a single one like in queue affinity.
      
      Going forward it was decided that queue affinity will be deprecated
      as using hash provides a better user experience which means on 4 ports
      HCAs hash will always be used.
      
      Future work will add hash support for 2 ports HCAs as well.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      cdf611d1
    • Mark Bloch's avatar
      net/mlx5: Lag, support single FDB only on 2 ports · e2c45931
      Mark Bloch authored
      E-Switch currently doesn't support more than 2 E-Switch managers
      being aggregated under a single hardware lag. Have specific checks
      to disallow creating lag when the code doesn't support it.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      e2c45931
    • Mark Bloch's avatar
      net/mlx5: Lag, store number of ports inside lag object · e9d5bb51
      Mark Bloch authored
      Store the number of lag ports inside the lag object. Lag object is a single
      shared object managing the lag state of multiple mlx5 devices on the same
      physical HCA.
      
      Downstream patches will allow hardware lag to be created over devices with
      more than 2 ports.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      e9d5bb51
    • Mark Bloch's avatar
      net/mlx5: Lag, filter non compatible devices · bc4c2f2e
      Mark Bloch authored
      When search for a peer lag device we can filter based on that
      device's capabilities.
      
      Downstream patch will be less strict when filtering compatible devices
      and remove the limitation where we require exact MLX5_MAX_PORTS and
      change it to a range.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      bc4c2f2e
    • Mark Bloch's avatar
      net/mlx5: Lag, use lag lock · ec2fa47d
      Mark Bloch authored
      Use a lag specific lock instead of depending on external locks to
      synchronise the lag creation/destruction.
      
      With this, taking E-Switch mode lock is no longer needed for syncing
      lag logic.
      
      Cleanup any dead code that is left over and don't export functions that
      aren't used outside the E-Switch core code.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ec2fa47d
    • Mark Bloch's avatar
      net/mlx5: Lag, move E-Switch prerequisite check into lag code · 4202ea95
      Mark Bloch authored
      There is no need to expose E-Switch function for something that can be
      checked with already present API inside lag code.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      4202ea95
    • Mark Bloch's avatar
      net/mlx5: devcom only supports 2 ports · 8a6e75e5
      Mark Bloch authored
      Devcom API is intended to be used between 2 devices only add this
      implied assumption into the code and check when it's no true.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8a6e75e5
    • Mark Bloch's avatar
      net/mlx5: Lag, expose number of lag ports · 34a30d76
      Mark Bloch authored
      Downstream patches will add support for hardware lag with
      more than 2 ports. Add a way for users to query the number of lag ports.
      Signed-off-by: default avatarMark Bloch <mbloch@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      34a30d76
    • Gavin Li's avatar
      net/mlx5: Increase FW pre-init timeout for health recovery · 37ca95e6
      Gavin Li authored
      Currently, health recovery will reload driver to recover it from fatal
      errors. During the driver's load process, it would wait for FW to set the
      pre-init bit for up to 120 seconds, beyond this threshold it would abort
      the load process. In some cases, such as a FW upgrade on the DPU, this
      timeout period is insufficient, and the user has no way to recover the
      host device.
      
      To solve this issue, introduce a new FW pre-init timeout for health
      recovery, which is set to 2 hours.
      
      The timeout for devlink reload and probe will use the original one because
      they are user triggered flows, and therefore should not have a
      significantly long timeout, during which the user command would hang.
      Signed-off-by: default avatarGavin Li <gavinl@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      37ca95e6
    • Gavin Li's avatar
      net/mlx5: Add exit route when waiting for FW · 8324a02c
      Gavin Li authored
      Currently, removing a device needs to get the driver interface lock before
      doing any cleanup. If the driver is waiting in a loop for FW init, there
      is no way to cancel the wait, instead the device cleanup waits for the
      loop to conclude and release the lock.
      
      To allow immediate response to remove device commands, check the TEARDOWN
      flag while waiting for FW init, and exit the loop if it has been set.
      Signed-off-by: default avatarGavin Li <gavinl@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8324a02c
    • Jakub Kicinski's avatar
      Merge branch 'nfp-support-corigine-pcie-vendor-id' · 9eab75d4
      Jakub Kicinski authored
      Simon Horman says:
      
      ====================
      nfp: support Corigine PCIE vendor ID
      
      Historically the nfp driver has supported NFP chips with Netronome's
      PCIE vendor ID. This patch extends the driver to also support NFP
      chips, which at this point are assumed to be otherwise identical from
      a software perspective, that have Corigine's PCIE vendor ID (0x1da8).
      
      This patchset begins by cleaning up strings to make them:
      * Vendor neutral for the NFP chip
      * Relate to Corigine for the driver itself
      
      It then adds support to the driver for the Corigine's PCIE vendor ID
      ====================
      
      Link: https://lore.kernel.org/r/20220508173816.476357-1-simon.horman@corigine.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9eab75d4
    • Yu Xiao's avatar
      nfp: support Corigine PCIE vendor ID · 299ba7a3
      Yu Xiao authored
      Historically the nfp driver has supported NFP chips with Netronome's
      PCIE vendor ID. This patch extends the driver to also support NFP
      chips, which at this point are assumed to be otherwise identical from
      a software perspective, that have Corigine's PCIE vendor ID (0x1da8).
      
      Also, Rename the macro definitions PCI_DEVICE_ID_NERTONEOME_NFPXXXX
      to PCI_DEVICE_ID_NFPXXXX, as they are now used in conjunction with two
      PCIE vendor IDs.
      Signed-off-by: default avatarYu Xiao <yu.xiao@corigine.com>
      Signed-off-by: default avatarYinjun Zhang <yinjun.zhang@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      299ba7a3
    • Yu Xiao's avatar
      nfp: vendor neutral strings for chip and Corigne in strings for driver · 34e244ea
      Yu Xiao authored
      Historically the nfp driver has supported NFP chips with Netronome's
      PCIE vendor ID. In preparation for extending the to also support NFP
      chips that have Corigine's PCIE vendor ID (0x1da8) make printk statements
      relating to the chip vendor neutral.
      
      An alternate approach is to set the string based on the PCI vendor ID.
      In our judgement this proved to cumbersome so we have taken this simpler
      approach.
      
      Update strings relating to the driver to use Corigine, who have taken
      over maintenance of the driver.
      Signed-off-by: default avatarYu Xiao <yu.xiao@corigine.com>
      Signed-off-by: default avatarYinjun Zhang <yinjun.zhang@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34e244ea
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 5bcfeb6e
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2022-05-06
      
      Marcin Szycik says:
      
      This patchset adds support for systemd defined naming scheme for port
      representors, as well as re-enables displaying PCI bus-info in ethtool.
      
      bus-info information has previously been removed from ethtool for port
      representors, as a workaround for a bug in lshw tool, where the tool would
      sometimes display wrong descriptions for port representors/PF. Now the bug
      has been fixed in lshw tool [1].
      
      Removing the workaround can be considered a regression (user might be
      running an older, unpatched version of lshw) (see [2] for discussion).
      However, calling SET_NETDEV_DEV also produces the same effect as removing
      the workaround, i.e. lshw is able to access PCI bus-info (this time not
      via ethtool, but in some other way) and the bug can occur.
      
      Adding SET_NETDEV_DEV is important, as it greatly improves netdev naming -
      - port representors are named based on PF name. Currently port representors
      are named "ethX", which might be confusing, especially when spawning VFs on
      multiple PFs. Furthermore, it's currently harder to determine to which PF
      does a particular port representor belong, as bus-info is not shown in
      ethtool.
      
      Consider the following three cases:
      
      Case 1: current code - driver workaround in place, no SET_NETDEV_DEV,
      lshw with or without fix. Port representors are not displayed because they
      don't have bus-info (the workaround), PFs are labelled correctly:
      
      $ sudo ./lshw -c net -businfo
      Bus info          Device      Class          Description
      ========================================================
      pci@0000:02:00.0  ens6f0      network        Ethernet Controller E810-XXV for SFP <-- PF
      pci@0000:02:00.1  ens6f1      network        Ethernet Controller E810-XXV for SFP
      pci@0000:02:01.0  ens6f0v0    network        Ethernet Adaptive Virtual Function <-- VF
      pci@0000:02:01.1  ens6f0v1    network        Ethernet Adaptive Virtual Function
      ...
      
      Case 2: driver workaround in place, SET_NETDEV_DEV, no lshw fix. Port
      representors have predictable names. lshw is able to get bus-info because
      of SET_NETDEV_DEV and netdevs CAN be mislabelled:
      
      $ sudo ./lshw -c net -businfo
      Bus info          Device           Class          Description
      =============================================================
      pci@0000:02:00.0  ens6f0npf0vf60   network        Ethernet Controller E810-XXV for SFP <-- mislabeled port representor
      pci@0000:02:00.1  ens6f1           network        Ethernet Controller E810-XXV for SFP
      pci@0000:02:01.0  ens6f0v0         network        Ethernet Adaptive Virtual Function
      pci@0000:02:01.1  ens6f0v1         network        Ethernet Adaptive Virtual Function
      ...
      pci@0000:02:00.0  ens6f0npf0vf26   network        Ethernet interface
      pci@0000:02:00.0  ens6f0           network        Ethernet interface <-- mislabeled PF
      pci@0000:02:00.0  ens6f0npf0vf81   network        Ethernet interface
      ...
      $ sudo ethtool -i ens6f0npf0vf60
      driver: ice
      ...
      bus-info:
      ...
      
      Output of lshw would be the same with workaround removed; it does not
      change the fact that lshw labels netdevs incorrectly, while at the same
      time it prevents ethtool from displaying potentially useful data
      (bus-info).
      
      Case 3: workaround removed, SET_NETDEV_DEV, lshw fix:
      
      $ sudo ./lshw -c net -businfo
      Bus info          Device           Class          Description
      =============================================================
      pci@0000:02:00.0  ens6f0npf0vf73   network        Ethernet Controller E810-XXV for SFP
      pci@0000:02:00.1  ens6f1           network        Ethernet Controller E810-XXV for SFP
      pci@0000:02:01.0  ens6f0v0         network        Ethernet Adaptive Virtual Function
      pci@0000:02:01.1  ens6f0v1         network        Ethernet Adaptive Virtual Function
      ...
      pci@0000:02:00.0  ens6f0npf0vf5    network        Ethernet Controller E810-XXV for SFP
      pci@0000:02:00.0  ens6f0           network        Ethernet Controller E810-XXV for SFP
      pci@0000:02:00.0  ens6f0npf0vf60   network        Ethernet Controller E810-XXV for SFP
      ...
      $ sudo ethtool -i ens6f0npf0vf73
      driver: ice
      ...
      bus-info: 0000:02:00.0
      ...
      
      In this case poort representors have predictable names, netdevs are not
      mislabelled in lshw, and bus-info is shown in ethtool.
      
      [1] https://ezix.org/src/pkg/lshw/commit/9bf4e4c9c1
      [2] https://patchwork.ozlabs.org/project/intel-wired-lan/patch/20220321144731.3935-1-marcin.szycik@linux.intel.com
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        Revert "ice: Hide bus-info in ethtool for PRs in switchdev mode"
        ice: link representors to PCI device
      ====================
      
      Link: https://lore.kernel.org/r/20220506180052.5256-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5bcfeb6e
    • Jiapeng Chong's avatar
      ROSE: Remove unused code and clean up some inconsistent indenting · eef0dc7e
      Jiapeng Chong authored
      Eliminate the follow smatch warning:
      
      net/rose/rose_route.c:1136 rose_node_show() warn: inconsistent
      indenting.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Link: https://lore.kernel.org/r/20220507034207.18651-1-jiapeng.chong@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eef0dc7e
  2. 09 May, 2022 20 commits