1. 27 Oct, 2021 15 commits
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2021-10-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · c230dc86
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2021-10-26
      
      HW-GRO support in mlx5
      
      Beside the HW GRO this series includes two trivial non-mlx5 patches:
       - net: Prevent HW-GRO and LRO features operate together
       - lib: bitmap: Introduce node-aware alloc API
      
      Khalid Manaa Says:
      ==================
      This series implements the HW-GRO offload using the HW feature SHAMPO.
      
      HW-GRO: Hardware offload for the Generic Receive Offload feature.
      
      SHAMPO: Split Headers And Merge Payload Offload.
      
      This feature performs headers data split for each received packed and
      merge the payloads of the packets of the same session.
      
      There are new HW components for this feature:
      
      The headers buffer:
      – cyclic buffer where the packets headers will be located
      
      Reservation buffer:
      – capability to divide RQ WQEs to reservations, a definite size in
        granularity of 4KB, the reservation is used to define the largest segment
        that we can create by packets stitching.
      
      Each reservation will have a session and the new received packet can be merged
      to the session, terminate it, or open a new one according to the match criteria.
      
      When a new packet is received the headers will be written to the headers buffer
      and the data will be written to the reservation, in case the packet matches
      the session the data will be written continuously otherwise it will be written
      after performing an alignment.
      
      SHAMPO RQ, WQ and CQE changes:
      -----------------------------
      RQ (receive queue) new params:
      
       -shampo_no_match_alignment_granularity: the HW alignment granularity in case
        the received packet doesn't match the current session.
      
       -shampo_match_criteria_type: the type of match criteria.
      
       -reservation_timeout: the maximum time that the HW will hold the reservation.
      
       -Each RQ has SKB that represents the current opened flow.
      
      WQ (work queue) new params:
      
       -headers_mkey: mkey that represents the headers buffer, where the packets
        headers will be written by the HW.
      
       -shampo_enable: flag to verify if the WQ supports SHAMPO feature.
      
       -log_reservation_size: the log of the reservation size where the data of
        the packet will be written by the HW.
      
       -log_max_num_of_packets_per_reservation: log of the maximum number of packets
        that can be written to the same reservation.
      
       -log_headers_entry_size: log of the header entry size of the headers buffer.
      
       -log_headers_buffer_entry_num: log of the entries number of the headers buffer.
      
      CQEs (Completion queue entry) SHAMPO fields:
      
       -match: in case it is set, then the current packet matches the opened session.
      
       -flush: in case it is set, the opened session must be flushed.
      
       -header_size: the size of the packet’s headers.
      
       -header_entry_index: the entry index in the headers buffer of the received
        packet headers.
      
       -data_offset: the offset of the received packet data in the WQE.
      
      HW-GRO works as follow:
      ----------------------
      The feature can be enabled on the interface using the ethtool command by
      setting on rx-gro-hw. When the feature is on the mlx5 driver will reopen
      the RQ to support the SHAMPO feature:
      
      Will allocate the headers buffer and fill the parameters regarding the
      reservation and the match criteria.
      
      Receive packet flow:
      
      each RQ will hold SKB that represents the current GRO opened session.
      
      The driver has a new CQE handler mlx5e_handle_rx_cqe_mpwrq_shampo which will
      use the CQE SHAMPO params to extract the location of the packet’s headers
      in the headers buffer and the location of the packets data in the RQ.
      
      Also, the CQE has two flags flush and match that indicate if the current
      packet matches the current session or not and if we need to close the session.
      
      In case there is an opened session, and we receive a matched packet then the
      handler will merge the packet's payload to the current SKB, in case we receive
      no match then the handler will flush the SKB and create a new one for the new packet.
      
      In case the flash flag is set then the driver will close the session, the SKB
      will be passed to the network stack.
      
      In case the driver merges packets in the SKB, before passing the SKB to the network
      stack the driver will update the checksum of the packet’s headers.
      
      SKB build:
      ---------
      The driver will build a new SKB in the following situations:
      in case there is no current opened session.
      In case the current packet doesn’t match the current session.
      In case there is no place to add the packets data to the SKB that represents the
      current session.
      
      Otherwise, the driver will add the packet’s data to the SKB.
      
      When the driver builds a new SKB, the linear area will contain only the packet headers
      and the data will be added to the SKB fragments.
      
      In case the entry size of the headers buffer is sufficient to build the SKB
      it will be used, otherwise the driver will allocate new memory to build the SKB.
      
      ==================
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c230dc86
    • Maor Dickman's avatar
      net/mlx5: Lag, Make mlx5_lag_is_multipath() be static inline · 8ca9caee
      Maor Dickman authored
      Fix "no previous prototype" W=1 warnings when CONFIG_MLX5_CORE_EN is not set:
      
        drivers/net/ethernet/mellanox/mlx5/core/lag_mp.h:34:6: error: no previous prototype for ‘mlx5_lag_is_multipath’ [-Werror=missing-prototypes]
           34 | bool mlx5_lag_is_multipath(struct mlx5_core_dev *dev) { return false; }
              |      ^~~~~~~~~~~~~~~~~~~~~
      
      Fixes: 14fe2471 ("net/mlx5: Lag, change multipath and bonding to be mutually exclusive")
      Signed-off-by: default avatarMaor Dickman <maord@nvidia.com>
      8ca9caee
    • Khalid Manaa's avatar
      net/mlx5e: Prevent HW-GRO and CQE-COMPRESS features operate together · ae345299
      Khalid Manaa authored
      HW-GRO and CQE-COMPRESS are mutually exclusive, this commit adds this
      restriction.
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ae345299
    • Khalid Manaa's avatar
      net/mlx5e: Add HW-GRO offload · 83439f3c
      Khalid Manaa authored
      This commit introduces HW-GRO offload by using the SHAMPO feature
      - Add set feature handler for HW-GRO.
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      83439f3c
    • Khalid Manaa's avatar
      net/mlx5e: Add HW_GRO statistics · def09e7b
      Khalid Manaa authored
      This patch adds HW_GRO counters to RX packets statistics:
       - gro_match_packets: counter of received packets with set match flag.
      
       - gro_packets: counter of received packets over the HW_GRO feature,
                      this counter is increased by one for every received
                      HW_GRO cqe.
      
       - gro_bytes: counter of received bytes over the HW_GRO feature,
                    this counter is increased by the received bytes for every
                    received HW_GRO cqe.
      
       - gro_skbs: counter of built HW_GRO skbs,
                   increased by one when we flush HW_GRO skb
                   (when we call a napi_gro_receive with hw_gro skb).
      
       - gro_large_hds: counter of received packets with large headers size,
                        in case the packet needs new SKB, the driver will allocate
                        new one and will not use the headers entry to build it.
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      def09e7b
    • Khalid Manaa's avatar
      net/mlx5e: HW_GRO cqe handler implementation · 92552d3a
      Khalid Manaa authored
      this patch updates the SHAMPO CQE handler to support HW_GRO,
      
      changes in the SHAMPO CQE handler:
      - CQE match and flush fields are used to determine if to build new skb
        using the new received packet,
        or to add the received packet data to the existing RQ.hw_gro_skb,
        also this fields are used to determine when to flush the skb.
      - in the end of the function mlx5e_poll_rx_cq the RQ.hw_gro_skb is flushed.
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      92552d3a
    • Ben Ben-Ishay's avatar
      net/mlx5e: Add data path for SHAMPO feature · 64509b05
      Ben Ben-Ishay authored
      The header buffer is used to store the headers of the rx packets.
      The header buffer size deduced from WorkQueue size + restriction
      of max packets per WorkQueueElement.
      This commit adds the functionality for posting/updating memory for
      the header buffer during the posting/updating of WQEs.
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      64509b05
    • Khalid Manaa's avatar
      net/mlx5e: Add handle SHAMPO cqe support · f97d5c2a
      Khalid Manaa authored
      This patch adds the new CQE SHAMPO fields:
      - flush: indicates that we must close the current session and pass the SKB
               to the network stack.
      
      - match: indicates that the current packet matches the oppened session,
               the packet will be merge into the current SKB.
      
      - header_size: the size of the packet headers that written into the headers
                     buffer.
      
      - header_entry_index: the entry index in the headers buffer.
      
      - data_offset: packets data offset in the WQE.
      
      Also new cqe handler is added to handle SHAMPO packets:
      - The new handler uses CQE SHAMPO fields to build the SKB.
        CQE's Flush and match fields are not used in this patch, packets are not
        merged in this patch.
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      f97d5c2a
    • Ben Ben-Ishay's avatar
      net/mlx5e: Add control path for SHAMPO feature · e5ca8fb0
      Ben Ben-Ishay authored
      This commit introduces the control path infrastructure for SHAMPO feature.
      
      SHAMPO feature enables packet stitching by splitting packets to
      header and payload, the header is placed on a dedicated buffer
      and the payload on the RX ring, this allows stitching the data part
      of a flow together continuously in the receive buffer.
      
      SHAMPO feature is implemented as linked list striding RQ feature.
      To support packets splitting and payload stitching:
      - Enlarge the ICOSQ and the correspond CQ to support the header buffer
        memory regions.
      - Add support to create linked list striding RQ with SHAMPO feature set
        in the open_rq function.
      - Add deallocation function and corresponded calls for SHAMPO header
        buffer.
      - Add mlx5e_create_umr_klm_mkey to support KLM mkey for the header
        buffer.
      - Rename mlx5e_create_umr_mkey to mlx5e_create_umr_mtt_mkey.
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      e5ca8fb0
    • Ben Ben-Ishay's avatar
      net/mlx5e: Add support to klm_umr_wqe · d7b896ac
      Ben Ben-Ishay authored
      This commit adds the needed definitions for using the klm_umr_wqe.
      UMR stands for user-mode memory registration, is a mechanism to alter
      address translation properties of MKEY by posting WorkQueueElement
      aka WQE on send queue.
      MKEY stands for memory key, MKEY are used to describe a region in memory that
      can be later used by HW.
      KLM stands for {Key, Length, MemVa}, KLM_MKEY is indirect MKEY that enables
      to map multiple memory spaces with different sizes in unified MKEY.
      klm_umr_wqe is a UMR that use to update a KLM_MKEY.
      SHAMPO feature uses KLM_MKEY for memory registration of his header buffer.
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d7b896ac
    • Khalid Manaa's avatar
      net/mlx5e: Rename TIR lro functions to TIR packet merge functions · eaee12f0
      Khalid Manaa authored
      This series introduces new packet merge type, therefore rename lro
      functions to packet merge to support the new merge type:
      - Generalize + rename mlx5e_build_tir_ctx_lro to
        mlx5e_build_tir_ctx_packet_merge.
      - Rename mlx5e_modify_tirs_lro to mlx5e_modify_tirs_packet_merge.
      - Rename lro bit in mlx5_ifc_modify_tir_bitmask_bits to packet_merge.
      - Rename lro_en in mlx5e_params to packet_merge_type type and combine
        packet_merge params into one struct mlx5e_packet_merge_param.
      Signed-off-by: default avatarKhalid Manaa <khalidm@nvidia.com>
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      eaee12f0
    • Ben Ben-Ishay's avatar
      net/mlx5: Add SHAMPO caps, HW bits and enumerations · 7025329d
      Ben Ben-Ishay authored
      This commit adds SHAMPO bit to hca_cap and SHAMPO capabilities structure,
      SHAMPO related HW spec hardware fields and enumerations.
      SHAMPO stands for: split headers and merge payload offload.
      SHAMPO new fields:
      WQ:
       - headers_mkey: mkey that represents the headers buffer, where the packets
         headers will be written by the HW.
      
       - shampo_enable: flag to verify if the WQ supports SHAMPO feature.
      
       - log_reservation_size: the log of the reservation size where the data of
         the packet will be written by the HW.
      
       - log_max_num_of_packets_per_reservation: log of the maximum number of
         packets that can be written to the same reservation.
      
       - log_headers_entry_size: log of the header entry size of the headers buffer.
      
       - log_headers_buffer_entry_num: log of the entries number of the headers buffer.
      
      RQ:
       - shampo_no_match_alignment_granularity: the HW alignment granularity
         in case the received packet doesn't match the current session.
      
       - shampo_match_criteria_type: the type of match criteria.
      
       - reservation_timeout: the maximum time that the HW will hold the
         reservation.
      
      mlx5_ifc_shampo_cap_bits, the capabilities of the SHAMPO feature:
       - shampo_log_max_reservation_size: the maximum allowed value of the field
         WQ.log_reservation_size.
      
       - log_reservation_size: the minimum allowed value of the field
         WQ.log_reservation_size.
      
       - shampo_min_mss_size: the minimum payload size of packet that can open
         a new session or be merged to a session.
      
       - shampo_max_log_headers_entry_size: the maximum allowed value of the field
         WQ.log_headers_entry_size
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7025329d
    • Ben Ben-Ishay's avatar
      net/mlx5e: Rename lro_timeout to packet_merge_timeout · 50f477fe
      Ben Ben-Ishay authored
      TIR stands for transport interface receive, the TIR object is
      responsible for performing all transport related operations on
      the receive side like packet processing, demultiplexing the packets
      to different RQ's, etc.
      lro_timeout is a field in the TIR that is used to set the timeout for lro
      session, this series introduces new packet merge type, therefore rename
      lro_timeout to packet_merge_timeout for all packet merge types.
      Signed-off-by: default avatarBen Ben-Ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      50f477fe
    • Ben Ben-ishay's avatar
      net: Prevent HW-GRO and LRO features operate together · 54b2b3ec
      Ben Ben-ishay authored
      LRO and HW-GRO are mutually exclusive, this commit adds this restriction
      in netdev_fix_feature. HW-GRO is preferred, that means in case both
      HW-GRO and LRO features are requested, LRO is cleared.
      Signed-off-by: default avatarBen Ben-ishay <benishay@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      54b2b3ec
    • Tariq Toukan's avatar
      lib: bitmap: Introduce node-aware alloc API · 7529cc7f
      Tariq Toukan authored
      Expose new node-aware API for bitmap allocation:
      bitmap_alloc_node() / bitmap_zalloc_node().
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7529cc7f
  2. 26 Oct, 2021 25 commits
    • Luo Jie's avatar
      net: phy: fixed warning: Function parameter not described · 06338cef
      Luo Jie authored
      Fixed warning: Function parameter or member 'enable' not
      described in 'genphy_c45_fast_retrain'
      Signed-off-by: default avatarLuo Jie <luoj@codeaurora.org>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20211026102957.17100-1-luoj@codeaurora.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      06338cef
    • Jakub Kicinski's avatar
      net/mlx5: remove the recent devlink params · 6b367174
      Jakub Kicinski authored
      revert commit 46ae40b9 ("net/mlx5: Let user configure io_eq_size param")
      revert commit a6cb08da ("net/mlx5: Let user configure event_eq_size param")
      revert commit 55460406 ("net/mlx5: Let user configure max_macs param")
      
      The EQE parameters are applicable to more drivers, they should
      be configured via standard API, probably ethtool. Example of
      another driver needing something similar:
      
      https://lore.kernel.org/all/1633454136-14679-3-git-send-email-sbhatta@marvell.com/
      
      The last param for "max_macs" is probably fine but the documentation
      is severely lacking. The meaning and implications for changing the
      param need to be stated.
      
      Link: https://lore.kernel.org/r/20211026152939.3125950-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6b367174
    • David S. Miller's avatar
      Merge branch 'phy-supported-interfaces-bitmap' · 4d2af64b
      David S. Miller authored
      Russell King says:
      
      ====================
      Introduce supported interfaces bitmap
      
      This series introduces a new bitmap to allow us to indicate which
      phy_interface_t modes are supported.
      
      Currently, phylink will call ->validate with PHY_INTERFACE_MODE_NA to
      request all link mode capabilities from the MAC driver before choosing
      an interface to use. This leads in some cases to some rather hairly
      code. This can be simplified if phylink is aware of the interface modes
      that  the MAC supports, and it can instead walk those modes, calling
      ->validate for each one, and combining the results.
      
      This series merely introduces the support; there is no change of
      behaviour until MAC drivers populate their supported_interfaces bitmap.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d2af64b
    • Russell King (Oracle)'s avatar
      net: phylink: use supported_interfaces for phylink validation · d25f3a74
      Russell King (Oracle) authored
      If the network device supplies a supported interface bitmap, we can use
      that during phylink's validation to simplify MAC drivers in two ways by
      using the supported_interfaces bitmap to:
      
      1. reject unsupported interfaces before calling into the MAC driver.
      2. generate the set of all supported link modes across all supported
         interfaces (used mainly for SFP, but also some 10G PHYs.)
      Suggested-by: default avatarSean Anderson <sean.anderson@seco.com>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d25f3a74
    • Russell King's avatar
      net: phylink: add MAC phy_interface_t bitmap · 38c310eb
      Russell King authored
      Add a phy_interface_t bitmap so the MAC driver can specifiy which PHY
      interface modes it supports.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38c310eb
    • Russell King (Oracle)'s avatar
      net: phy: add phy_interface_t bitmap support · 8e20f591
      Russell King (Oracle) authored
      Add support for a bitmap for phy interface modes, which includes:
      - a macro to declare the interface bitmap
      - an inline helper to zero the interface bitmap
      - an inline helper to detect an empty interface bitmap
      - inline helpers to do a bitwise AND and OR operations on two interface
        bitmaps
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e20f591
    • David S. Miller's avatar
      Merge branch 'dsa-isolation-prep' · 656bcd5d
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      DSA preparations for FDB isolation between bridges
      
      This series makes 2 small changes to DSA's SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
      handler, which will make it possible to offer switch drivers a stable
      association between a FDB entry and a bridge device in a future series.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      656bcd5d
    • Vladimir Oltean's avatar
      net: dsa: stop calling dev_hold in dsa_slave_fdb_event · 425d19ce
      Vladimir Oltean authored
      Now that we guarantee that SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events have
      finished executing by the time we leave our bridge upper interface,
      we've established a stronger boundary condition for how long the
      dsa_slave_switchdev_event_work() might run.
      
      As such, it is no longer possible for DSA slave interfaces to become
      unregistered, since they are still bridge ports.
      
      So delete the unnecessary dev_hold() and dev_put().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      425d19ce
    • Vladimir Oltean's avatar
      net: dsa: flush switchdev workqueue when leaving the bridge · d7d0d423
      Vladimir Oltean authored
      DSA is preparing to offer switch drivers an API through which they can
      associate each FDB entry with a struct net_device *bridge_dev. This can
      be used to perform FDB isolation (the FDB lookup performed on the
      ingress of a standalone, or bridged port, should not find an FDB entry
      that is present in the FDB of another bridge).
      
      In preparation of that work, DSA needs to ensure that by the time we
      call the switch .port_fdb_add and .port_fdb_del methods, the
      dp->bridge_dev pointer is still valid, i.e. the port is still a bridge
      port.
      
      This is not guaranteed because the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE API
      requires drivers that must have sleepable context to handle those events
      to schedule the deferred work themselves. DSA does this through the
      dsa_owq.
      
      It can happen that a port leaves a bridge, del_nbp() flushes the FDB on
      that port, SWITCHDEV_FDB_DEL_TO_DEVICE is notified in atomic context,
      DSA schedules its deferred work, but del_nbp() finishes unlinking the
      bridge as a master from the port before DSA's deferred work is run.
      
      Fundamentally, the port must not be unlinked from the bridge until all
      FDB deletion deferred work items have been flushed. The bridge must wait
      for the completion of these hardware accesses.
      
      An attempt has been made to address this issue centrally in switchdev by
      making SWITCHDEV_FDB_DEL_TO_DEVICE deferred (=> blocking) at the switchdev
      level, which would offer implicit synchronization with del_nbp:
      
      https://patchwork.kernel.org/project/netdevbpf/cover/20210820115746.3701811-1-vladimir.oltean@nxp.com/
      
      but it seems that any attempt to modify switchdev's behavior and make
      the events blocking there would introduce undesirable side effects in
      other switchdev consumers.
      
      The most undesirable behavior seems to be that
      switchdev_deferred_process_work() takes the rtnl_mutex itself, which
      would be worse off than having the rtnl_mutex taken individually from
      drivers which is what we have now (except DSA which has removed that
      lock since commit 0faf890f ("net: dsa: drop rtnl_lock from
      dsa_slave_switchdev_event_work")).
      
      So to offer the needed guarantee to DSA switch drivers, I have come up
      with a compromise solution that does not require switchdev rework:
      we already have a hook at the last moment in time when the bridge is
      still an upper of ours: the NETDEV_PRECHANGEUPPER handler. We can flush
      the dsa_owq manually from there, which makes all FDB deletions
      synchronous.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7d0d423
    • Lukas Wunner's avatar
      ifb: Depend on netfilter alternatively to tc · 046178e7
      Lukas Wunner authored
      IFB originally depended on NET_CLS_ACT for traffic redirection.
      But since v4.5, that may be achieved with NFT_FWD_NETDEV as well.
      
      Fixes: 39e6dea2 ("netfilter: nf_tables: add forward expression to the netdev family")
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: <stable@vger.kernel.org> # v4.5+: bcfabee1: netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
      Cc: <stable@vger.kernel.org> # v4.5+
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      046178e7
    • Jeremy Kerr's avatar
      mctp: Implement extended addressing · 99ce45d5
      Jeremy Kerr authored
      This change allows an extended address struct - struct sockaddr_mctp_ext
      - to be passed to sendmsg/recvmsg. This allows userspace to specify
      output ifindex and physical address information (for sendmsg) or receive
      the input ifindex/physaddr for incoming messages (for recvmsg). This is
      typically used by userspace for MCTP address discovery and assignment
      operations.
      
      The extended addressing facility is conditional on a new sockopt:
      MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before
      the kernel will consume/populate the extended address data.
      
      Includes a fix for an uninitialised var:
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJeremy Kerr <jk@codeconstruct.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99ce45d5
    • Nathan Chancellor's avatar
      net: ax88796c: Remove pointless check in ax88796c_open() · 971f5c40
      Nathan Chancellor authored
      Clang warns:
      
      drivers/net/ethernet/asix/ax88796c_main.c:851:24: error: address of
      array 'ax_local->phydev->advertising' will always evaluate to 'true'
      [-Werror,-Wpointer-bool-conversion]
              if (ax_local->phydev->advertising &&
                  ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ ~~
      
      advertising cannot be NULL here if ax_local is not NULL, which cannot
      happen due to the check in ax88796c_probe(). Remove the check.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1492Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      971f5c40
    • Nathan Chancellor's avatar
      net: ax88796c: Fix clang -Wimplicit-fallthrough in ax88796c_set_mac() · 3c554881
      Nathan Chancellor authored
      Clang warns:
      
      drivers/net/ethernet/asix/ax88796c_main.c:696:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
              case SPEED_10:
              ^
      drivers/net/ethernet/asix/ax88796c_main.c:696:2: note: insert 'break;' to avoid fall-through
              case SPEED_10:
              ^
              break;
      drivers/net/ethernet/asix/ax88796c_main.c:706:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
              case DUPLEX_HALF:
              ^
      drivers/net/ethernet/asix/ax88796c_main.c:706:2: note: insert 'break;' to avoid fall-through
              case DUPLEX_HALF:
              ^
              break;
      
      Clang is a little more pedantic than GCC, which permits implicit
      fallthroughs to cases that contain just break or return. Clang's version
      is more in line with the kernel's own stance in deprecated.rst, which
      states that all switch/case blocks must end in either break,
      fallthrough, continue, goto, or return. Add the missing breaks to fix
      the warning.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1491Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c554881
    • Haiyang Zhang's avatar
      net: mana: Allow setting the number of queues while the NIC is down · a137c069
      Haiyang Zhang authored
      The existing code doesn't allow setting the number of queues while the
      NIC is down.
      
      Update the ethtool handler functions to support setting the number of
      queues while the NIC is at down state.
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a137c069
    • Andreas Oetken's avatar
      net: hsr: Add support for redbox supervision frames · eafaa88b
      Andreas Oetken authored
      added support for the redbox supervision frames
      as defined in the IEC-62439-3:2018.
      Signed-off-by: default avatarAndreas Oetken <andreas.oetken@siemens-energy.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eafaa88b
    • David S. Miller's avatar
      Merge branch 'tcp_stream_alloc_skb' · 3247e3ff
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: tcp_stream_alloc_skb() changes
      
      sk_stream_alloc_skb() is only used by TCP.
      
      Rename it to tcp_stream_alloc_skb() and apply small
      optimizations.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3247e3ff
    • Eric Dumazet's avatar
      tcp: remove unneeded code from tcp_stream_alloc_skb() · c4322884
      Eric Dumazet authored
      Aligning @size argument to 4 bytes is not needed.
      
      The header alignment has nothing to do with @size.
      
      It really depends on skb->head alignment and MAX_TCP_HEADER.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4322884
    • Eric Dumazet's avatar
      tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skb · 8a794df6
      Eric Dumazet authored
      Both IPv4 and IPv6 uses same reserve, no need risking
      cache line misses to fetch its value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a794df6
    • Eric Dumazet's avatar
      tcp: rename sk_stream_alloc_skb · f8dd3b8d
      Eric Dumazet authored
      sk_stream_alloc_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration
      to include/net/tcp.h
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8dd3b8d
    • Eric Dumazet's avatar
      net: annotate data-race in neigh_output() · d18785e2
      Eric Dumazet authored
      neigh_output() reads n->nud_state and hh->hh_len locklessly.
      
      This is fine, but we need to add annotations and document this.
      
      We evaluate skip_cache first to avoid reading these fields
      if the cache has to by bypassed.
      
      syzbot report:
      
      BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2
      
      write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1:
       __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128
       neigh_event_send include/net/neighbour.h:444 [inline]
       neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476
       neigh_output include/net/neighbour.h:510 [inline]
       ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       secondary_startup_64_no_verify+0xb1/0xbb
      
      read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0:
       neigh_output include/net/neighbour.h:507 [inline]
       ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       rest_init+0xee/0x100 init/main.c:734
       arch_call_rest_init+0xa/0xb
       start_kernel+0x5e4/0x669 init/main.c:1142
       secondary_startup_64_no_verify+0xb1/0xbb
      
      value changed: 0x20 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d18785e2
    • David S. Miller's avatar
      Merge branch 'mlxsw-rif-mac-prefixes' · 72b93a86
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Support multiple RIF MAC prefixes
      
      Currently, mlxsw enforces that all the netdevs used as router interfaces
      (RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1).
      Otherwise, an error is returned to user space with extack. This patchset
      relaxes the limitation through the use of RIF MAC profiles.
      
      A RIF MAC profile is a hardware entity that represents a particular MAC
      prefix which multiple RIFs can reference. Therefore, the number of
      possible MAC prefixes is no longer one, but the number of profiles
      supported by the device.
      
      The ability to change the MAC of a particular netdev is useful, for
      example, for users who use the netdev to connect to an upstream provider
      that performs MAC filtering. Currently, such users are either forced to
      negotiate with the provider or change the MAC address of all other
      netdevs so that they share the same prefix.
      
      Patchset overview:
      
      Patches #1-#3 are preparations.
      
      Patch #4 adds actual support for RIF MAC profiles.
      
      Patch #5 exposes RIF MAC profiles as a devlink resource, so that user
      space has visibility into the maximum number of profiles and current
      occupancy. Useful for debugging and testing (next 3 patches).
      
      Patches #6-#8 add both scale and functional tests.
      
      Patch #9 removes tests that validated the previous limitation. It is now
      covered by patch #6 for devices that support a single profile.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72b93a86
    • Danielle Ratson's avatar
      selftests: mlxsw: Remove deprecated test cases · c24dbf3d
      Danielle Ratson authored
      After adding the previous patches, the constraint that all the router
      interface MAC addresses have the same prefix is no longer relevant.
      
      Remove the test cases that validated that this constraint is honored.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c24dbf3d
    • Danielle Ratson's avatar
      selftests: Add an occupancy test for RIF MAC profiles · 20d446db
      Danielle Ratson authored
      When all the RIF MAC profiles are in use, test that it is possible to
      change the MAC of a netdev (i.e., a RIF) when its MAC profile is not
      shared with other RIFs. Test that replacement fails when the MAC profile
      is shared.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20d446db
    • Danielle Ratson's avatar
      selftests: mlxsw: Add forwarding test for RIF MAC profiles · a10b7bac
      Danielle Ratson authored
      Verify that MAC profile changes are indeed applied and that packets are
      forwarded with the correct source MAC.
      
      Output example:
      
      $ ./rif_mac_profiles.sh
      TEST: h1->h2: new mac profile                                       [ OK ]
      TEST: h2->h1: new mac profile                                       [ OK ]
      TEST: h1->h2: edit mac profile                                      [ OK ]
      TEST: h2->h1: edit mac profile                                      [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a10b7bac
    • Danielle Ratson's avatar
      selftests: mlxsw: Add a scale test for RIF MAC profiles · 152f98e7
      Danielle Ratson authored
      Query the maximum number of supported RIF MAC profiles using
      devlink-resource and verify that all available MAC profiles can be utilized
      and that an error is generated when user space tries to exceed this number.
      
      Output example in Spectrum-2:
      
      $ TESTS='rif_mac_profile' ./resource_scale.sh
      TEST: 'rif_mac_profile' 4                                           [ OK ]
      TEST: 'rif_mac_profile' overflow 5                                  [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      152f98e7