1. 20 Apr, 2022 10 commits
    • Taehee Yoo's avatar
      net: atlantic: Implement .ndo_xdp_xmit handler · 45638f01
      Taehee Yoo authored
      aq_xdp_xmit() is the callback function of .ndo_xdp_xmit.
      It internally calls aq_nic_xmit_xdpf() to send packet.
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45638f01
    • Taehee Yoo's avatar
      net: atlantic: Implement xdp data plane · 26efaef7
      Taehee Yoo authored
      It supports XDP_PASS, XDP_DROP and multi buffer.
      
      The new function aq_nic_xmit_xdpf() is used to send packet with
      xdp_frame and internally it calls aq_nic_map_xdp().
      
      AQC chip supports 32 multi-queues and 8 vectors(irq).
      there are two option
      1. under 8 cores and 4 tx queues per core.
      2. under 4 cores and 8 tx queues per core.
      
      Like ixgbe, these tx queues can be used only for XDP_TX, XDP_REDIRECT
      queue. If so, no tx_lock is needed.
      But this patchset doesn't use this strategy because getting hardware tx
      queue index cost is too high.
      So, tx_lock is used in the aq_nic_xmit_xdpf().
      
      single-core, single queue, 80% cpu utilization.
      
        30.75%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
        10.35%  [kernel]                  [k] aq_hw_read_reg <---------- here
         4.38%  [kernel]                  [k] get_page_from_freelist
      
      single-core, 8 queues, 100% cpu utilization, half PPS.
      
        45.56%  [kernel]                  [k] aq_hw_read_reg <---------- here
        17.58%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
         4.72%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive
      
      The new function __aq_ring_xdp_clean() is a xdp rx handler and this is
      called only when XDP is attached.
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26efaef7
    • Taehee Yoo's avatar
      net: atlantic: Implement xdp control plane · 0d14657f
      Taehee Yoo authored
      aq_xdp() is a xdp setup callback function for Atlantic driver.
      When XDP is attached or detached, the device will be restarted because
      it uses different headroom, tailroom, and page order value.
      
      If XDP enabled, it switches default page order value from 0 to 2.
      Because the default maximum frame size is still 2K and it needs
      additional area for headroom and tailroom.
      The total size(headroom + frame size + tailroom) is 2624.
      So, 1472Bytes will be always wasted for every frame.
      But when order-2 is used, these pages can be used 6 times
      with flip strategy.
      It means only about 106Bytes per frame will be wasted.
      
      Also, It supports xdp fragment feature.
      MTU can be 16K if xdp prog supports xdp fragment.
      If not, MTU can not exceed 2K - ETH_HLEN - ETH_FCS.
      
      And a static key is added and It will be used to call the xdp_clean
      handler in ->poll(). data plane implementation will be contained
      the followed patch.
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d14657f
    • David S. Miller's avatar
      Merge branch 'dsa-cross-chip-notifier-cleanup' · 8ab38ed7
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      DSA cross-chip notifier cleanups
      
      This patch set makes the following improvements:
      
      - Cross-chip notifiers pass a switch index, port index, sometimes tree
        index, all as integers. Sometimes we need to recover the struct
        dsa_port based on those integers. That recovery involves traversing a
        list. By passing directly a pointer to the struct dsa_port we can
        avoid that, and the indices passed previously can still be obtained
        from the passed struct dsa_port.
      
      - Resetting VLAN filtering on a switch has explicit code to make it run
        on a single switch, so it has no place to stay in the cross-chip
        notifier code. Move it out.
      
      - Changing the MTU on a user port affects only that single port, yet the
        code passes through the cross-chip notifier layer where all switches
        are notified. Avoid that.
      
      - Other related cosmetic changes in the MTU changing procedure.
      
      Apart from the slight improvement in performance given by
      (a) doing less work in cross-chip notifiers
      (b) emitting less cross-chip notifiers
      we also end up with about 100 less lines of code.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ab38ed7
    • Vladimir Oltean's avatar
      net: dsa: don't emit targeted cross-chip notifiers for MTU change · be6ff966
      Vladimir Oltean authored
      A cross-chip notifier with "targeted_match=true" is one that matches
      only the local port of the switch that emitted it. In other words,
      passing through the cross-chip notifier layer serves no purpose.
      
      Eliminate this concept by calling directly ds->ops->port_change_mtu
      instead of emitting a targeted cross-chip notifier. This leaves the
      DSA_NOTIFIER_MTU event being emitted only for MTU updates on the CPU
      port, which need to be reflected also across all DSA links.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be6ff966
    • Vladimir Oltean's avatar
      net: dsa: drop dsa_slave_priv from dsa_slave_change_mtu · 4715029f
      Vladimir Oltean authored
      We can get a hold of the "ds" pointer directly from "dp", no need for
      the dsa_slave_priv.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4715029f
    • Vladimir Oltean's avatar
      net: dsa: avoid one dsa_to_port() in dsa_slave_change_mtu · cf1c39d3
      Vladimir Oltean authored
      We could retrieve the cpu_dp pointer directly from the "dp" we already
      have, no need to resort to dsa_to_port(ds, port).
      
      This change also removes the need for an "int port", so that is also
      deleted.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf1c39d3
    • Vladimir Oltean's avatar
      net: dsa: use dsa_tree_for_each_user_port in dsa_slave_change_mtu · b2033a05
      Vladimir Oltean authored
      Use the more conventional iterator over user ports instead of explicitly
      ignoring them, and use the more conventional name "other_dp" instead of
      "dp_iter", for readability.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2033a05
    • Vladimir Oltean's avatar
      net: dsa: make cross-chip notifiers more efficient for host events · 726816a1
      Vladimir Oltean authored
      To determine whether a given port should react to the port targeted by
      the notifier, dsa_port_host_vlan_match() and dsa_port_host_address_match()
      look at the positioning of the switch port currently executing the
      notifier relative to the switch port for which the notifier was emitted.
      
      To maintain stylistic compatibility with the other match functions from
      switch.c, the host address and host VLAN match functions take the
      notifier information about targeted port, switch and tree indices as
      argument. However, these functions only use that information to retrieve
      the struct dsa_port *targeted_dp, which is an invariant for the outer
      loop that calls them. So it makes more sense to calculate the targeted
      dp only once, and pass it to them as argument.
      
      But furthermore, the targeted dp is actually known at the time the call
      to dsa_port_notify() is made. It is just that we decide to only save the
      indices of the port, switch and tree in the notifier structure, just to
      retrace our steps and find the dp again using dsa_switch_find() and
      dsa_to_port().
      
      But both the above functions are relatively expensive, since they need
      to iterate through lists. It appears more straightforward to make all
      notifiers just pass the targeted dp inside their info structure, and
      have the code that needs the indices to look at info->dp->index instead
      of info->port, or info->dp->ds->index instead of info->sw_index, or
      info->dp->ds->dst->index instead of info->tree_index.
      
      For the sake of consistency, all cross-chip notifiers are converted to
      pass the "dp" directly.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      726816a1
    • Vladimir Oltean's avatar
      net: dsa: move reset of VLAN filtering to dsa_port_switchdev_unsync_attrs · 8e9e678e
      Vladimir Oltean authored
      In dsa_port_switchdev_unsync_attrs() there is a comment that resetting
      the VLAN filtering isn't done where it is expected. And since commit
      108dc874 ("net: dsa: Avoid cross-chip syncing of VLAN filtering"),
      there is no reason to handle this in switch.c either.
      
      Therefore, move the logic to port.c, and adapt it slightly to the data
      structures and naming conventions from there.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e9e678e
  2. 19 Apr, 2022 8 commits
    • Paolo Abeni's avatar
      Merge branch 'rtnetlink-improve-alt_ifname-config-and-fix-dangerous-group-usage' · cc4bdef2
      Paolo Abeni authored
      Florent Fourcot says:
      
      ====================
      rtnetlink: improve ALT_IFNAME config and fix dangerous GROUP usage
      
      First commit forbids dangerous calls when both IFNAME and GROUP are
      given, since it can introduce unexpected behaviour when IFNAME does not
      match any interface.
      
      Second patch achieves primary goal of this patchset to fix/improve
      IFLA_ALT_IFNAME attribute, since previous code was never working for
      newlink/setlink. ip-link command is probably getting interface index
      before, and was not using this feature.
      
      Last two patches are improving error code on corner cases.
      
      Changes in v2:
        * Remove ifname argument in rtnl_dev_get/do_setlink
          functions (simplify code)
        * Use a boolean to avoid condition duplication in __rtnl_newlink
      
      Changes in v3:
        * Simplify rtnl_dev_get signature
      
      Changes in v4:
        * Rename link_lookup to link_specified
      
      Changes in v5:
        * Re-order patches
      ====================
      
      Link: https://lore.kernel.org/r/20220415165330.10497-1-florent.fourcot@wifirst.frSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cc4bdef2
    • Florent Fourcot's avatar
      rtnetlink: return EINVAL when request cannot succeed · b6177d32
      Florent Fourcot authored
      A request without interface name/interface index/interface group cannot
      work. We should return EINVAL
      Signed-off-by: default avatarFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: default avatarBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b6177d32
    • Florent Fourcot's avatar
      rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink · dee04163
      Florent Fourcot authored
      If IFLA_ALT_IFNAME is set and given interface is not found,
      we should return ENODEV and be consistent with IFLA_IFNAME
      behaviour
      This commit extends feature of commit 76c9ac0e,
      "net: rtnetlink: add possibility to use alternative names as message handle"
      
      CC: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: default avatarBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      dee04163
    • Florent Fourcot's avatar
      rtnetlink: enable alt_ifname for setlink/newlink · 5ea08b52
      Florent Fourcot authored
      buffer called "ifname" given in function rtnl_dev_get
      is always valid when called by setlink/newlink,
      but contains only empty string when IFLA_IFNAME is not given. So
      IFLA_ALT_IFNAME is always ignored
      
      This patch fixes rtnl_dev_get function with a remove of ifname argument,
      and move ifname copy in do_setlink when required.
      
      It extends feature of commit 76c9ac0e,
      "net: rtnetlink: add possibility to use alternative names as message
      handle""
      
      CC: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: default avatarBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5ea08b52
    • Florent Fourcot's avatar
      rtnetlink: return ENODEV when ifname does not exist and group is given · ef2a7c90
      Florent Fourcot authored
      When the interface does not exist, and a group is given, the given
      parameters are being set to all interfaces of the given group. The given
      IFNAME/ALT_IF_NAME are being ignored in that case.
      
      That can be dangerous since a typo (or a deleted interface) can produce
      weird side effects for caller:
      
      Case 1:
      
       IFLA_IFNAME=valid_interface
       IFLA_GROUP=1
       MTU=1234
      
      Case 1 will update MTU and group of the given interface "valid_interface".
      
      Case 2:
      
       IFLA_IFNAME=doesnotexist
       IFLA_GROUP=1
       MTU=1234
      
      Case 2 will update MTU of all interfaces in group 1. IFLA_IFNAME is
      ignored in this case
      
      This behaviour is not consistent and dangerous. In order to fix this issue,
      we now return ENODEV when the given IFNAME does not exist.
      Signed-off-by: default avatarFlorent Fourcot <florent.fourcot@wifirst.fr>
      Signed-off-by: default avatarBrian Baboch <brian.baboch@wifirst.fr>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ef2a7c90
    • Paolo Abeni's avatar
      Merge branch 'net-sched-allow-user-to-select-txqueue' · 8b11c35d
      Paolo Abeni authored
      Tonghao Zhang says:
      
      ====================
      net: sched: allow user to select txqueue
      
      From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
      
      Patch 1 allow user to select txqueue in clsact hook.
      Patch 2 support skbhash to select txqueue.
      ====================
      
      Link: https://lore.kernel.org/r/20220415164046.26636-1-xiangxia.m.yue@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8b11c35d
    • Tonghao Zhang's avatar
      net: sched: support hash selecting tx queue · 38a6f086
      Tonghao Zhang authored
      This patch allows users to pick queue_mapping, range
      from A to B. Then we can load balance packets from A
      to B tx queue. The range is an unsigned 16bit value
      in decimal format.
      
      $ tc filter ... action skbedit queue_mapping skbhash A B
      
      "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
      is enhanced with flags: SKBEDIT_F_TXQ_SKBHASH
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | qn        | qm
          v           v           v
        HTB/FQ       FIFO   ...  FIFO
      
      For example:
      If P1 sends out packets to different Pods on other host, and
      we want distribute flows from qn - qm. Then we can use skb->hash
      as hash.
      
      setup commands:
      $ NETDEV=eth0
      $ ip netns add n1
      $ ip link add ipv1 link $NETDEV type ipvlan mode l2
      $ ip link set ipv1 netns n1
      $ ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up
      
      $ tc qdisc add dev $NETDEV clsact
      $ tc filter add dev $NETDEV egress protocol ip prio 1 \
              flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping skbhash 2 6
      $ tc qdisc add dev $NETDEV handle 1: root mq
      $ tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
      $ tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
      $ tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
      $ tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
      $ tc qdisc add dev $NETDEV parent 1:3 pfifo
      $ tc qdisc add dev $NETDEV parent 1:4 pfifo
      $ tc qdisc add dev $NETDEV parent 1:5 pfifo
      $ tc qdisc add dev $NETDEV parent 1:6 pfifo
      $ tc qdisc add dev $NETDEV parent 1:7 pfifo
      
      $ ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10
      
      pick txqueue from 2 - 6:
      $ ethtool -S $NETDEV | grep -i tx_queue_[0-9]_bytes
           tx_queue_0_bytes: 42
           tx_queue_1_bytes: 0
           tx_queue_2_bytes: 11442586444
           tx_queue_3_bytes: 7383615334
           tx_queue_4_bytes: 3981365579
           tx_queue_5_bytes: 3983235051
           tx_queue_6_bytes: 6706236461
           tx_queue_7_bytes: 42
           tx_queue_8_bytes: 0
           tx_queue_9_bytes: 0
      
      txqueues 2 - 6 are mapped to classid 1:3 - 1:7
      $ tc -s class show dev $NETDEV
      ...
      class mq 1:3 root leaf 8002:
       Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:4 root leaf 8003:
       Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:5 root leaf 8004:
       Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:6 root leaf 8005:
       Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      class mq 1:7 root leaf 8006:
       Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      ...
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      38a6f086
    • Tonghao Zhang's avatar
      net: sched: use queue_mapping to pick tx queue · 2f1e85b1
      Tonghao Zhang authored
      This patch fixes issue:
      * If we install tc filters with act_skbedit in clsact hook.
        It doesn't work, because netdev_core_pick_tx() overwrites
        queue_mapping.
      
        $ tc filter ... action skbedit queue_mapping 1
      
      And this patch is useful:
      * We can use FQ + EDT to implement efficient policies. Tx queues
        are picked by xps, ndo_select_queue of netdev driver, or skb hash
        in netdev_core_pick_tx(). In fact, the netdev driver, and skb
        hash are _not_ under control. xps uses the CPUs map to select Tx
        queues, but we can't figure out which task_struct of pod/containter
        running on this cpu in most case. We can use clsact filters to classify
        one pod/container traffic to one Tx queue. Why ?
      
        In containter networking environment, there are two kinds of pod/
        containter/net-namespace. One kind (e.g. P1, P2), the high throughput
        is key in these applications. But avoid running out of network resource,
        the outbound traffic of these pods is limited, using or sharing one
        dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
        (e.g. Pn), the low latency of data access is key. And the traffic is not
        limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
        This choice provides two benefits. First, contention on the HTB/FQ Qdisc
        lock is significantly reduced since fewer CPUs contend for the same queue.
        More importantly, Qdisc contention can be eliminated completely if each
        CPU has its own FIFO Qdisc for the second kind of pods.
      
        There must be a mechanism in place to support classifying traffic based on
        pods/container to different Tx queues. Note that clsact is outside of Qdisc
        while Qdisc can run a classifier to select a sub-queue under the lock.
      
        In general recording the decision in the skb seems a little heavy handed.
        This patch introduces a per-CPU variable, suggested by Eric.
      
        The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
        - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
          is set in qdisc->enqueue() though tx queue has been selected in
          netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
          firstly in __dev_queue_xmit(), is useful:
        - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
          in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
          For example, eth0, macvlan in pod, which root Qdisc install skbedit
          queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
          eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
          because there is no filters in clsact or tx Qdisc of this netdev.
          Same action taked in eth0, ixgbe in Host.
        - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
          in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
          in __dev_queue_xmit when processing next packets.
      
        For performance reasons, use the static key. If user does not config the NET_EGRESS,
        the patch will not be compiled.
      
        +----+      +----+      +----+
        | P1 |      | P2 |      | Pn |
        +----+      +----+      +----+
          |           |           |
          +-----------+-----------+
                      |
                      | clsact/skbedit
                      |      MQ
                      v
          +-----------+-----------+
          | q0        | q1        | qn
          v           v           v
        HTB/FQ      HTB/FQ  ...  FIFO
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Lobakin <alobakin@pm.me>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Antoine Tenart <atenart@kernel.org>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      2f1e85b1
  3. 18 Apr, 2022 20 commits
    • Luiz Angelo Daros de Luca's avatar
      docs: net: dsa: describe issues with checksum offload · a997157e
      Luiz Angelo Daros de Luca authored
      DSA tags before IP header (categories 1 and 2) or after the payload (3)
      might introduce offload checksum issues.
      Signed-off-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a997157e
    • David S. Miller's avatar
      Merge branch 'mlxsw-line-card' · 2a38de06
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Introduce line card support for modular switch
      
      Jiri says:
      
      This patchset introduces support for modular switch systems and also
      introduces mlxsw support for NVIDIA Mellanox SN4800 modular switch.
      It contains 8 slots to accommodate line cards - replaceable PHY modules
      which may contain gearboxes.
      Currently supported line card:
      16X 100GbE (QSFP28)
      Other line cards that are going to be supported:
      8X 200GbE (QSFP56)
      4X 400GbE (QSFP-DD)
      There may be other types of line cards added in the future.
      
      To be consistent with the port split configuration (splitter cabels),
      the line card entities are treated in the similar way. The nature of
      a line card is not "a pluggable device", but "a pluggable PHY module".
      
      A concept of "provisioning" is introduced. The user may "provision"
      certain slot with a line card type. Driver then creates all instances
      (devlink ports, netdevices, etc) related to this line card type. It does
      not matter if the line card is plugged-in at the time. User is able to
      configure netdevices, devlink ports, setup port splitters, etc. From the
      perspective of the switch ASIC, all is present and can be configured.
      
      The carrier of netdevices stays down if the line card is not plugged-in.
      Once the line card is inserted and activated, the carrier of
      the related netdevices is then reflecting the physical line state,
      same as for an ordinary fixed port.
      
      Once user does not want to use the line card related instances
      anymore, he can "unprovision" the slot. Driver then removes the
      instances.
      
      Patches 1-4 are extending devlink driver API and UAPI in order to
      register, show, dump, provision and activate the line card.
      Patches 5-17 are implementing the introduced API in mlxsw.
      The last patch adds a selftest for mlxsw line cards.
      
      Example:
      $ devlink port # No ports are listed
      $ devlink lc
      pci/0000:01:00.0:
        lc 1 state unprovisioned
          supported_types:
             16x100G
        lc 2 state unprovisioned
          supported_types:
             16x100G
        lc 3 state unprovisioned
          supported_types:
             16x100G
        lc 4 state unprovisioned
          supported_types:
             16x100G
        lc 5 state unprovisioned
          supported_types:
             16x100G
        lc 6 state unprovisioned
          supported_types:
             16x100G
        lc 7 state unprovisioned
          supported_types:
             16x100G
        lc 8 state unprovisioned
          supported_types:
             16x100G
      
      Note that driver exposes list supported line card types. Currently
      there is only one: "16x100G".
      
      To provision the slot #8:
      
      $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
      $ devlink lc show pci/0000:01:00.0 lc 8
      pci/0000:01:00.0:
        lc 8 state active type 16x100G
          supported_types:
             16x100G
      $ devlink port
      pci/0000:01:00.0/0: type notset flavour cpu port 0 splittable false
      pci/0000:01:00.0/53: type eth netdev enp1s0nl8p1 flavour physical lc 8 port 1 splittable true lanes 4
      pci/0000:01:00.0/54: type eth netdev enp1s0nl8p2 flavour physical lc 8 port 2 splittable true lanes 4
      pci/0000:01:00.0/55: type eth netdev enp1s0nl8p3 flavour physical lc 8 port 3 splittable true lanes 4
      pci/0000:01:00.0/56: type eth netdev enp1s0nl8p4 flavour physical lc 8 port 4 splittable true lanes 4
      pci/0000:01:00.0/57: type eth netdev enp1s0nl8p5 flavour physical lc 8 port 5 splittable true lanes 4
      pci/0000:01:00.0/58: type eth netdev enp1s0nl8p6 flavour physical lc 8 port 6 splittable true lanes 4
      pci/0000:01:00.0/59: type eth netdev enp1s0nl8p7 flavour physical lc 8 port 7 splittable true lanes 4
      pci/0000:01:00.0/60: type eth netdev enp1s0nl8p8 flavour physical lc 8 port 8 splittable true lanes 4
      pci/0000:01:00.0/61: type eth netdev enp1s0nl8p9 flavour physical lc 8 port 9 splittable true lanes 4
      pci/0000:01:00.0/62: type eth netdev enp1s0nl8p10 flavour physical lc 8 port 10 splittable true lanes 4
      pci/0000:01:00.0/63: type eth netdev enp1s0nl8p11 flavour physical lc 8 port 11 splittable true lanes 4
      pci/0000:01:00.0/64: type eth netdev enp1s0nl8p12 flavour physical lc 8 port 12 splittable true lanes 4
      pci/0000:01:00.0/125: type eth netdev enp1s0nl8p13 flavour physical lc 8 port 13 splittable true lanes 4
      pci/0000:01:00.0/126: type eth netdev enp1s0nl8p14 flavour physical lc 8 port 14 splittable true lanes 4
      pci/0000:01:00.0/127: type eth netdev enp1s0nl8p15 flavour physical lc 8 port 15 splittable true lanes 4
      pci/0000:01:00.0/128: type eth netdev enp1s0nl8p16 flavour physical lc 8 port 16 splittable true lanes 4
      
      To uprovision the slot #8:
      
      $ devlink lc set pci/0000:01:00.0 lc 8 notype
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a38de06
    • Jiri Pirko's avatar
      selftests: mlxsw: Introduce devlink line card provision/unprovision/activation tests · e1fad951
      Jiri Pirko authored
      Introduce basic line card manipulation which consists of provisioning,
      unprovisioning and activation of a line card.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1fad951
    • Jiri Pirko's avatar
      mlxsw: spectrum: Add port to linecard mapping · 6445eef0
      Jiri Pirko authored
      For each port get slot_index using PMLP register. For ports residing
      on a linecard, identify it with the linecard by setting mapping
      using devlink_port_linecard_set() helper. Use linecard slot index for
      PMTDB register queries.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6445eef0
    • Jiri Pirko's avatar
      mlxsw: core: Extend driver ops by remove selected ports op · 45bf3b72
      Jiri Pirko authored
      In case of line card implementation, the core has to have a way to
      remove relevant ports manually. Extend the Spectrum driver ops by an op
      that implements port removal of selected ports upon request.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45bf3b72
    • Jiri Pirko's avatar
      mlxsw: core_linecards: Implement line card activation process · ee7a70fa
      Jiri Pirko authored
      Allow to process events generated upon line card getting "ready" and
      "active".
      
      When DSDSC event with "ready" bit set is delivered, that means the
      line card is powered up. Use MDDC register to push the line card to
      active state. Once FW is done with that, the DSDSC event with "active"
      bit set is delivered.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee7a70fa
    • Jiri Pirko's avatar
      mlxsw: core_linecards: Add line card objects and implement provisioning · b217127e
      Jiri Pirko authored
      Introduce objects for line cards and an infrastructure around that.
      Use devlink_linecard_create/destroy() to register the line card with
      devlink core. Implement provisioning ops with a list of supported
      line cards.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b217127e
    • Jiri Pirko's avatar
      mlxsw: reg: Add Management Binary Code Transfer Register · 5bade5aa
      Jiri Pirko authored
      The MBCT register allows to transfer binary INI codes from the host to
      the management FW by transferring it by chunks of maximum 1KB.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5bade5aa
    • Jiri Pirko's avatar
      mlxsw: reg: Add Management DownStream Device Control Register · 5290a8ff
      Jiri Pirko authored
      The MDDC register allows to control downstream devices and line cards.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5290a8ff
    • Jiri Pirko's avatar
      mlxsw: reg: Add Management DownStream Device Query Register · 505f524d
      Jiri Pirko authored
      The MDDQ register allows to query the DownStream device properties.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      505f524d
    • Jiri Pirko's avatar
      mlxsw: spectrum: Introduce port mapping change event processing · b0ec003e
      Jiri Pirko authored
      Register PMLPE trap and process the port mapping changes delivered
      by it by creating related ports. Note that this happens after
      provisioning. The INI of the linecard is processed and merged by FW.
      PMLPE is generated for each port. Process this mapping change.
      
      Layout of PMLPE is the same as layout of PMLP.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0ec003e
    • Jiri Pirko's avatar
      mlxsw: Narrow the critical section of devl_lock during ports creation/removal · adc64623
      Jiri Pirko authored
      No need to hold the lock for alloc and freecpu. So narrow the critical
      section. Follow-up patch is going to benefit from this by adding more
      code to the functions which will be out of the critical as well.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adc64623
    • Jiri Pirko's avatar
      mlxsw: reg: Add Ports Mapping Event Configuration Register · ebf0c534
      Jiri Pirko authored
      The PMECR register is used to enable/disable event triggering
      in case of local port mapping change.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebf0c534
    • Jiri Pirko's avatar
      mlxsw: spectrum: Allocate port mapping array of structs instead of pointers · d3ad2d88
      Jiri Pirko authored
      Instead of array of pointers to port mapping structures, allocate the
      array of structures directly.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3ad2d88
    • Jiri Pirko's avatar
      mlxsw: spectrum: Allow lane to start from non-zero index · bac62191
      Jiri Pirko authored
      So far, the lane index always started from zero. That is not true for
      modular systems with gearbox-equipped linecards. Loose the check so the
      lanes can start from non-zero index.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bac62191
    • Jiri Pirko's avatar
      devlink: add port to line card relationship set · b8375859
      Jiri Pirko authored
      In order to properly inform user about relationship between port and
      line card, introduce a driver API to set line card for a port. Use this
      information to extend port devlink netlink message by line card index
      and also include the line card index into phys_port_name and by that
      into a netdevice name.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8375859
    • Jiri Pirko's avatar
      devlink: implement line card active state · fc9f50d5
      Jiri Pirko authored
      Allow driver to mark a line card as active. Expose this state to the
      userspace over devlink netlink interface with proper notifications.
      'active' state means that line card was plugged in after
      being provisioned.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc9f50d5
    • Jiri Pirko's avatar
      devlink: implement line card provisioning · fcdc8ce2
      Jiri Pirko authored
      In order to be able to configure all needed stuff on a port/netdevice
      of a line card without the line card being present, introduce line card
      provisioning. Basically by setting a type, provisioning process will
      start and driver is supposed to create a placeholder for instances
      (ports/netdevices) for a line card type.
      
      Allow the user to query the supported line card types over line card
      get command. Then implement two netlink command SET to allow user to
      set/unset the card type.
      
      On the driver API side, add provision/unprovision ops and supported
      types array to be advertised. Upon provision op call, the driver should
      take care of creating the instances for the particular line card type.
      Introduce provision_set/clear() functions to be called by the driver
      once the provisioning/unprovisioning is done on its side. These helpers
      are not to be called directly due to the async nature of provisioning.
      
      Example:
      $ devlink port # No ports are listed
      $ devlink lc
      pci/0000:01:00.0:
        lc 1 state unprovisioned
          supported_types:
             16x100G
        lc 2 state unprovisioned
          supported_types:
             16x100G
        lc 3 state unprovisioned
          supported_types:
             16x100G
        lc 4 state unprovisioned
          supported_types:
             16x100G
        lc 5 state unprovisioned
          supported_types:
             16x100G
        lc 6 state unprovisioned
          supported_types:
             16x100G
        lc 7 state unprovisioned
          supported_types:
             16x100G
        lc 8 state unprovisioned
          supported_types:
             16x100G
      
      $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
      $ devlink lc show pci/0000:01:00.0 lc 8
      pci/0000:01:00.0:
        lc 8 state active type 16x100G
          supported_types:
             16x100G
      $ devlink port
      pci/0000:01:00.0/0: type notset flavour cpu port 0 splittable false
      pci/0000:01:00.0/53: type eth netdev enp1s0nl8p1 flavour physical lc 8 port 1 splittable true lanes 4
      pci/0000:01:00.0/54: type eth netdev enp1s0nl8p2 flavour physical lc 8 port 2 splittable true lanes 4
      pci/0000:01:00.0/55: type eth netdev enp1s0nl8p3 flavour physical lc 8 port 3 splittable true lanes 4
      pci/0000:01:00.0/56: type eth netdev enp1s0nl8p4 flavour physical lc 8 port 4 splittable true lanes 4
      pci/0000:01:00.0/57: type eth netdev enp1s0nl8p5 flavour physical lc 8 port 5 splittable true lanes 4
      pci/0000:01:00.0/58: type eth netdev enp1s0nl8p6 flavour physical lc 8 port 6 splittable true lanes 4
      pci/0000:01:00.0/59: type eth netdev enp1s0nl8p7 flavour physical lc 8 port 7 splittable true lanes 4
      pci/0000:01:00.0/60: type eth netdev enp1s0nl8p8 flavour physical lc 8 port 8 splittable true lanes 4
      pci/0000:01:00.0/61: type eth netdev enp1s0nl8p9 flavour physical lc 8 port 9 splittable true lanes 4
      pci/0000:01:00.0/62: type eth netdev enp1s0nl8p10 flavour physical lc 8 port 10 splittable true lanes 4
      pci/0000:01:00.0/63: type eth netdev enp1s0nl8p11 flavour physical lc 8 port 11 splittable true lanes 4
      pci/0000:01:00.0/64: type eth netdev enp1s0nl8p12 flavour physical lc 8 port 12 splittable true lanes 4
      pci/0000:01:00.0/125: type eth netdev enp1s0nl8p13 flavour physical lc 8 port 13 splittable true lanes 4
      pci/0000:01:00.0/126: type eth netdev enp1s0nl8p14 flavour physical lc 8 port 14 splittable true lanes 4
      pci/0000:01:00.0/127: type eth netdev enp1s0nl8p15 flavour physical lc 8 port 15 splittable true lanes 4
      pci/0000:01:00.0/128: type eth netdev enp1s0nl8p16 flavour physical lc 8 port 16 splittable true lanes 4
      
      $ devlink lc set pci/0000:01:00.0 lc 8 notype
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fcdc8ce2
    • Jiri Pirko's avatar
      devlink: add support to create line card and expose to user · c246f9b5
      Jiri Pirko authored
      Extend the devlink API so the driver is going to be able to create and
      destroy linecard instances. There can be multiple line cards per devlink
      device. Expose this new type of object over devlink netlink API to the
      userspace, with notifications.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c246f9b5
    • Eric Dumazet's avatar
      tcp: fix signed/unsigned comparison · 843f7740
      Eric Dumazet authored
      Kernel test robot reported:
      
      smatch warnings:
      net/ipv4/tcp_input.c:5966 tcp_rcv_established() warn: unsigned 'reason' is never less than zero.
      
      I actually had one packetdrill failing because of this bug,
      and was about to send the fix :)
      
      v2: Andreas Schwab also pointed out that @reason needs to be negated
          before we reach tcp_drop_reason()
      
      Fixes: 4b506af9 ("tcp: add two drop reasons for tcp_ack()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarAndreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      843f7740
  4. 17 Apr, 2022 2 commits