1. 11 Feb, 2022 21 commits
    • David S. Miller's avatar
      Merge branch 'ipv6-loopback' · c002496b
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      ipv6: remove addrconf reliance on loopback
      
      Second patch in this series removes IPv6 requirement about the netns
      loopback device being the last device being dismantled.
      
      This was needed because rt6_uncached_list_flush_dev()
      and ip6_dst_ifdown() had to switch dst dev to a known
      device (loopback).
      
      Instead of loopback, we can use the (hidden) blackhole_netdev
      which is also always there.
      
      This will allow future simplfications of netdev_run_to()
      and other parts of the stack like default_device_exit_batch().
      
      Last two patches are optimizations for both IP families.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c002496b
    • Eric Dumazet's avatar
      ipv4: add (struct uncached_list)->quarantine list · 29e5375d
      Eric Dumazet authored
      This is an optimization to keep the per-cpu lists as short as possible:
      
      Whenever rt_flush_dev() changes one rtable dst.dev
      matching the disappearing device, it can can transfer the object
      to a quarantine list, waiting for a final rt_del_uncached_list().
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29e5375d
    • Eric Dumazet's avatar
      ipv6: add (struct uncached_list)->quarantine list · ba55ef81
      Eric Dumazet authored
      This is an optimization to keep the per-cpu lists as short as possible:
      
      Whenever rt6_uncached_list_flush_dev() changes one rt6_info
      matching the disappearing device, it can can transfer the object
      to a quarantine list, waiting for a final rt6_uncached_list_del().
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba55ef81
    • Eric Dumazet's avatar
      ipv6: give an IPv6 dev to blackhole_netdev · e5f80fcf
      Eric Dumazet authored
      IPv6 addrconf notifiers wants the loopback device to
      be the last device being dismantled at netns deletion.
      
      This caused many limitations and work arounds.
      
      Back in linux-5.3, Mahesh added a per host blackhole_netdev
      that can be used whenever we need to make sure objects no longer
      refer to a disappearing device.
      
      If we attach to blackhole_netdev an ip6_ptr (allocate an idev),
      then we can use this special device (which is never freed)
      in place of the loopback_dev (which can be freed).
      
      This will permit improvements in netdev_run_todo() and other parts
      of the stack where had steps to make sure loopback_dev was
      the last device to disappear.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5f80fcf
    • Eric Dumazet's avatar
      ipv6: get rid of net->ipv6.rt6_stats->fib_rt_uncache · 2d4feb2c
      Eric Dumazet authored
      This counter has never been visible, there is little point
      trying to maintain it.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d4feb2c
    • Holger Brunck's avatar
      dsa: mv88e6xxx: make serdes SGMII/Fiber tx amplitude configurable · 926eae60
      Holger Brunck authored
      The mv88e6352, mv88e6240 and mv88e6176  have a serdes interface. This patch
      allows to configure the output swing to a desired value in the
      phy-handle of the port. The value which is peak to peak has to be
      specified in microvolts. As the chips only supports eight dedicated
      values we return EINVAL if the value in the DTS does not match one of
      these values.
      Signed-off-by: default avatarHolger Brunck <holger.brunck@hitachienergy.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      926eae60
    • Marek Behún's avatar
      dt-bindings: phy: Add `tx-p2p-microvolt` property binding · 066c4b6b
      Marek Behún authored
      Common PHYs and network PCSes often have the possibility to specify
      peak-to-peak voltage on the differential pair - the default voltage
      sometimes needs to be changed for a particular board.
      
      Add properties `tx-p2p-microvolt` and `tx-p2p-microvolt-names` for this
      purpose. The second property is needed to specify the mode for the
      corresponding voltage in the `tx-p2p-microvolt` property, if the voltage
      is to be used only for speficic mode. More voltage-mode pairs can be
      specified.
      
      Example usage with only one voltage (it will be used for all supported
      PHY modes, the `tx-p2p-microvolt-names` property is not needed in this
      case):
      
        tx-p2p-microvolt = <915000>;
      
      Example usage with voltages for multiple modes:
      
        tx-p2p-microvolt = <915000>, <1100000>, <1200000>;
        tx-p2p-microvolt-names = "2500base-x", "usb", "pcie";
      
      Add these properties into a separate file phy/transmit-amplitude.yaml,
      which should be referenced by any binding that uses it.
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      066c4b6b
    • Guillaume Nault's avatar
      ipv6: Reject routes configurations that specify dsfield (tos) · b9605161
      Guillaume Nault authored
      The ->rtm_tos option is normally used to route packets based on both
      the destination address and the DS field. However it's ignored for
      IPv6 routes. Setting ->rtm_tos for IPv6 is thus invalid as the route
      is going to work only on the destination address anyway, so it won't
      behave as specified.
      Suggested-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9605161
    • David S. Miller's avatar
      Merge branch 'dsa-cleanup' · 12a8f37f
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      More aggressive DSA cleanup
      
      This series deletes some code which is apparently not needed.
      
      I've had these patches in my tree for a while, and testing on my boards
      didn't reveal any issues.
      
      Compared to the RFC v1 series, the only change is the addition of patch 3.
      https://patchwork.kernel.org/project/netdevbpf/cover/20220107184842.550334-1-vladimir.oltean@nxp.com/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12a8f37f
    • Vladimir Oltean's avatar
      net: dsa: remove lockdep class for DSA slave address list · ddb44bdc
      Vladimir Oltean authored
      Since commit 2f1e8ea7 ("net: dsa: link interfaces with the DSA
      master to get rid of lockdep warnings"), suggested by Cong Wang, the
      DSA interfaces and their master have different dev->nested_level, which
      makes netif_addr_lock() stop complaining about potentially recursive
      locking on the same lock class.
      
      So we no longer need DSA slave interfaces to have their own lockdep
      class.
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddb44bdc
    • Vladimir Oltean's avatar
      net: dsa: remove lockdep class for DSA master address list · 8db2bc79
      Vladimir Oltean authored
      Since commit 2f1e8ea7 ("net: dsa: link interfaces with the DSA
      master to get rid of lockdep warnings"), suggested by Cong Wang, the
      DSA interfaces and their master have different dev->nested_level, which
      makes netif_addr_lock() stop complaining about potentially recursive
      locking on the same lock class.
      
      So we no longer need DSA masters to have their own lockdep class.
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8db2bc79
    • Vladimir Oltean's avatar
      net: dsa: remove ndo_get_phys_port_name and ndo_get_port_parent_id · 45b987d5
      Vladimir Oltean authored
      There are no legacy ports, DSA registers a devlink instance with ports
      unconditionally for all switch drivers. Therefore, delete the old-style
      ndo operations used for determining bridge forwarding domains.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45b987d5
    • David S. Miller's avatar
      Merge branch 'smc-optimizations' · 1ea59b5e
      David S. Miller authored
      D. Wythe says:
      
      ====================
      net/smc: Optimizing performance in short-lived scenarios
      
      This patch set aims to optimizing performance of SMC in short-lived
      links scenarios, which is quite unsatisfactory right now.
      
      In our benchmark, we test it with follow scripts:
      
      ./wrk -c 10000 -t 4 -H 'Connection: Close' -d 20 http://smc-server
      
      Current performance figures like that:
      
      Running 20s test @ http://11.213.45.6
        4 threads and 10000 connections
        4956 requests in 20.06s, 3.24MB read
        Socket errors: connect 0, read 0, write 672, timeout 0
      Requests/sec:    247.07
      Transfer/sec:    165.28KB
      
      There are many reasons for this phenomenon, this patch set doesn't
      solve it all though, but it can be well alleviated with it in.
      
      Patch 1/5  (Make smc_tcp_listen_work() independent) :
      
      Separate smc_tcp_listen_work() from smc_listen_work(), make them
      independent of each other, the busy SMC handshake can not affect new TCP
      connections visit any more. Avoid discarding a large number of TCP
      connections after being overstock, which is undoubtedly raise the
      connection establishment time.
      
      Patch 2/5 (Limit SMC backlog connections):
      
      Since patch 1 has separated smc_tcp_listen_work() from
      smc_listen_work(), an unrestricted TCP accept have come into being. This
      patch try to put a limit on SMC backlog connections refers to
      implementation of TCP.
      
      Patch 3/5 (Limit SMC visits when handshake workqueue congested):
      
      Considering the complexity of SMC handshake right now, in short-lived
      links scenarios, this may not be the main scenario of SMC though, it's
      performance is still quite poor. This patch try to provide constraint on
      SMC handshake when handshake workqueue congested, which is the sign of
      SMC handshake stacking in our opinion.
      
      Patch 4/5 (Dynamic control handshake limitation by socket options)
      
      This patch allow applications dynamically control the ability of SMC
      handshake limitation. Since SMC don't support set SMC socket option
      before,
      this patch also have to support SMC's owns socket options.
      
      Patch 5/5 (Add global configure for handshake limitation by netlink)
      
      This patch provides a way to get benefit of handshake limitation
      without
      modifying any code for applications, which is quite useful for most
      existing applications.
      
      After this patch set, performance figures like that:
      
      Running 20s test @ http://11.213.45.6
        4 threads and 10000 connections
        693253 requests in 20.10s, 452.88MB read
      Requests/sec:  34488.13
      Transfer/sec:     22.53MB
      
      That's a quite well performance improvement, about to 6 to 7 times in my
      environment.
      ---
      changelog:
      v1 -> v2:
      - fix compile warning
      - fix invalid dependencies in kconfig
      v2 -> v3:
      - correct spelling mistakes
      - fix useless variable declare
      v3 -> v4
      - make smc_tcp_ls_wq be static
      v4 -> v5
      - add dynamic control for SMC auto fallback by socket options
      - add global configure for SMC auto fallback through netlink
      v5 -> v6
      - move auto fallback to net namespace scope
      - remove auto fallback attribute in SMC_GEN_SYS_INFO
      - add independent attributes for auto fallback
      v6 -> v7
      - fix wording and the naming issues, rename 'auto fallback' to handshake
        limitation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ea59b5e
    • D. Wythe's avatar
      net/smc: Add global configure for handshake limitation by netlink · f9496b7c
      D. Wythe authored
      Although we can control SMC handshake limitation through socket options,
      which means that applications who need it must modify their code. It's
      quite troublesome for many existing applications. This patch modifies
      the global default value of SMC handshake limitation through netlink,
      providing a way to put constraint on handshake without modifies any code
      for applications.
      Suggested-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9496b7c
    • D. Wythe's avatar
      net/smc: Dynamic control handshake limitation by socket options · a6a6fe27
      D. Wythe authored
      This patch aims to add dynamic control for SMC handshake limitation for
      every smc sockets, in production environment, it is possible for the
      same applications to handle different service types, and may have
      different opinion on SMC handshake limitation.
      
      This patch try socket options to complete it, since we don't have socket
      option level for SMC yet, which requires us to implement it at the same
      time.
      
      This patch does the following:
      
      - add new socket option level: SOL_SMC.
      - add new SMC socket option: SMC_LIMIT_HS.
      - provide getter/setter for SMC socket options.
      
      Link: https://lore.kernel.org/all/20f504f961e1a803f85d64229ad84260434203bd.1644323503.git.alibuda@linux.alibaba.com/Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6a6fe27
    • D. Wythe's avatar
      net/smc: Limit SMC visits when handshake workqueue congested · 48b6190a
      D. Wythe authored
      This patch intends to provide a mechanism to put constraint on SMC
      connections visit according to the pressure of SMC handshake process.
      At present, frequent visits will cause the incoming connections to be
      backlogged in SMC handshake queue, raise the connections established
      time. Which is quite unacceptable for those applications who base on
      short lived connections.
      
      There are two ways to implement this mechanism:
      
      1. Put limitation after TCP established.
      2. Put limitation before TCP established.
      
      In the first way, we need to wait and receive CLC messages that the
      client will potentially send, and then actively reply with a decline
      message, in a sense, which is also a sort of SMC handshake, affect the
      connections established time on its way.
      
      In the second way, the only problem is that we need to inject SMC logic
      into TCP when it is about to reply the incoming SYN, since we already do
      that, it's seems not a problem anymore. And advantage is obvious, few
      additional processes are required to complete the constraint.
      
      This patch use the second way. After this patch, connections who beyond
      constraint will not informed any SMC indication, and SMC will not be
      involved in any of its subsequent processes.
      
      Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48b6190a
    • D. Wythe's avatar
      net/smc: Limit backlog connections · 8270d9c2
      D. Wythe authored
      Current implementation does not handling backlog semantics, one
      potential risk is that server will be flooded by infinite amount
      connections, even if client was SMC-incapable.
      
      This patch works to put a limit on backlog connections, referring to the
      TCP implementation, we divides SMC connections into two categories:
      
      1. Half SMC connection, which includes all TCP established while SMC not
      connections.
      
      2. Full SMC connection, which includes all SMC established connections.
      
      For half SMC connection, since all half SMC connections starts with TCP
      established, we can achieve our goal by put a limit before TCP
      established. Refer to the implementation of TCP, this limits will based
      on not only the half SMC connections but also the full connections,
      which is also a constraint on full SMC connections.
      
      For full SMC connections, although we know exactly where it starts, it's
      quite hard to put a limit before it. The easiest way is to block wait
      before receive SMC confirm CLC message, while it's under protection by
      smc_server_lgr_pending, a global lock, which leads this limit to the
      entire host instead of a single listen socket. Another way is to drop
      the full connections, but considering the cast of SMC connections, we
      prefer to keep full SMC connections.
      
      Even so, the limits of full SMC connections still exists, see commits
      about half SMC connection below.
      
      After this patch, the limits of backend connection shows like:
      
      For SMC:
      
      1. Client with SMC-capability can makes 2 * backlog full SMC connections
         or 1 * backlog half SMC connections and 1 * backlog full SMC
         connections at most.
      
      2. Client without SMC-capability can only makes 1 * backlog half TCP
         connections and 1 * backlog full TCP connections.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8270d9c2
    • D. Wythe's avatar
      net/smc: Make smc_tcp_listen_work() independent · 3079e342
      D. Wythe authored
      In multithread and 10K connections benchmark, the backend TCP connection
      established very slowly, and lots of TCP connections stay in SYN_SENT
      state.
      
      Client: smc_run wrk -c 10000 -t 4 http://server
      
      the netstate of server host shows like:
          145042 times the listen queue of a socket overflowed
          145042 SYNs to LISTEN sockets dropped
      
      One reason of this issue is that, since the smc_tcp_listen_work() shared
      the same workqueue (smc_hs_wq) with smc_listen_work(), while the
      smc_listen_work() do blocking wait for smc connection established. Once
      the workqueue became congested, it's will block the accept() from TCP
      listen.
      
      This patch creates a independent workqueue(smc_tcp_ls_wq) for
      smc_tcp_listen_work(), separate it from smc_listen_work(), which is
      quite acceptable considering that smc_tcp_listen_work() runs very fast.
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3079e342
    • Luiz Angelo Daros de Luca's avatar
      dt-bindings: net: dsa: realtek: convert to YAML schema, add MDIO · 429c83c7
      Luiz Angelo Daros de Luca authored
      Schema changes:
      
      - support for mdio-connected switches (mdio driver), recognized by
        checking the presence of property "reg"
      - new compatible strings for rtl8367s and rtl8367rb
      - "interrupt-controller" was not added as a required property. It might
        still work polling the ports when missing.
      
      Examples changes:
      
      - renamed "switch_intc" to make it unique between examples
      - removed "dsa-mdio" from mdio compatible property
      - renamed phy@0 to ethernet-phy@0 (not tested with real HW)
        phy@ requires #phy-cells
      Signed-off-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      429c83c7
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 5b91c5cc
      Jakub Kicinski authored
      No conflicts.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5b91c5cc
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f1baf68e
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from netfilter and can.
      
      Current release - new code bugs:
      
         - sparx5: fix get_stat64 out-of-bound access and crash
      
         - smc: fix netdev ref tracker misuse
      
        Previous releases - regressions:
      
         - eth: ixgbevf: require large buffers for build_skb on 82599VF, avoid
           overflows
      
         - eth: ocelot: fix all IP traffic getting trapped to CPU with PTP
           over IP
      
         - bonding: fix rare link activation misses in 802.3ad mode
      
        Previous releases - always broken:
      
         - tcp: fix tcp sock mem accounting in zero-copy corner cases
      
         - remove the cached dst when uncloning an skb dst and its metadata,
           since we only have one ref it'd lead to an UaF
      
         - netfilter:
            - conntrack: don't refresh sctp entries in closed state
            - conntrack: re-init state for retransmitted syn-ack, avoid
              connection establishment getting stuck with strange stacks
            - ctnetlink: disable helper autoassign, avoid it getting lost
            - nft_payload: don't allow transport header access for fragments
      
         - dsa: fix use of devres for mdio throughout drivers
      
         - eth: amd-xgbe: disable interrupts during pci removal
      
         - eth: dpaa2-eth: unregister netdev before disconnecting the PHY
      
         - eth: ice: fix IPIP and SIT TSO offload"
      
      * tag 'net-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
        net: dsa: mv88e6xxx: fix use-after-free in mv88e6xxx_mdios_unregister
        net: mscc: ocelot: fix mutex lock error during ethtool stats read
        ice: Avoid RTNL lock when re-creating auxiliary device
        ice: Fix KASAN error in LAG NETDEV_UNREGISTER handler
        ice: fix IPIP and SIT TSO offload
        ice: fix an error code in ice_cfg_phy_fec()
        net: mpls: Fix GCC 12 warning
        dpaa2-eth: unregister the netdev before disconnecting from the PHY
        skbuff: cleanup double word in comment
        net: macb: Align the dma and coherent dma masks
        mptcp: netlink: process IPv6 addrs in creating listening sockets
        selftests: mptcp: add missing join check
        net: usb: qmi_wwan: Add support for Dell DW5829e
        vlan: move dev_put into vlan_dev_uninit
        vlan: introduce vlan_dev_free_egress_priority
        ax25: fix UAF bugs of net_device caused by rebinding operation
        net: dsa: fix panic when DSA master device unbinds on shutdown
        net: amd-xgbe: disable interrupts during pci removal
        tipc: rate limit warning for received illegal binding update
        net: mdio: aspeed: Add missing MODULE_DEVICE_TABLE
        ...
      f1baf68e
  2. 10 Feb, 2022 19 commits