1. 13 May, 2022 21 commits
    • Ren Zhijie's avatar
      sfc: siena: Fix Kconfig dependencies · f9a210c7
      Ren Zhijie authored
      If CONFIG_PTP_1588_CLOCK=m and CONFIG_SFC_SIENA=y, the siena driver will fail to link:
      
      drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_remove_channel':
      ptp.c:(.text+0xa28): undefined reference to `ptp_clock_unregister'
      drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_probe_channel':
      ptp.c:(.text+0x13a0): undefined reference to `ptp_clock_register'
      ptp.c:(.text+0x1470): undefined reference to `ptp_clock_unregister'
      drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_pps_worker':
      ptp.c:(.text+0x1d29): undefined reference to `ptp_clock_event'
      drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_siena_ptp_get_ts_info':
      ptp.c:(.text+0x301b): undefined reference to `ptp_clock_index'
      
      To fix this build error, make SFC_SIENA depends on PTP_1588_CLOCK.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Fixes: d48523cb ("sfc: Copy shared files needed for Siena (part 2)")
      Signed-off-by: default avatarRen Zhijie <renzhijie2@huawei.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Link: https://lore.kernel.org/r/20220513012721.140871-1-renzhijie2@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f9a210c7
    • David S. Miller's avatar
      Merge branch 'bnxt_en-next' · a65cc843
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Updates for net-next
      
      This small patchset updates the firmware interface, adds timestamping
      support for all receive packets, and adds revised NVRAM package error
      messages for ethtool and devlink.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a65cc843
    • Kalesh AP's avatar
      bnxt_en: parse and report result field when NVRAM package install fails · ab0bed4b
      Kalesh AP authored
      Instead of always returning -ENOPKG, decode the firmware error
      code further when the HWRM_NVM_INSTALL_UPDATE firmware call fails.
      Return a more suitable error code to userspace and log an error
      in dmesg.
      
      This is version 2 of the earlier patch that was reverted:
      
      02acd399 ("bnxt_en: parse result field when NVRAM package install fails")
      
      In this new version, if the call is made through devlink instead of
      ethtool, we'll also set the error message in extack.
      
      Link: https://lore.kernel.org/netdev/20220307141358.4d52462e@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab0bed4b
    • Pavan Chebbi's avatar
      bnxt_en: Enable packet timestamping for all RX packets · 66ed81dc
      Pavan Chebbi authored
      Add driver support to enable timestamping on all RX packets
      that are received by the NIC. This capability can be requested
      by the applications using SIOCSHWTSTAMP ioctl with filter type
      HWTSTAMP_FILTER_ALL.
      
      Cc: Richard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66ed81dc
    • Pavan Chebbi's avatar
      bnxt_en: Configure ptp filters during bnxt open · 11862689
      Pavan Chebbi authored
      For correctness, we need to configure the packet filters for timestamping
      during bnxt_open.  This way they are always configured after firmware
      reset or chip reset.  We should not assume that the filters will always
      be retained across resets.
      
      This patch modifies the ioctl handler and always configures the PTP
      filters in the bnxt_open() path.
      
      Cc: Richard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11862689
    • Michael Chan's avatar
      bnxt_en: Update firmware interface to 1.10.2.95 · ad04cc05
      Michael Chan authored
      The main changes are timestamp support for all RX packets and new PCIe
      statistics.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad04cc05
    • Robert Hancock's avatar
      net: axienet: Use NAPI for TX completion path · 9e2bc267
      Robert Hancock authored
      This driver was using the TX IRQ handler to perform all TX completion
      tasks. Under heavy TX network load, this can cause significant irqs-off
      latencies (found to be in the hundreds of microseconds using ftrace).
      This can cause other issues, such as overrunning serial UART FIFOs when
      using high baud rates with limited UART FIFO sizes.
      
      Switch to using a NAPI poll handler to perform the TX completion work
      to get this out of hard IRQ context and avoid the IRQ latency impact.
      A separate poll handler is used for TX and RX since they have separate
      IRQs on this controller, so that the completion work for each of them
      stays on the same CPU as the interrupt.
      
      Testing on a Xilinx MPSoC ZU9EG platform using iperf3 from a Linux PC
      through a switch at 1G link speed showed no significant change in TX or
      RX throughput, with approximately 941 Mbps before and after. Hard IRQ
      time in the TX throughput test was significantly reduced from 12% to
      below 1% on the CPU handling TX interrupts, with total hard+soft IRQ CPU
      usage dropping from about 56% down to 48%.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e2bc267
    • Robert Hancock's avatar
      net: axienet: Be more careful about updating tx_bd_tail · f0cf4000
      Robert Hancock authored
      The axienet_start_xmit function was updating the tx_bd_tail variable
      multiple times, with potential rollbacks on error or invalid
      intermediate positions, even though this variable is also used in the
      TX completion path. Use READ_ONCE where this variable is read and
      WRITE_ONCE where it is written to make this update more atomic, and
      move the write before the MMIO write to start the transfer, so it is
      protected by that implicit write barrier.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0cf4000
    • Eric Dumazet's avatar
      inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH() · 4915d50e
      Eric Dumazet authored
      INET_MATCH() runs without holding a lock on the socket.
      
      We probably need to annotate most reads.
      
      This patch makes INET_MATCH() an inline function
      to ease our changes.
      
      v2:
      
      We remove the 32bit version of it, as modern compilers
      should generate the same code really, no need to
      try to be smarter.
      
      Also make 'struct net *net' the first argument.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4915d50e
    • Amit Cohen's avatar
      selftests: fib_nexthops: Make the test more robust · 49bb39bd
      Amit Cohen authored
      Rarely some of the test cases fail. Make the test more robust by increasing
      the timeout of ping commands to 5 seconds.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49bb39bd
    • David S. Miller's avatar
      Merge branch 'lan95xx-no-polling' · b7da9c6b
      David S. Miller authored
      Lukas Wunner says:
      
      ====================
      Polling be gone on LAN95xx
      
      Do away with link status polling on LAN95xx USB Ethernet
      and rely on interrupts instead, thereby reducing bus traffic,
      CPU overhead and improving interface bringup latency.
      
      Link to v2:
      https://lore.kernel.org/netdev/cover.1651574194.git.lukas@wunner.de/
      
      Only change since v2:
      
      * Patch [5/7]:
        * Drop call to __irq_enter_raw() which worked around a warning in
          generic_handle_domain_irq().  That warning is gone since
          792ea6a0 (queued on tip.git/irq/urgent).
          (Marc Zyngier, Thomas Gleixner)
      ====================
      b7da9c6b
    • Lukas Wunner's avatar
      net: phy: smsc: Cope with hot-removal in interrupt handler · 1e7b81ed
      Lukas Wunner authored
      If reading the Interrupt Source Flag register fails with -ENODEV, then
      the PHY has been hot-removed and the correct response is to bail out
      instead of throwing a WARN splat and attempting to suspend the PHY.
      The PHY should be stopped in due course anyway as the kernel
      asynchronously tears down the device.
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e7b81ed
    • Lukas Wunner's avatar
      net: phy: smsc: Cache interrupt mask · 7e8b617e
      Lukas Wunner authored
      Cache the interrupt mask to avoid re-reading it from the PHY upon every
      interrupt.
      
      This will simplify a subsequent commit which detects hot-removal in the
      interrupt handler and bails out.
      
      Analyzing and debugging PHY transactions also becomes simpler if such
      redundant reads are avoided.
      
      Last not least, interrupt overhead and latency is slightly improved.
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e8b617e
    • Lukas Wunner's avatar
      usbnet: smsc95xx: Forward PHY interrupts to PHY driver to avoid polling · 1ce8b372
      Lukas Wunner authored
      Link status of SMSC LAN95xx chips is polled once per second, even though
      they're capable of signaling PHY interrupts through the MAC layer.
      
      Forward those interrupts to the PHY driver to avoid polling.  Benefits
      are reduced bus traffic, reduced CPU overhead and quicker interface
      bringup.
      
      Polling was introduced in 2016 by commit d69d1694 ("usbnet:
      smsc95xx: fix link detection for disabled autonegotiation").
      Back then, the LAN95xx driver neglected to enable the ENERGYON interrupt,
      hence couldn't detect link-up events when auto-negotiation was disabled.
      The proper solution would have been to enable the ENERGYON interrupt
      instead of polling.
      
      Since then, PHY handling was moved from the LAN95xx driver to the SMSC
      PHY driver with commit 05b35e7e ("smsc95xx: add phylib support").
      That PHY driver is capable of link detection with auto-negotiation
      disabled because it enables the ENERGYON interrupt.
      
      Note that signaling interrupts through the MAC layer not only works with
      the integrated PHY, but also with an external PHY, provided its
      interrupt pin is attached to LAN95xx's nPHY_INT pin.
      
      In the unlikely event that the interrupt pin of an external PHY is
      attached to a GPIO of the SoC (or not connected at all), the driver can
      be amended to retrieve the irq from the PHY's of_node.
      
      To forward PHY interrupts to phylib, it is not sufficient to call
      phy_mac_interrupt().  Instead, the PHY's interrupt handler needs to run
      so that PHY interrupts are cleared.  That's because according to page
      119 of the LAN950x datasheet, "The source of this interrupt is a level.
      The interrupt persists until it is cleared in the PHY."
      
      https://www.microchip.com/content/dam/mchp/documents/UNG/ProductDocuments/DataSheets/LAN950x-Data-Sheet-DS00001875D.pdf
      
      Therefore, create an IRQ domain with a single IRQ for the PHY.  In the
      future, the IRQ domain may be extended to support the 11 GPIOs on the
      LAN95xx.
      
      Normally the PHY interrupt should be masked until the PHY driver has
      cleared it.  However masking requires a (sleeping) USB transaction and
      interrupts are received in (non-sleepable) softirq context.  I decided
      not to mask the interrupt at all (by using the dummy_irq_chip's noop
      ->irq_mask() callback):  The USB interrupt endpoint is polled in 1 msec
      intervals and normally that's sufficient to wake the PHY driver's IRQ
      thread and have it clear the interrupt.  If it does take longer, worst
      thing that can happen is the IRQ thread is woken again.  No big deal.
      
      Because PHY interrupts are now perpetually enabled, there's no need to
      selectively enable them on suspend.  So remove all invocations of
      smsc95xx_enable_phy_wakeup_interrupts().
      
      In smsc95xx_resume(), move the call of phy_init_hw() before
      usbnet_resume() (which restarts the status URB) to ensure that the PHY
      is fully initialized when an interrupt is handled.
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Reviewed-by: Andrew Lunn <andrew@lunn.ch> # from a PHY perspective
      Cc: Andre Edich <andre.edich@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ce8b372
    • Lukas Wunner's avatar
      usbnet: smsc95xx: Avoid link settings race on interrupt reception · 8960f878
      Lukas Wunner authored
      When a PHY interrupt is signaled, the SMSC LAN95xx driver updates the
      MAC full duplex mode and PHY flow control registers based on cached data
      in struct phy_device:
      
        smsc95xx_status()                 # raises EVENT_LINK_RESET
          usbnet_deferred_kevent()
            smsc95xx_link_reset()         # uses cached data in phydev
      
      Simultaneously, phylib polls link status once per second and updates
      that cached data:
      
        phy_state_machine()
          phy_check_link_status()
            phy_read_status()
              lan87xx_read_status()
                genphy_read_status()      # updates cached data in phydev
      
      If smsc95xx_link_reset() wins the race against genphy_read_status(),
      the registers may be updated based on stale data.
      
      E.g. if the link was previously down, phydev->duplex is set to
      DUPLEX_UNKNOWN and that's what smsc95xx_link_reset() will use, even
      though genphy_read_status() may update it to DUPLEX_FULL afterwards.
      
      PHY interrupts are currently only enabled on suspend to trigger wakeup,
      so the impact of the race is limited, but we're about to enable them
      perpetually.
      
      Avoid the race by delaying execution of smsc95xx_link_reset() until
      phy_state_machine() has done its job and calls back via
      smsc95xx_handle_link_change().
      
      Signaling EVENT_LINK_RESET on wakeup is not necessary because phylib
      picks up link status changes through polling.  So drop the declaration
      of a ->link_reset() callback.
      
      Note that the semicolon on a line by itself added in smsc95xx_status()
      is a placeholder for a function call which will be added in a subsequent
      commit.  That function call will actually handle the INT_ENP_PHY_INT_
      interrupt.
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8960f878
    • Lukas Wunner's avatar
      usbnet: smsc95xx: Don't reset PHY behind PHY driver's back · 14021da6
      Lukas Wunner authored
      smsc95xx_reset() resets the PHY behind the PHY driver's back, which
      seems like a bad idea generally.  Remove that portion of the function.
      
      We're about to use PHY interrupts instead of polling to detect link
      changes on SMSC LAN95xx chips.  Because smsc95xx_reset() is called from
      usbnet_open(), PHY interrupt settings are lost whenever the net_device
      is brought up.
      
      There are two other callers of smsc95xx_reset(), namely smsc95xx_bind()
      and smsc95xx_reset_resume(), and both may indeed benefit from a PHY
      reset.  However they already perform one through their calls to
      phy_connect_direct() and phy_init_hw().
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Martyn Welch <martyn.welch@collabora.com>
      Cc: Gabriel Hojda <ghojda@yo2urs.ro>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14021da6
    • Lukas Wunner's avatar
      usbnet: smsc95xx: Don't clear read-only PHY interrupt · 3108871f
      Lukas Wunner authored
      Upon receiving data from the Interrupt Endpoint, the SMSC LAN95xx driver
      attempts to clear the signaled interrupts by writing "all ones" to the
      Interrupt Status Register.
      
      However the driver only ever enables a single type of interrupt, namely
      the PHY Interrupt.  And according to page 119 of the LAN950x datasheet,
      its bit in the Interrupt Status Register is read-only.  There's no other
      way to clear it than in a separate PHY register:
      
      https://www.microchip.com/content/dam/mchp/documents/UNG/ProductDocuments/DataSheets/LAN950x-Data-Sheet-DS00001875D.pdf
      
      Consequently, writing "all ones" to the Interrupt Status Register is
      pointless and can be dropped.
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3108871f
    • Lukas Wunner's avatar
      usbnet: Run unregister_netdev() before unbind() again · d1408f6b
      Lukas Wunner authored
      Commit 2c9d6c2b ("usbnet: run unbind() before unregister_netdev()")
      sought to fix a use-after-free on disconnect of USB Ethernet adapters.
      
      It turns out that a different fix is necessary to address the issue:
      https://lore.kernel.org/netdev/18b3541e5372bc9b9fc733d422f4e698c089077c.1650177997.git.lukas@wunner.de/
      
      So the commit was not necessary.
      
      The commit made binding and unbinding of USB Ethernet asymmetrical:
      Before, usbnet_probe() first invoked the ->bind() callback and then
      register_netdev().  usbnet_disconnect() mirrored that by first invoking
      unregister_netdev() and then ->unbind().
      
      Since the commit, the order in usbnet_disconnect() is reversed and no
      longer mirrors usbnet_probe().
      
      One consequence is that a PHY disconnected (and stopped) in ->unbind()
      is afterwards stopped once more by unregister_netdev() as it closes the
      netdev before unregistering.  That necessitates a contortion in ->stop()
      because the PHY may only be stopped if it hasn't already been
      disconnected.
      
      Reverting the commit allows making the call to phy_stop() unconditional
      in ->stop().
      
      Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500
      Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Cc: Martyn Welch <martyn.welch@collabora.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1408f6b
    • Yang Li's avatar
      net: ethernet: fix platform_no_drv_owner.cocci warning · 7b8b8222
      Yang Li authored
      Remove .owner field if calls are used which set it automatically.
      ./drivers/net/ethernet/sunplus/spl2sw_driver.c:569:3-8: No need to set
      .owner here. The core will do it.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b8b8222
    • Jie Wang's avatar
      net: page_pool: add page allocation stats for two fast page allocate path · 0f6deac3
      Jie Wang authored
      Currently If use page pool allocation stats to analysis a RX performance
      degradation problem. These stats only count for pages allocate from
      page_pool_alloc_pages. But nic drivers such as hns3 use
      page_pool_dev_alloc_frag to allocate pages, so page stats in this API
      should also be counted.
      Signed-off-by: default avatarJie Wang <wangjie125@huawei.com>
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f6deac3
    • Jiapeng Chong's avatar
      net: ethernet: Use swap() instead of open coding it · a19cef45
      Jiapeng Chong authored
      Clean the following coccicheck warning:
      
      ./drivers/net/ethernet/sunplus/spl2sw_driver.c:217:27-28: WARNING
      opportunity for swap().
      
      ./drivers/net/ethernet/sunplus/spl2sw_driver.c:222:27-28: WARNING
      opportunity for swap().
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a19cef45
  2. 12 May, 2022 19 commits
    • Jakub Kicinski's avatar
      Merge branch 'net-inet-retire-port-only-listening_hash' · b67fd3d9
      Jakub Kicinski authored
      Martin KaFai Lau says:
      
      ====================
      net: inet: Retire port only listening_hash
      
      This series is to retire the port only listening_hash.
      
      The listen sk is currently stored in two hash tables,
      listening_hash (hashed by port) and lhash2 (hashed by port and address).
      
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the TCP-SYN lookup fast path does not use listening_hash.
      
      The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
      also moved the seq_file (/proc/net/tcp) iteration usage from
      listening_hash to lhash2.
      
      There are still a few listening_hash usages left.
      One of them is inet_reuseport_add_sock() which uses the listening_hash
      to search a listen sk during the listen() system call.  This turns
      out to be very slow on use cases that listen on many different
      VIPs at a popular port (e.g. 443).  [ On top of the slowness in
      adding to the tail in the IPv6 case ]. A latter patch has a
      selftest to demonstrate this case.
      
      This series takes this chance to move all remaining listening_hash
      usages to lhash2 and then retire listening_hash.
      ====================
      
      Link: https://lore.kernel.org/r/20220512000546.188616-1-kafai@fb.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b67fd3d9
    • Martin KaFai Lau's avatar
      net: selftests: Stress reuseport listen · ec8cb4f6
      Martin KaFai Lau authored
      This patch adds a test that has 300 VIPs listening on port 443.
      Each VIP:443 will have 80 listening socks by using SO_REUSEPORT.
      Thus, it will have 24000 listening socks.
      
      Before removing the port only listening_hash, all socks will be in the
      same port 443 bucket and inet_reuseport_add_sock() spends much time to
      walk through the bucket.  After removing the port only listening_hash
      and move all usage to the port+addr lhash2, each bucket in the
      ideal case has 80 sk which is much smaller than before.
      
      Here is the test result from a qemu:
      Before: listen 24000 socks took 210.210485362 (~210s)
       After: listen 24000 socks took 0.207173      (~210ms)
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec8cb4f6
    • Martin KaFai Lau's avatar
      net: inet: Retire port only listening_hash · cae3873c
      Martin KaFai Lau authored
      The listen sk is currently stored in two hash tables,
      listening_hash (hashed by port) and lhash2 (hashed by port and address).
      
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the TCP-SYN lookup fast path does not use listening_hash.
      
      The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
      also moved the seq_file (/proc/net/tcp) iteration usage from
      listening_hash to lhash2.
      
      There are still a few listening_hash usages left.
      One of them is inet_reuseport_add_sock() which uses the listening_hash
      to search a listen sk during the listen() system call.  This turns
      out to be very slow on use cases that listen on many different
      VIPs at a popular port (e.g. 443).  [ On top of the slowness in
      adding to the tail in the IPv6 case ].  The latter patch has a
      selftest to demonstrate this case.
      
      This patch takes this chance to move all remaining listening_hash
      usages to lhash2 and then retire listening_hash.
      
      Since most changes need to be done together, it is hard to cut
      the listening_hash to lhash2 switch into small patches.  The
      changes in this patch is highlighted here for the review
      purpose.
      
      1. Because of the listening_hash removal, lhash2 can use the
         sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
         This will also keep the sk_unhashed() check to work as is
         after stop adding sk to listening_hash.
      
         The union is removed from inet_listen_hashbucket because
         only nulls_head is needed.
      
      2. icsk->icsk_listen_portaddr_node and its helpers are removed.
      
      3. The current lhash2 users needs to iterate with sk_nulls_node
         instead of icsk_listen_portaddr_node.
      
         One case is in the inet[6]_lhash2_lookup().
      
         Another case is the seq_file iterator in tcp_ipv4.c.
         One thing to note is sk_nulls_next() is needed
         because the old inet_lhash2_for_each_icsk_continue()
         does a "next" first before iterating.
      
      4. Move the remaining listening_hash usage to lhash2
      
         inet_reuseport_add_sock() which this series is
         trying to improve.
      
         inet_diag.c and mptcp_diag.c are the final two
         remaining use cases and is moved to lhash2 now also.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cae3873c
    • Martin KaFai Lau's avatar
      net: inet: Open code inet_hash2 and inet_unhash2 · e8d00590
      Martin KaFai Lau authored
      This patch folds lhash2 related functions into __inet_hash and
      inet_unhash.  This will make the removal of the listening_hash
      in a latter patch easier to review.
      
      First, this patch folds inet_hash2 into __inet_hash.
      
      For unhash, the current call sequence is like
      inet_unhash() => __inet_unhash() => inet_unhash2().
      The specific testing cases in __inet_unhash() are mostly related
      to TCP_LISTEN sk and its caller inet_unhash() already has
      the TCP_LISTEN test, so this patch folds both __inet_unhash() and
      inet_unhash2() into inet_unhash().
      
      Note that all listening_hash users also have lhash2 initialized,
      so the !h->lhash2 check is no longer needed.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e8d00590
    • Martin KaFai Lau's avatar
      net: inet: Remove count from inet_listen_hashbucket · 8ea1eebb
      Martin KaFai Lau authored
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the count is no longer used.  This patch removes it.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8ea1eebb
    • Jakub Kicinski's avatar
      Merge branch 'make-sfc-siena-ko-specific-to-siena' · 0c1822d9
      Jakub Kicinski authored
      Martin Habets says:
      
      ====================
      Make sfc-siena.ko specific to Siena
      
      This series is a follow-up to the one titled "Move Siena into
      a separate subdirectory".
      It enhances the new sfc-siena.ko module to differentiate it from sfc.ko.
      
      	Patches
      
      Patches 1-5 create separate Kconfig options for Siena, and adjusts the
      various names used for work items and directories.
      Patch 6 reinstates SRIOV functionality in sfc-siena.ko.
      
      	Testing
      
      Various build tests were done such as allyesconfig, W=1 and sparse.
      The new sfc-siena.ko and sfc.ko modules were tested on a machine with NICs
      for both modules in them.
      Inserting the updated sfc.ko and the new sfc-siena.ko modules at the same
      time works, and no work items and directories exist with the same name.
      ====================
      
      Link: https://lore.kernel.org/r/165228589518.696.7119477411428288875.stgit@palantir17.mph.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c1822d9
    • Martin Habets's avatar
      sfc/siena: Reinstate SRIOV init/fini function calls · c3743039
      Martin Habets authored
      They were removed in the first series since they were not used for EF10.
      Put that code back for Siena, with the prototypes in siena_sriov.h
      since that file is a more applicable place for it.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3743039
    • Martin Habets's avatar
      sfc/siena: Make PTP and reset support specific for Siena · ef9b5770
      Martin Habets authored
      Change the clock name and work queue names to differentiate them from
      the names used in sfc.ko.
      Signed-off-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef9b5770
    • Martin Habets's avatar
      sfc/siena: Make MCDI logging support specific for Siena · 58b6b3d5
      Martin Habets authored
      Add a Siena Kconfig option and use it in stead of the sfc one.
      Rename the internal variable for the 'mcdi_logging_default' module
      parameter to avoid a naming conflict with the one in sfc.ko.
      Signed-off-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      58b6b3d5
    • Martin Habets's avatar
      siena: Make HWMON support specific for Siena · f62a0745
      Martin Habets authored
      Add a Siena Kconfig option and use it in stead of the sfc one.
      Signed-off-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f62a0745
    • Martin Habets's avatar
      siena: Make SRIOV support specific for Siena · dfb1cfbd
      Martin Habets authored
      Add a Siena Kconfig option and use it in stead of the sfc one.
      Signed-off-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dfb1cfbd
    • Martin Habets's avatar
      siena: Make MTD support specific for Siena · 65d4b471
      Martin Habets authored
      Add a Siena Kconfig option and use it in stead of the sfc one.
      Signed-off-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65d4b471
    • Jakub Kicinski's avatar
      Merge branch 'restructure-struct-ocelot_port' · 75db72de
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Restructure struct ocelot_port
      
      This patch set represents preparation for further work. It adds an
      "index" field to struct ocelot_port, and populates it from the Felix DSA
      driver and Ocelot switchdev driver.
      
      The users of struct ocelot_port :: index are the same users as those of
      struct ocelot_port_private :: chip_port.
      ====================
      
      Link: https://lore.kernel.org/r/20220511100637.568950-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      75db72de
    • Vladimir Oltean's avatar
      net: mscc: ocelot: move ocelot_port_private :: chip_port to ocelot_port :: index · 7e708760
      Vladimir Oltean authored
      Currently the ocelot switch lib is unaware of the index of a struct
      ocelot_port, since that is kept in the encapsulating structures of outer
      drivers (struct dsa_port :: index, struct ocelot_port_private :: chip_port).
      
      With the upcoming increase in complexity associated with assigning DSA
      tag_8021q CPU ports to certain user ports, it becomes necessary for the
      switch lib to be able to retrieve the index of a certain ocelot_port.
      
      Therefore, introduce a new u8 to ocelot_port (same size as the chip_port
      used by the ocelot switchdev driver) and rework the existing code to
      populate and use it.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7e708760
    • Vladimir Oltean's avatar
      net: mscc: ocelot: minimize holes in struct ocelot_port · 6d0be600
      Vladimir Oltean authored
      Reorder members of struct ocelot_port to eliminate holes and reduce
      structure size. Pahole says:
      
      Before:
      
      struct ocelot_port {
              struct ocelot *            ocelot;               /*     0     8 */
              struct regmap *            target;               /*     8     8 */
              bool                       vlan_aware;           /*    16     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              const struct ocelot_bridge_vlan  * pvid_vlan;    /*    24     8 */
              unsigned int               ptp_skbs_in_flight;   /*    32     4 */
              u8                         ptp_cmd;              /*    36     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              struct sk_buff_head        tx_skbs;              /*    40    96 */
              /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
              u8                         ts_id;                /*   136     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              phy_interface_t            phy_mode;             /*   140     4 */
              bool                       is_dsa_8021q_cpu;     /*   144     1 */
              bool                       learn_ena;            /*   145     1 */
      
              /* XXX 6 bytes hole, try to pack */
      
              struct net_device *        bond;                 /*   152     8 */
              bool                       lag_tx_active;        /*   160     1 */
      
              /* XXX 1 byte hole, try to pack */
      
              u16                        mrp_ring_id;          /*   162     2 */
      
              /* XXX 4 bytes hole, try to pack */
      
              struct net_device *        bridge;               /*   168     8 */
              int                        bridge_num;           /*   176     4 */
              u8                         stp_state;            /*   180     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              int                        speed;                /*   184     4 */
      
              /* size: 192, cachelines: 3, members: 18 */
              /* sum members: 161, holes: 7, sum holes: 27 */
              /* padding: 4 */
      };
      
      After:
      
      struct ocelot_port {
              struct ocelot *            ocelot;               /*     0     8 */
              struct regmap *            target;               /*     8     8 */
              struct net_device *        bond;                 /*    16     8 */
              struct net_device *        bridge;               /*    24     8 */
              const struct ocelot_bridge_vlan  * pvid_vlan;    /*    32     8 */
              phy_interface_t            phy_mode;             /*    40     4 */
              unsigned int               ptp_skbs_in_flight;   /*    44     4 */
              struct sk_buff_head        tx_skbs;              /*    48    96 */
              /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
              u16                        mrp_ring_id;          /*   144     2 */
              u8                         ptp_cmd;              /*   146     1 */
              u8                         ts_id;                /*   147     1 */
              u8                         stp_state;            /*   148     1 */
              bool                       vlan_aware;           /*   149     1 */
              bool                       is_dsa_8021q_cpu;     /*   150     1 */
              bool                       learn_ena;            /*   151     1 */
              bool                       lag_tx_active;        /*   152     1 */
      
              /* XXX 3 bytes hole, try to pack */
      
              int                        bridge_num;           /*   156     4 */
              int                        speed;                /*   160     4 */
      
              /* size: 168, cachelines: 3, members: 18 */
              /* sum members: 161, holes: 1, sum holes: 3 */
              /* padding: 4 */
              /* last cacheline: 40 bytes */
      };
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6d0be600
    • Vladimir Oltean's avatar
      net: mscc: ocelot: delete ocelot_port :: xmit_template · 15f6d01e
      Vladimir Oltean authored
      This is no longer used since commit 7c4bb540 ("net: dsa: tag_ocelot:
      create separate tagger for Seville").
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      15f6d01e
    • Jakub Kicinski's avatar
      Merge branch 'dsa-changes-for-multiple-cpu-ports-part-1' · 879c610c
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      DSA changes for multiple CPU ports (part 1)
      
      I am trying to enable the second internal port pair from the NXP LS1028A
      Felix switch for DSA-tagged traffic via "ocelot-8021q". This series
      represents part 1 (of an unknown number) of that effort.
      
      It does some preparation work, like managing host flooding in DSA via a
      dedicated method, and removing the CPU port as argument from the tagging
      protocol change procedure.
      
      In terms of driver-specific changes, it reworks the 2 tag protocol
      implementations in the Felix driver to have a structured data format.
      It enables host flooding towards all tag_8021q CPU ports. It dynamically
      updates the tag_8021q CPU port used for traps. It also fixes a bug
      introduced by a previous refactoring/oversimplification commit in
      net-next.
      ====================
      
      Link: https://lore.kernel.org/r/20220511095020.562461-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      879c610c
    • Vladimir Oltean's avatar
      net: dsa: felix: reimplement tagging protocol change with function pointers · 7a29d220
      Vladimir Oltean authored
      The error handling for the current tagging protocol change procedure is
      a bit brittle (we dismantle the previous tagging protocol entirely
      before setting up the new one). By identifying which parts of a tagging
      protocol are unique to itself and which parts are shared with the other,
      we can implement a protocol change procedure where error handling is a
      bit more robust, because we start setting up the new protocol first, and
      tear down the old one only after the setup of the specific and shared
      parts succeeded.
      
      The protocol change is a bit too open-coded too, in the area of
      migrating host flood settings and MDBs. By identifying what differs
      between tagging protocols (the forwarding masks for host flooding) we
      can implement a more straightforward migration procedure which is
      handled in the shared portion of the protocol change, rather than
      individually by each protocol.
      
      Therefore, a more structured approach calls for the introduction of a
      structure of function pointers per tagging protocol. This covers setup,
      teardown and the host forwarding mask. In the future it will also cover
      how to prepare for a new DSA master.
      
      The initial tagging protocol setup (at driver probe time) and the final
      teardown (at driver removal time) are also adapted to call into the
      structured methods of the specific protocol in current use. This is
      especially relevant for teardown, where we previously called
      felix_del_tag_protocol() only for the first CPU port. But by not
      specifying which CPU port this is for, we gain more flexibility to
      support multiple CPU ports in the future.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7a29d220
    • Vladimir Oltean's avatar
      net: dsa: felix: dynamically determine tag_8021q CPU port for traps · c352e5e8
      Vladimir Oltean authored
      Ocelot switches support a single active CPU port at a time (at least as
      a trapping destination, i.e. for control traffic). This is true
      regardless of whether we are using the native copy-to-CPU-port-module
      functionality, or a redirect action towards the software-defined
      tag_8021q CPU port.
      
      Currently we assume that the trapping destination in tag_8021q mode is
      the first CPU port, yet in the future we may want to migrate the user
      ports to the second CPU port.
      
      For that to work, we need to make sure that the tag_8021q trapping
      destination is a CPU port that is active, i.e. is used by at least some
      user port on which the trap was added. Otherwise, we may end up
      redirecting the traffic to a CPU port which isn't even up.
      
      Note that due to the current design where we simply choose the CPU port
      of the first port from the trap's ingress port mask, it may be that a
      CPU port absorbes control traffic from user ports which aren't affine to
      it as per user space's request. This isn't ideal, but is the lesser of
      two evils. Following the user-configured affinity for traps would mean
      that we can no longer reuse a single TCAM entry for multiple traps,
      which is what we actually do for e.g. PTP. Either we duplicate and
      deduplicate TCAM entries on the fly when user-to-CPU-port mappings
      change (which is unnecessarily complicated), or we redirect trapped
      traffic to all tag_8021q CPU ports if multiple such ports are in use.
      The latter would have actually been nice, if it actually worked, but it
      doesn't, since a OCELOT_MASK_MODE_REDIRECT action towards multiple ports
      would not take PGID_SRC into consideration, and it would just duplicate
      the packet towards each (CPU) port, leading to duplicates in software.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c352e5e8