1. 16 Nov, 2019 22 commits
    • Eric Dumazet's avatar
      selftests: net: avoid ptl lock contention in tcp_mmap · 597b01ed
      Eric Dumazet authored
      tcp_mmap is used as a reference program for TCP rx zerocopy,
      so it is important to point out some potential issues.
      
      If multiple threads are concurrently using getsockopt(...
      TCP_ZEROCOPY_RECEIVE), there is a chance the low-level mm
      functions compete on shared ptl lock, if vma are arbitrary placed.
      
      Instead of letting the mm layer place the chunks back to back,
      this patch enforces an alignment so that each thread uses
      a different ptl lock.
      
      Performance measured on a 100 Gbit NIC, with 8 tcp_mmap clients
      launched at the same time :
      
      $ for f in {1..8}; do ./tcp_mmap -H 2002:a05:6608:290:: & done
      
      In the following run, we reproduce the old behavior by requesting no alignment :
      
      $ tcp_mmap -sz -C $((128*1024)) -a 4096
      received 32768 MB (100 % mmap'ed) in 9.69532 s, 28.3516 Gbit
        cpu usage user:0.08634 sys:3.86258, 120.511 usec per MB, 171839 c-switches
      received 32768 MB (100 % mmap'ed) in 25.4719 s, 10.7914 Gbit
        cpu usage user:0.055268 sys:21.5633, 659.745 usec per MB, 9065 c-switches
      received 32768 MB (100 % mmap'ed) in 28.5419 s, 9.63069 Gbit
        cpu usage user:0.057401 sys:23.8761, 730.392 usec per MB, 14987 c-switches
      received 32768 MB (100 % mmap'ed) in 28.655 s, 9.59268 Gbit
        cpu usage user:0.059689 sys:23.8087, 728.406 usec per MB, 18509 c-switches
      received 32768 MB (100 % mmap'ed) in 28.7808 s, 9.55074 Gbit
        cpu usage user:0.066042 sys:23.4632, 718.056 usec per MB, 24702 c-switches
      received 32768 MB (100 % mmap'ed) in 28.8259 s, 9.5358 Gbit
        cpu usage user:0.056547 sys:23.6628, 723.858 usec per MB, 23518 c-switches
      received 32768 MB (100 % mmap'ed) in 28.8808 s, 9.51767 Gbit
        cpu usage user:0.059357 sys:23.8515, 729.703 usec per MB, 14691 c-switches
      received 32768 MB (100 % mmap'ed) in 28.8879 s, 9.51534 Gbit
        cpu usage user:0.047115 sys:23.7349, 725.769 usec per MB, 21773 c-switches
      
      New behavior (automatic alignment based on Hugepagesize),
      we can see the system overhead being dramatically reduced.
      
      $ tcp_mmap -sz -C $((128*1024))
      received 32768 MB (100 % mmap'ed) in 13.5339 s, 20.3103 Gbit
        cpu usage user:0.122644 sys:3.4125, 107.884 usec per MB, 168567 c-switches
      received 32768 MB (100 % mmap'ed) in 16.0335 s, 17.1439 Gbit
        cpu usage user:0.132428 sys:3.55752, 112.608 usec per MB, 188557 c-switches
      received 32768 MB (100 % mmap'ed) in 17.5506 s, 15.6621 Gbit
        cpu usage user:0.155405 sys:3.24889, 103.891 usec per MB, 226652 c-switches
      received 32768 MB (100 % mmap'ed) in 19.1924 s, 14.3222 Gbit
        cpu usage user:0.135352 sys:3.35583, 106.542 usec per MB, 207404 c-switches
      received 32768 MB (100 % mmap'ed) in 22.3649 s, 12.2906 Gbit
        cpu usage user:0.142429 sys:3.53187, 112.131 usec per MB, 250225 c-switches
      received 32768 MB (100 % mmap'ed) in 22.5336 s, 12.1986 Gbit
        cpu usage user:0.140654 sys:3.61971, 114.757 usec per MB, 253754 c-switches
      received 32768 MB (100 % mmap'ed) in 22.5483 s, 12.1906 Gbit
        cpu usage user:0.134035 sys:3.55952, 112.718 usec per MB, 252997 c-switches
      received 32768 MB (100 % mmap'ed) in 22.6442 s, 12.139 Gbit
        cpu usage user:0.126173 sys:3.71251, 117.147 usec per MB, 253728 c-switches
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      597b01ed
    • Heiner Kallweit's avatar
      r8169: load firmware for RTL8168fp/RTL8117 · 229c1e0d
      Heiner Kallweit authored
      Load Realtek-provided firmware for RTL8168fp/RTL8117. Unlike the
      firmware for other chip versions which is for the PHY, firmware for
      RTL8168fp/RTL8117 is for the MAC.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      229c1e0d
    • Heiner Kallweit's avatar
      r8169: improve conditional firmware loading for RTL8168d · 718af5bc
      Heiner Kallweit authored
      Using constant MII_EXPANSION is misleading here because register 0x06
      has a different meaning on page 0x0005. Here a proprietary PHY
      parameter is read by writing the parameter id to register 0x05 on page
      0x0005, followed by reading the parameter value from register 0x06.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      718af5bc
    • Russell King's avatar
      net: phylink: update to use phy_support_asym_pause() · 725ea4bf
      Russell King authored
      Use phy_support_asym_pause() rather than open-coding it.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      725ea4bf
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-next-2019-11-15' of... · 50bef719
      David S. Miller authored
      Merge tag 'wireless-drivers-next-2019-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      wireless-drivers-next patches for v5.5
      
      Second set of patches for v5.5. Nothing special this time, smaller
      features to various drivers and of course fixes all over.
      
      Major changes:
      
      iwlwifi
      
      * update scan FW API
      
      * bump the supported FW API version
      
      * add debug dump collection on assert in WoWLAN
      
      * enable adaptive dwell on P2P interfaces
      
      ath10k
      
      * request for PM_QOS_CPU_DMA_LATENCY to improve firmware initialisation time
      
      qtnfmac
      
      * add support for getting/setting transmit power
      
      * handle MIC failure event from firmware
      
      rtl8xxxu
      
      * add support for Edimax EW-7611ULB
      
      wil6210
      
      * add SPDX license identifiers
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50bef719
    • Matteo Croce's avatar
      bonding: symmetric ICMP transmit · df98be06
      Matteo Croce authored
      A bonding with layer2+3 or layer3+4 hashing uses the IP addresses and the ports
      to balance packets between slaves. With some network errors, we receive an ICMP
      error packet by the remote host or a router. If sent by a router, the source IP
      can differ from the remote host one. Additionally the ICMP protocol has no port
      numbers, so a layer3+4 bonding will get a different hash than the previous one.
      These two conditions could let the packet go through a different interface than
      the other packets of the same flow:
      
          # tcpdump -qltnni veth0 |sed 's/^/0: /' &
          # tcpdump -qltnni veth1 |sed 's/^/1: /' &
          # hping3 -2 192.168.0.2 -p 9
          0: IP 192.168.0.1.2251 > 192.168.0.2.9: UDP, length 0
          1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
          1: IP 192.168.0.1.2252 > 192.168.0.2.9: UDP, length 0
          1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
          1: IP 192.168.0.1.2253 > 192.168.0.2.9: UDP, length 0
          1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
          0: IP 192.168.0.1.2254 > 192.168.0.2.9: UDP, length 0
          1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
      
      An ICMP error packet contains the header of the packet which caused the network
      error, so inspect it and match the flow against it, so we can send the ICMP via
      the same interface of the previous packet in the flow.
      Move the IP and port dissect code into a generic function bond_flow_ip() and if
      we are dissecting an ICMP error packet, call it again with the adjusted offset.
      
          # hping3 -2 192.168.0.2 -p 9
          1: IP 192.168.0.1.1224 > 192.168.0.2.9: UDP, length 0
          1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
          1: IP 192.168.0.1.1225 > 192.168.0.2.9: UDP, length 0
          1: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
          0: IP 192.168.0.1.1226 > 192.168.0.2.9: UDP, length 0
          0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
          0: IP 192.168.0.1.1227 > 192.168.0.2.9: UDP, length 0
          0: IP 192.168.0.2 > 192.168.0.1: ICMP 192.168.0.2 udp port 9 unreachable, length 36
      Signed-off-by: default avatarMatteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df98be06
    • Horatiu Vultur's avatar
      net: mscc: ocelot: omit error check from of_get_phy_mode · 4214fa1e
      Horatiu Vultur authored
      The commit 0c65b2b9 ("net: of_get_phy_mode: Change API to solve
      int/unit warnings") updated the function of_get_phy_mode declaration.
      Now it returns an error code and in case the node doesn't contain the
      property 'phy-mode' or 'phy-connection-type' it returns -EINVAL and would
      set the phy_interface_t to PHY_INTERFACE_MODE_NA.
      
      Ocelot VSC7514 has 4 internal phys which have the phy interface
      PHY_INTERFACE_MODE_NA. So because of_get_phy_mode would assign
      PHY_INTERFACE_MODE_NA to phy_mode when there is an error, there is no need
      to add the error check.
      
      Updates for v2:
       - drop error check because of_get_phy_mode already assigns phy_interface
         to PHY_INTERFACE_MODE in case of error.
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4214fa1e
    • Alexander Lobakin's avatar
      net: core: allow fast GRO for skbs with Ethernet header in head · 8aef998d
      Alexander Lobakin authored
      Commit 78d3fd0b ("gro: Only use skb_gro_header for completely
      non-linear packets") back in May'09 (v2.6.31-rc1) has changed the
      original condition '!skb_headlen(skb)' to
      'skb->mac_header == skb->tail' in gro_reset_offset() saying: "Since
      the drivers that need this optimisation all provide completely
      non-linear packets" (note that this condition has become the current
      'skb_mac_header(skb) == skb_tail_pointer(skb)' later with commmit
      ced14f68 ("net: Correct comparisons and calculations using
      skb->tail and skb-transport_header") without any functional changes).
      
      For now, we have the following rough statistics for v5.4-rc7:
      1) napi_gro_frags: 14
      2) napi_gro_receive with skb->head containing (most of) payload: 83
      3) napi_gro_receive with skb->head containing all the headers: 20
      4) napi_gro_receive with skb->head containing only Ethernet header: 2
      
      With the current condition, fast GRO with the usage of
      NAPI_GRO_CB(skb)->frag0 is available only in the [1] case.
      Packets pushed by [2] and [3] go through the 'slow' path, but
      it's not a problem for them as they already contain all the needed
      headers in skb->head, so pskb_may_pull() only moves skb->data.
      
      The layout of skbs in the fourth [4] case at the moment of
      dev_gro_receive() is identical to skbs that have come through [1],
      as napi_frags_skb() pulls Ethernet header to skb->head. The only
      difference is that the mentioned condition is always false for them,
      because skb_put() and friends irreversibly alter the tail pointer.
      They also go through the 'slow' path, but now every single
      pskb_may_pull() in every single .gro_receive() will call the *really*
      slow __pskb_pull_tail() to pull headers to head. This significantly
      decreases the overall performance for no visible reasons.
      
      The only two users of method [4] is:
      * drivers/staging/qlge
      * drivers/net/wireless/iwlwifi (all three variants: dvm, mvm, mvm-mq)
      
      Note that in case with wireless drivers we can't use [1]
      (napi_gro_frags()) at least for now and mac80211 stack always
      performs pushes and pulls anyways, so performance hit is inavoidable.
      
      At the moment of v2.6.31 the mentioned change was necessary (that's
      why I don't add the "Fixes:" tag), but it became obsolete since
      skb_gro_mac_header() has gone in commit a50e233c ("net-gro:
      restore frag0 optimization"), so we can simply revert the condition
      in gro_reset_offset() to allow skbs from [4] go through the 'fast'
      path just like in case [1].
      
      This was tested on a 600 MHz MIPS CPU and a custom driver and this
      patch gave boosts up to 40 Mbps to method [4] in both directions
      comparing to net-next, which made overall performance relatively
      close to [1] (without it, [4] is the slowest).
      
      v2:
      - Add more references and explanations to commit message
      - Fix some typos ibid
      - No functional changes
      Signed-off-by: default avatarAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8aef998d
    • David S. Miller's avatar
      Merge branch 'bnx2x-Remove-function-casts' · f92e88db
      David S. Miller authored
      Kees Cook says:
      
      ====================
      bnx2x: Remove function casts
      
      In order to make the entire kernel usable under Clang's Control Flow
      Integrity protections, function prototype casts need to be avoided
      because this will trip CFI checks at runtime (i.e. a mismatch between
      the caller's expected function prototype and the destination function's
      prototype). Many of these cases can be found with -Wcast-function-type,
      which found that bnx2x had a bunch of needless (or at least confusing)
      function casts. This series removes them all.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f92e88db
    • Kees Cook's avatar
      bnx2x: Remove hw_reset_t function casts · 548e5ffe
      Kees Cook authored
      All .rw_reset callbacks except bnx2x_84833_hw_reset_phy() use a
      void return type. No callers of .hw_reset check a return value and
      bnx2x_84833_hw_reset_phy() unconditionally returns 0. Remove all
      hw_reset_t casts and fix the return type to void.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      548e5ffe
    • Kees Cook's avatar
      bnx2x: Remove format_fw_ver_t function casts · 26658f6b
      Kees Cook authored
      The return values for format_fw_ver_t callbacks are supposed to be
      "int", not "u8". Ultimately, the top-level caller doesn't actually check
      the return value at all, but just clean this all up anyway and fix the
      prototypes so that casts are no longer needed.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26658f6b
    • Kees Cook's avatar
      bnx2x: Remove config_init_t function casts · 3e19d1f2
      Kees Cook authored
      No callers of .config_init check return values. Remove the casting and
      change all callbacks to have the correct function prototype.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e19d1f2
    • Kees Cook's avatar
      bnx2x: Remove read_status_t function casts · 2c855d73
      Kees Cook authored
      The function casts for .read_status callbacks end up casting some int
      return values to u8. This seems to be bug-prone (-EINVAL being returned
      into something that appears to be true/false), but fixing the function
      prototypes doesn't change the existing behavior. Fix the return values
      to remove the casts.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c855d73
    • Kees Cook's avatar
      bnx2x: Drop redundant callback function casts · 86c1fe88
      Kees Cook authored
      NULL is already "void *" so it will auto-cast in assignments and
      initializers. Additionally, all the callbacks for .link_reset,
      .config_loopback, .set_link_led, and .phy_specific_func are already
      correct. No casting is needed for these, so remove them.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86c1fe88
    • Po Liu's avatar
      enetc: update TSN Qbv PSPEED set according to adjust link speed · 2e47cb41
      Po Liu authored
      ENETC has a register PSPEED to indicate the link speed of hardware.
      It is need to update accordingly. PSPEED field needs to be updated
      with the port speed for QBV scheduling purposes. Or else there is
      chance for gate slot not free by frame taking the MAC if PSPEED and
      phy speed not match. So update PSPEED when link adjust. This is
      implement by the adjust_link.
      Signed-off-by: default avatarPo Liu <Po.Liu@nxp.com>
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e47cb41
    • Po Liu's avatar
      enetc: Configure the Time-Aware Scheduler via tc-taprio offload · 34c6adf1
      Po Liu authored
      ENETC supports in hardware for time-based egress shaping according
      to IEEE 802.1Qbv. This patch implement the Qbv enablement by the
      hardware offload method qdisc tc-taprio method.
      Also update cbdr writeback to up level since control bd ring may
      writeback data to control bd ring.
      Signed-off-by: default avatarPo Liu <Po.Liu@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34c6adf1
    • Jonathan Lemon's avatar
      page_pool: do not release pool until inflight == 0. · c3f812ce
      Jonathan Lemon authored
      The page pool keeps track of the number of pages in flight, and
      it isn't safe to remove the pool until all pages are returned.
      
      Disallow removing the pool until all pages are back, so the pool
      is always available for page producers.
      
      Make the page pool responsible for its own delayed destruction
      instead of relying on XDP, so the page pool can be used without
      the xdp memory model.
      
      When all pages are returned, free the pool and notify xdp if the
      pool is registered with the xdp memory system.  Have the callback
      perform a table walk since some drivers (cpsw) may share the pool
      among multiple xdp_rxq_info.
      
      Note that the increment of pages_state_release_cnt may result in
      inflight == 0, resulting in the pool being released.
      
      Fixes: d956a048 ("xdp: force mem allocator removal and periodic warning")
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f812ce
    • David S. Miller's avatar
      Merge branch 'smc-last-part-of-termination-improvements' · 3af7ff93
      David S. Miller authored
      Karsten Graul says:
      
      ====================
      last part of termination improvements
      
      Patches 1 and 2 finish the set of termination patches, introducing
      a reboot handler that terminates all link groups. Patch 3 adds an
      rcu_barrier before the module is unloaded, and patch 4 is cleanup.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3af7ff93
    • Ursula Braun's avatar
      net/smc: remove unused constant · ab8536ca
      Ursula Braun authored
      Constant SMC_CLOSE_WAIT_LISTEN_CLCSOCK_TIME is defined, but since
      commit 3d502067 ("net/smc: simplify wait when closing listen socket")
      no longer used. Remove it.
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab8536ca
    • Ursula Braun's avatar
      net/smc: use rcu_barrier() on module unload · 4ead9c96
      Ursula Braun authored
      Add rcu_barrier() to make sure no RCU readers or callbacks are
      pending when the module is unloaded.
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ead9c96
    • Ursula Braun's avatar
      net/smc: guarantee removal of link groups in reboot · a33a803c
      Ursula Braun authored
      When rebooting it should be guaranteed all link groups are cleaned
      up and freed.
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a33a803c
    • Ursula Braun's avatar
      net/smc: introduce bookkeeping of SMCR link groups · 6dabd405
      Ursula Braun authored
      If the smc module is unloaded return control from exit routine only,
      if all link groups are freed.
      If an IB device is thrown away return control from device removal only,
      if all link groups belonging to this device are freed.
      Counters for the total number of SMCR link groups and for the total
      number of SMCR links per IB device are introduced. smc module unloading
      continues only if the total number of SMCR link groups is zero. IB device
      removal continues only it the total number of SMCR links per IB device
      has decreased to zero.
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6dabd405
  2. 15 Nov, 2019 18 commits