1. 10 Sep, 2020 19 commits
    • Sameeh Jubran's avatar
      net: ena: ethtool: convert stat_offset to 64 bit resolution · f1852d64
      Sameeh Jubran authored
      The type of all stat fields is u64, therefore when iterating over stat
      fields in a stats struct, it makes sense to use an offset in 64 bit
      resolution. Doing so allows us to drop some of the casting that is
      currently used when referencing stats.
      Signed-off-by: default avatarSameeh Jubran <sameehj@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1852d64
    • Christoph Paasch's avatar
      selftests/mptcp: Better delay & reordering configuration · e5484658
      Christoph Paasch authored
      The delay was intended to be configured to "simulate" a high(er) BDP
      link. As such, it needs to be set as part of the loss-configuration and
      not as part of the netem reordering configuration.
      
      The reordering-config also requires a delay but that delay is the
      reordering-extend. So, a good approach is to set the reordering-extend
      as a function of the configured latency. E.g., 25% of the overall
      latency.
      
      To speed up the selftests, we limit the delay to 50ms maximum to avoid
      having the selftests run for too long.
      
      Finally, the intention of tc_reorder was that when it is unset, the test
      picks a random configuration. However, currently it is always initialized
      and thus the random config won't be picked up.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/6Reported-and-reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5484658
    • David S. Miller's avatar
      Merge branch 'tcp-add-tos-reflection-feature' · d095c462
      David S. Miller authored
      Wei Wang says:
      
      ====================
      tcp: add tos reflection feature
      
      This patch series adds a new tcp feature to reflect TOS value received in
      SYN, and send it out in SYN-ACK, and eventually set the TOS value of the
      established socket with this reflected TOS value. This provides a way to
      set the traffic class/QoS level for all traffic in the same connection
      to be the same as the incoming SYN. It could be useful for datacenters
      to provide equivalent QoS according to the incoming request.
      This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
      default turned off.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d095c462
    • Wei Wang's avatar
      tcp: reflect tos value received in SYN to the socket · ac8f1710
      Wei Wang authored
      This commit adds a new TCP feature to reflect the tos value received in
      SYN, and send it out on the SYN-ACK, and eventually set the tos value of
      the established socket with this reflected tos value. This provides a
      way to set the traffic class/QoS level for all traffic in the same
      connection to be the same as the incoming SYN request. It could be
      useful in data centers to provide equivalent QoS according to the
      incoming request.
      This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
      default turned off.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac8f1710
    • Wei Wang's avatar
      ip: pass tos into ip_build_and_send_pkt() · de033b7d
      Wei Wang authored
      This commit adds tos as a new passed in parameter to
      ip_build_and_send_pkt() which will be used in the later commit.
      This is a pure restructure and does not have any functional change.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de033b7d
    • Wei Wang's avatar
      tcp: record received TOS value in the request socket · e9b12edc
      Wei Wang authored
      A new field is added to the request sock to record the TOS value
      received on the listening socket during 3WHS:
      When not under syn flood, it is recording the TOS value sent in SYN.
      When under syn flood, it is recording the TOS value sent in the ACK.
      This is a preparation patch in order to do TOS reflection in the later
      commit.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9b12edc
    • Lorenzo Bianconi's avatar
      net: mventa: drop mvneta_stats from mvneta_swbm_rx_frame signature · 3a8c4ad1
      Lorenzo Bianconi authored
      Remove mvneta_stats from mvneta_swbm_rx_frame signature since now stats
      are accounted in mvneta_run_xdp routine
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a8c4ad1
    • David S. Miller's avatar
      Merge branch 'netpoll-make-sure-napi_list-is-safe-for-RCU-traversal' · 6198f446
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      netpoll: make sure napi_list is safe for RCU traversal
      
      This series is a follow-up to the fix in commit 96e97bc0 ("net:
      disable netpoll on fresh napis"). To avoid any latent race conditions
      convert dev->napi_list to a proper RCU list. We need minor restructuring
      because it looks like netif_napi_del() used to be idempotent, and
      it may be quite hard to track down everyone who depends on that.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6198f446
    • Jakub Kicinski's avatar
      net: make sure napi_list is safe for RCU traversal · 5251ef82
      Jakub Kicinski authored
      netpoll needs to traverse dev->napi_list under RCU, make
      sure it uses the right iterator and that removal from this
      list is handled safely.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5251ef82
    • Jakub Kicinski's avatar
      net: manage napi add/del idempotence explicitly · 4d092dd2
      Jakub Kicinski authored
      To RCUify napi->dev_list we need to replace list_del_init()
      with list_del_rcu(). There is no _init() version for RCU for
      obvious reasons. Up until now netif_napi_del() was idempotent
      so to make sure it remains such add a bit which is set when
      NAPI is listed, and cleared when it removed. Since we don't
      expect multiple calls to netif_napi_add() to be correct,
      add a warning on that side.
      
      Now that napi_hash_add / napi_hash_del are only called by
      napi_add / del we can actually steal its bit. We just need
      to make sure hash node is initialized correctly.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d092dd2
    • Jakub Kicinski's avatar
      net: remove napi_hash_del() from driver-facing API · 5198d545
      Jakub Kicinski authored
      We allow drivers to call napi_hash_del() before calling
      netif_napi_del() to batch RCU grace periods. This makes
      the API asymmetric and leaks internal implementation details.
      Soon we will want the grace period to protect more than just
      the NAPI hash table.
      
      Restructure the API and have drivers call a new function -
      __netif_napi_del() if they want to take care of RCU waits.
      
      Note that only core was checking the return status from
      napi_hash_del() so the new helper does not report if the
      NAPI was actually deleted.
      
      Some notes on driver oddness:
       - veth observed the grace period before calling netif_napi_del()
         but that should not matter
       - myri10ge observed normal RCU flavor
       - bnx2x and enic did not actually observe the grace period
         (unless they did so implicitly)
       - virtio_net and enic only unhashed Rx NAPIs
      
      The last two points seem to indicate that the calls to
      napi_hash_del() were a left over rather than an optimization.
      Regardless, it's easy enough to correct them.
      
      This patch may introduce extra synchronize_net() calls for
      interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
      free_netdev() to call netif_napi_del(). This seems inevitable
      since we want to use RCU for netpoll dev->napi_list traversal,
      and almost no drivers set IFF_DISABLE_NETPOLL.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5198d545
    • David S. Miller's avatar
      Merge branch 'mlx4-avoid-devlink-port-type-not-set-warnings' · 8b40f21b
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      mlx4: avoid devlink port type not set warnings
      
      This small set addresses the issue of mlx4 potentially not setting
      devlink port type when Ethernet or IB driver is not built, but
      port has that type.
      
      v2:
       - add patch 1
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b40f21b
    • Jakub Kicinski's avatar
      mlx4: make sure to always set the port type · 0313c7c2
      Jakub Kicinski authored
      Even tho mlx4_core registers the devlink ports, it's mlx4_en
      and mlx4_ib which set their type. In situations where one of
      the two is not built yet the machine has ports of given type
      we see the devlink warning from devlink_port_type_warn() trigger.
      
      Having ports of a type not supported by the kernel may seem
      surprising, but it does occur in practice - when the unsupported
      port is not plugged in to a switch anyway users are more than happy
      not to see it (and potentially allocate any resources to it).
      
      Set the type in mlx4_core if type-specific driver is not built.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0313c7c2
    • Jakub Kicinski's avatar
      devlink: don't crash if netdev is NULL · 3ea87ca7
      Jakub Kicinski authored
      Following change will add support for a corner case where
      we may not have a netdev to pass to devlink_port_type_eth_set()
      but we still want to set port type.
      
      This is definitely a corner case, and drivers should not normally
      pass NULL netdev - print a warning message when this happens.
      
      Sadly for other port types (ib) switches don't have a device
      reference, the way we always do for Ethernet, so we can't put
      the warning in __devlink_port_type_set().
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ea87ca7
    • Lorenzo Bianconi's avatar
      net: mvneta: rely on MVNETA_MAX_RX_BUF_SIZE for pkt split in mvneta_swbm_rx_frame() · 6eb8b7fb
      Lorenzo Bianconi authored
      In order to easily change the rx buffer size, rely on
      MVNETA_MAX_RX_BUF_SIZE instead of PAGE_SIZE in mvneta_swbm_rx_frame
      routine for rx buffer split. Currently this is not an issue since we set
      MVNETA_MAX_RX_BUF_SIZE to PAGE_SIZE - MVNETA_SKB_PAD but it is a good to
      have to configure a different rx buffer size.
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6eb8b7fb
    • David S. Miller's avatar
      Merge branch 'Allow-more-than-255-IPv4-multicast-interfaces' · 8c5c49a6
      David S. Miller authored
      Paul Davey says:
      
      ====================
      Allow more than 255 IPv4 multicast interfaces
      
      Currently it is not possible to use more than 255 multicast interfaces
      for IPv4 due to the format of the igmpmsg header which only has 8 bits
      available for the VIF ID.  There is space available in the igmpmsg
      header to store the full VIF ID in the form of an unused byte following
      the VIF ID field.  There is also enough space for the full VIF ID in
      the Netlink cache notifications, however the value is currently taken
      directly from the igmpmsg header and has thus already been truncated.
      
      Adding the high byte of the VIF ID into the unused3 byte of igmpmsg
      allows use of more than 255 IPv4 multicast interfaces. The full VIF ID
      is  also available in the Netlink notification by assembling it from
      both bytes from the igmpmsg.
      
      Additionally this reveals a deficiency in the Netlink cache report
      notifications, they lack any means for differentiating cache reports
      relating to different multicast routing tables.  This is easily
      resolved by adding the multicast route table ID to the cache reports.
      
      changes in v2:
       - Added high byte of VIF ID to igmpmsg struct replacing unused3
         member.
       - Assemble VIF ID in Netlink notification from both bytes in igmpmsg
         header.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c5c49a6
    • Paul Davey's avatar
      ipmr: Use full VIF ID in netlink cache reports · bb82067c
      Paul Davey authored
      Insert the full 16 bit VIF ID into ipmr Netlink cache reports.
      
      The VIF_ID attribute has 32 bits of space so can store the full VIF ID
      extracted from the high and low byte fields in the igmpmsg.
      Signed-off-by: default avatarPaul Davey <paul.davey@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb82067c
    • Paul Davey's avatar
      ipmr: Add high byte of VIF ID to igmpmsg · c8715a8e
      Paul Davey authored
      Use the unused3 byte in struct igmpmsg to hold the high 8 bits of the
      VIF ID.
      
      If using more than 255 IPv4 multicast interfaces it is necessary to have
      access to a VIF ID for cache reports that is wider than 8 bits, the VIF
      ID present in the igmpmsg reports sent to mroute_sk was only 8 bits wide
      in the igmpmsg header.  Adding the high 8 bits of the 16 bit VIF ID in
      the unused byte allows use of more than 255 IPv4 multicast interfaces.
      Signed-off-by: default avatarPaul Davey <paul.davey@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8715a8e
    • Paul Davey's avatar
      ipmr: Add route table ID to netlink cache reports · 501cb008
      Paul Davey authored
      Insert the multicast route table ID as a Netlink attribute to Netlink
      cache report notifications.
      
      When multiple route tables are in use it is necessary to have a way to
      determine which route table a given cache report belongs to when
      receiving the cache report.
      Signed-off-by: default avatarPaul Davey <paul.davey@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      501cb008
  2. 09 Sep, 2020 21 commits