1. 17 May, 2020 5 commits
    • Florian Westphal's avatar
      mptcp: fill skb extension cache outside of mptcp_sendmsg_frag · 149f7c71
      Florian Westphal authored
      The mptcp_sendmsg_frag helper contains a loop that will wait on the
      subflow sk.
      
      It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
      requested.  mptcp_sendmsg already has such a wait loop that is used when
      no subflow socket is available for transmission.
      
      This is a preparation patch that makes sure we call
      mptcp_sendmsg_frag only if a skb extension has been allocated.
      
      Moreover, such allocation currently uses GFP_ATOMIC while it
      could use sleeping allocation instead.
      
      Followup patches will remove the wait loop from mptcp_sendmsg_frag()
      and will allow to do a sleeping allocation for the extension.
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      149f7c71
    • Florian Westphal's avatar
      mptcp: avoid blocking in tcp_sendpages · 72511aab
      Florian Westphal authored
      The transmit loop continues to xmit new data until an error is returned
      or all data was transmitted.
      
      For the blocking i/o case, this means that tcp_sendpages() may block on
      the subflow until more space becomes available, i.e. we end up sleeping
      with the mptcp socket lock held.
      
      Instead we should check if a different subflow is ready to be used.
      
      This restarts the subflow sk lookup when the tx operation succeeded
      and the tcp subflow can't accept more data or if tcp_sendpages
      indicates -EAGAIN on a blocking mptcp socket.
      
      In that case we also need to set the NOSPACE bit to make sure we get
      notified once memory becomes available.
      
      In case all subflows are busy, the existing logic will wait until a
      subflow is ready, releasing the mptcp socket lock while doing so.
      
      The mptcp worker already sets DONTWAIT, so no need to make changes there.
      
      v2:
       * set NOSPACE bit
       * add a comment to clarify that mptcp-sk sndbuf limits need to
         be checked as well.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72511aab
    • Florian Westphal's avatar
      mptcp: break and restart in case mptcp sndbuf is full · fb529e62
      Florian Westphal authored
      Its not enough to check for available tcp send space.
      
      We also hold on to transmitted data for mptcp-level retransmits.
      Right now we will send more and more data if the peer can ack data
      at the tcp level fast enough, since that frees up tcp send buffer space.
      
      But we also need to check that data was acked and reclaimed at the mptcp
      level.
      
      Therefore add needed check in mptcp_sendmsg, flush tcp data and
      wait until more mptcp snd space becomes available if we are over the
      limit.  Before we wait for more data, also make sure we start the
      retransmit timer if we ran out of sndbuf space.
      
      Otherwise there is a very small chance that we wait forever:
      
       * receiver is waiting for data
       * sender is blocked because mptcp socket buffer is full
       * at tcp level, all data was acked
       * mptcp-level snd_una was not updated, because last ack
         that acknowledged the last data packet carried an older
         MPTCP-ack.
      
      Restarting the retransmit timer avoids this problem: if TCP
      subflow is idle, data is retransmitted from the RTX queue.
      
      New data will make the peer send a new, updated MPTCP-Ack.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb529e62
    • Florian Westphal's avatar
      mptcp: move common nospace-pattern to a helper · a0e17064
      Florian Westphal authored
      Paolo noticed that ssk_check_wmem() has same pattern, so add/use
      common helper for both places.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0e17064
    • David Ahern's avatar
      selftests: Drop 'pref medium' in route checks · eb682677
      David Ahern authored
      The 'pref medium' attribute was moved in iproute2 to be near the prefix
      which is where it applies versus after the last nexthop. The nexthop
      tests were updated to drop the string from route checking, but it crept
      in again with the compat tests.
      
      Fixes: 4dddb5be ("selftests: net: add new testcases for nexthop API compat mode sysctl")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb682677
  2. 16 May, 2020 19 commits
  3. 15 May, 2020 16 commits
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2020-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · ea6119aa
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2020-05-15
      
      mlx5 core and mlx5e (netdev) updates:
      
      1) Two fixes for release all FW pages support.
      2) Improvement in calculating the send queue stop room on tx
      3) Flow steering auto-groups creation improvements
      4) TC offload fix for Connection tracking with NAT action
      5) IPoIB support for self looback to allow communication between ipoib
      pkey child interfaces on the same host.
      6) DCBNL cleanup to avoid #ifdef DCBNL all over the main mlx5e code
      7) Small and trivial code cleanup
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea6119aa
    • Nathan Chancellor's avatar
      ethernet: ti: am65-cpts: Add missing inline qualifier to stub functions · 2ea46dc6
      Nathan Chancellor authored
      When building with Clang:
      
      In file included from drivers/net/ethernet/ti/am65-cpsw-ethtool.c:15:
      drivers/net/ethernet/ti/am65-cpts.h:58:12: warning: unused function
      'am65_cpts_ns_gettime' [-Wunused-function]
      static s64 am65_cpts_ns_gettime(struct am65_cpts *cpts)
                 ^
      drivers/net/ethernet/ti/am65-cpts.h:63:12: warning: unused function
      'am65_cpts_estf_enable' [-Wunused-function]
      static int am65_cpts_estf_enable(struct am65_cpts *cpts,
                 ^
      drivers/net/ethernet/ti/am65-cpts.h:69:13: warning: unused function
      'am65_cpts_estf_disable' [-Wunused-function]
      static void am65_cpts_estf_disable(struct am65_cpts *cpts, int idx)
                  ^
      3 warnings generated.
      
      These functions need to be marked as inline, which adds __maybe_unused,
      to avoid these warnings, which is the pattern for stub functions.
      
      Fixes: ec008fa2 ("ethernet: ti: am65-cpts: add routines to support taprio offload")
      Link: https://github.com/ClangBuiltLinux/linux/issues/1026Signed-off-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ea46dc6
    • Tariq Toukan's avatar
      net/mlx5e: Take DCBNL-related definitions into dedicated files · 3f3ab178
      Tariq Toukan authored
      Take DCBNL-related definitions out of the common en.h header,
      Use a dedicated header file for exposing them.
      Some need not to be exposed, use them locally in the .c file.
      Use stubs to eliminate use of CONFIG_MLX5_CORE_EN_DCB in the
      generic control flows.
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      3f3ab178
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Calculate SQ stop room in a robust way · 5ffb4d85
      Maxim Mikityanskiy authored
      Currently, different formulas are used to estimate the space that may be
      taken by WQEs in the SQ during a single packet transmit. This space is
      called stop room, and it's checked in the end of packet transmit to find
      out if the next packet could overflow the SQ. If it could, the driver
      tells the kernel to stop sending next packets.
      
      Many factors affect the stop room:
      
      1. Padding with NOPs to avoid WQEs spanning over page boundaries.
      
      2. Enabled and disabled offloads (TLS, upcoming MPWQE).
      
      3. The maximum size of a WQE.
      
      The padding is performed before every WQE if it doesn't fit the current
      page.
      
      The current formula assumes that only one padding will be required per
      packet, and it doesn't take into account that the WQEs posted during the
      transmission of a single packet might exceed the page size in very rare
      circumstances. For example, to hit this condition with 4096-byte pages,
      TLS offload will have to interrupt an almost-full MPWQE session, be in
      the resync flow and try to transmit a near to maximum amount of data.
      
      To avoid SQ overflows in such rare cases after MPWQE is added, this
      patch introduces a more robust formula to estimate the stop room. The
      new formula uses the fact that a WQE of size X will not require more
      than X-1 WQEBBs of padding. More exact estimations are possible, but
      they result in much more complex and error-prone code for little gain.
      
      Before this patch, the TLS stop room included space for both INNOVA and
      ConnectX TLS offloads that couldn't run at the same time anyway, so this
      patch accounts only for the active one.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      5ffb4d85
    • Erez Shitrit's avatar
      net/mlx5e: IPoIB, Drop multicast packets that this interface sent · 8b46d424
      Erez Shitrit authored
      After enabled loopback packets for IPoIB, we need to drop these packets
      that this HCA has replicated and came back to the same interface that
      sent them.
      
      Fixes: 4c6c615e ("net/mlx5e: IPoIB, Add PKEY child interface nic profile")
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      8b46d424
    • Erez Shitrit's avatar
      net/mlx5e: IPoIB, Enable loopback packets for IPoIB interfaces · 80639b19
      Erez Shitrit authored
      Enable loopback of unicast and multicast traffic for IPoIB enhanced
      mode.
      This will allow interfaces with the same pkey to communicate between
      them e.g cloned interfaces that located in different namespaces.
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      80639b19
    • Roi Dayan's avatar
      net/mlx5e: CT: Fix offload with CT action after CT NAT action · 9102d836
      Roi Dayan authored
      It could be a chain of rules will do action CT again after CT NAT
      Before this fix matching will break as we get into the CT table
      after NAT changes and not CT NAT.
      Fix this by adding pre ct and pre ct nat tables to skip ct/ct_nat
      tables and go straight to post_ct table if ct/nat was already done.
      Signed-off-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarPaul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      9102d836
    • Eran Ben Elisha's avatar
      net/mlx5: Move internal timer read function to clock library · 90bf1c8d
      Eran Ben Elisha authored
      Move mlx5_read_internal_timer() into lib/clock.c file as it is being
      used there. As such, make this function a static one.
      
      In addition, rearrange headers include to support function move.
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarAya Levin <ayal@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      90bf1c8d
    • Paul Blakey's avatar
      net/mlx5: Wait for inactive autogroups · 49c0355d
      Paul Blakey authored
      Currently, if one thread tries to add an entry to an autogrouped table
      with no free matching group, while another thread is in the process of
      creating a new matching autogroup, it doesn't wait for the new group
      creation, and creates an unnecessary new autogroup.
      
      Instead of skipping inactive, wait on the write lock of those groups.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarMark Bloch <markb@mellanox.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      49c0355d
    • Parav Pandit's avatar
      net/mlx5: Drain wq first during PCI device removal · 41798df9
      Parav Pandit authored
      mlx5_unload_one() is done with cleanup = true only once.
      
      So instead of doing health wq drain inside the if(), directly do
      during PCI device removal.
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      41798df9
    • Parav Pandit's avatar
      net/mlx5: Have single error unwinding path · 4162f58b
      Parav Pandit authored
      Having multiple error unwinding path are error prone.
      Lets have just one error unwinding path.
      Signed-off-by: default avatarParav Pandit <parav@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      4162f58b
    • Eran Ben Elisha's avatar
      net/mlx5: Fix a bug of releasing wrong chunks on > 4K page size systems · e7f860e2
      Eran Ben Elisha authored
      On systems with page size larger than 4K, a fwp object has few 4K chunks.
      Fix a bug in fwp free flow where the chunk address was dropped and
      fwp->addr was used instead (first chunk address). This caused a wrong
      update of fwp->bitmask which later can cause errors in re-alloc fwp
      chunk flow.
      
      In order to fix this it, re-factor the release flow:
      - Free 4k: Releases a specific 4k chunk inside the fwp, defined by
        starting address.
      - Free fwp: Unconditionally release the whole fwp and its resources.
      Free addr will call free fwp if all chunks were released, in order to do
      code sharing.
      
      In addition, fix npages to count for all released chunks correctly.
      
      Fixes: c6168161 ("net/mlx5: Add support for release all pages event")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      e7f860e2
    • Eran Ben Elisha's avatar
      net/mlx5: Dedicate fw page to the requesting function · 2726cd4a
      Eran Ben Elisha authored
      The cited patch assumes that all chuncks in a fw page belong to the same
      function, thus the driver must dedicate fw page to the requesting
      function, which is actually what was intedned in the original fw pages
      allocator design, hence the fwp->func_id !
      
      Up until the cited patch everything worked ok, but now "relase all pages"
      is broken on systems with page_size > 4k.
      
      Fix this by dedicating fw page to the requesting function id via adding a
      func_id parameter to alloc_4k() function.
      
      Fixes: c6168161 ("net/mlx5: Add support for release all pages event")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      2726cd4a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · da07f52d
      David S. Miller authored
      Move the bpf verifier trace check into the new switch statement in
      HEAD.
      
      Resolve the overlapping changes in hinic, where bug fixes overlap
      the addition of VF support.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da07f52d
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f85c1598
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix sk_psock reference count leak on receive, from Xiyu Yang.
      
       2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.
      
       3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
          from Maciej Żenczykowski.
      
       4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
          Abeni.
      
       5) Fix netprio cgroup v2 leak, from Zefan Li.
      
       6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.
      
       7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.
      
       8) Various bpf probe handling fixes, from Daniel Borkmann.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
        selftests: mptcp: pm: rm the right tmp file
        dpaa2-eth: properly handle buffer size restrictions
        bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
        bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
        bpf: Restrict bpf_probe_read{, str}() only to archs where they work
        MAINTAINERS: Mark networking drivers as Maintained.
        ipmr: Add lockdep expression to ipmr_for_each_table macro
        ipmr: Fix RCU list debugging warning
        drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
        net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
        tcp: fix error recovery in tcp_zerocopy_receive()
        MAINTAINERS: Add Jakub to networking drivers.
        MAINTAINERS: another add of Karsten Graul for S390 networking
        drivers: ipa: fix typos for ipa_smp2p structure doc
        pppoe: only process PADT targeted at local interfaces
        selftests/bpf: Enforce returning 0 for fentry/fexit programs
        bpf: Enforce returning 0 for fentry/fexit progs
        net: stmmac: fix num_por initialization
        security: Fix the default value of secid_to_secctx hook
        libbpf: Fix register naming in PT_REGS s390 macros
        ...
      f85c1598
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · d5dfe4f1
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "A few minor bug fixes for user visible defects, and one regression:
      
         - Various bugs from static checkers and syzkaller
      
         - Add missing error checking in mlx4
      
         - Prevent RTNL lock recursion in i40iw
      
         - Fix segfault in cxgb4 in peer abort cases
      
         - Fix a regression added in 5.7 where the IB_EVENT_DEVICE_FATAL could
           be lost, and wasn't delivered to all the FDs"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/uverbs: Move IB_EVENT_DEVICE_FATAL to destroy_uobj
        RDMA/uverbs: Do not discard the IB_EVENT_DEVICE_FATAL event
        RDMA/iw_cxgb4: Fix incorrect function parameters
        RDMA/core: Fix double put of resource
        IB/core: Fix potential NULL pointer dereference in pkey cache
        IB/hfi1: Fix another case where pq is left on waitlist
        IB/i40iw: Remove bogus call to netdev_master_upper_dev_get()
        IB/mlx4: Test return value of calls to ib_get_cached_pkey
        RDMA/rxe: Always return ERR_PTR from rxe_create_mmap_info()
        i40iw: Fix error handling in i40iw_manage_arp_cache()
      d5dfe4f1