1. 17 Nov, 2020 32 commits
  2. 16 Nov, 2020 8 commits
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-improve-multiple-xmit-streams-support' · 72308ecb
      Jakub Kicinski authored
      Paolo Abeni says:
      
      ====================
      mptcp: improve multiple xmit streams support
      
      This series improves MPTCP handling of multiple concurrent
      xmit streams.
      
      The to-be-transmitted data is enqueued to a subflow only when
      the send window is open, keeping the subflows xmit queue shorter
      and allowing for faster switch-over.
      
      The above requires a more accurate msk socket state tracking
      and some additional infrastructure to allow pushing the data
      pending in the msk xmit queue as soon as the MPTCP's send window
      opens (patches 6-10).
      
      As a side effect, the MPTCP socket could enqueue data to subflows
      after close() time - to completely spooling the data sitting in the
      msk xmit queue. Dealing with the requires some infrastructure and
      core TCP changes (patches 1-5)
      
      Finally, patches 11-12 introduce a more accurate tracking of the other
      end's receive window.
      
      Overall this refactor the MPTCP xmit path, without introducing
      new features - the new code is covered by the existing self-tests.
      
      v2 -> v3:
       - rebased,
       - fixed checkpatch issue in patch 1/13
       - fixed some state tracking issues in patch 8/13
      
      v1 -> v2:
       - this is just a repost, to cope with patchwork issues, no changes
         at all
      ====================
      
      Link: https://lore.kernel.org/r/cover.1605458224.git.pabeni@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      72308ecb
    • Paolo Abeni's avatar
      mptcp: send explicit ack on delayed ack_seq incr · 7ed90803
      Paolo Abeni authored
      When the worker moves some bytes from the OoO queue into
      the receive queue, the msk->ask_seq is updated, the MPTCP-level
      ack carrying that value needs to wait the next ingress packet,
      possibly slowing down or hanging the peer
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7ed90803
    • Florian Westphal's avatar
      mptcp: keep track of advertised windows right edge · 6f8a612a
      Florian Westphal authored
      Before sending 'x' new bytes also check that the new snd_una would
      be within the permitted receive window.
      
      For every ACK that also contains a DSS ack, check whether its tcp-level
      receive window would advance the current mptcp window right edge and
      update it if so.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Co-developed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6f8a612a
    • Florian Westphal's avatar
      mptcp: rework poll+nospace handling · 8edf0864
      Florian Westphal authored
      MPTCP maintains a status bit, MPTCP_SEND_SPACE, that is set when at
      least one subflow and the mptcp socket itself are writeable.
      
      mptcp_poll returns EPOLLOUT if the bit is set.
      
      mptcp_sendmsg makes sure MPTCP_SEND_SPACE gets cleared when last write
      has used up all subflows or the mptcp socket wmem.
      
      This reworks nospace handling as follows:
      
      MPTCP_SEND_SPACE is replaced with MPTCP_NOSPACE, i.e. inverted meaning.
      This bit is set when the mptcp socket is not writeable.
      The mptcp-level ack path schedule will then schedule the mptcp worker
      to allow it to free already-acked data (and reduce wmem usage).
      
      This will then wake userspace processes that wait for a POLLOUT event.
      
      sendmsg will set MPTCP_NOSPACE only when it has to wait for more
      wmem (blocking I/O case).
      
      poll path will set MPTCP_NOSPACE in case the mptcp socket is
      not writeable.
      
      Normal tcp-level notification (SOCK_NOSPACE) is only enabled
      in case the subflow socket has no available wmem.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8edf0864
    • Paolo Abeni's avatar
      mptcp: try to push pending data on snd una updates · 813e0a68
      Paolo Abeni authored
      After the previous patch we may end-up with unsent data
      in the write buffer. If such buffer is full, the writer
      will block for unlimited time.
      
      We need to trigger the MPTCP xmit path even for the
      subflow rx path, on MPTCP snd_una updates.
      
      Keep things simple and just schedule the work queue if
      needed.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      813e0a68
    • Paolo Abeni's avatar
      mptcp: move page frag allocation in mptcp_sendmsg() · d9ca1de8
      Paolo Abeni authored
      mptcp_sendmsg() is refactored so that first it copies
      the data provided from user space into the send queue,
      and then tries to spool the send queue via sendmsg_frag.
      
      There a subtle change in the mptcp level collapsing on
      consecutive data fragment: we now allow that only on unsent
      data.
      
      The latter don't need to deal with msghdr data anymore
      and can be simplified in a relevant way.
      
      snd_nxt and write_seq are now tracked independently.
      
      Overall this allows some relevant cleanup and will
      allow sending pending mptcp data on msk una update in
      later patch.
      Co-developed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d9ca1de8
    • Paolo Abeni's avatar
      mptcp: refactor shutdown and close · e16163b6
      Paolo Abeni authored
      We must not close the subflows before all the MPTCP level
      data, comprising the DATA_FIN has been acked at the MPTCP
      level, otherwise we could be unable to retransmit as needed.
      
      __mptcp_wr_shutdown() shutdown is responsible to check for the
      correct status and close all subflows. Is called by the output
      path after spooling any data and at shutdown/close time.
      
      In a similar way, __mptcp_destroy_sock() is responsible to clean-up
      the MPTCP level status, and is called when the msk transition
      to TCP_CLOSE.
      
      The protocol level close() does not force anymore the TCP_CLOSE
      status, but orphan the msk socket and all the subflows.
      Orphaned msk sockets are forciby closed after a timeout or
      when all MPTCP-level data is acked.
      
      There is a caveat about keeping the orphaned subflows around:
      the TCP stack can asynchronusly call tcp_cleanup_ulp() on them via
      tcp_close(). To prevent accessing freed memory on later MPTCP
      level operations, the msk acquires a reference to each subflow
      socket and prevent subflow_ulp_release() from releasing the
      subflow context before __mptcp_destroy_sock().
      
      The additional subflow references are released by __mptcp_done()
      and the async ULP release is detected checking ULP ops. If such
      field has been already cleared by the ULP release path, the
      dangling context is freed directly by __mptcp_done().
      Co-developed-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e16163b6
    • Paolo Abeni's avatar
      mptcp: introduce MPTCP snd_nxt · eaa2ffab
      Paolo Abeni authored
      Track the next MPTCP sequence number used on xmit,
      currently always equal to write_next.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eaa2ffab