An error occurred fetching the project authors.
  1. 22 Jul, 2020 1 commit
    • Eric Dumazet's avatar
      tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key() · d9e26f65
      Eric Dumazet authored
      [ Upstream commit 6a2febec ]
      
      MD5 keys are read with RCU protection, and tcp_md5_do_add()
      might update in-place a prior key.
      
      Normally, typical RCU updates would allocate a new piece
      of memory. In this case only key->key and key->keylen might
      be updated, and we do not care if an incoming packet could
      see the old key, the new one, or some intermediate value,
      since changing the key on a live flow is known to be problematic
      anyway.
      
      We only want to make sure that in the case key->keylen
      is changed, cpus in tcp_md5_hash_key() wont try to use
      uninitialized data, or crash because key->keylen was
      read twice to feed sg_init_one() and ahash_request_set_crypt()
      
      Fixes: 9ea88a15 ("tcp: md5: check md5 signature without socket lock")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9e26f65
  2. 14 Feb, 2020 4 commits
  3. 04 Aug, 2019 1 commit
  4. 17 Jun, 2019 1 commit
    • Eric Dumazet's avatar
      tcp: limit payload size of sacked skbs · cc1b58cc
      Eric Dumazet authored
      commit 3b4929f6 upstream.
      
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Backport notes, provided by Joao Martins <joao.m.martins@oracle.com>
      
      v4.15 or since commit 737ff314 ("tcp: use sequence distance to
      detect reordering") had switched from the packet-based FACK tracking and
      switched to sequence-based.
      
      v4.14 and older still have the old logic and hence on
      tcp_skb_shift_data() needs to retain its original logic and have
      @fack_count in sync. In other words, we keep the increment of pcount with
      tcp_skb_pcount(skb) to later used that to update fack_count. To make it
      more explicit we track the new skb that gets incremented to pcount in
      @next_pcount, and we get to avoid the constant invocation of
      tcp_skb_pcount(skb) all together.
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJonathan Looney <jtl@netflix.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc1b58cc
  5. 23 Feb, 2019 1 commit
  6. 24 Aug, 2018 1 commit
  7. 25 Jul, 2018 1 commit
  8. 19 May, 2018 1 commit
  9. 16 May, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: fix TCP_REPAIR_QUEUE bound checking · 869f5381
      Eric Dumazet authored
      commit bf2acc94 upstream.
      
      syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
      with following C-repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
      sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
      	1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
      setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
      writev(3, [{"\270", 1}], 1)             = 1
      setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
      writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144
      
      The 3rd system call looks odd :
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      
      This patch makes sure bound checking is using an unsigned compare.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      869f5381
  10. 29 Apr, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets · 228ce13c
      Eric Dumazet authored
      [ Upstream commit 72123032 ]
      
      syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]
      
      I believe this was caused by a TCP_MD5SIG being set on live
      flow.
      
      This is highly unexpected, since TCP option space is limited.
      
      For instance, presence of TCP MD5 option automatically disables
      TCP TimeStamp option at SYN/SYNACK time, which we can not do
      once flow has been established.
      
      Really, adding/deleting an MD5 key only makes sense on sockets
      in CLOSE or LISTEN state.
      
      [1]
      BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
      CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
       tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
       tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
       tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
       tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
       SyS_sendto+0x8a/0xb0 net/socket.c:1715
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x448fe9
      RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
      RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
      RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
      R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
      R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
       kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
       slab_post_alloc_hook mm/slab.h:445 [inline]
       slab_alloc_node mm/slub.c:2737 [inline]
       __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
       __kmalloc_reserve net/core/skbuff.c:138 [inline]
       __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
       alloc_skb include/linux/skbuff.h:984 [inline]
       tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
       __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
       tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
       tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
       tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
       SyS_sendto+0x8a/0xb0 net/socket.c:1715
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: cfb6eeb4 ("[TCP]: MD5 Signature Option (RFC2385) support.")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      228ce13c
  11. 13 Feb, 2018 1 commit
  12. 31 Jan, 2018 1 commit
    • Dan Streetman's avatar
      net: tcp: close sock if net namespace is exiting · cf67be7a
      Dan Streetman authored
      [ Upstream commit 4ee806d5 ]
      
      When a tcp socket is closed, if it detects that its net namespace is
      exiting, close immediately and do not wait for FIN sequence.
      
      For normal sockets, a reference is taken to their net namespace, so it will
      never exit while the socket is open.  However, kernel sockets do not take a
      reference to their net namespace, so it may begin exiting while the kernel
      socket is still open.  In this case if the kernel socket is a tcp socket,
      it will stay open trying to complete its close sequence.  The sock's dst(s)
      hold a reference to their interface, which are all transferred to the
      namespace's loopback interface when the real interfaces are taken down.
      When the namespace tries to take down its loopback interface, it hangs
      waiting for all references to the loopback interface to release, which
      results in messages like:
      
      unregister_netdevice: waiting for lo to become free. Usage count = 1
      
      These messages continue until the socket finally times out and closes.
      Since the net namespace cleanup holds the net_mutex while calling its
      registered pernet callbacks, any new net namespace initialization is
      blocked until the current net namespace finishes exiting.
      
      After this change, the tcp socket notices the exiting net namespace, and
      closes immediately, releasing its dst(s) and their reference to the
      loopback interface, which lets the net namespace continue exiting.
      
      Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by: default avatarDan Streetman <ddstreet@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf67be7a
  13. 02 Jan, 2018 1 commit
  14. 21 Nov, 2017 1 commit
    • Soheil Hassas Yeganeh's avatar
      tcp: provide timestamps for partial writes · fe975496
      Soheil Hassas Yeganeh authored
      [ Upstream commit ad02c4f5 ]
      
      For TCP sockets, TX timestamps are only captured when the user data
      is successfully and fully written to the socket. In many cases,
      however, TCP writes can be partial for which no timestamp is
      collected.
      
      Collect timestamps whenever any user data is (fully or partially)
      copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
      instead of the local skb pointer since it can be set to NULL on
      the error path.
      
      Note that tcp_write_queue_tail can be NULL, even if bytes have been
      copied to the socket. This is because acknowledgements are being
      processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
      called tcp_write_queue_tail can be NULL. For such cases, this patch
      does not collect any timestamps (i.e., it is best-effort).
      
      This patch is written with suggestions from Willem de Bruijn and
      Eric Dumazet.
      
      Change-log V1 -> V2:
      	- Use sockc.tsflags instead of sk->sk_tsflags.
      	- Use the same code path for normal writes and errors.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe975496
  15. 20 Sep, 2017 1 commit
  16. 21 Jul, 2017 1 commit
  17. 07 Jun, 2017 1 commit
    • Wei Wang's avatar
      tcp: avoid fastopen API to be used on AF_UNSPEC · 4b81271e
      Wei Wang authored
      [ Upstream commit ba615f67 ]
      
      Fastopen API should be used to perform fastopen operations on the TCP
      socket. It does not make sense to use fastopen API to perform disconnect
      by calling it with AF_UNSPEC. The fastopen data path is also prone to
      race conditions and bugs when using with AF_UNSPEC.
      
      One issue reported and analyzed by Vegard Nossum is as follows:
      +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      Thread A:                            Thread B:
      ------------------------------------------------------------------------
      sendto()
       - tcp_sendmsg()
           - sk_stream_memory_free() = 0
               - goto wait_for_sndbuf
      	     - sk_stream_wait_memory()
      	        - sk_wait_event() // sleep
                |                          sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
      	  |                           - tcp_sendmsg()
      	  |                              - tcp_sendmsg_fastopen()
      	  |                                 - __inet_stream_connect()
      	  |                                    - tcp_disconnect() //because of AF_UNSPEC
      	  |                                       - tcp_transmit_skb()// send RST
      	  |                                    - return 0; // no reconnect!
      	  |                           - sk_stream_wait_connect()
      	  |                                 - sock_error()
      	  |                                    - xchg(&sk->sk_err, 0)
      	  |                                    - return -ECONNRESET
      	- ... // wake up, see sk->sk_err == 0
          - skb_entail() on TCP_CLOSE socket
      
      If the connection is reopened then we will send a brand new SYN packet
      after thread A has already queued a buffer. At this point I think the
      socket internal state (sequence numbers etc.) becomes messed up.
      
      When the new connection is closed, the FIN-ACK is rejected because the
      sequence number is outside the window. The other side tries to
      retransmit,
      but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
      corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
      +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
      
      Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
      and return EOPNOTSUPP to user if such case happens.
      
      Fixes: cf60af03 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b81271e
  18. 03 May, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: clear saved_syn in tcp_disconnect() · 479beb4c
      Eric Dumazet authored
      [ Upstream commit 17c3060b ]
      
      In the (very unlikely) case a passive socket becomes a listener,
      we do not want to duplicate its saved SYN headers.
      
      This would lead to double frees, use after free, and please hackers and
      various fuzzers
      
      Tested:
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, IPPROTO_TCP, TCP_SAVE_SYN, [1], 4) = 0
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
      
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 5) = 0
      
         +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
        +.1 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 connect(4, AF_UNSPEC, ...) = 0
         +0 close(3) = 0
         +0 bind(4, ..., ...) = 0
         +0 listen(4, 5) = 0
      
         +0 < S 0:0(0) win 32972 <mss 1460,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
        +.1 < . 1:1(0) ack 1 win 257
      
      Fixes: cd8ae852 ("tcp: provide SYN headers for passive connections")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      479beb4c
  19. 18 Feb, 2017 1 commit
  20. 03 Nov, 2016 2 commits
    • Eric Dumazet's avatar
      tcp: fix return value for partial writes · 79d8665b
      Eric Dumazet authored
      After my commit, tcp_sendmsg() might restart its loop after
      processing socket backlog.
      
      If sk_err is set, we blindly return an error, even though we
      copied data to user space before.
      
      We should instead return number of bytes that could be copied,
      otherwise user space might resend data and corrupt the stream.
      
      This might happen if another thread is using recvmsg(MSG_ERRQUEUE)
      to process timestamps.
      
      Issue was diagnosed by Soheil and Willem, big kudos to them !
      
      Fixes: d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Tested-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79d8665b
    • Eric Dumazet's avatar
      tcp: fix potential memory corruption · ac9e70b1
      Eric Dumazet authored
      Imagine initial value of max_skb_frags is 17, and last
      skb in write queue has 15 frags.
      
      Then max_skb_frags is lowered to 14 or smaller value.
      
      tcp_sendmsg() will then be allowed to add additional page frags
      and eventually go past MAX_SKB_FRAGS, overflowing struct
      skb_shared_info.
      
      Fixes: 5f74f82e ("net:Add sysctl_max_skb_frags")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
      Cc: Håkon Bugge <haakon.bugge@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac9e70b1
  21. 08 Oct, 2016 1 commit
  22. 04 Oct, 2016 1 commit
    • Al Viro's avatar
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro authored
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      25869262
  23. 21 Sep, 2016 4 commits
    • Yuchung Cheng's avatar
      tcp: export data delivery rate · eb8329e0
      Yuchung Cheng authored
      This commit export two new fields in struct tcp_info:
      
        tcpi_delivery_rate: The most recent goodput, as measured by
          tcp_rate_gen(). If the socket is limited by the sending
          application (e.g., no data to send), it reports the highest
          measurement instead of the most recent. The unit is bytes per
          second (like other rate fields in tcp_info).
      
        tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
          was measured when the socket's throughput was limited by the
          sending application.
      
      This delivery rate information can be useful for applications that
      want to know the current throughput the TCP connection is seeing,
      e.g. adaptive bitrate video streaming. It can also be very useful for
      debugging or troubleshooting.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb8329e0
    • Soheil Hassas Yeganeh's avatar
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh authored
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7722e85
    • Eric Dumazet's avatar
      tcp: switch back to proper tcp_skb_cb size check in tcp_init() · b2d3ea4a
      Eric Dumazet authored
      Revert to the tcp_skb_cb size check that tcp_init() had before commit
      b4772ef8 ("net: use common macro for assering skb->cb[] available
      size in protocol families"). As related commit 744d5a3e ("net:
      move skb->dropcount to skb->cb[]") explains, the
      sock_skb_cb_check_size() mechanism was added to ensure that there is
      space for dropcount, "for protocol families using it". But TCP is not
      a protocol using dropcount, so tcp_init() doesn't need to provision
      space for dropcount in the skb->cb[], and thus we can revert to the
      older form of the tcp_skb_cb size check. Doing so allows TCP to use 4
      more bytes of the skb->cb[] space.
      
      Fixes: b4772ef8 ("net: use common macro for assering skb->cb[] available size in protocol families")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2d3ea4a
    • Neal Cardwell's avatar
      tcp: use windowed min filter library for TCP min_rtt estimation · 64033892
      Neal Cardwell authored
      Refactor the TCP min_rtt code to reuse the new win_minmax library in
      lib/win_minmax.c to simplify the TCP code.
      
      This is a pure refactor: the functionality is exactly the same. We
      just moved the windowed min code to make TCP easier to read and
      maintain, and to allow other parts of the kernel to use the windowed
      min/max filter code.
      Signed-off-by: default avatarVan Jacobson <vanj@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64033892
  24. 17 Sep, 2016 1 commit
    • Eric Dumazet's avatar
      tcp: prepare skbs for better sack shifting · 3613b3db
      Eric Dumazet authored
      With large BDP TCP flows and lossy networks, it is very important
      to keep a low number of skbs in the write queue.
      
      RACK and SACK processing can perform a linear scan of it.
      
      We should avoid putting any payload in skb->head, so that SACK
      shifting can be done if needed.
      
      With this patch, we allow to pack ~0.5 MB per skb instead of
      the 64KB initially cooked at tcp_sendmsg() time.
      
      This gives a reduction of number of skbs in write queue by eight.
      tcp_rack_detect_loss() likes this.
      
      We still allow payload in skb->head for first skb put in the queue,
      to not impact RPC workloads.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3613b3db
  25. 09 Sep, 2016 1 commit
    • Yaogong Wang's avatar
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang authored
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: default avatarYaogong Wang <wygivan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f5afeae
  26. 29 Aug, 2016 1 commit
  27. 24 Aug, 2016 1 commit
  28. 20 Aug, 2016 1 commit
  29. 01 Jul, 2016 1 commit
  30. 30 Jun, 2016 1 commit
    • Andrey Vagin's avatar
      tcp: add an ability to dump and restore window parameters · b1ed4c4f
      Andrey Vagin authored
      We found that sometimes a restored tcp socket doesn't work.
      
      A reason of this bug is incorrect window parameters and in this case
      tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The
      other side drops packets with this seq, because seq is less than
      tp->rcv_nxt ( tcp_sequence() ).
      
      Data from a send queue is sent only if there is enough space in a
      window, so when we restore unacked data, we need to expand a window to
      fit this data.
      
      This was in a first version of this patch:
      "tcp: extend window to fit all restored unacked data in a send queue"
      
      Then Alexey recommended me to restore window parameters instead of
      adjusted them according with data in a sent queue. This sounds resonable.
      
      rcv_wnd has to be restored, because it was reported to another side
      and the offered window is never shrunk.
      One of reasons why we need to restore snd_wnd was described above.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1ed4c4f
  31. 04 May, 2016 1 commit
  32. 02 May, 2016 2 commits
    • Eric Dumazet's avatar
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet authored
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d41a69f1
    • Eric Dumazet's avatar
      tcp: do not block bh during prequeue processing · fb3477c0
      Eric Dumazet authored
      AFAIK, nothing in current TCP stack absolutely wants BH
      being disabled once socket is owned by a thread running in
      process context.
      
      As mentioned in my prior patch ("tcp: give prequeue mode some care"),
      processing a batch of packets might take time, better not block BH
      at all.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb3477c0