1. 30 Aug, 2013 4 commits
    • Yuchung Cheng's avatar
      tcp: do not use cached RTT for RTT estimation · 1b7fdd2a
      Yuchung Cheng authored
      RTT cached in the TCP metrics are valuable for the initial timeout
      because SYN RTT usually does not account for serialization delays
      on low BW path.
      
      However using it to seed the RTT estimator maybe disruptive because
      other components (e.g., pacing) require the smooth RTT to be obtained
      from actual connection.
      
      The solution is to use the higher cached RTT to set the first RTO
      conservatively like tcp_rtt_estimator(), but avoid seeding the other
      RTT estimator variables such as srtt.  It is also a good idea to
      keep RTO conservative to obtain the first RTT sample, and the
      performance is insured by TCP loss probe if SYN RTT is available.
      
      To keep the seeding formula consistent across SYN RTT and cached RTT,
      the rttvar is twice the cached RTT instead of cached RTTVAR value. The
      reason is because cached variation may be too small (near min RTO)
      which defeats the purpose of being conservative on first RTO. However
      the metrics still keep the RTT variations as they might be useful for
      user applications (through ip).
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b7fdd2a
    • Eric Dumazet's avatar
      pkt_sched: fq: prefetch() fix · 08f89b98
      Eric Dumazet authored
      kbuild bot reported following m68k build error :
      
        net/sched/sch_fq.c: In function 'fq_dequeue':
      >> net/sched/sch_fq.c:491:2: error: implicit declaration of function
      'prefetch' [-Werror=implicit-function-declaration]
         cc1: some warnings being treated as errors
      
      While we are fixing this, move this prefetch() call a bit earlier.
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08f89b98
    • Joe Perches's avatar
      drivers:net: Convert dma_alloc_coherent(...__GFP_ZERO) to dma_zalloc_coherent · ede23fa8
      Joe Perches authored
      __GFP_ZERO is an uncommon flag and perhaps is better
      not used.  static inline dma_zalloc_coherent exists
      so convert the uses of dma_alloc_coherent with __GFP_ZERO
      to the more common kernel style with zalloc.
      
      Remove memset from the static inline dma_zalloc_coherent
      and add just one use of __GFP_ZERO instead.
      
      Trivially reduces the size of the existing uses of
      dma_zalloc_coherent.
      
      Realign arguments as appropriate.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Acked-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ede23fa8
    • Eric Dumazet's avatar
      pkt_sched: fq: Fair Queue packet scheduler · afe4fd06
      Eric Dumazet authored
      - Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
      - Uses the new_flow/old_flow separation from FQ_codel
      - New flows get an initial credit allowing IW10 without added delay.
      - Special FIFO queue for high prio packets (no need for PRIO + FQ)
      - Uses a hash table of RB trees to locate the flows at enqueue() time
      - Smart on demand gc (at enqueue() time, RB tree lookup evicts old
        unused flows)
      - Dynamic memory allocations.
      - Designed to allow millions of concurrent flows per Qdisc.
      - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
      - Single high resolution timer for throttled flows (if any).
      - One RB tree to link throttled flows.
      - Ability to have a max rate per flow. We might add a socket option
        to add per socket limitation.
      
      Attempts have been made to add TCP pacing in TCP stack, but this
      seems to add complex code to an already complex stack.
      
      TCP pacing is welcomed for flows having idle times, as the cwnd
      permits TCP stack to queue a possibly large number of packets.
      
      This removes the 'slow start after idle' choice, hitting badly
      large BDP flows, and applications delivering chunks of data
      as video streams.
      
      Nicely spaced packets :
      Here interface is 10Gbit, but flow bottleneck is ~20Mbit
      
      cwin is big, yet FQ avoids the typical bursts generated by TCP
      (as in netperf TCP_RR -- -r 100000,100000)
      
      15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
      15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
      15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
      15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
      15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
      15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
      15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
      15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
      15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
      15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
      
      TCP timestamps show that most packets from B were queued in the same ms
      timeframe (TSval 1159799{3,4}), but FQ managed to send them right
      in time to avoid a big burst.
      
      In slow start or steady state, very few packets are throttled [1]
      
      FQ gets a bunch of tunables as :
      
        limit : max number of packets on whole Qdisc (default 10000)
      
        flow_limit : max number of packets per flow (default 100)
      
        quantum : the credit per RR round (default is 2 MTU)
      
        initial_quantum : initial credit for new flows (default is 10 MTU)
      
        maxrate : max per flow rate (default : unlimited)
      
        buckets : number of RB trees (default : 1024) in hash table.
                     (consumes 8 bytes per bucket)
      
        [no]pacing : disable/enable pacing (default is enable)
      
      All of them can be changed on a live qdisc.
      
      $ tc qd add dev eth0 root fq help
      Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
                    [ quantum BYTES ] [ initial_quantum BYTES ]
                    [ maxrate RATE  ] [ buckets NUMBER ]
                    [ [no]pacing ]
      
      $ tc -s -d qd
      qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
       Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
       backlog 0b 0p requeues 14
        511 flows, 511 inactive, 0 throttled
        110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
      
      [1] Except if initial srtt is overestimated, as if using
      cached srtt in tcp metrics. We'll provide a fix for this issue.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afe4fd06
  2. 29 Aug, 2013 36 commits