1. 28 Dec, 2019 6 commits
    • David S. Miller's avatar
      Merge branch 'tcp_cubic-various-fixes' · 36a78867
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp_cubic: various fixes
      
      This patch series converts tcp_cubic to usec clock resolution
      for Hystart logic.
      
      This makes Hystart more relevant for data-center flows.
      Prior to this series, Hystart was not kicking, or was
      kicking without good reason, since the 1ms clock was too coarse.
      
      Last patch also fixes an issue with Hystart vs TCP pacing.
      
      v2: removed a last-minute debug chunk from last patch
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36a78867
    • Eric Dumazet's avatar
      tcp_cubic: make Hystart aware of pacing · ede656e8
      Eric Dumazet authored
      For years we disabled Hystart ACK train detection at Google
      because it was fooled by TCP pacing.
      
      ACK train detection uses a simple heuristic, detecting if
      we receive ACK past half the RTT, to exit slow start before
      hitting the bottleneck and experience massive drops.
      
      But pacing by design might delay packets up to RTT/2,
      so we need to tweak the Hystart logic to be aware of this
      extra delay.
      
      Tested:
       Added a 100 usec delay at receiver.
      
      Before:
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9117
         7057
         9553
         8300
         7030
         6849
         9533
        10126
         6876
         8473
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1230               0.0
      
      After :
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
         9845
        10103
        10866
        11096
        11936
        11487
        11773
        12188
        11066
        11894
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       6462               0.0
      
      Disabling Hystart ACK Train detection gives similar numbers
      
      echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
      nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11173
        10954
        12455
        10627
        11578
        11583
        11222
        10880
        10665
        11366
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ede656e8
    • Eric Dumazet's avatar
      tcp_cubic: tweak Hystart detection for short RTT flows · 42f3a8aa
      Eric Dumazet authored
      After switching ca->delay_min to usec resolution, we exit
      slow start prematurely for very low RTT flows, setting
      snd_ssthresh to 20.
      
      The reason is that delay_min is fed with RTT of small packet
      trains. Then as cwnd is increased, TCP sends bigger TSO packets.
      
      LRO/GRO aggregation and/or interrupt mitigation strategies
      on receiver tend to inflate RTT samples.
      
      Fix this by adding to delay_min the expected delay of
      two TSO packets, given current pacing rate.
      
      Tested:
      
      Sender uses pfifo_fast qdisc
      
      Before :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        11348
        11707
        11562
        11428
        11773
        11534
         9878
        11693
        10597
        10968
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       200                0.0
      
      After :
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14877
        14517
        15797
        18466
        17376
        14833
        17558
        17933
        16039
        18059
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1670               0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42f3a8aa
    • Eric Dumazet's avatar
      tcp_cubic: switch bictcp_clock() to usec resolution · cff04e2d
      Eric Dumazet authored
      Current 1ms clock feeds ca->round_start, ca->delay_min,
      ca->last_ack.
      
      This is quite problematic for data-center flows, where delay_min
      is way below 1 ms.
      
      This means Hystart Train detection triggers every time jiffies value
      is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
      expression becomes true.
      
      This kind of random behavior can be solved by reusing the existing
      usec timestamp that TCP keeps in tp->tcp_mstamp
      
      Note that a followup patch will tweak things a bit, because
      during slow start, GRO aggregation on receivers naturally
      increases the RTT as TSO packets gradually come to ~64KB size.
      
      To recap, right after this patch CUBIC Hystart train detection
      is more aggressive, since short RTT flows might exit slow start at
      cwnd = 20, instead of being possibly unbounded.
      
      Following patch will address this problem.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cff04e2d
    • Eric Dumazet's avatar
      tcp_cubic: remove one conditional from hystart_update() · 35821fc2
      Eric Dumazet authored
      If we initialize ca->curr_rtt to ~0U, we do not need to test
      for zero value in hystart_update()
      
      We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
      been processed, and thus ca->curr_rtt will have a sane value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35821fc2
    • Eric Dumazet's avatar
      tcp_cubic: optimize hystart_update() · 473900a5
      Eric Dumazet authored
      We do not care which bit in ca->found is set.
      
      We avoid accessing hystart and hystart_detect unless really needed,
      possibly avoiding one cache line miss.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      473900a5
  2. 27 Dec, 2019 2 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 2bbc078f
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2019-12-27
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      We've added 127 non-merge commits during the last 17 day(s) which contain
      a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).
      
      There are three merge conflicts. Conflicts and resolution looks as follows:
      
      1) Merge conflict in net/bpf/test_run.c:
      
      There was a tree-wide cleanup c593642c ("treewide: Use sizeof_field() macro")
      which gets in the way with b590cb5f ("bpf: Switch to offsetofend in
      BPF_PROG_TEST_RUN"):
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                                   sizeof_field(struct __sk_buff, priority),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
        >>>>>>> 7c8dce4b
      
      There are a few occasions that look similar to this. Always take the chunk with
      offsetofend(). Note that there is one where the fields differ in here:
      
        <<<<<<< HEAD
                if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                                   sizeof_field(struct __sk_buff, tstamp),
        =======
                if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
        >>>>>>> 7c8dce4b
      
      Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
      850a88cc ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").
      
      2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:
      
      (I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)
      
        <<<<<<< HEAD
                if (is_13b_check(off, insn))
                        return -1;
                emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
        =======
                emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
        >>>>>>> 7c8dce4b
      
      Result should look like:
      
                emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
      
      3) Merge conflict in arch/riscv/include/asm/pgtable.h:
      
        <<<<<<< HEAD
        =======
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        #define vmemmap         ((struct page *)VMEMMAP_START)
      
        >>>>>>> 7c8dce4b
      
      Only take the BPF_* defines from there and move them higher up in the
      same file. Remove the rest from the chunk. The VMALLOC_* etc defines
      got moved via 01f52e16 ("riscv: define vmemmap before pfn_to_page
      calls"). Result:
      
        [...]
        #define __S101  PAGE_READ_EXEC
        #define __S110  PAGE_SHARED_EXEC
        #define __S111  PAGE_SHARED_EXEC
      
        #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
        #define VMALLOC_END      (PAGE_OFFSET - 1)
        #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)
      
        #define BPF_JIT_REGION_SIZE     (SZ_128M)
        #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
        #define BPF_JIT_REGION_END      (VMALLOC_END)
      
        /*
         * Roughly size the vmemmap space to be large enough to fit enough
         * struct pages to map half the virtual address space. Then
         * position vmemmap directly below the VMALLOC region.
         */
        #define VMEMMAP_SHIFT \
                (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
        #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
        #define VMEMMAP_END     (VMALLOC_START - 1)
        #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)
      
        [...]
      
      Let me know if there are any other issues.
      
      Anyway, the main changes are:
      
      1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
         to a provided BPF object file. This provides an alternative, simplified API
         compared to standard libbpf interaction. Also, add libbpf extern variable
         resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.
      
      2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
         generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
         add various BPF riscv JIT improvements, from Björn Töpel.
      
      3) Extend bpftool to allow matching BPF programs and maps by name,
         from Paul Chaignon.
      
      4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
         flag for allowing updates without service interruption, from Andrey Ignatov.
      
      5) Cleanup and simplification of ring access functions for AF_XDP with a
         bonus of 0-5% performance improvement, from Magnus Karlsson.
      
      6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
         audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.
      
      7) Move and extend test_select_reuseport into BPF program tests under
         BPF selftests, from Jakub Sitnicki.
      
      8) Various BPF sample improvements for xdpsock for customizing parameters
         to set up and benchmark AF_XDP, from Jay Jayatheerthan.
      
      9) Improve libbpf to provide a ulimit hint on permission denied errors.
         Also change XDP sample programs to attach in driver mode by default,
         from Toke Høiland-Jørgensen.
      
      10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
          programs, from Nikita V. Shirokov.
      
      11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.
      
      12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
          libbpf conversion, from Jesper Dangaard Brouer.
      
      13) Minor misc improvements from various others.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bbc078f
    • Andrii Nakryiko's avatar
      bpftool: Make skeleton C code compilable with C++ compiler · 7c8dce4b
      Andrii Nakryiko authored
      When auto-generated BPF skeleton C code is included from C++ application, it
      triggers compilation error due to void * being implicitly casted to whatever
      target pointer type. This is supported by C, but not C++. To solve this
      problem, add explicit casts, where necessary.
      
      To ensure issues like this are captured going forward, add skeleton usage in
      test_cpp test.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191226210253.3132060-1-andriin@fb.com
      7c8dce4b
  3. 26 Dec, 2019 32 commits