1. 31 May, 2019 9 commits
    • Roman Gushchin's avatar
      bpf: add memlock precharge for socket local storage · d50836cd
      Roman Gushchin authored
      Socket local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d50836cd
    • Roman Gushchin's avatar
      bpf: add memlock precharge check for cgroup_local_storage · ffc8b144
      Roman Gushchin authored
      Cgroup local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffc8b144
    • Alexei Starovoitov's avatar
      Merge branch 'propagate-cn-to-tcp' · 576240cf
      Alexei Starovoitov authored
      Lawrence Brakmo says:
      
      ====================
      This patchset adds support for propagating congestion notifications (cn)
      to TCP from cgroup inet skb egress BPF programs.
      
      Current cgroup skb BPF programs cannot trigger TCP congestion window
      reductions, even when they drop a packet. This patch-set adds support
      for cgroup skb BPF programs to send congestion notifications in the
      return value when the packets are TCP packets. Rather than the
      current 1 for keeping the packet and 0 for dropping it, they can
      now return:
          NET_XMIT_SUCCESS    (0)    - continue with packet output
          NET_XMIT_DROP       (1)    - drop packet and do cn
          NET_XMIT_CN         (2)    - continue with packet output and do cn
          -EPERM                     - drop packet
      
      Finally, HBM programs are modified to collect and return more
      statistics.
      
      There has been some discussion regarding the best place to manage
      bandwidths. Some believe this should be done in the qdisc where it can
      also be managed with a BPF program. We believe there are advantages
      for doing it with a BPF program in the cgroup/skb callback. For example,
      it reduces overheads in the cases where there is on primary workload and
      one or more secondary workloads, where each workload is running on its
      own cgroupv2. In this scenario, we only need to throttle the secondary
      workloads and there is no overhead for the primary workload since there
      will be no BPF program attached to its cgroup.
      
      Regardless, we agree that this mechanism should not penalize those that
      are not using it. We tested this by doing 1 byte req/reply RPCs over
      loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
      Each test was repeated 50 times with a 1 minute delay between each set
      of 10. We then calculated the average RPCs/sec over the 50 tests. We
      compare upstream with upstream + patchset and no BPF program as well
      as upstream + patchset and a BPF program that just returns ALLOW_PKT.
      Here are the results:
      
      upstream                           80937 RPCs/sec
      upstream + patches, no BPF program 80894 RPCs/sec
      upstream + patches, BPF program    80634 RPCs/sec
      
      These numbers indicate that there is no penalty for these patches
      
      The use of congestion notifications improves the performance of HBM when
      using Cubic. Without congestion notifications, Cubic will not decrease its
      cwnd and HBM will need to drop a large percentage of the packets.
      
      The following results are obtained for rate limits of 1Gbps,
      between two servers using netperf, and only one flow. We also show how
      reducing the max delayed ACK timer can improve the performance when
      using Cubic.
      
      Command used was:
        ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                         -s=<server running netserver>
        where:
           <rate>   is 1000
           --no_cn  specifies no cwr notifications
           dctcp    uses dctcp
      
                             Cubic                    DCTCP
      Lim, DA      Mbps cwnd cred drops  Mbps cwnd cred drops
      --------     ---- ---- ---- -----  ---- ---- ---- -----
        1G, 40       35  462 -320 67%     995    1 -212  0.05%
        1G, 40,cn   736    9  -78  0.07   995    1 -212  0.05
        1G,  5,cn   941    2 -189  0.13   995    1 -212  0.05
      
      Notes:
        --no_cn has no effect with DCTCP
        Lim = rate limit
        DA = maximum delay ack timer
        cred = credit in packets
        drops = % packets dropped
      
      v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
              New egress values apply to all protocols, not just TCP
              Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
              Removed changes to __tcp_transmit_skb (patch 5), no longer needed
              Removed sample use of EDT
      v2->v3: Removed the probe timer related changes
      v3->v4: Replaced preempt_enable_no_resched() by preempt_enable()
              in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      576240cf
    • brakmo's avatar
      bpf: Add more stats to HBM · d58c6f72
      brakmo authored
      Adds more stats to HBM, including average cwnd and rtt of all TCP
      flows, percents of packets that are ecn ce marked and distribution
      of return values.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d58c6f72
    • brakmo's avatar
      bpf: Add cn support to hbm_out_kern.c · ffd81558
      brakmo authored
      Update hbm_out_kern.c to support returning cn notifications.
      Also updates relevant files to allow disabling cn notifications.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffd81558
    • brakmo's avatar
      bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls · 956fe219
      brakmo authored
      Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
      congestion notifications from the BPF programs.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      956fe219
    • brakmo's avatar
      bpf: Update __cgroup_bpf_run_filter_skb with cn · e7a3160d
      brakmo authored
      For egress packets, __cgroup_bpf_fun_filter_skb() will now call
      BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
      in order to propagate congestion notifications (cn) requests to TCP
      callers.
      
      For egress packets, this function can return:
         NET_XMIT_SUCCESS    (0)    - continue with packet output
         NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
         NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                      to call cwr
         -EPERM                     - drop packet
      
      For ingress packets, this function will return -EPERM if any attached
      program was found and if it returned != 1 during execution. Otherwise 0
      is returned.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e7a3160d
    • brakmo's avatar
      bpf: cgroup inet skb programs can return 0 to 3 · 5cf1e914
      brakmo authored
      Allows cgroup inet skb programs to return values in the range [0, 3].
      The second bit is used to deterine if congestion occurred and higher
      level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()
      
      The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
      at load time if it uses the new return values (i.e. 2 or 3).
      
      The expected_attach_type is currently not enforced for
      BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
      expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
      BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
      break backward compatibility.
      
      This patch adds a enforce_expected_attach_type bit to only
      enforce the expected_attach_type when it uses the new
      return value.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5cf1e914
    • brakmo's avatar
      bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY · 1f52f6c0
      brakmo authored
      Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
      __cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
      request cwr for TCP packets.
      
      Current cgroup skb programs can only return 0 or 1 (0 to drop the
      packet. This macro changes the behavior so the low order bit
      indicates whether the packet should be dropped (0) or not (1)
      and the next bit is used for congestion notification (cn).
      
      Hence, new allowed return values of CGROUP EGRESS BPF programs are:
        0: drop packet
        1: keep packet
        2: drop packet and call cwr
        3: keep packet and call cwr
      
      This macro then converts it to one of NET_XMIT values or -EPERM
      that has the effect of dropping the packet with no cn.
        0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
        1: NET_XMIT_DROP     skb should be dropped and cwr called
        2: NET_XMIT_CN       skb should be transmitted and cwr called
        3: -EPERM            skb should be dropped (no cn)
      
      Note that when more than one BPF program is called, the packet is
      dropped if at least one of programs requests it be dropped, and
      there is cn if at least one program returns cn.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1f52f6c0
  2. 29 May, 2019 15 commits
  3. 28 May, 2019 15 commits
  4. 25 May, 2019 1 commit