1. 14 Jun, 2019 6 commits
  2. 12 Jun, 2019 1 commit
    • Valdis Klētnieks's avatar
      bpf: silence warning messages in core · aee450cb
      Valdis Klētnieks authored
      Compiling kernel/bpf/core.c with W=1 causes a flood of warnings:
      
      kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init]
       1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
            |                                                                 ^~~~
      kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
       1087 |  INSN_3(ALU, ADD,  X),   \
            |  ^~~~~~
      kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
       1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
            |   ^~~~~~~~~~~~
      kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]')
       1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
            |                                                                 ^~~~
      kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
       1087 |  INSN_3(ALU, ADD,  X),   \
            |  ^~~~~~
      kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
       1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
            |   ^~~~~~~~~~~~
      
      98 copies of the above.
      
      The attached patch silences the warnings, because we *know* we're overwriting
      the default initializer. That leaves bpf/core.c with only 6 other warnings,
      which become more visible in comparison.
      Signed-off-by: default avatarValdis Kletnieks <valdis.kletnieks@vt.edu>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      aee450cb
  3. 11 Jun, 2019 12 commits
  4. 06 Jun, 2019 2 commits
  5. 04 Jun, 2019 2 commits
  6. 01 Jun, 2019 4 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 0462eaac
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2019-05-31
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      Lots of exciting new features in the first PR of this developement cycle!
      The main changes are:
      
      1) misc verifier improvements, from Alexei.
      
      2) bpftool can now convert btf to valid C, from Andrii.
      
      3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
         This feature greatly improves BPF speed on 32-bit architectures. From Jiong.
      
      4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
         bpf programs got stuck in dying cgroups. From Roman.
      
      5) new bpf_send_signal() helper, from Yonghong.
      
      6) cgroup inet skb programs can signal CN to the stack, from Lawrence.
      
      7) miscellaneous cleanups, from many developers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0462eaac
    • Alan Maguire's avatar
      selftests/bpf: measure RTT from xdp using xdping · cd538502
      Alan Maguire authored
      xdping allows us to get latency estimates from XDP.  Output looks
      like this:
      
      ./xdping -I eth4 192.168.55.8
      Setting up XDP for eth4, please wait...
      XDP setup disrupts network connectivity, hit Ctrl+C to quit
      
      Normal ping RTT data
      [Ignore final RTT; it is distorted by XDP using the reply]
      PING 192.168.55.8 (192.168.55.8) from 192.168.55.7 eth4: 56(84) bytes of data.
      64 bytes from 192.168.55.8: icmp_seq=1 ttl=64 time=0.302 ms
      64 bytes from 192.168.55.8: icmp_seq=2 ttl=64 time=0.208 ms
      64 bytes from 192.168.55.8: icmp_seq=3 ttl=64 time=0.163 ms
      64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.275 ms
      
      4 packets transmitted, 4 received, 0% packet loss, time 3079ms
      rtt min/avg/max/mdev = 0.163/0.237/0.302/0.054 ms
      
      XDP RTT data:
      64 bytes from 192.168.55.8: icmp_seq=5 ttl=64 time=0.02808 ms
      64 bytes from 192.168.55.8: icmp_seq=6 ttl=64 time=0.02804 ms
      64 bytes from 192.168.55.8: icmp_seq=7 ttl=64 time=0.02815 ms
      64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.02805 ms
      
      The xdping program loads the associated xdping_kern.o BPF program
      and attaches it to the specified interface.  If run in client
      mode (the default), it will add a map entry keyed by the
      target IP address; this map will store RTT measurements, current
      sequence number etc.  Finally in client mode the ping command
      is executed, and the xdping BPF program will use the last ICMP
      reply, reformulate it as an ICMP request with the next sequence
      number and XDP_TX it.  After the reply to that request is received
      we can measure RTT and repeat until the desired number of
      measurements is made.  This is why the sequence numbers in the
      normal ping are 1, 2, 3 and 8.  We XDP_TX a modified version
      of ICMP reply 4 and keep doing this until we get the 4 replies
      we need; hence the networking stack only sees reply 8, where
      we have XDP_PASSed it upstream since we are done.
      
      In server mode (-s), xdping simply takes ICMP requests and replies
      to them in XDP rather than passing the request up to the networking
      stack.  No map entry is required.
      
      xdping can be run in native XDP mode (the default, or specified
      via -N) or in skb mode (-S).
      
      A test program test_xdping.sh exercises some of these options.
      
      Note that native XDP does not seem to XDP_TX for veths, hence -N
      is not tested.  Looking at the code, it looks like XDP_TX is
      supported so I'm not sure if that's expected.  Running xdping in
      native mode for ixgbe as both client and server works fine.
      
      Changes since v4
      
      - close fds on cleanup (Song Liu)
      
      Changes since v3
      
      - fixed seq to be __be16 (Song Liu)
      - fixed fd checks in xdping.c (Song Liu)
      
      Changes since v2
      
      - updated commit message to explain why seq number of last
        ICMP reply is 8 not 4 (Song Liu)
      - updated types of seq number, raddr and eliminated csum variable
        in xdpclient/xdpserver functions as it was not needed (Song Liu)
      - added XDPING_DEFAULT_COUNT definition and usage specification of
        default/max counts (Song Liu)
      
      Changes since v1
       - moved from RFC to PATCH
       - removed unused variable in ipv4_csum() (Song Liu)
       - refactored ICMP checks into icmp_check() function called by client
         and server programs and reworked client and server programs due
         to lack of shared code (Song Liu)
       - added checks to ensure that SKB and native mode are not requested
         together (Song Liu)
      Signed-off-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd538502
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 33aae282
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2019-05-31
      
      This series contains updates to the iavf driver.
      
      Nathan Chancellor converts the use of gnu_printf to printf.
      
      Aleksandr modifies the driver to limit the number of RSS queues to the
      number of online CPUs in order to avoid creating misconfigured RSS
      queues.
      
      Gustavo A. R. Silva converts a couple of instances where sizeof() can be
      replaced with struct_size().
      
      Alice makes the remaining changes to the iavf driver to cleanup all the
      old "i40evf" references in the driver to iavf, including the file names
      that still contained the old driver reference.  There was no functional
      changes made, just cosmetic to reduce any confusion going forward now
      that the iavf driver is the virtual function driver for both i40e and
      ice drivers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33aae282
    • Jiong Wang's avatar
      bpf: doc: update answer for 32-bit subregister question · c231c22a
      Jiong Wang authored
      There has been quite a few progress around the two steps mentioned in the
      answer to the following question:
      
        Q: BPF 32-bit subregister requirements
      
      This patch updates the answer to reflect what has been done.
      
      v2:
       - Add missing full stop. (Song Liu)
       - Minor tweak on one sentence. (Song Liu)
      
      v1:
       - Integrated rephrase from Quentin and Jakub
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarJiong Wang <jiong.wang@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c231c22a
  7. 31 May, 2019 13 commits
    • Alexei Starovoitov's avatar
      Merge branch 'map-charge-cleanup' · d168286d
      Alexei Starovoitov authored
      Roman Gushchin says:
      
      ====================
      During my work on memcg-based memory accounting for bpf maps
      I've done some cleanups and refactorings of the existing
      memlock rlimit-based code. It makes it more robust, unifies
      size to pages conversion, size checks and corresponding error
      codes. Also it adds coverage for cgroup local storage and
      socket local storage maps.
      
      It looks like some preliminary work on the mm side might be
      required to start working on the memcg-based accounting,
      so I'm sending these patches as a separate patchset.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d168286d
    • Roman Gushchin's avatar
      bpf: move memory size checks to bpf_map_charge_init() · c85d6913
      Roman Gushchin authored
      Most bpf map types doing similar checks and bytes to pages
      conversion during memory allocation and charging.
      
      Let's unify these checks by moving them into bpf_map_charge_init().
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c85d6913
    • Roman Gushchin's avatar
      bpf: rework memlock-based memory accounting for maps · b936ca64
      Roman Gushchin authored
      In order to unify the existing memlock charging code with the
      memcg-based memory accounting, which will be added later, let's
      rework the current scheme.
      
      Currently the following design is used:
        1) .alloc() callback optionally checks if the allocation will likely
           succeed using bpf_map_precharge_memlock()
        2) .alloc() performs actual allocations
        3) .alloc() callback calculates map cost and sets map.memory.pages
        4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
           and performs actual charging; in case of failure the map is
           destroyed
        <map is in use>
        1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
           performs uncharge and releases the user
        2) .map_free() callback releases the memory
      
      The scheme can be simplified and made more robust:
        1) .alloc() calculates map cost and calls bpf_map_charge_init()
        2) bpf_map_charge_init() sets map.memory.user and performs actual
          charge
        3) .alloc() performs actual allocations
        <map is in use>
        1) .map_free() callback releases the memory
        2) bpf_map_charge_finish() performs uncharge and releases the user
      
      The new scheme also allows to reuse bpf_map_charge_init()/finish()
      functions for memcg-based accounting. Because charges are performed
      before actual allocations and uncharges after freeing the memory,
      no bogus memory pressure can be created.
      
      In cases when the map structure is not available (e.g. it's not
      created yet, or is already destroyed), on-stack bpf_map_memory
      structure is used. The charge can be transferred with the
      bpf_map_charge_move() function.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b936ca64
    • Roman Gushchin's avatar
      bpf: group memory related fields in struct bpf_map_memory · 3539b96e
      Roman Gushchin authored
      Group "user" and "pages" fields of bpf_map into the bpf_map_memory
      structure. Later it can be extended with "memcg" and other related
      information.
      
      The main reason for a such change (beside cosmetics) is to pass
      bpf_map_memory structure to charging functions before the actual
      allocation of bpf_map.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3539b96e
    • Roman Gushchin's avatar
      bpf: add memlock precharge for socket local storage · d50836cd
      Roman Gushchin authored
      Socket local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d50836cd
    • Roman Gushchin's avatar
      bpf: add memlock precharge check for cgroup_local_storage · ffc8b144
      Roman Gushchin authored
      Cgroup local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffc8b144
    • Alexei Starovoitov's avatar
      Merge branch 'propagate-cn-to-tcp' · 576240cf
      Alexei Starovoitov authored
      Lawrence Brakmo says:
      
      ====================
      This patchset adds support for propagating congestion notifications (cn)
      to TCP from cgroup inet skb egress BPF programs.
      
      Current cgroup skb BPF programs cannot trigger TCP congestion window
      reductions, even when they drop a packet. This patch-set adds support
      for cgroup skb BPF programs to send congestion notifications in the
      return value when the packets are TCP packets. Rather than the
      current 1 for keeping the packet and 0 for dropping it, they can
      now return:
          NET_XMIT_SUCCESS    (0)    - continue with packet output
          NET_XMIT_DROP       (1)    - drop packet and do cn
          NET_XMIT_CN         (2)    - continue with packet output and do cn
          -EPERM                     - drop packet
      
      Finally, HBM programs are modified to collect and return more
      statistics.
      
      There has been some discussion regarding the best place to manage
      bandwidths. Some believe this should be done in the qdisc where it can
      also be managed with a BPF program. We believe there are advantages
      for doing it with a BPF program in the cgroup/skb callback. For example,
      it reduces overheads in the cases where there is on primary workload and
      one or more secondary workloads, where each workload is running on its
      own cgroupv2. In this scenario, we only need to throttle the secondary
      workloads and there is no overhead for the primary workload since there
      will be no BPF program attached to its cgroup.
      
      Regardless, we agree that this mechanism should not penalize those that
      are not using it. We tested this by doing 1 byte req/reply RPCs over
      loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
      Each test was repeated 50 times with a 1 minute delay between each set
      of 10. We then calculated the average RPCs/sec over the 50 tests. We
      compare upstream with upstream + patchset and no BPF program as well
      as upstream + patchset and a BPF program that just returns ALLOW_PKT.
      Here are the results:
      
      upstream                           80937 RPCs/sec
      upstream + patches, no BPF program 80894 RPCs/sec
      upstream + patches, BPF program    80634 RPCs/sec
      
      These numbers indicate that there is no penalty for these patches
      
      The use of congestion notifications improves the performance of HBM when
      using Cubic. Without congestion notifications, Cubic will not decrease its
      cwnd and HBM will need to drop a large percentage of the packets.
      
      The following results are obtained for rate limits of 1Gbps,
      between two servers using netperf, and only one flow. We also show how
      reducing the max delayed ACK timer can improve the performance when
      using Cubic.
      
      Command used was:
        ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                         -s=<server running netserver>
        where:
           <rate>   is 1000
           --no_cn  specifies no cwr notifications
           dctcp    uses dctcp
      
                             Cubic                    DCTCP
      Lim, DA      Mbps cwnd cred drops  Mbps cwnd cred drops
      --------     ---- ---- ---- -----  ---- ---- ---- -----
        1G, 40       35  462 -320 67%     995    1 -212  0.05%
        1G, 40,cn   736    9  -78  0.07   995    1 -212  0.05
        1G,  5,cn   941    2 -189  0.13   995    1 -212  0.05
      
      Notes:
        --no_cn has no effect with DCTCP
        Lim = rate limit
        DA = maximum delay ack timer
        cred = credit in packets
        drops = % packets dropped
      
      v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
              New egress values apply to all protocols, not just TCP
              Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
              Removed changes to __tcp_transmit_skb (patch 5), no longer needed
              Removed sample use of EDT
      v2->v3: Removed the probe timer related changes
      v3->v4: Replaced preempt_enable_no_resched() by preempt_enable()
              in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      576240cf
    • brakmo's avatar
      bpf: Add more stats to HBM · d58c6f72
      brakmo authored
      Adds more stats to HBM, including average cwnd and rtt of all TCP
      flows, percents of packets that are ecn ce marked and distribution
      of return values.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d58c6f72
    • brakmo's avatar
      bpf: Add cn support to hbm_out_kern.c · ffd81558
      brakmo authored
      Update hbm_out_kern.c to support returning cn notifications.
      Also updates relevant files to allow disabling cn notifications.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffd81558
    • brakmo's avatar
      bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls · 956fe219
      brakmo authored
      Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
      congestion notifications from the BPF programs.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      956fe219
    • brakmo's avatar
      bpf: Update __cgroup_bpf_run_filter_skb with cn · e7a3160d
      brakmo authored
      For egress packets, __cgroup_bpf_fun_filter_skb() will now call
      BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
      in order to propagate congestion notifications (cn) requests to TCP
      callers.
      
      For egress packets, this function can return:
         NET_XMIT_SUCCESS    (0)    - continue with packet output
         NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
         NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                      to call cwr
         -EPERM                     - drop packet
      
      For ingress packets, this function will return -EPERM if any attached
      program was found and if it returned != 1 during execution. Otherwise 0
      is returned.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e7a3160d
    • brakmo's avatar
      bpf: cgroup inet skb programs can return 0 to 3 · 5cf1e914
      brakmo authored
      Allows cgroup inet skb programs to return values in the range [0, 3].
      The second bit is used to deterine if congestion occurred and higher
      level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()
      
      The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
      at load time if it uses the new return values (i.e. 2 or 3).
      
      The expected_attach_type is currently not enforced for
      BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
      expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
      BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
      break backward compatibility.
      
      This patch adds a enforce_expected_attach_type bit to only
      enforce the expected_attach_type when it uses the new
      return value.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5cf1e914
    • brakmo's avatar
      bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY · 1f52f6c0
      brakmo authored
      Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
      __cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
      request cwr for TCP packets.
      
      Current cgroup skb programs can only return 0 or 1 (0 to drop the
      packet. This macro changes the behavior so the low order bit
      indicates whether the packet should be dropped (0) or not (1)
      and the next bit is used for congestion notification (cn).
      
      Hence, new allowed return values of CGROUP EGRESS BPF programs are:
        0: drop packet
        1: keep packet
        2: drop packet and call cwr
        3: keep packet and call cwr
      
      This macro then converts it to one of NET_XMIT values or -EPERM
      that has the effect of dropping the packet with no cn.
        0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
        1: NET_XMIT_DROP     skb should be dropped and cwr called
        2: NET_XMIT_CN       skb should be transmitted and cwr called
        3: -EPERM            skb should be dropped (no cn)
      
      Note that when more than one BPF program is called, the packet is
      dropped if at least one of programs requests it be dropped, and
      there is cn if at least one program returns cn.
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1f52f6c0