1. 18 Apr, 2019 7 commits
    • Andrey Ignatov's avatar
      bpf: Document BPF_PROG_TYPE_CGROUP_SYSCTL · da703149
      Andrey Ignatov authored
      Add documentation for BPF_PROG_TYPE_CGROUP_SYSCTL, including general
      info, attach type, context, return code, helpers, example and usage
      considerations.
      
      A separate file prog_cgroup_sysctl.rst is added to Documentation/bpf/.
      
      In the future more program types can be documented in their own
      prog_<name>.rst files.
      
      Another way to place program type specific documentation would be to
      group program types somehow (e.g. cgroup.rst for all cgroup-bpf
      programs), but it may not scale well since some program types may belong
      to different groups, e.g. BPF_PROG_TYPE_CGROUP_SKB can be documented
      together with either cgroup-bpf programs or programs that access skb.
      
      The new file is added to the index and verified by `make htmldocs` /
      sanity-check by lynx.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      da703149
    • Yonghong Song's avatar
      selftests/bpf: fix a compilation error · ba02de1a
      Yonghong Song authored
      I hit the following compilation error with gcc 4.8.5.
      
        prog_tests/flow_dissector.c: In function ‘test_flow_dissector’:
        prog_tests/flow_dissector.c:155:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
          for (int i = 0; i < ARRAY_SIZE(tests); i++) {
          ^
        prog_tests/flow_dissector.c:155:2: note: use option -std=c99 or -std=gnu99 to compile your code
      
      Let us fix the issue by avoiding this particular c99 feature.
      
      Fixes: a5cb3346 ("selftests/bpf: make flow dissector tests more extensible")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ba02de1a
    • Alexei Starovoitov's avatar
      Merge branch 'bulk-cpumap-redirect' · 193d0002
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      This patchset utilize a number of different kernel bulk APIs for optimizing
      the performance for the XDP cpumap redirect feature.
      
      Benchmark details are available here:
       https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org
      
      Performance measurements can be considered micro benchmarks, as they measure
      dropping packets at different stages in the network stack.
      Summary based on above:
      
      Baseline benchmarks
      - baseline-redirect: UdpNoPorts: 3,180,074
      - baseline-redirect: iptables-raw drop: 6,193,534
      
      Patch1: bpf: cpumap use ptr_ring_consume_batched
      - redirect: UdpNoPorts: 3,327,729
      - redirect: iptables-raw drop: 6,321,540
      
      Patch2: net: core: introduce build_skb_around
      - redirect: UdpNoPorts: 3,221,303
      - redirect: iptables-raw drop: 6,320,066
      
      Patch3: bpf: cpumap do bulk allocation of SKBs
      - redirect: UdpNoPorts: 3,290,563
      - redirect: iptables-raw drop: 6,650,112
      
      Patch4: bpf: cpumap memory prefetchw optimizations for struct page
      - redirect: UdpNoPorts: 3,520,250
      - redirect: iptables-raw drop: 7,649,604
      
      In this V2 submission I have chosen drop the SKB-list patch using
      netif_receive_skb_list() as it was not showing a performance improvement for
      these micro benchmarks.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      193d0002
    • Jesper Dangaard Brouer's avatar
      bpf: cpumap memory prefetchw optimizations for struct page · 86d23145
      Jesper Dangaard Brouer authored
      A lot of the performance gain comes from this patch.
      
      While analysing performance overhead it was found that the largest CPU
      stalls were caused when touching the struct page area. It is first read with
      a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
      written by page_frag_free() call.
      
      Measurements show that the prefetchw (W) variant operation is needed to
      achieve the performance gain. We believe this optimization it two fold,
      first the W-variant saves one step in the cache-coherency protocol, and
      second it helps us to avoid the non-temporal prefetch HW optimizations and
      bring this into all cache-levels. It might be worth investigating if
      prefetch into L2 will have the same benefit.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      86d23145
    • Jesper Dangaard Brouer's avatar
      bpf: cpumap do bulk allocation of SKBs · 8f0504a9
      Jesper Dangaard Brouer authored
      As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
      SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
      kmem_cache_alloc_bulk() API, and use the previously introduced function
      build_skb_around().
      
      Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
      memory for us. This does clear a larger area than needed, but my micro
      benchmarks on Intel CPUs show that this is slightly faster due to being a
      cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
      is a future optimization potential, because SKBs will with high probability
      originate from same page. If we can find/identify continuous memory areas
      then the Intel CPU memset rep stos will have a real performance gain.)
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8f0504a9
    • Jesper Dangaard Brouer's avatar
      net: core: introduce build_skb_around · ba0509b6
      Jesper Dangaard Brouer authored
      The function build_skb() also have the responsibility to allocate and clear
      the SKB structure. Introduce a new function build_skb_around(), that moves
      the responsibility of allocation and clearing to the caller. This allows
      caller to use kmem_cache (slab/slub) bulk allocation API.
      
      Next patch use this function combined with kmem_cache_alloc_bulk.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ba0509b6
    • Jesper Dangaard Brouer's avatar
      bpf: cpumap use ptr_ring_consume_batched · 77361825
      Jesper Dangaard Brouer authored
      Move ptr_ring dequeue outside loop, that allocate SKBs and calls network
      stack, as these operations that can take some time. The ptr_ring is a
      communication channel between CPUs, where we want to reduce/limit any
      cacheline bouncing.
      
      Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the
      period and times the remote cacheline in ptr_ring is read
      
      Batch size 8 is both to (1) limit BH-disable period, and (2) consume one
      cacheline on 64-bit archs. After reducing the BH-disable section further
      then we can consider changing this, while still thinking about L1 cacheline
      size being active.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77361825
  2. 17 Apr, 2019 13 commits
    • Alexei Starovoitov's avatar
      Merge branch 'af_xdp-smp_mb-fixes' · 00967e84
      Alexei Starovoitov authored
      Magnus Karlsson says:
      
      ====================
      This patch set fixes one bug and removes two dependencies on Linux
      kernel headers from the XDP socket code in libbpf. A number of people
      have pointed out that these two dependencies make it hard to build the
      XDP socket part of libbpf without any kernel header dependencies. The
      two removed dependecies are:
      
      * Remove the usage of likely and unlikely (compiler.h) in xsk.h. It
        has been reported that the use of these actually decreases the
        performance of the ring access code due to an increase in
        instruction cache misses, so let us just remove these.
      
      * Remove the dependency on barrier.h as it brings in a lot of kernel
        headers. As the XDP socket code only uses two simple functions from
        it, we can reimplement these. As a bonus, the new implementation is
        faster as it uses the same barrier primitives as the kernel does
        when the same code is compiled there. Without this patch, the user
        land code uses lfence and sfence on x86, which are unnecessarily
        harsh/thorough.
      
      In the process of removing these dependencies a missing barrier
      function for at least PPC64 was discovered. For a full explanation on
      the missing barrier, please refer to patch 1. So the patch set now
      starts with two patches fixing this. I have also added a patch at the
      end removing this full memory barrier for x86 only, as it is not
      needed there.
      
      Structure of the patch set:
      Patch 1-2: Adds the missing barrier function in kernel and user space.
      Patch 3-4: Removes the dependencies
      Patch 5: Optimizes the added barrier from patch 2 so that it does not
               do unnecessary work on x86.
      
      v2 -> v3:
      * Added missing memory barrier in ring code
      * Added an explanation on the three barriers we use in the code
      * Moved barrier functions from xsk.h to libbpf_util.h
      * Added comment on why we have these functions in libbpf_util.h
      * Added a new barrier function in user space that makes it possible to
        remove the full memory barrier on x86.
      
      v1 -> v2:
      * Added comment about validity of ARM 32-bit barriers.
        Only armv7 and above.
      
      /Magnus
      ====================
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      00967e84
    • Magnus Karlsson's avatar
      libbpf: optimize barrier for XDP socket rings · 2c5935f1
      Magnus Karlsson authored
      The full memory barrier in the XDP socket rings on the consumer side
      between the load of the data and the store of the consumer ring is
      there to protect the store from being executed before the load of the
      data. If this was allowed to happen, the producer might overwrite the
      data field with a new entry before the consumer got the chance to read
      it.
      
      On x86, stores are guaranteed not to be reordered with older loads, so
      it does not need a full memory barrier here. A compile time barrier
      would be enough. This patch introdcues a new primitive in
      libbpf_util.h that implements a new barrier type (libbpf_smp_rwmb)
      hindering stores to be reordered with older loads. It is then used in
      the XDP socket ring access code in libbpf to improve performance.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2c5935f1
    • Magnus Karlsson's avatar
      libbpf: remove dependency on barrier.h in xsk.h · b7e3a280
      Magnus Karlsson authored
      The use of smp_rmb() and smp_wmb() creates a Linux header dependency
      on barrier.h that is unnecessary in most parts. This patch implements
      the two small defines that are needed from barrier.h. As a bonus, the
      new implementations are faster than the default ones as they default
      to sfence and lfence for x86, while we only need a compiler barrier in
      our case. Just as it is when the same ring access code is compiled in
      the kernel.
      
      Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b7e3a280
    • Magnus Karlsson's avatar
      libbpf: remove likely/unlikely in xsk.h · a06d7296
      Magnus Karlsson authored
      This patch removes the use of likely and unlikely in xsk.h since they
      create a dependency on Linux headers as reported by several
      users. There have also been reports that the use of these decreases
      performance as the compiler puts the code on two different cache lines
      instead of on a single one. All in all, I think we are better off
      without them.
      
      Fixes: 1cad0788 ("libbpf: add support for using AF_XDP sockets")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a06d7296
    • Magnus Karlsson's avatar
      libbpf: fix XDP socket ring buffer memory ordering · d5e63fdd
      Magnus Karlsson authored
      The ring buffer code of	XDP sockets is missing a memory	barrier	on the
      consumer side between the load of the data and the write that signals
      that it is ok for the producer to put new data into the buffer. On
      architectures that does not guarantee that stores are not reordered
      with older loads, the producer might put data into the ring before the
      consumer had the chance to read it. As IA does guarantee this
      ordering, it would only need a compiler barrier here, but there are no
      primitives in barrier.h for this specific case (hinder writes to be ordered
      before older reads) so I had to add a smp_mb() here which will
      translate into a run-time synch operation on IA.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d5e63fdd
    • Magnus Karlsson's avatar
      xsk: fix XDP socket ring buffer memory ordering · f63666de
      Magnus Karlsson authored
      The ring buffer code of XDP sockets is missing a memory barrier on the
      consumer side between the load of the data and the write that signals
      that it is ok for the producer to put new data into the buffer. On
      architectures that does not guarantee that stores are not reordered
      with older loads, the producer might put data into the ring before the
      consumer had the chance to read it. As IA does guarantee this
      ordering, it would only need a compiler barrier here, but there are no
      primitives in Linux for this specific case (hinder writes to be ordered
      before older reads) so I had to add a smp_mb() here which will
      translate into a run-time synch operation on IA.
      
      Added a longish comment in the code explaining what each barrier in
      the ring implementation accomplishes and what would happen if we
      removed one of them.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f63666de
    • Prashant Bhole's avatar
      tools/bpftool: show btf_id in map listing · d1b7725d
      Prashant Bhole authored
      Let's print btf id of map similar to the way we are printing it
      for programs.
      
      Sample output:
      user@test# bpftool map -f
      61: lpm_trie  flags 0x1
      	key 20B  value 8B  max_entries 1  memlock 4096B
      133: array  name test_btf_id  flags 0x0
      	key 4B  value 4B  max_entries 4  memlock 4096B
      	pinned /sys/fs/bpf/test100
      	btf_id 174
      170: array  name test_btf_id  flags 0x0
      	key 4B  value 4B  max_entries 4  memlock 4096B
      	btf_id 240
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d1b7725d
    • Prashant Bhole's avatar
      tools/bpftool: re-organize newline printing for map listing · d459b59e
      Prashant Bhole authored
      Let's move the final newline printing in show_map_close_plain() at
      the end of the function because it looks correct and consistent with
      prog.c. Also let's do related changes for the line which prints
      pinned file name.
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d459b59e
    • Andrey Ignatov's avatar
      bpftool: Support sysctl hook · f25377ee
      Andrey Ignatov authored
      Add support for recently added BPF_PROG_TYPE_CGROUP_SYSCTL program type
      and BPF_CGROUP_SYSCTL attach type.
      
      Example of bpftool output with sysctl program from selftests:
      
        # bpftool p load ./test_sysctl_prog.o /mnt/bpf/sysctl_prog type cgroup/sysctl
        # bpftool p l
        9: cgroup_sysctl  name sysctl_tcp_mem  tag 0dd05f81a8d0d52e  gpl
                loaded_at 2019-04-16T12:57:27-0700  uid 0
                xlated 1008B  jited 623B  memlock 4096B
        # bpftool c a /mnt/cgroup2/bla sysctl id 9
        # bpftool c t
        CgroupPath
        ID       AttachType      AttachFlags     Name
        /mnt/cgroup2/bla
            9        sysctl                          sysctl_tcp_mem
        # bpftool c d /mnt/cgroup2/bla sysctl id 9
        # bpftool c t
        CgroupPath
        ID       AttachType      AttachFlags     Name
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f25377ee
    • Andrii Nakryiko's avatar
      libbpf: fix printf formatter for ptrdiff_t argument · e1d1dc46
      Andrii Nakryiko authored
      Using %ld for printing out value of ptrdiff_t type is not portable
      between 32-bit and 64-bit archs. This is causing compilation errors for
      libbpf on 32-bit platform (discovered as part of an effort to integrate
      libbpf into systemd ([0])). Proper formatter is %td, which is used in
      this patch.
      
      v2->v1:
        - add Reported-by
        - provide more context on how this issue was discovered
      
      [0] https://github.com/systemd/systemd/pull/12151Reported-by: default avatarEvgeny Vereshchagin <evvers@ya.ru>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e1d1dc46
    • Prashant Bhole's avatar
      bpf: use BPF_CAST_CALL for casting bpf call · 0d306c31
      Prashant Bhole authored
      verifier.c uses BPF_CAST_CALL for casting bpf call except at one
      place in jit_subprogs(). Let's use the macro for consistency.
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0d306c31
    • Viet Hoang Tran's avatar
      bpf: allow clearing all sock_ops callback flags · 725721a6
      Viet Hoang Tran authored
      The helper function bpf_sock_ops_cb_flags_set() can be used to both
      set and clear the sock_ops callback flags. However, its current
      behavior is not consistent. BPF program may clear a flag if more than
      one were set, or replace a flag with another one, but cannot clear all
      flags.
      
      This patch also updates the documentation to clarify the ability to
      clear flags of this helper function.
      Signed-off-by: default avatarHoang Tran <hoang.tran@uclouvain.be>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      725721a6
    • Peter Oskolkov's avatar
      selftests: bpf: add VRF test cases to lwt_ip_encap test. · 809041e7
      Peter Oskolkov authored
      This patch adds tests validating that VRF and BPF-LWT
      encap work together well, as requested by David Ahern.
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      809041e7
  3. 16 Apr, 2019 16 commits
  4. 13 Apr, 2019 4 commits