1. 05 Jun, 2018 11 commits
  2. 04 Jun, 2018 12 commits
    • Yonghong Song's avatar
      bpf: guard bpf_get_current_cgroup_id() with CONFIG_CGROUPS · 34ea38ca
      Yonghong Song authored
      Commit bf6fa2c8 ("bpf: implement bpf_get_current_cgroup_id()
      helper") introduced a new helper bpf_get_current_cgroup_id().
      The helper has a dependency on CONFIG_CGROUPS.
      
      When CONFIG_CGROUPS is not defined, using the helper will result
      the following verifier error:
        kernel subsystem misconfigured func bpf_get_current_cgroup_id#80
      which is hard for users to interpret.
      Guarding the reference to bpf_get_current_cgroup_id_proto with
      CONFIG_CGROUPS will result in below better message:
        unknown func bpf_get_current_cgroup_id#80
      
      Fixes: bf6fa2c8 ("bpf: implement bpf_get_current_cgroup_id() helper")
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      34ea38ca
    • Daniel Borkmann's avatar
      Merge branch 'bpf-af-xdp-fixes' · 64995362
      Daniel Borkmann authored
      Björn Töpel says:
      
      ====================
      An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
      https://www.spinics.net/lists/netdev/msg503664.html) is that it does
      not support NICs that have a "type-writer" model in an efficient
      way. In this model, a memory window is passed to the hardware and
      multiple frames might be filled into that window, instead of just one
      that we have in the current fixed frame-size model.
      
      This patch set fixes two bugs in the current implementation and then
      changes the uapi so that the type-writer model can be supported
      efficiently by a possible future extension of AF_XDP.
      
      These are the uapi changes in this patch:
      
      * Change the "u32 idx" in the descriptors to "u64 addr". The current
        idx based format does NOT work for the type-writer model (as packets
        can start anywhere within a frame) but that a relative address
        pointer (the u64 addr) works well for both models in the prototype
        code we have that supports both models. We increased it from u32 to
        u64 to support umems larger than 4G. We have also removed the u16
        offset when having a "u64 addr" since that information is already
        carried in the least significant bits of the address.
      
      * We want to use "u8 padding[5]" for something useful in the future
        (since we are not allowed to change its name), so we now call it
        just options so it can be extended for various purposes in the
        future. It is an u32 as that it what is left of the 16 byte
        descriptor.
      
      * We changed the name of frame_size in the UMEM_REG setsockopt to
        chunk_size since this naming also makes sense to the type-writer
        model.
      
      With these changes to the uapi, we believe the type-writer model can
      be supported without having to resort to a new descriptor format. The
      type-writer model could then be supported, from the uapi point of
      view, by setting a flag at bind time and providing a new flag bit in
      the options field of the descriptor that signals to user space that
      all packets have been written in a chunk. Or with a new chunk
      completion queue as suggested by Mykyta in his latest feedback mail on
      the list.
      
      We based this patch set on bpf-next commit bd3a08aa ("bpf:
      flowlabel in bpf_fib_lookup should be flowinfo")
      
      The structure of the patch set is as follows:
      
      Patches 1-2: Fixes two bugs in the current implementation.
      Patches 3-4: Prepares the uapi for a "type-writer" model and modifies
                   the sample application so that it works with the new
      	     uapi.
      Patch 5: Small performance improvement patch for the sample application.
      
      Cheers: Magnus and Björn
      ====================
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      64995362
    • Magnus Karlsson's avatar
    • Björn Töpel's avatar
      samples/bpf: adapted to new uapi · a412ef54
      Björn Töpel authored
      Here, the xdpsock sample application is adjusted to the new descriptor
      format.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a412ef54
    • Björn Töpel's avatar
      xsk: new descriptor addressing scheme · bbff2f32
      Björn Töpel authored
      Currently, AF_XDP only supports a fixed frame-size memory scheme where
      each frame is referenced via an index (idx). A user passes the frame
      index to the kernel, and the kernel acts upon the data.  Some NICs,
      however, do not have a fixed frame-size model, instead they have a
      model where a memory window is passed to the hardware and multiple
      frames are filled into that window (referred to as the "type-writer"
      model).
      
      By changing the descriptor format from the current frame index
      addressing scheme, AF_XDP can in the future be extended to support
      these kinds of NICs.
      
      In the index-based model, an idx refers to a frame of size
      frame_size. Addressing a frame in the UMEM is done by offseting the
      UMEM starting address by a global offset, idx * frame_size + offset.
      Communicating via the fill- and completion-rings are done by means of
      idx.
      
      In this commit, the idx is removed in favor of an address (addr),
      which is a relative address ranging over the UMEM. To convert an
      idx-based address to the new addr is simply: addr = idx * frame_size +
      offset.
      
      We also stop referring to the UMEM "frame" as a frame. Instead it is
      simply called a chunk.
      
      To transfer ownership of a chunk to the kernel, the addr of the chunk
      is passed in the fill-ring. Note, that the kernel will mask addr to
      make it chunk aligned, so there is no need for userspace to do
      that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
      3000 to the fill-ring will refer to the same chunk.
      
      On the completion-ring, the addr will match that of the Tx descriptor,
      passed to the kernel.
      
      Changing the descriptor format to use chunks/addr will allow for
      future changes to move to a type-writer based model, where multiple
      frames can reside in one chunk. In this model passing one single chunk
      into the fill-ring, would potentially result in multiple Rx
      descriptors.
      
      This commit changes the uapi of AF_XDP sockets, and updates the
      documentation.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bbff2f32
    • Björn Töpel's avatar
      xsk: proper Rx drop statistics update · a509a955
      Björn Töpel authored
      Previously, rx_dropped could be updated incorrectly, e.g. if the XDP
      program redirected the frame to a socket bound to a different queue
      than where the XDP program was executing.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a509a955
    • Björn Töpel's avatar
      xsk: proper fill queue descriptor validation · 4e64c835
      Björn Töpel authored
      Previously the fill queue descriptor was not copied to kernel space
      prior validating it, making it possible for userland to change the
      descriptor post-kernel-validation.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4e64c835
    • David Ahern's avatar
      bpf: flowlabel in bpf_fib_lookup should be flowinfo · bd3a08aa
      David Ahern authored
      As Michal noted the flow struct takes both the flow label and priority.
      Update the bpf_fib_lookup API to note that it is flowinfo and not just
      the flow label.
      
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bd3a08aa
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_get_current_cgroup_id' · 432bdb58
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      Patch #1 implements the new helper in the kernel.
      Patch #2 syncs the uapi bpf.h header and helper between tools
      and kernel.
      Patch #3 shows how to get the same cgroup id in user space,
      so a filter or policy could be configgured in the bpf program
      based on current task cgroup.
      
      Changelog:
        v1 -> v2:
           . rebase to resolve merge conflict with latest bpf-next.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      432bdb58
    • Yonghong Song's avatar
      tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper · f269099a
      Yonghong Song authored
      Syscall name_to_handle_at() can be used to get cgroup id
      for a particular cgroup path in user space. The selftest
      got cgroup id from both user and kernel, and compare to
      ensure they are equal to each other.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f269099a
    • Yonghong Song's avatar
      tools/bpf: sync uapi bpf.h for bpf_get_current_cgroup_id() helper · c7ddbbaf
      Yonghong Song authored
      Sync kernel uapi/linux/bpf.h with tools uapi/linux/bpf.h.
      Also add the necessary helper define in bpf_helpers.h.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c7ddbbaf
    • Yonghong Song's avatar
      bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8
      Yonghong Song authored
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      The later patch will provide an example to show that
      userspace can get the same cgroup id so it could
      configure a filter or policy in the bpf program based on
      task cgroup id.
      
      The helper is currently implemented for tracing. It can
      be added to other program types as well when needed.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bf6fa2c8
  3. 03 Jun, 2018 17 commits
    • Alexei Starovoitov's avatar
      Merge branch 'ndo_xdp_xmit-cleanup' · ea9916ea
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      As I mentioned in merge commit 10f67868 ("Merge branch 'xdp_xmit-bulking'")
      I plan to change the API for ndo_xdp_xmit once more, by adding a flags
      argument, which is done in this patchset.
      
      I know it is late in the cycle (currently at rc7), but it would be
      nice to avoid changing NDOs over several kernel releases, as it is
      annoying to vendors and distro backporters, but it is not strictly
      UAPI so it is allowed (according to Alexei).
      
      The end-goal is getting rid of the ndo_xdp_flush operation, as it will
      make it possible for drivers to implement a TXQ synchronization mechanism
      that is not necessarily derived from the CPU id (smp_processor_id).
      
      This patchset removes all callers of the ndo_xdp_flush operation, but
      it doesn't take the last step of removing it from all drivers.  This
      can be done later, or I can update the patchset on request.
      
      Micro-benchmarks only show a very small performance improvement, for
      map-redirect around ~2 ns, and for non-map redirect ~7 ns.  I've not
      benchmarked this with CONFIG_RETPOLINE, but the performance benefit
      should be more visible given we end-up removing an indirect call.
      
      ---
      V2: Updated based on feedback from Song Liu <songliubraving@fb.com>
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ea9916ea
    • Jesper Dangaard Brouer's avatar
      bpf/xdp: devmap can avoid calling ndo_xdp_flush · c1ece6b2
      Jesper Dangaard Brouer authored
      The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead
      instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in
      appropriate places.
      
      Notice after this patch it is possible to remove ndo_xdp_flush
      completely, as this is the last user of ndo_xdp_flush. This is left
      for later patches, to keep driver changes separate.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c1ece6b2
    • Jesper Dangaard Brouer's avatar
      bpf/xdp: non-map redirect can avoid calling ndo_xdp_flush · 1e67575a
      Jesper Dangaard Brouer authored
      This is the first real user of the XDP_XMIT_FLUSH flag.
      
      As pointed out many times, XDP_REDIRECT without using BPF maps is
      significant slower than the map variant.  This is primary due to the
      lack of bulking, as the ndo_xdp_flush operation is required after each
      frame (to avoid frames hanging on the egress device).
      
      It is still possible to optimize this case.  Instead of invoking two
      NDO indirect calls, which are very expensive with CONFIG_RETPOLINE,
      instead instruct ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1e67575a
    • Jesper Dangaard Brouer's avatar
      xdp: done implementing ndo_xdp_xmit flush flag for all drivers · 73de5717
      Jesper Dangaard Brouer authored
      Removing XDP_XMIT_FLAGS_NONE as all driver now implement
      a flush operation in their ndo_xdp_xmit call.  The compiler
      will catch if any users of XDP_XMIT_FLAGS_NONE remains.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      73de5717
    • Jesper Dangaard Brouer's avatar
      virtio_net: implement flush flag for ndo_xdp_xmit · 5d274cb4
      Jesper Dangaard Brouer authored
      When passed the XDP_XMIT_FLUSH flag virtnet_xdp_xmit now performs the
      same virtqueue_kick as virtnet_xdp_flush.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d274cb4
    • Jesper Dangaard Brouer's avatar
      tun: implement flush flag for ndo_xdp_xmit · 0c9d917b
      Jesper Dangaard Brouer authored
      When passed the XDP_XMIT_FLUSH flag tun_xdp_xmit now performs the same
      kind of socket wake up as in tun_xdp_flush(). The wake up code from
      tun_xdp_flush is generalized and shared with tun_xdp_xmit.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0c9d917b
    • Jesper Dangaard Brouer's avatar
      ixgbe: implement flush flag for ndo_xdp_xmit · 5e2e6095
      Jesper Dangaard Brouer authored
      When passed the XDP_XMIT_FLUSH flag ixgbe_xdp_xmit now performs the
      same kind of ring tail update as in ixgbe_xdp_flush.  The update tail
      code in ixgbe_xdp_flush is generalized and shared with ixgbe_xdp_xmit.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5e2e6095
    • Jesper Dangaard Brouer's avatar
      i40e: implement flush flag for ndo_xdp_xmit · cdb57ed0
      Jesper Dangaard Brouer authored
      When passed the XDP_XMIT_FLUSH flag i40e_xdp_xmit now performs the
      same kind of ring tail update as in i40e_xdp_flush.  The advantage is
      that all the necessary checks have been performed and xdp_ring can be
      updated, instead of having to perform the exact same steps/checks in
      i40e_xdp_flush
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cdb57ed0
    • Jesper Dangaard Brouer's avatar
      xdp: add flags argument to ndo_xdp_xmit API · 42b33468
      Jesper Dangaard Brouer authored
      This patch only change the API and reject any use of flags. This is an
      intermediate step that allows us to implement the flush flag operation
      later, for each individual driver in a separate patch.
      
      The plan is to implement flush operation via XDP_XMIT_FLUSH flag
      and then remove XDP_XMIT_FLAGS_NONE when done.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      42b33468
    • Alexei Starovoitov's avatar
      Merge branch 'misc-BPF-improvements' · 69b45078
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      This set adds various patches I still had in my queue, first two
      are test cases to provide coverage for the recent two fixes that
      went to bpf tree, then a small improvement on the error message
      for gpl helpers. Next, we expose prog and map id into fdinfo in
      order to allow for inspection of these objections currently used
      in applications. Patch after that removes a retpoline call for
      map lookup/update/delete helpers. A new helper is added in the
      subsequent patch to lookup the skb's socket's cgroup v2 id which
      can be used in an efficient way for e.g. lookups on egress side.
      Next one is a fix to fully clear state info in tunnel/xfrm helpers.
      Given this is full cap_sys_admin from init ns and has same priv
      requirements like tracing, bpf-next should be okay. A small bug
      fix for bpf_asm follows, and next a fix for context access in
      tracing which was recently reported. Lastly, a small update in
      the maintainer's file to add patchwork url and missing files.
      
      Thanks!
      
      v2 -> v3:
        - Noticed a merge artefact inside uapi header comment, sigh,
          fixed now.
      v1 -> v2:
        - minor fix in getting context access work on 32 bit for tracing
        - add paragraph to uapi helper doc to better describe kernel
          build deps for cggroup helper
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69b45078
    • Daniel Borkmann's avatar
      bpf, doc: add missing patchwork url and libbpf to maintainers · 10a76564
      Daniel Borkmann authored
      Add missing bits under tools/lib/bpf/ and also Q: entry in order to
      make it easier for people to retrieve current patch queue.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      10a76564
    • Daniel Borkmann's avatar
      bpf: sync bpf uapi header with tools · 6b6a1925
      Daniel Borkmann authored
      Pull in recent changes from include/uapi/linux/bpf.h.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6b6a1925
    • Daniel Borkmann's avatar
      bpf: fix context access in tracing progs on 32 bit archs · bc23105c
      Daniel Borkmann authored
      Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
      program type in test_verifier report the following errors on x86_32:
      
        172/p unpriv: spill/fill of different pointers ldx FAIL
        Unexpected error message!
        0: (bf) r6 = r10
        1: (07) r6 += -8
        2: (15) if r1 == 0x0 goto pc+3
        R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
        3: (bf) r2 = r10
        4: (07) r2 += -76
        5: (7b) *(u64 *)(r6 +0) = r2
        6: (55) if r1 != 0x0 goto pc+1
        R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
        7: (7b) *(u64 *)(r6 +0) = r1
        8: (79) r1 = *(u64 *)(r6 +0)
        9: (79) r1 = *(u64 *)(r1 +68)
        invalid bpf_context access off=68 size=8
      
        378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (71) r0 = *(u8 *)(r1 +68)
        invalid bpf_context access off=68 size=1
      
        379/p check bpf_perf_event_data->sample_period half load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (69) r0 = *(u16 *)(r1 +68)
        invalid bpf_context access off=68 size=2
      
        380/p check bpf_perf_event_data->sample_period word load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (61) r0 = *(u32 *)(r1 +68)
        invalid bpf_context access off=68 size=4
      
        381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (79) r0 = *(u64 *)(r1 +68)
        invalid bpf_context access off=68 size=8
      
      Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
      boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
      will then bail out saying that off & (size_default - 1) which is 68 & 7
      doesn't cleanly align in the case of sample_period access from struct
      bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
      unaligned access here though underlying arch can handle it just fine.
      Therefore adjust this down to machine size and check and rewrite the
      offset for narrow access on that basis. We also need to fix corresponding
      pe_prog_is_valid_access(), since we hit the check for off % size != 0
      (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
      for tracing work on x86_32.
      Reported-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Tested-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bc23105c
    • Daniel Borkmann's avatar
      bpf: fix cbpf parser bug for octal numbers · b3bbba35
      Daniel Borkmann authored
      Range is 0-7, not 0-9, otherwise parser silently excludes it from the
      strtol() rather than throwing an error.
      Reported-by: default avatarMarc Boschma <marc@boschma.cx>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b3bbba35
    • Daniel Borkmann's avatar
      bpf: make sure to clear unused fields in tunnel/xfrm state fetch · 1fbc2e0c
      Daniel Borkmann authored
      Since the remaining bits are not filled in struct bpf_tunnel_key
      resp. struct bpf_xfrm_state and originate from uninitialized stack
      space, we should make sure to clear them before handing control
      back to the program.
      
      Also add a padding element to struct bpf_xfrm_state for future use
      similar as we have in struct bpf_tunnel_key and clear it as well.
      
        struct bpf_xfrm_state {
            __u32                      reqid;            /*     0     4 */
            __u32                      spi;              /*     4     4 */
            __u16                      family;           /*     8     2 */
      
            /* XXX 2 bytes hole, try to pack */
      
            union {
                __u32              remote_ipv4;          /*           4 */
                __u32              remote_ipv6[4];       /*          16 */
            };                                           /*    12    16 */
      
            /* size: 28, cachelines: 1, members: 4 */
            /* sum members: 26, holes: 1, sum holes: 2 */
            /* last cacheline: 28 bytes */
        };
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1fbc2e0c
    • Daniel Borkmann's avatar
      bpf: add bpf_skb_cgroup_id helper · cb20b08e
      Daniel Borkmann authored
      Add a new bpf_skb_cgroup_id() helper that allows to retrieve the
      cgroup id from the skb's socket. This is useful in particular to
      enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in
      cgroup v2 by allowing ID based matching on egress. This can in
      particular be used in combination with applying policy e.g. from
      map lookups, and also complements the older bpf_skb_under_cgroup()
      interface. In user space the cgroup id for a given path can be
      retrieved through the f_handle as demonstrated in [0] recently.
      
        [0] https://lkml.org/lkml/2018/5/22/1190Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cb20b08e
    • Daniel Borkmann's avatar
      bpf: avoid retpoline for lookup/update/delete calls on maps · 09772d92
      Daniel Borkmann authored
      While some of the BPF map lookup helpers provide a ->map_gen_lookup()
      callback for inlining the map lookup altogether it is not available
      for every map, so the remaining ones have to call bpf_map_lookup_elem()
      helper which does a dispatch to map->ops->map_lookup_elem(). In
      times of retpolines, this will control and trap speculative execution
      rather than letting it do its work for the indirect call and will
      therefore cause a slowdown. Likewise, bpf_map_update_elem() and
      bpf_map_delete_elem() do not have an inlined version and need to call
      into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
      handlers.
      
      Before:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#232656
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
         16: (95) exit                                 helper
      
      After:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#233328
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
         16: (95) exit
      
      In all three lookup/update/delete cases however we can use the actual
      address of the map callback directly if we find that there's only a
      single path with a map pointer leading to the helper call, meaning
      when the map pointer has not been poisoned from verifier side.
      Example code can be seen above for the delete case.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09772d92