1. 25 May, 2018 14 commits
    • Jesper Dangaard Brouer's avatar
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer authored
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      735fc405
    • Jesper Dangaard Brouer's avatar
      xdp: introduce xdp_return_frame_rx_napi · 389ab7f0
      Jesper Dangaard Brouer authored
      When sending an xdp_frame through xdp_do_redirect call, then error
      cases can happen where the xdp_frame needs to be dropped, and
      returning an -errno code isn't sufficient/possible any-longer
      (e.g. for cpumap case). This is already fully supported, by simply
      calling xdp_return_frame.
      
      This patch is an optimization, which provides xdp_return_frame_rx_napi,
      which is a faster variant for these error cases.  It take advantage of
      the protection provided by XDP RX running under NAPI protection.
      
      This change is mostly relevant for drivers using the page_pool
      allocator as it can take advantage of this. (Tested with mlx5).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      389ab7f0
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmit · 9940fbf6
      Jesper Dangaard Brouer authored
      The xdp_monitor sample/tool is updated to use the new tracepoint
      xdp:xdp_devmap_xmit the previous patch just introduced.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9940fbf6
    • Jesper Dangaard Brouer's avatar
      xdp: add tracepoint for devmap like cpumap have · 38edddb8
      Jesper Dangaard Brouer authored
      Notice how this allow us get XDP statistic without affecting the XDP
      performance, as tracepoint is no-longer activated on a per packet basis.
      
      V5: Spotted by John Fastabend.
       Fix 'sent' also counted 'drops' in this patch, a later patch corrected
       this, but it was a mistake in this intermediate step.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      38edddb8
    • Jesper Dangaard Brouer's avatar
      bpf: devmap prepare xdp frames for bulking · 5d053f9d
      Jesper Dangaard Brouer authored
      Like cpumap create queue for xdp frames that will be bulked.  For now,
      this patch simply invoke ndo_xdp_xmit foreach frame.  This happens,
      either when the map flush operation is envoked, or when the limit
      DEV_MAP_BULK_SIZE is reached.
      
      V5: Avoid memleak on error path in dev_map_update_elem()
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d053f9d
    • Jesper Dangaard Brouer's avatar
      bpf: devmap introduce dev_map_enqueue · 67f29e07
      Jesper Dangaard Brouer authored
      Functionality is the same, but the ndo_xdp_xmit call is now
      simply invoked from inside the devmap.c code.
      
      V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
      
      V5: Cleanups requested by Daniel
       - Newlines before func definition
       - Use BUILD_BUG_ON checks
       - Remove unnecessary use return value store in dev_map_enqueue
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      67f29e07
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-task-fd-query' · f80acbd2
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Currently, suppose a userspace application has loaded a bpf program
      and attached it to a tracepoint/kprobe/uprobe, and a bpf
      introspection tool, e.g., bpftool, wants to show which bpf program
      is attached to which tracepoint/kprobe/uprobe. Such attachment
      information will be really useful to understand the overall bpf
      deployment in the system.
      
      There is a name field (16 bytes) for each program, which could
      be used to encode the attachment point. There are some drawbacks
      for this approaches. First, bpftool user (e.g., an admin) may not
      really understand the association between the name and the
      attachment point. Second, if one program is attached to multiple
      places, encoding a proper name which can imply all these
      attachments becomes difficult.
      
      This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
      Given a pid and fd, this command will return bpf related information
      to user space. Right now it only supports tracepoint/kprobe/uprobe
      perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
         . prog_id
         . tracepoint name, or
         . k[ret]probe funcname + offset or kernel addr, or
         . u[ret]probe filename + offset
      to the userspace.
      The user can use "bpftool prog" to find more information about
      bpf program itself with prog_id.
      
      Patch #1 adds function perf_get_event() in kernel/events/core.c.
      Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
      Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
      in the libbpf library for samples/selftests/bpftool to use.
      Patch #4 adds ksym_get_addr() utility function.
      Patch #5 add a test in samples/bpf for querying k[ret]probes and
      u[ret]probes.
      Patch #6 add a test in tools/testing/selftests/bpf for querying
      raw_tracepoint and tracepoint.
      Patch #7 add a new subcommand "perf" to bpftool.
      
      Changelogs:
        v4 -> v5:
           . return strlen(buf) instead of strlen(buf) + 1
             in the attr.buf_len. As long as user provides
             non-empty buffer, it will be filed with empty
             string, truncated string, or full string
             based on the buffer size and the length of
             to-be-copied string.
        v3 -> v4:
           . made attr buf_len input/output. The length of
             actual buffter is written to buf_len so user space knows
             what is actually needed. If user provides a buffer
             with length >= 1 but less than required, do partial
             copy and return -ENOSPC.
           . code simplification with put_user.
           . changed query result attach_info to fd_type.
           . add tests at selftests/bpf to test zero len, null buf and
             insufficient buf.
        v2 -> v3:
           . made perf_get_event() return perf_event pointer const.
             this was to ensure that event fields are not meddled.
           . detect whether newly BPF_TASK_FD_QUERY is supported or
             not in "bpftool perf" and warn users if it is not.
        v1 -> v2:
           . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
             to BPF_TASK_FD_QUERY.
           . fixed various "bpftool perf" issues and added documentation
             and auto-completion.
      ====================
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f80acbd2
    • Yonghong Song's avatar
      tools/bpftool: add perf subcommand · b04df400
      Yonghong Song authored
      The new command "bpftool perf [show | list]" will traverse
      all processes under /proc, and if any fd is associated
      with a perf event, it will print out related perf event
      information. Documentation is also added.
      
      Below is an example to show the results using bcc commands.
      Running the following 4 bcc commands:
        kprobe:     trace.py '__x64_sys_nanosleep'
        kretprobe:  trace.py 'r::__x64_sys_nanosleep'
        tracepoint: trace.py 't:syscalls:sys_enter_nanosleep'
        uprobe:     trace.py 'p:/home/yhs/a.out:main'
      
      The bpftool command line and result:
      
        $ bpftool perf
        pid 21711  fd 5: prog_id 5  kprobe  func __x64_sys_write  offset 0
        pid 21765  fd 5: prog_id 7  kretprobe  func __x64_sys_nanosleep  offset 0
        pid 21767  fd 5: prog_id 8  tracepoint  sys_enter_nanosleep
        pid 21800  fd 5: prog_id 9  uprobe  filename /home/yhs/a.out  offset 1159
      
        $ bpftool -j perf
        [{"pid":21711,"fd":5,"prog_id":5,"fd_type":"kprobe","func":"__x64_sys_write","offset":0}, \
         {"pid":21765,"fd":5,"prog_id":7,"fd_type":"kretprobe","func":"__x64_sys_nanosleep","offset":0}, \
         {"pid":21767,"fd":5,"prog_id":8,"fd_type":"tracepoint","tracepoint":"sys_enter_nanosleep"}, \
         {"pid":21800,"fd":5,"prog_id":9,"fd_type":"uprobe","filename":"/home/yhs/a.out","offset":1159}]
      
        $ bpftool prog
        5: kprobe  name probe___x64_sys  tag e495a0c82f2c7a8d  gpl
      	  loaded_at 2018-05-15T04:46:37-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 4
        7: kprobe  name probe___x64_sys  tag f2fdee479a503abf  gpl
      	  loaded_at 2018-05-15T04:48:32-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 7
        8: tracepoint  name tracepoint__sys  tag 5390badef2395fcf  gpl
      	  loaded_at 2018-05-15T04:48:48-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 8
        9: kprobe  name probe_main_1  tag 0a87bdc2e2953b6d  gpl
      	  loaded_at 2018-05-15T04:49:52-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 9
      
        $ ps ax | grep "python ./trace.py"
        21711 pts/0    T      0:03 python ./trace.py __x64_sys_write
        21765 pts/0    S+     0:00 python ./trace.py r::__x64_sys_nanosleep
        21767 pts/2    S+     0:00 python ./trace.py t:syscalls:sys_enter_nanosleep
        21800 pts/3    S+     0:00 python ./trace.py p:/home/yhs/a.out:main
        22374 pts/1    S+     0:00 grep --color=auto python ./trace.py
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b04df400
    • Yonghong Song's avatar
      tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs · f699cf7a
      Yonghong Song authored
      The new tests are added to query perf_event information
      for raw_tracepoint and tracepoint attachment. For tracepoint,
      both syscalls and non-syscalls tracepoints are queries as
      they are treated slightly differently inside the kernel.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f699cf7a
    • Yonghong Song's avatar
      samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY · ecb96f7f
      Yonghong Song authored
      This is mostly to test kprobe/uprobe which needs kernel headers.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ecb96f7f
    • Yonghong Song's avatar
      tools/bpf: add ksym_get_addr() in trace_helpers · 73bc4d9f
      Yonghong Song authored
      Given a kernel function name, ksym_get_addr() will return the kernel
      address for this function, or 0 if it cannot find this function name
      in /proc/kallsyms. This function will be used later when a kernel
      address is used to initiate a kprobe perf event.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      73bc4d9f
    • Yonghong Song's avatar
      tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in libbpf · 30687ad9
      Yonghong Song authored
      Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and
      implement bpf_task_fd_query() in libbpf. The test programs
      in samples/bpf and tools/testing/selftests/bpf, and later bpftool
      will use this libbpf function to query kernel.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      30687ad9
    • Yonghong Song's avatar
      bpf: introduce bpf subcommand BPF_TASK_FD_QUERY · 41bdc4b4
      Yonghong Song authored
      Currently, suppose a userspace application has loaded a bpf program
      and attached it to a tracepoint/kprobe/uprobe, and a bpf
      introspection tool, e.g., bpftool, wants to show which bpf program
      is attached to which tracepoint/kprobe/uprobe. Such attachment
      information will be really useful to understand the overall bpf
      deployment in the system.
      
      There is a name field (16 bytes) for each program, which could
      be used to encode the attachment point. There are some drawbacks
      for this approaches. First, bpftool user (e.g., an admin) may not
      really understand the association between the name and the
      attachment point. Second, if one program is attached to multiple
      places, encoding a proper name which can imply all these
      attachments becomes difficult.
      
      This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
      Given a pid and fd, if the <pid, fd> is associated with a
      tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
         . prog_id
         . tracepoint name, or
         . k[ret]probe funcname + offset or kernel addr, or
         . u[ret]probe filename + offset
      to the userspace.
      The user can use "bpftool prog" to find more information about
      bpf program itself with prog_id.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41bdc4b4
    • Yonghong Song's avatar
      perf/core: add perf_get_event() to return perf_event given a struct file · f8d959a5
      Yonghong Song authored
      A new extern function, perf_get_event(), is added to return a perf event
      given a struct file. This function will be used in later patches.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f8d959a5
  2. 24 May, 2018 19 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-ipv6-seg6-bpf-action' · 31ad3923
      Daniel Borkmann authored
      Mathieu Xhonneux says:
      
      ====================
      As of Linux 4.14, it is possible to define advanced local processing for
      IPv6 packets with a Segment Routing Header through the seg6local LWT
      infrastructure. This LWT implements the network programming principles
      defined in the IETF "SRv6 Network Programming" draft.
      
      The implemented operations are generic, and it would be very interesting to
      be able to implement user-specific seg6local actions, without having to
      modify the kernel directly. To do so, this patchset adds an End.BPF action
      to seg6local, powered by some specific Segment Routing-related helpers,
      which provide SR functionalities that can be applied on the packet. This
      BPF hook would then allow to implement specific actions at native kernel
      speed such as OAM features, advanced SR SDN policies, SRv6 actions like
      Segment Routing Header (SRH) encapsulation depending on the content of
      the packet, etc.
      
      This patchset is divided in 6 patches, whose main features are :
      
      - A new seg6local action End.BPF with the corresponding new BPF program
        type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
        passed to the LWT seg6local through netlink, the same way as the LWT
        BPF hook operates.
      - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
        shrink a SRH and apply on a packet some of the generic SRv6 actions.
      - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
        encapsulation (via IPv6 encapsulation or inlining if the packet contains
        already an IPv6 header).
      
      As this patchset adds a new LWT BPF hook, I took into account the result
      of the discussions when the LWT BPF infrastructure got merged. Hence, the
      seg6local BPF hook doesn't allow write access to skb->data directly, only
      the SRH can be modified through specific helpers, which ensures that the
      integrity of the packet is maintained. More details are available in the
      related patches messages.
      
      The performances of this BPF hook have been assessed with the BPF JIT
      enabled on an Intel Xeon X3440 processors with 4 cores and 8 threads
      clocked at 2.53 GHz. No throughput losses are noted with the seg6local
      BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
      TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
      drops the throughput to 410kpps, and inlining a SRH via bpf_lwt_seg6_action
      drops the throughput to 420kpps. All throughputs are stable.
      
      Changelog:
      
      v2: move the SRH integrity state from skb->cb to a per-cpu buffer
      v3: - document helpers in man-page style
          - fix kbuild bugs
          - un-break BPF LWT out hook
          - bpf_push_seg6_encap is now static
          - preempt_enable is now called when the packet is dropped in
            input_action_end_bpf
      v4: fix kbuild bugs when CONFIG_IPV6=m
      v5: fix kbuild sparse warnings when CONFIG_IPV6=m
      v6: fix skb pointers-related bugs in helpers
      v7: - fix memory leak in error path of End.BPF setup
          - add freeing of BPF data in seg6_local_destroy_state
          - new enums SEG6_LOCAL_BPF_* instead of re-using ones of lwt bpf for
            netlink nested bpf attributes
          - SEG6_LOCAL_BPF_PROG attr now contains prog->aux->id when dumping
            state
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      31ad3923
    • Mathieu Xhonneux's avatar
      selftests/bpf: test for seg6local End.BPF action · c99a84ea
      Mathieu Xhonneux authored
      Add a new test for the seg6local End.BPF action. The following helpers
      are also tested:
      
      - bpf_lwt_push_encap within the LWT BPF IN hook
      - bpf_lwt_seg6_action
      - bpf_lwt_seg6_adjust_srh
      - bpf_lwt_seg6_store_bytes
      
      A chain of End.BPF actions is built. The SRH is injected through a LWT
      BPF IN hook before entering this chain. Each End.BPF action validates
      the previous one, otherwise the packet is dropped. The test succeeds
      if the last node in the chain receives the packet and the UDP datagram
      contained can be retrieved from userspace.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c99a84ea
    • Mathieu Xhonneux's avatar
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux authored
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • Mathieu Xhonneux's avatar
      bpf: Split lwt inout verifier structures · cd3092c7
      Mathieu Xhonneux authored
      The new bpf_lwt_push_encap helper should only be accessible within the
      LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
      panic.
      
      At the moment, both LWT BPF IN and OUT share the same list of helpers,
      whose calls are authorized by the verifier. This patch separates the
      verifier ops for the IN and OUT hooks, and allows the IN hook to call the
      bpf_lwt_push_encap helper.
      
      This patch is also the occasion to put all lwt_*_func_proto functions
      together for clarity. At the moment, socks_op_func_proto is in the middle
      of lwt_inout_func_proto and lwt_xmit_func_proto.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cd3092c7
    • Mathieu Xhonneux's avatar
      bpf: Add IPv6 Segment Routing helpers · fe94cc29
      Mathieu Xhonneux authored
      The BPF seg6local hook should be powerful enough to enable users to
      implement most of the use-cases one could think of. After some thinking,
      we figured out that the following actions should be possible on a SRv6
      packet, requiring 3 specific helpers :
          - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
          - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                                     (to add/delete TLVs)
          - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                                 (specifically End.X, End.T, End.B6 and
                                  End.B6.Encap)
      
      The specifications of these helpers are provided in the patch (see
      include/uapi/linux/bpf.h).
      
      The non-sensitive fields of the SRH are the following : flags, tag and
      TLVs. The other fields can not be modified, to maintain the SRH
      integrity. Flags, tag and TLVs can easily be modified as their validity
      can be checked afterwards via seg6_validate_srh. It is not allowed to
      modify the segments directly. If one wants to add segments on the path,
      he should stack a new SRH using the End.B6 action via
      bpf_lwt_seg6_action.
      
      Growing, shrinking or editing TLVs via the helpers will flag the SRH as
      invalid, and it will have to be re-validated before re-entering the IPv6
      layer. This flag is stored in a per-CPU buffer, along with the current
      header length in bytes.
      
      Storing the SRH len in bytes in the control block is mandatory when using
      bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
      len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
      boundary). When adding/deleting TLVs within the BPF program, the SRH may
      temporary be in an invalid state where its length cannot be rounded to 8
      bytes without remainder, hence the need to store the length in bytes
      separately. The caller of the BPF program can then ensure that the SRH's
      final length is valid using this value. Again, a final SRH modified by a
      BPF program which doesn’t respect the 8-bytes boundary will be discarded
      as it will be considered as invalid.
      
      Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
      available from the LWT BPF IN hook, but not from the seg6local BPF one.
      This helper allows to encapsulate a Segment Routing Header (either with
      a new outer IPv6 header, or by inlining it directly in the existing IPv6
      header) into a non-SRv6 packet. This helper is required if we want to
      offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
      as the BPF seg6local hook only works on traffic already containing a SRH.
      This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
      the same purpose but with a static SRH per route.
      
      These helpers require CONFIG_IPV6=y (and not =m).
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fe94cc29
    • Mathieu Xhonneux's avatar
      ipv6: sr: export function lookup_nexthop · 1c1e761e
      Mathieu Xhonneux authored
      The function lookup_nexthop is essential to implement most of the seg6local
      actions. As we want to provide a BPF helper allowing to apply some of these
      actions on the packet being processed, the helper should be able to call
      this function, hence the need to make it public.
      
      Moreover, if one argument is incorrect or if the next hop can not be found,
      an error should be returned by the BPF helper so the BPF program can adapt
      its processing of the packet (return an error, properly force the drop,
      ...). This patch hence makes this function return dst->error to indicate a
      possible error.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1c1e761e
    • Mathieu Xhonneux's avatar
      ipv6: sr: make seg6.h includable without IPv6 · 63526e1c
      Mathieu Xhonneux authored
      include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
      not enabled:
         include/net/seg6.h: In function 'seg6_pernet':
      >> include/net/seg6.h:52:14: error: 'struct net' has no member named
                                              'ipv6'; did you mean 'ipv4'?
           return net->ipv6.seg6_data;
                       ^~~~
                       ipv4
      
      This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
      allowing seg6.h to be included regardless of the configuration.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      63526e1c
    • Daniel Borkmann's avatar
      Merge branch 'bpf-multi-prog-improvements' · 30cfe3b4
      Daniel Borkmann authored
      Sandipan Das says:
      
      ====================
      [1] Support for bpf-to-bpf function calls in the powerpc64 JIT compiler.
      
      [2] Provide a way for resolving function calls because of the way JITed
          images are allocated in powerpc64.
      
      [3] Fix to get JITed instruction dumps for multi-function programs from
          the bpf system call.
      
      [4] Fix for bpftool to show delimited multi-function JITed image dumps.
      
      v4:
       - Incorporate review comments from Jakub.
       - Fix JSON output for bpftool.
      
      v3:
       - Change base tree tag to bpf-next.
       - Incorporate review comments from Alexei, Daniel and Jakub.
       - Make sure that the JITed image does not grow or shrink after
         the last pass due to the way the instruction sequence used
         to load a callee's address maybe optimized.
       - Make additional changes to the bpf system call and bpftool to
         make multi-function JITed dumps easier to correlate.
      
      v2:
       - Incorporate review comments from Jakub.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      30cfe3b4
    • Sandipan Das's avatar
      tools: bpftool: add delimiters to multi-function JITed dumps · f7f62c71
      Sandipan Das authored
      This splits up the contiguous JITed dump obtained via the bpf
      system call into more relatable chunks for each function in
      the program. If the kernel symbols corresponding to these are
      known, they are printed in the header for each JIT image dump
      otherwise the masked start address is printed.
      
      Before applying this patch:
      
        # bpftool prog dump jited id 1
      
           0:	push   %rbp
           1:	mov    %rsp,%rbp
        ...
          70:	leaveq
          71:	retq
          72:	push   %rbp
          73:	mov    %rsp,%rbp
        ...
          dd:	leaveq
          de:	retq
      
        # bpftool -p prog dump jited id 1
      
        [{
                "pc": "0x0",
                "operation": "push",
                "operands": ["%rbp"
                ]
            },{
        ...
            },{
                "pc": "0x71",
                "operation": "retq",
                "operands": [null
                ]
            },{
                "pc": "0x72",
                "operation": "push",
                "operands": ["%rbp"
                ]
            },{
        ...
            },{
                "pc": "0xde",
                "operation": "retq",
                "operands": [null
                ]
            }
        ]
      
      After applying this patch:
      
        # echo 0 > /proc/sys/net/core/bpf_jit_kallsyms
        # bpftool prog dump jited id 1
      
        0xffffffffc02c7000:
           0:	push   %rbp
           1:	mov    %rsp,%rbp
        ...
          70:	leaveq
          71:	retq
      
        0xffffffffc02cf000:
           0:	push   %rbp
           1:	mov    %rsp,%rbp
        ...
          6b:	leaveq
          6c:	retq
      
        # bpftool -p prog dump jited id 1
      
        [{
                "name": "0xffffffffc02c7000",
                "insns": [{
                        "pc": "0x0",
                        "operation": "push",
                        "operands": ["%rbp"
                        ]
                    },{
        ...
                    },{
                        "pc": "0x71",
                        "operation": "retq",
                        "operands": [null
                        ]
                    }
                ]
            },{
                "name": "0xffffffffc02cf000",
                "insns": [{
                        "pc": "0x0",
                        "operation": "push",
                        "operands": ["%rbp"
                        ]
                    },{
        ...
                    },{
                        "pc": "0x6c",
                        "operation": "retq",
                        "operands": [null
                        ]
                    }
                ]
            }
        ]
      
        # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
        # bpftool prog dump jited id 1
      
        bpf_prog_b811aab41a39ad3d_foo:
           0:	push   %rbp
           1:	mov    %rsp,%rbp
        ...
          70:	leaveq
          71:	retq
      
        bpf_prog_cf418ac8b67bebd9_F:
           0:	push   %rbp
           1:	mov    %rsp,%rbp
        ...
          6b:	leaveq
          6c:	retq
      
        # bpftool -p prog dump jited id 1
      
        [{
                "name": "bpf_prog_b811aab41a39ad3d_foo",
                "insns": [{
                        "pc": "0x0",
                        "operation": "push",
                        "operands": ["%rbp"
                        ]
                    },{
        ...
                    },{
                        "pc": "0x71",
                        "operation": "retq",
                        "operands": [null
                        ]
                    }
                ]
            },{
                "name": "bpf_prog_cf418ac8b67bebd9_F",
                "insns": [{
                        "pc": "0x0",
                        "operation": "push",
                        "operands": ["%rbp"
                        ]
                    },{
        ...
                    },{
                        "pc": "0x6c",
                        "operation": "retq",
                        "operands": [null
                        ]
                    }
                ]
            }
        ]
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f7f62c71
    • Sandipan Das's avatar
      tools: bpf: sync bpf uapi header · bd980d43
      Sandipan Das authored
      Syncing the bpf.h uapi header with tools so that struct
      bpf_prog_info has the two new fields for passing on the
      JITed image lengths of each function in a multi-function
      program.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bd980d43
    • Sandipan Das's avatar
      bpf: get JITed image lengths of functions via syscall · 815581c1
      Sandipan Das authored
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of the JITed image lengths of each function for a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      This can be used by userspace applications like bpftool
      to split up the contiguous JITed dump, also obtained via
      the system call, into more relatable chunks corresponding
      to each function.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      815581c1
    • Sandipan Das's avatar
      bpf: fix multi-function JITed dump obtained via syscall · 4d56a76e
      Sandipan Das authored
      Currently, for multi-function programs, we cannot get the JITed
      instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
      command. Because of this, userspace tools such as bpftool fail
      to identify a multi-function program as being JITed or not.
      
      With the JIT enabled and the test program running, this can be
      verified as follows:
      
        # cat /proc/sys/net/core/bpf_jit_enable
        1
      
      Before applying this patch:
      
        # bpftool prog list
        1: kprobe  name foo  tag b811aab41a39ad3d  gpl
                loaded_at 2018-05-16T11:43:38+0530  uid 0
                xlated 216B  not jited  memlock 65536B
        ...
      
        # bpftool prog dump jited id 1
        no instructions returned
      
      After applying this patch:
      
        # bpftool prog list
        1: kprobe  name foo  tag b811aab41a39ad3d  gpl
                loaded_at 2018-05-16T12:13:01+0530  uid 0
                xlated 216B  jited 308B  memlock 65536B
        ...
      
        # bpftool prog dump jited id 1
           0:   nop
           4:   nop
           8:   mflr    r0
           c:   std     r0,16(r1)
          10:   stdu    r1,-112(r1)
          14:   std     r31,104(r1)
          18:   addi    r31,r1,48
          1c:   li      r3,10
        ...
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4d56a76e
    • Sandipan Das's avatar
      tools: bpftool: resolve calls without using imm field · f84192ee
      Sandipan Das authored
      Currently, we resolve the callee's address for a JITed function
      call by using the imm field of the call instruction as an offset
      from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
      use this address to get the callee's kernel symbol's name.
      
      For some architectures, such as powerpc64, the imm field is not
      large enough to hold this offset. So, instead of assigning this
      offset to the imm field, the verifier now assigns the subprog
      id. Also, a list of kernel symbol addresses for all the JITed
      functions is provided in the program info. We now use the imm
      field as an index for this list to lookup a callee's symbol's
      address and resolve its name.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f84192ee
    • Sandipan Das's avatar
      tools: bpf: sync bpf uapi header · dd0c5f07
      Sandipan Das authored
      Syncing the bpf.h uapi header with tools so that struct
      bpf_prog_info has the two new fields for passing on the
      addresses of the kernel symbols corresponding to each
      function in a program.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      dd0c5f07
    • Sandipan Das's avatar
      bpf: get kernel symbol addresses via syscall · dbecd738
      Sandipan Das authored
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of kernel symbol addresses for all functions in a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      When bpf_jit_kallsyms is enabled, we can get the address
      of the corresponding kernel symbol for a callee function
      and resolve the symbol's name. The address is determined
      by adding the value of the call instruction's imm field
      to __bpf_call_base. This offset gets assigned to the imm
      field by the verifier.
      
      For some architectures, such as powerpc64, the imm field
      is not large enough to hold this offset.
      
      We resolve this by:
      
      [1] Assigning the subprog id to the imm field of a call
          instruction in the verifier instead of the offset of
          the callee's symbol's address from __bpf_call_base.
      
      [2] Determining the address of a callee's corresponding
          symbol by using the imm field as an index for the
          list of kernel symbol addresses now available from
          the program info.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      dbecd738
    • Sandipan Das's avatar
      bpf: powerpc64: add JIT support for multi-function programs · 8484ce83
      Sandipan Das authored
      This adds support for bpf-to-bpf function calls in the powerpc64
      JIT compiler. The JIT compiler converts the bpf call instructions
      to native branch instructions. After a round of the usual passes,
      the start addresses of the JITed images for the callee functions
      are known. Finally, to fixup the branch target addresses, we need
      to perform an extra pass.
      
      Because of the address range in which JITed images are allocated
      on powerpc64, the offsets of the start addresses of these images
      from __bpf_call_base are as large as 64 bits. So, for a function
      call, we cannot use the imm field of the instruction to determine
      the callee's address. Instead, we use the alternative method of
      getting it from the list of function addresses in the auxiliary
      data of the caller by using the off field as an index.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8484ce83
    • Sandipan Das's avatar
      bpf: powerpc64: pad function address loads with NOPs · 4ea69b2f
      Sandipan Das authored
      For multi-function programs, loading the address of a callee
      function to a register requires emitting instructions whose
      count varies from one to five depending on the nature of the
      address.
      
      Since we come to know of the callee's address only before the
      extra pass, the number of instructions required to load this
      address may vary from what was previously generated. This can
      make the JITed image grow or shrink.
      
      To avoid this, we should generate a constant five-instruction
      when loading function addresses by padding the optimized load
      sequence with NOPs.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4ea69b2f
    • Sandipan Das's avatar
      bpf: support 64-bit offsets for bpf function calls · 2162fed4
      Sandipan Das authored
      The imm field of a bpf instruction is a signed 32-bit integer.
      For JITed bpf-to-bpf function calls, it holds the offset of the
      start address of the callee's JITed image from __bpf_call_base.
      
      For some architectures, such as powerpc64, this offset may be
      as large as 64 bits and cannot be accomodated in the imm field
      without truncation.
      
      We resolve this by:
      
      [1] Additionally using the auxiliary data of each function to
          keep a list of start addresses of the JITed images for all
          functions determined by the verifier.
      
      [2] Retaining the subprog id inside the off field of the call
          instructions and using it to index into the list mentioned
          above and lookup the callee's address.
      
      To make sure that the existing JIT compilers continue to work
      without requiring changes, we keep the imm field as it is.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2162fed4
    • Martin KaFai Lau's avatar
      bpf: btf: Avoid variable length array · a2889a4c
      Martin KaFai Lau authored
      Sparse warning:
      kernel/bpf/btf.c:1985:34: warning: Variable length array is used.
      
      This patch directly uses ARRAY_SIZE().
      
      Fixes: f80442a4 ("bpf: btf: Change how section is supported in btf_header")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a2889a4c
  3. 23 May, 2018 7 commits
    • Sirio Balmelli's avatar
      tools/lib/libbpf.c: fix string format to allow build on arm32 · a1c81810
      Sirio Balmelli authored
      On arm32, 'cd tools/testing/selftests/bpf && make' fails with:
      
      libbpf.c:80:10: error: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘int64_t {aka long long int}’ [-Werror=format=]
         (func)("libbpf: " fmt, ##__VA_ARGS__); \
                ^
      libbpf.c:83:30: note: in expansion of macro ‘__pr’
       #define pr_warning(fmt, ...) __pr(__pr_warning, fmt, ##__VA_ARGS__)
                                    ^~~~
      libbpf.c:1072:3: note: in expansion of macro ‘pr_warning’
         pr_warning("map:%s value_type:%s has BTF type_size:%ld != value_size:%u\n",
      
      To fix, typecast 'key_size' and amend format string.
      Signed-off-by: default avatarSirio Balmelli <sirio@b-ad.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a1c81810
    • Sirio Balmelli's avatar
      selftests/bpf: Makefile fix "missing" headers on build with -idirafter · 167381f3
      Sirio Balmelli authored
      Selftests fail to build on several distros/architectures because of
      	missing headers files.
      
      On a Ubuntu/x86_64 some missing headers are:
      	asm/byteorder.h, asm/socket.h, asm/sockios.h
      
      On a Debian/arm32 build already fails at sys/cdefs.h
      
      In both cases, these already exist in /usr/include/<arch-specific-dir>,
      but Clang does not include these when using '-target bpf' flag,
      since it is no longer compiling against the host architecture.
      
      The solution is to:
      
      - run Clang without '-target bpf' and extract the include chain for the
      current system
      
      - add these to the bpf build with '-idirafter'
      
      The choice of -idirafter is to catch this error without injecting
      unexpected include behavior: if an arch-specific tree is built
      for bpf in the future, this will be correctly found by Clang.
      Signed-off-by: default avatarSirio Balmelli <sirio@b-ad.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      167381f3
    • Daniel Borkmann's avatar
      Merge branch 'btf-uapi-cleanups' · ff4fb475
      Daniel Borkmann authored
      Martin KaFai Lau says:
      
      ====================
      This patch set makes some changes to cleanup the unused
      bits in BTF uapi.  It also makes the btf_header extensible.
      
      Please see individual patches for details.
      
      v2:
      - Remove NR_SECS from patch 2
      - Remove "unsigned" check on array->index_type from patch 3
      - Remove BTF_INT_VARARGS and further limit BTF_INT_ENCODING
        from 8 bits to 4 bits in patch 4
      - Adjustments in test_btf.c to reflect changes in v2
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ff4fb475
    • Martin KaFai Lau's avatar
      bpf: btf: Add tests for the btf uapi changes · 61746dbe
      Martin KaFai Lau authored
      This patch does the followings:
      1. Modify libbpf and test_btf to reflect the uapi changes in btf
      2. Add test for the btf_header changes
      3. Add tests for array->index_type
      4. Add err_str check to the tests
      5. Fix a 4 bytes hole in "struct test #1" by swapping "m" and "n"
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      61746dbe
    • Martin KaFai Lau's avatar
      bpf: btf: Sync bpf.h and btf.h to tools · f03b15d3
      Martin KaFai Lau authored
      This patch sync the uapi bpf.h and btf.h to tools.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f03b15d3
    • Martin KaFai Lau's avatar
      bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info · 9b2cf328
      Martin KaFai Lau authored
      In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
      could cause confusion because the "id" of "btf_id" means the BPF obj id
      given to the BTF object while
      "btf_key_id" and "btf_value_id" means the BTF type id within
      that BTF object.
      
      To make it clear, btf_key_id and btf_value_id are
      renamed to btf_key_type_id and btf_value_type_id.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9b2cf328
    • Martin KaFai Lau's avatar
      bpf: btf: Remove unused bits from uapi/linux/btf.h · aea2f7b8
      Martin KaFai Lau authored
      This patch does the followings:
      1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k.  We can
         raise it later.
      
      2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID.  They are
         currently encoded at the highest bit of a u32.
         It is because the current use case does not require supporting
         parent type (i.e type_id referring to a type in another BTF file).
         It also does not support referring to a string in ELF.
      
         The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced
         by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are
         defined in btf.c instead of uapi/linux/btf.h.
      
      3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough.
         There is unused bits headroom if we ever needed it later.
      
      4. The root bit in BTF_INFO is also removed because it is not
         used in the current use case.
      
      5. Remove BTF_INT_VARARGS since func type is not supported now.
         The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits.
      
      The above can be added back later because the verifier
      ensures the unused bits are zeros.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      aea2f7b8