1. 03 May, 2018 16 commits
    • Magnus Karlsson's avatar
      samples/bpf: sample application and documentation for AF_XDP sockets · b4b8faa1
      Magnus Karlsson authored
      This is a sample application for AF_XDP sockets. The application
      supports three different modes of operation: rxdrop, txonly and l2fwd.
      
      To show-case a simple round-robin load-balancing between a set of
      sockets in an xskmap, set the RR_LB compile time define option to 1 in
      "xdpsock.h".
      
      v2: The entries variable was calculated twice in {umem,xq}_nb_avail.
      Co-authored-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b4b8faa1
    • Magnus Karlsson's avatar
      xsk: statistics support · af75d9e0
      Magnus Karlsson authored
      In this commit, a new getsockopt is added: XDP_STATISTICS. This is
      used to obtain stats from the sockets.
      
      v2: getsockopt now returns size of stats structure.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af75d9e0
    • Magnus Karlsson's avatar
      xsk: support for Tx · 35fcde7f
      Magnus Karlsson authored
      Here, Tx support is added. The user fills the Tx queue with frames to
      be sent by the kernel, and let's the kernel know using the sendmsg
      syscall.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      35fcde7f
    • Magnus Karlsson's avatar
      dev: packet: make packet_direct_xmit a common function · 865b03f2
      Magnus Karlsson authored
      The new dev_direct_xmit will be used by AF_XDP in later commits.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      865b03f2
    • Magnus Karlsson's avatar
      xsk: add Tx queue setup and mmap support · f6145903
      Magnus Karlsson authored
      Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
      a queue, where the user process can pass frames to be transmitted by
      the kernel.
      
      The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f6145903
    • Magnus Karlsson's avatar
      xsk: add umem completion queue support and mmap · fe230832
      Magnus Karlsson authored
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
      process can ask the kernel to allocate a queue (ring buffer) and also
      mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      kernel to user process. This will be used by the TX path to tell user
      space that a certain frame has been transmitted and user space can use
      it for something else, if it wishes.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fe230832
    • Björn Töpel's avatar
      xsk: wire up XDP_SKB side of AF_XDP · 02671e23
      Björn Töpel authored
      This commit wires up the xskmap to XDP_SKB layer.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02671e23
    • Björn Töpel's avatar
      xsk: wire up XDP_DRV side of AF_XDP · 1b1a251c
      Björn Töpel authored
      This commit wires up the xskmap to XDP_DRV layer.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b1a251c
    • Björn Töpel's avatar
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel authored
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fbfc504a
    • Björn Töpel's avatar
      xsk: add Rx receive functions and poll support · c497176c
      Björn Töpel authored
      Here the actual receive functions of AF_XDP are implemented, that in a
      later commit, will be called from the XDP layers.
      
      There's one set of functions for the XDP_DRV side and another for
      XDP_SKB (generic).
      
      A new XDP API, xdp_return_buff, is also introduced.
      
      Adding xdp_return_buff, which is analogous to xdp_return_frame, but
      acts upon an struct xdp_buff. The API will be used by AF_XDP in future
      commits.
      
      Support for the poll syscall is also implemented.
      
      v2: xskq_validate_id did not update cons_tail.
          The entries variable was calculated twice in xskq_nb_avail.
          Squashed xdp_return_buff commit.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c497176c
    • Magnus Karlsson's avatar
      xsk: add support for bind for Rx · 965a9909
      Magnus Karlsson authored
      Here, the bind syscall is added. Binding an AF_XDP socket, means
      associating the socket to an umem, a netdev and a queue index. This
      can be done in two ways.
      
      The first way, creating a "socket from scratch". Create the umem using
      the XDP_UMEM_REG setsockopt and an associated fill queue with
      XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
      setsockopt. Call bind passing ifindex and queue index ("channel" in
      ethtool speak).
      
      The second way to bind a socket, is simply skipping the
      umem/netdev/queue index, and passing another already setup AF_XDP
      socket. The new socket will then have the same umem/netdev/queue index
      as the parent so it will share the same umem. You must also set the
      flags field in the socket address to XDP_SHARED_UMEM.
      
      v2: Use PTR_ERR instead of passing error variable explicitly.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      965a9909
    • Björn Töpel's avatar
      xsk: add Rx queue setup and mmap support · b9b6b68e
      Björn Töpel authored
      Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
      a queue, where the kernel can pass completed Rx frames from the kernel
      to user process.
      
      The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b9b6b68e
    • Magnus Karlsson's avatar
      xsk: add umem fill queue support and mmap · 423f3832
      Magnus Karlsson authored
      Here, we add another setsockopt for registered user memory (umem)
      called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
      ask the kernel to allocate a queue (ring buffer) and also mmap it
      (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
      
      The queue is used to explicitly pass ownership of umem frames from the
      user process to the kernel. These frames will in a later patch be
      filled in with Rx packet data by the kernel.
      
      v2: Fixed potential crash in xsk_mmap.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      423f3832
    • Björn Töpel's avatar
      xsk: add user memory registration support sockopt · c0c77d8f
      Björn Töpel authored
      In this commit the base structure of the AF_XDP address family is set
      up. Further, we introduce the abilty register a window of user memory
      to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
      window is viewed by an AF_XDP socket as a set of equally large
      frames. After a user memory registration all frames are "owned" by the
      user application, and not the kernel.
      
      v2: More robust checks on umem creation and unaccount on error.
          Call set_page_dirty_lock on cleanup.
          Simplified xdp_umem_reg.
      Co-authored-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c0c77d8f
    • Björn Töpel's avatar
      net: initial AF_XDP skeleton · 68e8b849
      Björn Töpel authored
      Buildable skeleton of AF_XDP without any functionality. Just what it
      takes to register a new address family.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      68e8b849
    • Wang YanQing's avatar
      bpf, x86_32: add eBPF JIT compiler for ia32 · 03f5781b
      Wang YanQing authored
      The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
      only. Classic BPF is supported because of the conversion by BPF core.
      
      Almost all instructions from eBPF ISA supported except the following:
      BPF_ALU64 | BPF_DIV | BPF_K
      BPF_ALU64 | BPF_DIV | BPF_X
      BPF_ALU64 | BPF_MOD | BPF_K
      BPF_ALU64 | BPF_MOD | BPF_X
      BPF_STX | BPF_XADD | BPF_W
      BPF_STX | BPF_XADD | BPF_DW
      
      It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
      
      IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
      EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
      ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
      eBPF registers, R0-R10, are simulated through scratch space on stack.
      
      The reasons behind the hardware registers allocation policy are:
      1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
        for general eBPF 64bit register simulation.
      2:We need at least 4 registers to simulate most eBPF ISA operations
        on registers operands instead of on register&memory operands.
      3:We need to put BPF_REG_AX on hardware registers, or constant blinding
        will degrade jit performance heavily.
      
      Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
      Testing results on i5-5200U:
      1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
      2) test_progs: Summary: 83 PASSED, 0 FAILED.
      3) test_lpm: OK
      4) test_lru_map: OK
      5) test_verifier: Summary: 828 PASSED, 0 FAILED.
      
      Above tests are all done in following two conditions separately:
      1:bpf_jit_enable=1 and bpf_jit_harden=0
      2:bpf_jit_enable=1 and bpf_jit_harden=2
      
      Below are some numbers for this jit implementation:
      Note:
        I run test_progs in kselftest 100 times continuously for every condition,
        the numbers are in format: total/times=avg.
        The numbers that test_bpf reports show almost the same relation.
      
      a:jit_enable=0 and jit_harden=0            b:jit_enable=1 and jit_harden=0
        test_pkt_access:PASS:ipv4:15622/100=156    test_pkt_access:PASS:ipv4:10674/100=106
        test_pkt_access:PASS:ipv6:9130/100=91      test_pkt_access:PASS:ipv6:4855/100=48
        test_xdp:PASS:ipv4:240198/100=2401         test_xdp:PASS:ipv4:138912/100=1389
        test_xdp:PASS:ipv6:137326/100=1373         test_xdp:PASS:ipv6:68542/100=685
        test_l4lb:PASS:ipv4:61100/100=611          test_l4lb:PASS:ipv4:37302/100=373
        test_l4lb:PASS:ipv6:101000/100=1010        test_l4lb:PASS:ipv6:55030/100=550
      
      c:jit_enable=1 and jit_harden=2
        test_pkt_access:PASS:ipv4:10558/100=105
        test_pkt_access:PASS:ipv6:5092/100=50
        test_xdp:PASS:ipv4:131902/100=1319
        test_xdp:PASS:ipv6:77932/100=779
        test_l4lb:PASS:ipv4:38924/100=389
        test_l4lb:PASS:ipv6:57520/100=575
      
      The numbers show we get 30%~50% improvement.
      
      See Documentation/networking/filter.txt for more information.
      
      Changelog:
      
       Changes v5-v6:
       1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
         consistence reason.
       2:Clean up non-standard comments, reported by Daniel Borkmann.
       3:Fix a memory leak issue, repoted by Daniel Borkmann.
      
       Changes v4-v5:
       1:Delete is_on_stack, BPF_REG_AX is the only one
         on real hardware registers, so just check with
         it.
       2:Apply commit 1612a981 ("bpf, x64: fix JIT emission
         for dead code"), suggested by Daniel Borkmann.
      
       Changes v3-v4:
       1:Fix changelog in commit.
         I install llvm-6.0, then test_progs willn't report errors.
         I submit another patch:
         "bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
         to fix another problem, after that patch, test_verifier willn't report errors too.
       2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
      
       Changes v2-v3:
       1:Move BPF_REG_AX to real hardware registers for performance reason.
       3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
       4:Delete partial codes in 1c2a088a, suggested by Daniel Borkmann.
       5:Some bug fixes and comments improvement.
      
       Changes v1-v2:
       1:Fix bug in emit_ia32_neg64.
       2:Fix bug in emit_ia32_arsh_r64.
       3:Delete filename in top level comment, suggested by Thomas Gleixner.
       4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
       5:Rewrite some words in changelog.
       6:CodingSytle improvement and a little more comments.
      Signed-off-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      03f5781b
  2. 02 May, 2018 2 commits
    • Quentin Monnet's avatar
      bpf: relax constraints on formatting for eBPF helper documentation · 6f96674d
      Quentin Monnet authored
      The Python script used to parse and extract eBPF helpers documentation
      from include/uapi/linux/bpf.h expects a very specific formatting for the
      descriptions (single dot represents a space, '>' stands for a tab):
      
          /*
           ...
           *.int bpf_helper(list of arguments)
           *.>    Description
           *.>    >       Start of description
           *.>    >       Another line of description
           *.>    >       And yet another line of description
           *.>    Return
           *.>    >       0 on success, or a negative error in case of failure
           ...
           */
      
      This is too strict, and painful for developers who wants to add
      documentation for new helpers. Worse, it is extremely difficult to check
      that the formatting is correct during reviews. Change the format
      expected by the script and make it more flexible. The script now works
      whether or not the initial space (right after the star) is present, and
      accepts both tabs and white spaces (or a combination of both) for
      indenting description sections and contents.
      
      Concretely, something like the following would now be supported:
      
          /*
           ...
           *int bpf_helper(list of arguments)
           *......Description
           *.>    >       Start of description...
           *>     >       Another line of description
           *..............And yet another line of description
           *>     Return
           *.>    ........0 on success, or a negative error in case of failure
           ...
           */
      
      While at it, remove unnecessary carets from each regex used with match()
      in the script. They are redundant, as match() tries to match from the
      beginning of the string by default.
      
      v2: Remove unnecessary caret when a regex is used with match().
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6f96674d
    • Ingo Molnar's avatar
      x86/bpf: Clean up non-standard comments, to make the code more readable · a2c7a983
      Ingo Molnar authored
      So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
      noticed the weird and inconsistent comment style it mistakenly learned from
      the networking code:
      
       /* Multi-line comment ...
        * ... looks like this.
        */
      
      Fix this to use the standard comment style specified in Documentation/CodingStyle
      and used in arch/x86/ as well:
      
       /*
        * Multi-line comment ...
        * ... looks like this.
        */
      
      Also, to quote Linus's ... more explicit views about this:
      
        http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066
      
        > But no, the networking code picked *none* of the above sane formats.
        > Instead, it picked these two models that are just half-arsed
        > shit-for-brains:
        >
        >  (no)
        >      /* This is disgusting drug-induced
        >        * crap, and should die
        >        */
        >
        >   (no-no-no)
        >       /* This is also very nasty
        >        * and visually unbalanced */
        >
        > Please. The networking code actually has the *worst* possible comment
        > style. You can literally find that (no-no-no) style, which is just
        > really horribly disgusting and worse than the otherwise fairly similar
        > (d) in pretty much every way.
      
      Also improve the comments and some other details while at it:
      
       - Don't mix same-line and previous-line comment style on otherwise
         identical code patterns within the same function,
      
       - capitalize 'BPF' and x86 register names consistently,
      
       - capitalize sentences consistently,
      
       - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,
      
       - use more consistent punctuation,
      
       - use standard coding style in macros as well,
      
       - fix typos and a few other minor details.
      
      Consistent coding style is not optional, at least in arch/x86/.
      
      No change in functionality.
      
      ( In case this commit causes conflicts with pending development code
        I'll be glad to help resolve any conflicts! )
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a2c7a983
  3. 01 May, 2018 1 commit
    • Quentin Monnet's avatar
      tools: bpftool: change time format for program 'loaded at:' information · a3fe1f6f
      Quentin Monnet authored
      To make eBPF program load time easier to parse from "bpftool prog"
      output for machines, change the time format used by the program. The
      format now differs for plain and JSON version:
      
      - Plain version uses a string formatted according to ISO 8601.
      - JSON uses the number of seconds since the Epoch, wich is less friendly
        for humans but even easier to process.
      
      Example output:
      
          # ./bpftool prog
          41298: xdp  tag a04f5eef06a7f555 dev foo
                  loaded_at 2018-04-18T17:19:47+0100  uid 0
                  xlated 16B  not jited  memlock 4096B
      
          # ./bpftool prog -p
          [{
                  "id": 41298,
                  "type": "xdp",
                  "tag": "a04f5eef06a7f555",
                  "gpl_compatible": false,
                  "dev": {
                      "ifindex": 14,
                      "ns_dev": 3,
                      "ns_inode": 4026531993,
                      "ifname": "foo"
                  },
                  "loaded_at": 1524068387,
                  "uid": 0,
                  "bytes_xlated": 16,
                  "jited": false,
                  "bytes_memlock": 4096
              }
          ]
      
      Previously, "Apr 18/17:19" would be used at both places.
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a3fe1f6f
  4. 30 Apr, 2018 7 commits
  5. 29 Apr, 2018 14 commits
    • Teng Qin's avatar
      bpf: Allow bpf_current_task_under_cgroup in interrupt · 7ef37712
      Teng Qin authored
      Currently, the bpf_current_task_under_cgroup helper has a check where if
      the BPF program is running in_interrupt(), it will return -EINVAL. This
      prevents the helper to be used in many useful scenarios, particularly
      BPF programs attached to Perf Events.
      
      This commit removes the check. Tested a few NMI (Perf Event) and some
      softirq context, the helper returns the correct result.
      Signed-off-by: default avatarTeng Qin <qinteng@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7ef37712
    • Alexei Starovoitov's avatar
      Merge branch 'fix-bpf-helpers-doc' · fcf85729
      Alexei Starovoitov authored
      Andrey Ignatov says:
      
      ====================
      BPF helpers documentation in UAPI refers to kernel ctx structures when it
      has to refer to user visible ones. Fix it.
      ====================
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fcf85729
    • Andrey Ignatov's avatar
      bpf: Sync bpf.h to tools/ · 96871b9f
      Andrey Ignatov authored
      The patch syncs bpf.h to tools/.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96871b9f
    • Andrey Ignatov's avatar
      bpf: Fix helpers ctx struct types in uapi doc · a3ef8e9a
      Andrey Ignatov authored
      Helpers may operate on two types of ctx structures: user visible ones
      (e.g. `struct bpf_sock_ops`) when used in user programs, and kernel ones
      (e.g. `struct bpf_sock_ops_kern`) in kernel implementation.
      
      UAPI documentation must refer to only user visible structures.
      
      The patch replaces references to `_kern` structures in BPF helpers
      description by corresponding user visible structures.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a3ef8e9a
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_get_stack' · f60ad0a0
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table regardless of
      whether BPF_F_REUSE_STACKID is specified or not,
      so some stack traces may be missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      
      Patches #1 and #2 implemented the core kernel support.
      Patch #3 removes two never-hit branches in verifier.
      Patches #4 and #5 are two verifier improves to make
      bpf programming easier. Patch #6 synced the new helper
      to tools headers. Patch #7 moved perf_event polling code
      and ksym lookup code from samples/bpf to
      tools/testing/selftests/bpf. Patch #8 added a verifier
      test in tools/bpf for new verifier change.
      Patches #9 and #10 added tests for raw tracepoint prog
      and tracepoint prog respectively.
      
      Changelogs:
        v8 -> v9:
          . make function perf_event_mmap (in trace_helpers.c) extern
            to decouple perf_event_mmap and perf_event_poller.
          . add jit enabled handling for kernel stack verification
            in Patch #9. Since we did not have a good way to
            verify jit enabled kernel stack, just return true if
            the kernel stack is not empty.
          . In path #9, using raw_syscalls/sys_enter instead of
            sched/sched_switch, removed calling cmd
            "task 1 dd if=/dev/zero of=/dev/null" which is left
            with dangling process after the program exited.
        v7 -> v8:
          . rebase on top of latest bpf-next
          . simplify BPF_ARSH dst_reg->smin_val/smax_value tracking
          . rewrite the description of bpf_get_stack() in uapi bpf.h
            based on new format.
        v6 -> v7:
          . do perf callchain buffer allocation inside the
            verifier. so if the prog->has_callchain_buf is set,
            it is guaranteed that the buffer has been allocated.
          . change condition "trace_nr <= skip" to "trace_nr < skip"
            so that for zero size buffer, return 0 instead of -EFAULT
        v5 -> v6:
          . after refining return register smax_value and umax_value
            for helpers bpf_get_stack and bpf_probe_read_str,
            bounds and var_off of the return register are further refined.
          . added missing commit message for tools header sync commit.
          . removed one unnecessary empty line.
        v4 -> v5:
          . relied on dst_reg->var_off to refine umin_val/umax_val
            in verifier handling BPF_ARSH value range tracking,
            suggested by Edward.
        v3 -> v4:
          . fixed a bug when meta ptr is set to NULL in check_func_arg.
          . introduced tnum_arshift and added detailed comments for
            the underlying implementation
          . avoided using VLA in tools/bpf test_progs.
        v2 -> v3:
          . used meta to track helper memory size argument
          . implemented range checking for ARSH in verifier
          . moved perf event polling and ksym related functions
            from samples/bpf to tools/bpf
          . added test to compare build id's between bpf_get_stackid
            and bpf_get_stack
        v1 -> v2:
          . fixed compilation error when CONFIG_PERF_EVENTS is not enabled
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f60ad0a0
    • Yonghong Song's avatar
      tools/bpf: add a test for bpf_get_stack with tracepoint prog · 79b45350
      Yonghong Song authored
      The test_stacktrace_map and test_stacktrace_build_id are
      enhanced to call bpf_get_stack in the helper to get the
      stack trace as well.  The stack traces from bpf_get_stack
      and bpf_get_stackid are compared to ensure that for the
      same stack as represented as the same hash, their ip addresses
      or build id's must be the same.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      79b45350
    • Yonghong Song's avatar
      tools/bpf: add a test for bpf_get_stack with raw tracepoint prog · 173965fb
      Yonghong Song authored
      The test attached a raw_tracepoint program to raw_syscalls/sys_enter.
      It tested to get stack for user space, kernel space and user
      space with build_id request. It also tested to get user
      and kernel stack into the same buffer with back-to-back
      bpf_get_stack helper calls.
      
      If jit is not enabled, the user space application will check
      to ensure that the kernel function for raw_tracepoint
      ___bpf_prog_run is part of the stack.
      
      If jit is enabled, we did not have a reliable way to
      verify the kernel stack, so just assume the kernel stack
      is good when the kernel stack size is greater than 0.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      173965fb
    • Yonghong Song's avatar
      tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH · 2abe611c
      Yonghong Song authored
      The test_verifier already has a few ARSH test cases.
      This patch adds a new test case which takes advantage of newly
      improved verifier behavior for bpf_get_stack and ARSH.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2abe611c
    • Yonghong Song's avatar
      samples/bpf: move common-purpose trace functions to selftests · 28dbf861
      Yonghong Song authored
      There is no functionality change in this patch. The common-purpose
      trace functions, including perf_event polling and ksym lookup,
      are moved from trace_output_user.c and bpf_load.c to
      selftests/bpf/trace_helpers.c so that these function can
      be reused later in selftests.
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      28dbf861
    • Yonghong Song's avatar
      tools/bpf: add bpf_get_stack helper to tools headers · de2ff05f
      Yonghong Song authored
      The tools header file bpf.h is synced with kernel uapi bpf.h.
      The new helper is also added to bpf_helpers.h.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      de2ff05f
    • Yonghong Song's avatar
      bpf/verifier: improve register value range tracking with ARSH · 9cbe1f5a
      Yonghong Song authored
      When helpers like bpf_get_stack returns an int value
      and later on used for arithmetic computation, the LSH and ARSH
      operations are often required to get proper sign extension into
      64-bit. For example, without this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0)
      With this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
      With better range of "R8", later on when "R8" is added to other register,
      e.g., a map pointer or scalar-value register, the better register
      range can be derived and verifier failure may be avoided.
      
      In our later example,
          ......
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0)
              return 0;
          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
          ......
      Without improving ARSH value range tracking, the register representing
      "max_len - usize" will have smin_value equal to S64_MIN and will be
      rejected by verifier.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9cbe1f5a
    • Yonghong Song's avatar
      bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals · afbe1a5b
      Yonghong Song authored
      In verifier function adjust_scalar_min_max_vals,
      when src_known is false and the opcode is BPF_LSH/BPF_RSH,
      early return will happen in the function. So remove
      the branch in handling BPF_LSH/BPF_RSH when src_known is false.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      afbe1a5b
    • Yonghong Song's avatar
      bpf/verifier: refine retval R0 state for bpf_get_stack helper · 849fa506
      Yonghong Song authored
      The special property of return values for helpers bpf_get_stack
      and bpf_probe_read_str are captured in verifier.
      Both helpers return a negative error code or
      a length, which is equal to or smaller than the buffer
      size argument. This additional information in the
      verifier can avoid the condition such as "retval > bufsize"
      in the bpf program. For example, for the code blow,
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0 || usize > max_len)
              return 0;
      The verifier may have the following errors:
          52: (85) call bpf_get_stack#65
           R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
           R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R9_w=inv800 R10=fp0,call_-1
          53: (bf) r8 = r0
          54: (bf) r1 = r8
          55: (67) r1 <<= 32
          56: (bf) r2 = r1
          57: (77) r2 >>= 32
          58: (25) if r2 > 0x31f goto pc+33
           R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                               umax_value=18446744069414584320,
                               var_off=(0x0; 0xffffffff00000000))
           R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R8=inv(id=0) R9=inv800 R10=fp0,call_-1
          59: (1f) r9 -= r8
          60: (c7) r1 s>>= 32
          61: (bf) r2 = r7
          62: (0f) r2 += r1
          math between map_value pointer and register with unbounded
          min value is not allowed
      The failure is due to llvm compiler optimization where register "r2",
      which is a copy of "r1", is tested for condition while later on "r1"
      is used for map_ptr operation. The verifier is not able to track such
      inst sequence effectively.
      
      Without the "usize > max_len" condition, there is no llvm optimization
      and the below generated code passed verifier:
          52: (85) call bpf_get_stack#65
           R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
           R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R9_w=inv800 R10=fp0,call_-1
          53: (b7) r1 = 0
          54: (bf) r8 = r0
          55: (67) r8 <<= 32
          56: (c7) r8 s>>= 32
          57: (6d) if r1 s> r8 goto pc+24
           R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
           R1=inv0 R6=ctx(id=0,off=0,imm=0)
           R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
           R10=fp0,call_-1
          58: (bf) r2 = r7
          59: (0f) r2 += r8
          60: (1f) r9 -= r8
          61: (bf) r1 = r6
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      849fa506
    • Yonghong Song's avatar
      bpf: add bpf_get_stack helper · c195651e
      Yonghong Song authored
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table,
      so some stack traces are missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c195651e