1. 29 Apr, 2018 8 commits
    • Yonghong Song's avatar
      tools/bpf: add a verifier test case for bpf_get_stack helper and ARSH · 2abe611c
      Yonghong Song authored
      The test_verifier already has a few ARSH test cases.
      This patch adds a new test case which takes advantage of newly
      improved verifier behavior for bpf_get_stack and ARSH.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2abe611c
    • Yonghong Song's avatar
      samples/bpf: move common-purpose trace functions to selftests · 28dbf861
      Yonghong Song authored
      There is no functionality change in this patch. The common-purpose
      trace functions, including perf_event polling and ksym lookup,
      are moved from trace_output_user.c and bpf_load.c to
      selftests/bpf/trace_helpers.c so that these function can
      be reused later in selftests.
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      28dbf861
    • Yonghong Song's avatar
      tools/bpf: add bpf_get_stack helper to tools headers · de2ff05f
      Yonghong Song authored
      The tools header file bpf.h is synced with kernel uapi bpf.h.
      The new helper is also added to bpf_helpers.h.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      de2ff05f
    • Yonghong Song's avatar
      bpf/verifier: improve register value range tracking with ARSH · 9cbe1f5a
      Yonghong Song authored
      When helpers like bpf_get_stack returns an int value
      and later on used for arithmetic computation, the LSH and ARSH
      operations are often required to get proper sign extension into
      64-bit. For example, without this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0)
      With this patch:
          54: R0=inv(id=0,umax_value=800)
          54: (bf) r8 = r0
          55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800)
          55: (67) r8 <<= 32
          56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000))
          56: (c7) r8 s>>= 32
          57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff))
      With better range of "R8", later on when "R8" is added to other register,
      e.g., a map pointer or scalar-value register, the better register
      range can be derived and verifier failure may be avoided.
      
      In our later example,
          ......
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0)
              return 0;
          ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
          ......
      Without improving ARSH value range tracking, the register representing
      "max_len - usize" will have smin_value equal to S64_MIN and will be
      rejected by verifier.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9cbe1f5a
    • Yonghong Song's avatar
      bpf: remove never-hit branches in verifier adjust_scalar_min_max_vals · afbe1a5b
      Yonghong Song authored
      In verifier function adjust_scalar_min_max_vals,
      when src_known is false and the opcode is BPF_LSH/BPF_RSH,
      early return will happen in the function. So remove
      the branch in handling BPF_LSH/BPF_RSH when src_known is false.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      afbe1a5b
    • Yonghong Song's avatar
      bpf/verifier: refine retval R0 state for bpf_get_stack helper · 849fa506
      Yonghong Song authored
      The special property of return values for helpers bpf_get_stack
      and bpf_probe_read_str are captured in verifier.
      Both helpers return a negative error code or
      a length, which is equal to or smaller than the buffer
      size argument. This additional information in the
      verifier can avoid the condition such as "retval > bufsize"
      in the bpf program. For example, for the code blow,
          usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
          if (usize < 0 || usize > max_len)
              return 0;
      The verifier may have the following errors:
          52: (85) call bpf_get_stack#65
           R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
           R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R9_w=inv800 R10=fp0,call_-1
          53: (bf) r8 = r0
          54: (bf) r1 = r8
          55: (67) r1 <<= 32
          56: (bf) r2 = r1
          57: (77) r2 >>= 32
          58: (25) if r2 > 0x31f goto pc+33
           R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512,
                               umax_value=18446744069414584320,
                               var_off=(0x0; 0xffffffff00000000))
           R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff))
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R8=inv(id=0) R9=inv800 R10=fp0,call_-1
          59: (1f) r9 -= r8
          60: (c7) r1 s>>= 32
          61: (bf) r2 = r7
          62: (0f) r2 += r1
          math between map_value pointer and register with unbounded
          min value is not allowed
      The failure is due to llvm compiler optimization where register "r2",
      which is a copy of "r1", is tested for condition while later on "r1"
      is used for map_ptr operation. The verifier is not able to track such
      inst sequence effectively.
      
      Without the "usize > max_len" condition, there is no llvm optimization
      and the below generated code passed verifier:
          52: (85) call bpf_get_stack#65
           R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0)
           R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256
           R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R9_w=inv800 R10=fp0,call_-1
          53: (b7) r1 = 0
          54: (bf) r8 = r0
          55: (67) r8 <<= 32
          56: (c7) r8 s>>= 32
          57: (6d) if r1 s> r8 goto pc+24
           R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
           R1=inv0 R6=ctx(id=0,off=0,imm=0)
           R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
           R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800
           R10=fp0,call_-1
          58: (bf) r2 = r7
          59: (0f) r2 += r8
          60: (1f) r9 -= r8
          61: (bf) r1 = r6
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      849fa506
    • Yonghong Song's avatar
      bpf: add bpf_get_stack helper · c195651e
      Yonghong Song authored
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table,
      so some stack traces are missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c195651e
    • Yonghong Song's avatar
      bpf: change prototype for stack_map_get_build_id_offset · 5f412632
      Yonghong Song authored
      This patch didn't incur functionality change. The function prototype
      got changed so that the same function can be reused later.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5f412632
  2. 27 Apr, 2018 2 commits
    • Leo Yan's avatar
      bpf, doc: Update bpf_jit_enable limitation for CONFIG_BPF_JIT_ALWAYS_ON · 2c25fc9a
      Leo Yan authored
      When CONFIG_BPF_JIT_ALWAYS_ON is enabled, kernel has limitation for
      bpf_jit_enable, so it has fixed value 1 and we cannot set it to 2
      for JIT opcode dumping; this patch is to update the doc for it.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarLeo Yan <leo.yan@linaro.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2c25fc9a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 79741a38
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-04-27
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Add extensive BPF helper description into include/uapi/linux/bpf.h
         and a new script bpf_helpers_doc.py which allows for generating a
         man page out of it. Thus, every helper in BPF now comes with proper
         function signature, detailed description and return code explanation,
         from Quentin.
      
      2) Migrate the BPF collect metadata tunnel tests from BPF samples over
         to the BPF selftests and further extend them with v6 vxlan, geneve
         and ipip tests, simplify the ipip tests, improve documentation and
         convert to bpf_ntoh*() / bpf_hton*() api, from William.
      
      3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
         access stack and packet memory. Extend this to allow such helpers
         to also use map values, which enabled use cases where value from
         a first lookup can be directly used as a key for a second lookup,
         from Paul.
      
      4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
         order to retrieve XFRM state information containing SPI, peer
         address and reqid values, from Eyal.
      
      5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
         and SUB instructions with negative immediate into the opposite
         operation with a positive immediate such that nfp can better fit
         small immediates into instructions. Savings in instruction count
         up to 4% have been observed, from Jakub.
      
      6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
         and add support for dumping this through bpftool, from Jiri.
      
      7) Move the BPF sockmap samples over into BPF selftests instead since
         sockmap was rather a series of tests than sample anyway and this way
         this can be run from automated bots, from John.
      
      8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
         with generic XDP, from Nikita.
      
      9) Some follow-up cleanups to BTF, namely, removing unused defines from
         BTF uapi header and renaming 'name' struct btf_* members into name_off
         to make it more clear they are offsets into string section, from Martin.
      
      10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
          not run directly but invoked from test_sock_addr.sh, from Yonghong.
      
      11) Remove redundant ret assignment in sample BPF loader, from Wang.
      
      12) Add couple of missing files to BPF selftest's gitignore, from Anders.
      
      There are two trivial merge conflicts while pulling:
      
        1) Remove samples/sockmap/Makefile since all sockmap tests have been
           moved to selftests.
        2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
           file since git should ignore all of them.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79741a38
  3. 26 Apr, 2018 30 commits
    • Wang Sheng-Hui's avatar
      samples, bpf: remove redundant ret assignment in bpf_load_program() · c0885f61
      Wang Sheng-Hui authored
      2 redundant ret assignments removed:
      
      * 'ret = 1' before the logic 'if (data_maps)', and if any errors jump to
        label 'done'. No 'ret = 1' needed before the error jump.
      
      * After the '/* load programs */' part, if everything goes well, then
        the BPF code will be loaded and 'ret' set to 0 by load_and_attach().
        If something goes wrong, 'ret' set to none-O, the redundant 'ret = 0'
        after the for clause will make the error skipped.
      
        For example, if some BPF code cannot provide supported program types
        in ELF SEC("unknown"), the for clause will not call load_and_attach()
        to load the BPF code. 1 should be returned to callees instead of 0.
      Signed-off-by: default avatarWang Sheng-Hui <shhuiw@foxmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c0885f61
    • Daniel Borkmann's avatar
      Merge branch 'bpf-uapi-helper-doc' · a6712d45
      Daniel Borkmann authored
      Quentin Monnet says:
      
      ====================
      eBPF helper functions can be called from within eBPF programs to perform
      a variety of tasks that would be otherwise hard or impossible to do with
      eBPF itself. There is a growing number of such helper functions in the
      kernel, but documentation is scarce. The main user space header file
      does contain a short commented description of most helpers, but it is
      somewhat outdated and not complete. It is more a "cheat sheet" than a
      real documentation accessible to new eBPF developers.
      
      This commit attempts to improve the situation by replacing the existing
      overview for the helpers with a more developed description. Furthermore,
      a Python script is added to generate a manual page for eBPF helpers. The
      workflow is the following, and requires the rst2man utility:
      
          $ ./scripts/bpf_helpers_doc.py \
                  --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
          $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
          $ man /tmp/bpf-helpers.7
      
      The objective is to keep all documentation related to the helpers in a
      single place, and to be able to generate from here a manual page that
      could be packaged in the man-pages repository and shipped with most
      distributions.
      
      Additionally, parsing the prototypes of the helper functions could
      hopefully be reused, with a different Printer object, to generate
      header files needed in some eBPF-related projects.
      
      Regarding the description of each helper, it comprises several items:
      
      - The function prototype.
      - A description of the function and of its arguments (except for a
        couple of cases, when there are no arguments and the return value
        makes the function usage really obvious).
      - A description of return values (if not void).
      
      Additional items such as the list of compatible eBPF program and map
      types for each helper, Linux kernel version that introduced the helper,
      GPL-only restriction, and commit hash could be added in the future, but
      it was decided on the mailing list to leave them aside for now.
      
      For several helpers, descriptions are inspired (at times, nearly copied)
      from the commit logs introducing them in the kernel--Many thanks to
      their respective authors! Some sentences were also adapted from comments
      from the reviews, thanks to the reviewers as well. Descriptions were
      completed as much as possible, the objective being to have something easily
      accessible even for people just starting with eBPF. There is probably a bit
      more work to do in this direction for some helpers.
      
      Some RST formatting is used in the descriptions (not in function
      prototypes, to keep them readable, but the Python script provided in
      order to generate the RST for the manual page does add formatting to
      prototypes, to produce something pretty) to get "bold" and "italics" in
      manual pages. Hopefully, the descriptions in bpf.h file remains
      perfectly readable. Note that the few trailing white spaces are
      intentional, removing them would break paragraphs for rst2man.
      
      The descriptions should ideally be updated each time someone adds a new
      helper, or updates the behaviour (new socket option supported, ...) or
      the interface (new flags available, ...) of existing ones.
      
      To ease the review process, the documentation has been split into several
      patches.
      
      v3 -> v4:
      - Add a patch (#9) for newly added BPF helpers.
      - Add a patch (#10) to update UAPI bpf.h version under tools/.
      - Use SPDX tag in Python script.
      - Several fixes on man page header and footer, and helpers documentation.
        Please refer to individual patches for details.
      
      RFC v2 -> PATCH v3:
      Several fixes on man page header and footer, and helpers documentation.
      Please refer to individual patches for details.
      
      RFC v1 -> RFC v2:
      - Remove "For" (compatible program and map types), "Since" (minimal
        Linux kernel version required), "GPL only" sections and commit hashes
        for the helpers.
      - Add comment on top of the description list to explain how this
        documentation is supposed to be processed.
      - Update Python script accordingly (remove the same sections, and remove
        paragraphs on program types and GPL restrictions from man page
        header).
      - Split series into several patches.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: linux-doc@vger.kernel.org
      Cc: linux-man@vger.kernel.org
      a6712d45
    • Quentin Monnet's avatar
      bpf: update bpf.h uapi header for tools · 9cde0c88
      Quentin Monnet authored
      Update tools/include/uapi/linux/bpf.h file in order to reflect the
      changes for BPF helper functions documentation introduced in previous
      commits.
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9cde0c88
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (65-66) · 2d020dd7
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions:
      
      Helper from Nikita:
      - bpf_xdp_adjust_tail()
      
      Helper from Eyal:
      - bpf_skb_get_xfrm_state()
      
      v4:
      - New patch (helpers did not exist yet for previous versions).
      
      Cc: Nikita V. Shirokov <tehnerd@tehnerd.com>
      Cc: Eyal Birger <eyal.birger@gmail.com>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2d020dd7
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (58-64) · ab127040
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions, all
      written by John:
      
      - bpf_redirect_map()
      - bpf_sk_redirect_map()
      - bpf_sock_map_update()
      - bpf_msg_redirect_map()
      - bpf_msg_apply_bytes()
      - bpf_msg_cork_bytes()
      - bpf_msg_pull_data()
      
      v4:
      - bpf_redirect_map(): Fix typos: "XDP_ABORT" changed to "XDP_ABORTED",
        "his" to "this". Also add a paragraph on performance improvement over
        bpf_redirect() helper.
      
      v3:
      - bpf_sk_redirect_map(): Improve description of BPF_F_INGRESS flag.
      - bpf_msg_redirect_map(): Improve description of BPF_F_INGRESS flag.
      - bpf_redirect_map(): Fix note on CPU redirection, not fully implemented
        for generic XDP but supported on native XDP.
      - bpf_msg_pull_data(): Clarify comment about invalidated verifier
        checks.
      
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ab127040
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (51-57) · 7aa79a86
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions:
      
      Helpers from Lawrence:
      - bpf_setsockopt()
      - bpf_getsockopt()
      - bpf_sock_ops_cb_flags_set()
      
      Helpers from Yonghong:
      - bpf_perf_event_read_value()
      - bpf_perf_prog_read_value()
      
      Helper from Josef:
      - bpf_override_return()
      
      Helper from Andrey:
      - bpf_bind()
      
      v4:
      - bpf_perf_event_read_value(): State that this helper should be
        preferred over bpf_perf_event_read().
      
      v3:
      - bpf_perf_event_read_value(): Fix time of selection for perf event type
        in description. Remove occurences of "cores" to avoid confusion with
        "CPU".
      - bpf_bind(): Remove last paragraph of description, which was off topic.
      
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Andrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      [for bpf_perf_event_read_value(), bpf_perf_prog_read_value()]
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      [for bpf_bind()]
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7aa79a86
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (42-50) · c6b5fb86
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions:
      
      Helper from Kaixu:
      - bpf_perf_event_read()
      
      Helpers from Martin:
      - bpf_skb_under_cgroup()
      - bpf_xdp_adjust_head()
      
      Helpers from Sargun:
      - bpf_probe_write_user()
      - bpf_current_task_under_cgroup()
      
      Helper from Thomas:
      - bpf_skb_change_head()
      
      Helper from Gianluca:
      - bpf_probe_read_str()
      
      Helpers from Chenbo:
      - bpf_get_socket_cookie()
      - bpf_get_socket_uid()
      
      v4:
      - bpf_perf_event_read(): State that bpf_perf_event_read_value() should
        be preferred over this helper.
      - bpf_skb_change_head(): Clarify comment about invalidated verifier
        checks.
      - bpf_xdp_adjust_head(): Clarify comment about invalidated verifier
        checks.
      - bpf_probe_write_user(): Add that dst must be a valid user space
        address.
      - bpf_get_socket_cookie(): Improve description by making clearer that
        the cockie belongs to the socket, and state that it remains stable for
        the life of the socket.
      
      v3:
      - bpf_perf_event_read(): Fix time of selection for perf event type in
        description. Remove occurences of "cores" to avoid confusion with
        "CPU".
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Gianluca Borello <g.borello@gmail.com>
      Cc: Chenbo Feng <fengc@google.com>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      [for bpf_skb_under_cgroup(), bpf_xdp_adjust_head()]
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c6b5fb86
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (33-41) · fa15601a
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions, all
      written by Daniel:
      
      - bpf_get_hash_recalc()
      - bpf_skb_change_tail()
      - bpf_skb_pull_data()
      - bpf_csum_update()
      - bpf_set_hash_invalid()
      - bpf_get_numa_node_id()
      - bpf_set_hash()
      - bpf_skb_adjust_room()
      - bpf_xdp_adjust_meta()
      
      v4:
      - bpf_skb_change_tail(): Clarify comment about invalidated verifier
        checks.
      - bpf_skb_pull_data(): Clarify the motivation for using this helper or
        bpf_skb_load_bytes(), on non-linear buffers. Fix RST formatting for
        *skb*. Clarify comment about invalidated verifier checks.
      - bpf_csum_update(): Fix description of checksum (entire packet, not IP
        checksum). Fix a typo: "header" instead of "helper".
      - bpf_set_hash_invalid(): Mention bpf_get_hash_recalc().
      - bpf_get_numa_node_id(): State that the helper is not restricted to
        programs attached to sockets.
      - bpf_skb_adjust_room(): Clarify comment about invalidated verifier
        checks.
      - bpf_xdp_adjust_meta(): Clarify comment about invalidated verifier
        checks.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fa15601a
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (23-32) · 1fdd08be
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions, all
      written by Daniel:
      
      - bpf_get_prandom_u32()
      - bpf_get_smp_processor_id()
      - bpf_get_cgroup_classid()
      - bpf_get_route_realm()
      - bpf_skb_load_bytes()
      - bpf_csum_diff()
      - bpf_skb_get_tunnel_opt()
      - bpf_skb_set_tunnel_opt()
      - bpf_skb_change_proto()
      - bpf_skb_change_type()
      
      v4:
      - bpf_get_prandom_u32(): Warn that the prng is not cryptographically
        secure.
      - bpf_get_smp_processor_id(): Fix a typo (case).
      - bpf_get_cgroup_classid(): Clarify description. Add notes on the helper
        being limited to cgroup v1, and to egress path.
      - bpf_get_route_realm(): Add comparison with bpf_get_cgroup_classid().
        Add a note about usage with TC and advantage of clsact. Fix a typo in
        return value ("sdb" instead of "skb").
      - bpf_skb_load_bytes(): Make explicit loading large data loads it to the
        eBPF stack.
      - bpf_csum_diff(): Add a note on seed that can be cascaded. Link to
        bpf_l3|l4_csum_replace().
      - bpf_skb_get_tunnel_opt(): Add a note about usage with "collect
        metadata" mode, and example of this with Geneve.
      - bpf_skb_set_tunnel_opt(): Add a link to bpf_skb_get_tunnel_opt()
        description.
      - bpf_skb_change_proto(): Mention that the main use case is NAT64.
        Clarify comment about invalidated verifier checks.
      
      v3:
      - bpf_get_prandom_u32(): Fix helper name :(. Add description, including
        a note on the internal random state.
      - bpf_get_smp_processor_id(): Add description, including a note on the
        processor id remaining stable during program run.
      - bpf_get_cgroup_classid(): State that CONFIG_CGROUP_NET_CLASSID is
        required to use the helper. Add a reference to related documentation.
        State that placing a task in net_cls controller disables cgroup-bpf.
      - bpf_get_route_realm(): State that CONFIG_CGROUP_NET_CLASSID is
        required to use this helper.
      - bpf_skb_load_bytes(): Fix comment on current use cases for the helper.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1fdd08be
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (12-22) · c456dec4
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions, all
      written by Alexei:
      
      - bpf_get_current_pid_tgid()
      - bpf_get_current_uid_gid()
      - bpf_get_current_comm()
      - bpf_skb_vlan_push()
      - bpf_skb_vlan_pop()
      - bpf_skb_get_tunnel_key()
      - bpf_skb_set_tunnel_key()
      - bpf_redirect()
      - bpf_perf_event_output()
      - bpf_get_stackid()
      - bpf_get_current_task()
      
      v4:
      - bpf_redirect(): Fix typo: "XDP_ABORT" changed to "XDP_ABORTED". Add
        note on bpf_redirect_map() providing better performance. Replace "Save
        for" with "Except for".
      - bpf_skb_vlan_push(): Clarify comment about invalidated verifier
        checks.
      - bpf_skb_vlan_pop(): Clarify comment about invalidated verifier
        checks.
      - bpf_skb_get_tunnel_key(): Add notes on tunnel_id, "collect metadata"
        mode, and example tunneling protocols with which it can be used.
      - bpf_skb_set_tunnel_key(): Add a reference to the description of
        bpf_skb_get_tunnel_key().
      - bpf_perf_event_output(): Specify that, and for what purpose, the
        helper can be used with programs attached to TC and XDP.
      
      v3:
      - bpf_skb_get_tunnel_key(): Change and improve description and example.
      - bpf_redirect(): Improve description of BPF_F_INGRESS flag.
      - bpf_perf_event_output(): Fix first sentence of description. Delete
        wrong statement on context being evaluated as a struct pt_reg. Remove
        the long yet incomplete example.
      - bpf_get_stackid(): Add a note about PERF_MAX_STACK_DEPTH being
        configurable.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c456dec4
    • Quentin Monnet's avatar
      bpf: add documentation for eBPF helpers (01-11) · ad4a5223
      Quentin Monnet authored
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions, all
      written by Alexei:
      
      - bpf_map_lookup_elem()
      - bpf_map_update_elem()
      - bpf_map_delete_elem()
      - bpf_probe_read()
      - bpf_ktime_get_ns()
      - bpf_trace_printk()
      - bpf_skb_store_bytes()
      - bpf_l3_csum_replace()
      - bpf_l4_csum_replace()
      - bpf_tail_call()
      - bpf_clone_redirect()
      
      v4:
      - bpf_map_lookup_elem(): Add "const" qualifier for key.
      - bpf_map_update_elem(): Add "const" qualifier for key and value.
      - bpf_map_lookup_elem(): Add "const" qualifier for key.
      - bpf_skb_store_bytes(): Clarify comment about invalidated verifier
        checks.
      - bpf_l3_csum_replace(): Mention L3 instead of just IP, and add a note
        about bpf_csum_diff().
      - bpf_l4_csum_replace(): Mention L4 instead of just TCP/UDP, and add a
        note about bpf_csum_diff().
      - bpf_tail_call(): Bring minor edits to description.
      - bpf_clone_redirect(): Add a note about the relation with
        bpf_redirect(). Also clarify comment about invalidated verifier
        checks.
      
      v3:
      - bpf_map_lookup_elem(): Fix description of restrictions for flags
        related to the existence of the entry.
      - bpf_trace_printk(): State that trace_pipe can be configured. Fix
        return value in case an unknown format specifier is met. Add a note on
        kernel log notice when the helper is used. Edit example.
      - bpf_tail_call(): Improve comment on stack inheritance.
      - bpf_clone_redirect(): Improve description of BPF_F_INGRESS flag.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ad4a5223
    • Quentin Monnet's avatar
      bpf: add script and prepare bpf.h for new helpers documentation · 56a092c8
      Quentin Monnet authored
      Remove previous "overview" of eBPF helpers from user bpf.h header.
      Replace it by a comment explaining how to process the new documentation
      (to come in following patches) with a Python script to produce RST, then
      man page documentation.
      
      Also add the aforementioned Python script under scripts/. It is used to
      process include/uapi/linux/bpf.h and to extract helper descriptions, to
      turn it into a RST document that can further be processed with rst2man
      to produce a man page. The script takes one "--filename <path/to/file>"
      option. If the script is launched from scripts/ in the kernel root
      directory, it should be able to find the location of the header to
      parse, and "--filename <path/to/file>" is then optional. If it cannot
      find the file, then the option becomes mandatory. RST-formatted
      documentation is printed to standard output.
      
      Typical workflow for producing the final man page would be:
      
          $ ./scripts/bpf_helpers_doc.py \
                  --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
          $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
          $ man /tmp/bpf-helpers.7
      
      Note that the tool kernel-doc cannot be used to document eBPF helpers,
      whose signatures are not available directly in the header files
      (pre-processor directives are used to produce them at the beginning of
      the compilation process).
      
      v4:
      - Also remove overviews for newly added bpf_xdp_adjust_tail() and
        bpf_skb_get_xfrm_state().
      - Remove vague statement about what helpers are restricted to GPL
        programs in "LICENSE" section for man page footer.
      - Replace license boilerplate with SPDX tag for Python script.
      
      v3:
      - Change license for man page.
      - Remove "for safety reasons" from man page header text.
      - Change "packets metadata" to "packets" in man page header text.
      - Move and fix comment on helpers introducing no overhead.
      - Remove "NOTES" section from man page footer.
      - Add "LICENSE" section to man page footer.
      - Edit description of file include/uapi/linux/bpf.h in man page footer.
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      56a092c8
    • Daniel Borkmann's avatar
      Merge branch 'bpf-tunnel-metadata-selftests' · 3f13de6d
      Daniel Borkmann authored
      William Tu says:
      
      ====================
      The patch series provide end-to-end eBPF tunnel testsute.  A common topology
      is created below for all types of tunnels:
      
      Topology:
      ---------
           root namespace   |     at_ns0 namespace
                            |
            -----------     |     -----------
            | tnl dev |     |     | tnl dev |  (overlay network)
            -----------     |     -----------
            metadata-mode   |     native-mode
             with bpf       |
                            |
            ----------      |     ----------
            |  veth1  | --------- |  veth0  |  (underlay network)
            ----------    peer    ----------
      
      Device Configuration
      --------------------
       Root namespace with metadata-mode tunnel + BPF
       Device names and addresses:
             veth1 IP: 172.16.1.200, IPv6: 00::22 (underlay)
             tunnel dev <type>11, ex: gre11, IPv4: 10.1.1.200 (overlay)
      
       Namespace at_ns0 with native tunnel
       Device names and addresses:
             veth0 IPv4: 172.16.1.100, IPv6: 00::11 (underlay)
             tunnel dev <type>00, ex: gre00, IPv4: 10.1.1.100 (overlay)
      
      End-to-end ping packet flow
      ---------------------------
       Most of the tests start by namespace creation, device configuration,
       then ping the underlay and overlay network.  When doing 'ping 10.1.1.100'
       from root namespace, the following operations happen:
       1) Route lookup shows 10.1.1.100/24 belongs to tnl dev, fwd to tnl dev.
       2) Tnl device's egress BPF program is triggered and set the tunnel metadata,
          with remote_ip=172.16.1.200 and others.
       3) Outer tunnel header is prepended and route the packet to veth1's egress
       4) veth0's ingress queue receive the tunneled packet at namespace at_ns0
       5) Tunnel protocol handler, ex: vxlan_rcv, decap the packet
       6) Forward the packet to the overlay tnl dev
      
      Test Cases
      -----------------------------
       Tunnel Type |  BPF Programs
      -----------------------------
       GRE:          gre_set_tunnel, gre_get_tunnel
       IP6GRE:       ip6gretap_set_tunnel, ip6gretap_get_tunnel
       ERSPAN:       erspan_set_tunnel, erspan_get_tunnel
       IP6ERSPAN:    ip4ip6erspan_set_tunnel, ip4ip6erspan_get_tunnel
       VXLAN:        vxlan_set_tunnel, vxlan_get_tunnel
       IP6VXLAN:     ip6vxlan_set_tunnel, ip6vxlan_get_tunnel
       GENEVE:       geneve_set_tunnel, geneve_get_tunnel
       IP6GENEVE:    ip6geneve_set_tunnel, ip6geneve_get_tunnel
       IPIP:         ipip_set_tunnel, ipip_get_tunnel
       IP6IP:        ipip6_set_tunnel, ipip6_get_tunnel,
                     ip6ip6_set_tunnel, ip6ip6_get_tunnel
       XFRM:         xfrm_get_state
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3f13de6d
    • William Tu's avatar
      samples/bpf: remove the bpf tunnel testsuite. · b05cd740
      William Tu authored
      Move the testsuite to
      selftests/bpf/{test_tunnel_kern.c, test_tunnel.sh}
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b05cd740
    • William Tu's avatar
      selftests/bpf: bpf tunnel test. · 933a741e
      William Tu authored
      The patch migrates the original tests at samples/bpf/tcbpf2_kern.c
      and samples/bpf/test_tunnel_bpf.sh to selftests.  There are a couple
      changes from the original:
          1) add ipv6 vxlan, ipv6 geneve, ipv6 ipip tests
          2) simplify the original ipip tests (remove iperf tests)
          3) improve documentation
          4) use bpf_ntoh* and bpf_hton* api
      
      In summary, 'test_tunnel_kern.o' contains the following bpf program:
        GRE: gre_set_tunnel, gre_get_tunnel
        IP6GRE: ip6gretap_set_tunnel, ip6gretap_get_tunnel
        ERSPAN: erspan_set_tunnel, erspan_get_tunnel
        IP6ERSPAN: ip4ip6erspan_set_tunnel, ip4ip6erspan_get_tunnel
        VXLAN: vxlan_set_tunnel, vxlan_get_tunnel
        IP6VXLAN: ip6vxlan_set_tunnel, ip6vxlan_get_tunnel
        GENEVE: geneve_set_tunnel, geneve_get_tunnel
        IP6GENEVE: ip6geneve_set_tunnel, ip6geneve_get_tunnel
        IPIP: ipip_set_tunnel, ipip_get_tunnel
        IP6IP: ipip6_set_tunnel, ipip6_get_tunnel,
               ip6ip6_set_tunnel, ip6ip6_get_tunnel
        XFRM: xfrm_get_state
      Signed-off-by: default avatarWilliam Tu <u9012063@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      933a741e
    • Nikita V. Shirokov's avatar
      bpf: fix xdp_generic for bpf_adjust_tail usecase · f7613120
      Nikita V. Shirokov authored
      When bpf_adjust_tail was introduced for generic xdp, it changed skb's tail
      pointer, so it was pointing to the new "end of the packet". However skb's
      len field wasn't properly modified, so on the wire ethernet frame had
      original (or even bigger, if adjust_head was used) size. This diff is
      fixing this.
      
      Fixes: 198d83bb (" bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail")
      Signed-off-by: default avatarNikita V. Shirokov <tehnerd@tehnerd.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f7613120
    • Jiri Olsa's avatar
      tools, bpftool: Display license GPL compatible in prog show/list · 9b984a20
      Jiri Olsa authored
      Display the license "gpl" string in bpftool prog command, like:
      
        # bpftool prog list
        5: tracepoint  name func  tag 57cd311f2e27366b  gpl
                loaded_at Apr 26/09:37  uid 0
                xlated 16B  not jited  memlock 4096B
      
        # bpftool --json --pretty prog show
        [{
                "id": 5,
                "type": "tracepoint",
                "name": "func",
                "tag": "57cd311f2e27366b",
                "gpl_compatible": true,
                "loaded_at": "Apr 26/09:37",
                "uid": 0,
                "bytes_xlated": 16,
                "jited": false,
                "bytes_memlock": 4096
            }
        ]
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9b984a20
    • Jiri Olsa's avatar
      tools, bpf: Sync bpf.h uapi header · fb6ef42b
      Jiri Olsa authored
      Syncing the bpf.h uapi header with tools.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fb6ef42b
    • Jiri Olsa's avatar
      bpf: Add gpl_compatible flag to struct bpf_prog_info · b85fab0e
      Jiri Olsa authored
      Adding gpl_compatible flag to struct bpf_prog_info
      so it can be dumped via bpf_prog_get_info_by_fd and
      displayed via bpftool progs dump.
      
      Alexei noticed 4-byte hole in struct bpf_prog_info,
      so we put the u32 flags field in there, and we can
      keep adding bit fields in there without breaking
      user space.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b85fab0e
    • David S. Miller's avatar
      Merge branch 'udp-gso' · cb586c63
      David S. Miller authored
      Willem de Bruijn says:
      
      ====================
      udp gso
      
      Segmentation offload reduces cycles/byte for large packets by
      amortizing the cost of protocol stack traversal.
      
      This patchset implements GSO for UDP. A process can concatenate and
      submit multiple datagrams to the same destination in one send call
      by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
      or passing an analogous cmsg at send time.
      
      The stack will send the entire large (up to network layer max size)
      datagram through the protocol layer. At the GSO layer, it is broken
      up in individual segments. All receive the same network layer header
      and UDP src and dst port. All but the last segment have the same UDP
      header, but the last may differ in length and checksum.
      
      Initial results show a significant reduction in UDP cycles/byte.
      See the main patch for more details and benchmark results.
      
              udp
                876 MB/s 14873 msg/s 624666 calls/s
                  11,205,777,429      cycles
      
              udp gso
               2139 MB/s 36282 msg/s 36282 calls/s
                  11,204,374,561      cycles
      
      The patch set is broken down as follows:
      - patch 1 is a prerequisite: code rearrangement, noop otherwise
      - patch 2 implements the gso logic
      - patch 3 adds protocol stack support for UDP_SEGMENT
      - patch 4,5,7 are refinements
      - patch 6 adds the cmsg interface
      - patch 8..11 are tests
      
      This idea was presented previously at netconf 2017-2
      http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
      
      Changes v1 -> v2
        - Convert __udp_gso_segment to modify headers after skb_segment
        - Split main patch into two, one for gso logic, one for UDP_SEGMENT
      
      Changes RFC -> v1
        - MSG_MORE:
            fixed, by allowing checksum offload with corking if gso
        - SKB_GSO_UDP_L4:
            made independent from SKB_GSO_UDP
            and removed skb_is_ufo() wrapper
        - NETIF_F_GSO_UDP_L4:
            add to netdev_features_string
            and to netdev-features.txt
            add BUILD_BUG_ON to match SKB_GSO_UDP_L4 value
        - UDP_MAX_SEGMENTS:
            introduce limit on number of segments per gso skb
            to avoid extreme cases like IP_MAX_MTU/IPV4_MIN_MTU
        - CHECKSUM_PARTIAL:
            test against missing feature after ndo_features_check
            if not supported return error, analogous to udp_send_check
        - MSG_ZEROCOPY: removed, deferred for now
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb586c63
    • Willem de Bruijn's avatar
      selftests: udp gso benchmark · 3a687bef
      Willem de Bruijn authored
      Send udp data between a source and sink, optionally with udp gso.
      The two processes are expected to be run on separate hosts.
      
      A script is included that runs them together over loopback in a
      single namespace for functionality testing.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a687bef
    • Willem de Bruijn's avatar
      selftests: udp gso with corking · 3f12817f
      Willem de Bruijn authored
      Corked sockets take a different path to construct a udp datagram than
      the lockless fast path. Test this alternate path.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f12817f
    • Willem de Bruijn's avatar
      selftests: udp gso with connected sockets · e5b2d91c
      Willem de Bruijn authored
      Connected sockets use path mtu instead of device mtu.
      
      Test this path by inserting a route mtu that is lower than the device
      mtu. Verify that the path mtu for the connection matches this lower
      number, then run the same test as in the connectionless case.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5b2d91c
    • Willem de Bruijn's avatar
      selftests: udp gso · a1607257
      Willem de Bruijn authored
      Validate udp gso, including edge cases (such as min/max gso sizes).
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1607257
    • Willem de Bruijn's avatar
      udp: add gso support to virtual devices · 83aa025f
      Willem de Bruijn authored
      Virtual devices such as tunnels and bonding can handle large packets.
      Only segment packets when reaching a physical or loopback device.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83aa025f
    • Willem de Bruijn's avatar
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn authored
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e8de857
    • Willem de Bruijn's avatar
      udp: paged allocation with gso · 15e36f5b
      Willem de Bruijn authored
      When sending large datagrams that are later segmented, store data in
      page frags to avoid copying from linear in skb_segment.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15e36f5b
    • Willem de Bruijn's avatar
      udp: better wmem accounting on gso · ad405857
      Willem de Bruijn authored
      skb_segment by default transfers allocated wmem from the gso skb
      to the tail of the segment list. This underreports real truesize
      of the list, especially if the tail might be dropped.
      
      Similar to tcp_gso_segment, update wmem_alloc with the aggregate
      list truesize and make each segment responsible for its own
      share by setting skb->destructor.
      
      Clear gso_skb->destructor prior to calling skb_segment to skip
      the default assignment to tail.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad405857
    • Willem de Bruijn's avatar
      udp: generate gso with UDP_SEGMENT · bec1f6f6
      Willem de Bruijn authored
      Support generic segmentation offload for udp datagrams. Callers can
      concatenate and send at once the payload of multiple datagrams with
      the same destination.
      
      To set segment size, the caller sets socket option UDP_SEGMENT to the
      length of each discrete payload. This value must be smaller than or
      equal to the relevant MTU.
      
      A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
      per send call basis.
      
      Total byte length may then exceed MTU. If not an exact multiple of
      segment size, the last segment will be shorter.
      
      The implementation adds a gso_size field to the udp socket, ip(v6)
      cmsg cookie and inet_cork structure to be able to set the value at
      setsockopt or cmsg time and to work with both lockless and corked
      paths.
      
      Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
      
          tcp tso
           3197 MB/s 54232 msg/s 54232 calls/s
               6,457,754,262      cycles
      
          tcp gso
           1765 MB/s 29939 msg/s 29939 calls/s
              11,203,021,806      cycles
      
          tcp without tso/gso *
            739 MB/s 12548 msg/s 12548 calls/s
              11,205,483,630      cycles
      
          udp
            876 MB/s 14873 msg/s 624666 calls/s
              11,205,777,429      cycles
      
          udp gso
           2139 MB/s 36282 msg/s 36282 calls/s
              11,204,374,561      cycles
      
         [*] after reverting commit 0a6b2a1d
             ("tcp: switch to GSO being always on")
      
      Measured total system cycles ('-a') for one core while pinning both
      the network receive path and benchmark process to that core:
      
        perf stat -a -C 12 -e cycles \
          ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
      
      Note the reduction in calls/s with GSO. Bytes per syscall drops
      increases from 1470 to 61818.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bec1f6f6
    • Willem de Bruijn's avatar
      udp: add udp gso · ee80d1eb
      Willem de Bruijn authored
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee80d1eb