1. 21 Sep, 2020 8 commits
  2. 18 Sep, 2020 9 commits
    • Martin KaFai Lau's avatar
      bpf: Use hlist_add_head_rcu when linking to local_storage · 70b97111
      Martin KaFai Lau authored
      The local_storage->list will be traversed by rcu reader in parallel.
      Thus, hlist_add_head_rcu() is needed in bpf_selem_link_storage_nolock().
      This patch fixes it.
      
      This part of the code has recently been refactored in bpf-next
      and this patch makes changes to the new file "bpf_local_storage.c".
      Instead of using the original offending commit in the Fixes tag,
      the commit that created the file "bpf_local_storage.c" is used.
      
      A separate fix has been provided to the bpf tree.
      
      Fixes: 450af8d0 ("bpf: Split bpf_local_storage to bpf_sk_storage")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200916204453.2003915-1-kafai@fb.com
      70b97111
    • Ilya Leoshkevich's avatar
      samples/bpf: Fix test_map_in_map on s390 · f55f4c34
      Ilya Leoshkevich authored
      s390 uses socketcall multiplexer instead of individual socket syscalls.
      Therefore, "kprobe/" SYSCALL(sys_connect) does not trigger and
      test_map_in_map fails. Fix by using "kprobe/__sys_connect" instead.
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200915115519.3769807-1-iii@linux.ibm.com
      f55f4c34
    • Ilya Leoshkevich's avatar
      selftests/bpf: Fix endianness issue in test_sockopt_sk · fec47bbc
      Ilya Leoshkevich authored
      getsetsockopt() calls getsockopt() with optlen == 1, but then checks
      the resulting int. It is ok on little endian, but not on big endian.
      
      Fix by checking char instead.
      
      Fixes: 8a027dc0 ("selftests/bpf: add sockopt test that exercises sk helpers")
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200915113928.3768496-1-iii@linux.ibm.com
      fec47bbc
    • Ilya Leoshkevich's avatar
      selftests/bpf: Fix endianness issue in sk_assign · b6ed6cf4
      Ilya Leoshkevich authored
      server_map's value size is 8, but the test tries to put an int there.
      This sort of works on x86 (unless followed by non-0), but hard fails on
      s390.
      
      Fix by using __s64 instead of int.
      
      Fixes: 2d7824ff ("selftests: bpf: Add test for sk_assign")
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200915113815.3768217-1-iii@linux.ibm.com
      b6ed6cf4
    • Maciej Fijalkowski's avatar
      selftests/bpf: Add tailcall_bpf2bpf tests · 3b037911
      Maciej Fijalkowski authored
      Add four tests to tailcalls selftest explicitly named
      "tailcall_bpf2bpf_X" as their purpose is to validate that combination
      of tailcalls with bpf2bpf calls are working properly.
      These tests also validate LD_ABS from subprograms.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3b037911
    • Alexei Starovoitov's avatar
      bpf: Add abnormal return checks. · 09b28d76
      Alexei Starovoitov authored
      LD_[ABS|IND] instructions may return from the function early. bpf_tail_call
      pseudo instruction is either fallthrough or return. Allow them in the
      subprograms only when subprograms are BTF annotated and have scalar return
      types. Allow ld_abs and tail_call in the main program even if it calls into
      subprograms. In the past that was not ok to do for ld_abs, since it was JITed
      with special exit sequence. Since bpf_gen_ld_abs() was introduced the ld_abs
      looks like normal exit insn from JIT point of view, so it's safe to allow them
      in the main program.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09b28d76
    • Maciej Fijalkowski's avatar
      bpf: allow for tailcalls in BPF subprograms for x64 JIT · e411901c
      Maciej Fijalkowski authored
      Relax verifier's restriction that was meant to forbid tailcall usage
      when subprog count was higher than 1.
      
      Also, do not max out the stack depth of program that utilizes tailcalls.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e411901c
    • Maciej Fijalkowski's avatar
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski authored
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • Maciej Fijalkowski's avatar
      bpf: Limit caller's stack depth 256 for subprogs with tailcalls · 7f6e4312
      Maciej Fijalkowski authored
      Protect against potential stack overflow that might happen when bpf2bpf
      calls get combined with tailcalls. Limit the caller's stack depth for
      such case down to 256 so that the worst case scenario would result in 8k
      stack size (32 which is tailcall limit * 256 = 8k).
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7f6e4312
  3. 17 Sep, 2020 3 commits
    • Maciej Fijalkowski's avatar
      bpf: rename poke descriptor's 'ip' member to 'tailcall_target' · cf71b174
      Maciej Fijalkowski authored
      Reflect the actual purpose of poke->ip and rename it to
      poke->tailcall_target so that it will not the be confused with another
      poke target that will be introduced in next commit.
      
      While at it, do the same thing with poke->ip_stable - rename it to
      poke->tailcall_target_stable.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cf71b174
    • Maciej Fijalkowski's avatar
      bpf: propagate poke descriptors to subprograms · a748c697
      Maciej Fijalkowski authored
      Previously, there was no need for poke descriptors being present in
      subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
      in them. Each subprog is JITed independently so in order to enable
      JITing subprograms that use tailcalls, do the following:
      
      - in fixup_bpf_calls() store the index of tailcall insn onto the generated
        poke descriptor,
      - in case when insn patching occurs, adjust the tailcall insn idx from
        bpf_patch_insn_data,
      - then in jit_subprogs() check whether the given poke descriptor belongs
        to the current subprog by checking if that previously stored absolute
        index of tail call insn is in the scope of the insns of given subprog,
      - update the insn->imm with new poke descriptor slot so that while JITing
        the proper poke descriptor will be grabbed
      
      This way each of the main program's poke descriptors are distributed
      across the subprograms poke descriptor array, so main program's
      descriptors can be untracked out of the prog array map.
      
      Add also subprog's aux struct to the BPF map poke_progs list by calling
      on it map_poke_track().
      
      In case of any error, call the map_poke_untrack() on subprog's aux
      structs that have already been registered to prog array map.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a748c697
    • Maciej Fijalkowski's avatar
      bpf, x64: use %rcx instead of %rax for tail call retpolines · 0d4ddce3
      Maciej Fijalkowski authored
      Currently, %rax is used to store the jump target when BPF program is
      emitting the retpoline instructions that are handling the indirect
      tailcall.
      
      There is a plan to use %rax for different purpose, which is storing the
      tail call counter. In order to preserve this value across the tailcalls,
      adjust the BPF indirect tailcalls so that the target program will reside
      in %rcx and teach the retpoline instructions about new location of jump
      target.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0d4ddce3
  4. 16 Sep, 2020 7 commits
    • Andrii Nakryiko's avatar
      selftests/bpf: Merge most of test_btf into test_progs · c64779e2
      Andrii Nakryiko authored
      Merge 183 tests from test_btf into test_progs framework to be exercised
      regularly. All the test_btf tests that were moved are modeled as proper
      sub-tests in test_progs framework for ease of debugging and reporting.
      
      No functional or behavioral changes were intended, I tried to preserve
      original behavior as much as possible. E.g., `test_progs -v` will activate
      "always_log" flag to emit BTF validation log.
      
      The only difference is in reducing the max_entries limit for pretty-printing
      tests from (128 * 1024) to just 128 to reduce tests running time without
      reducing the coverage.
      
      Example test run:
      
        $ sudo ./test_progs -n 8
        ...
        #8 btf:OK
        Summary: 1/183 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200916004819.3767489-1-andriin@fb.com
      c64779e2
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_metadata' · ffa915f4
      Alexei Starovoitov authored
      Stanislav Fomichev says:
      
      ====================
      Currently, if a user wants to store arbitrary metadata for an eBPF
      program, for example, the program build commit hash or version, they
      could store it in a map, and conveniently libbpf uses .data section to
      populate an internal map. However, if the program does not actually
      reference the map, then the map would be de-refcounted and freed.
      
      This patch set introduces a new syscall BPF_PROG_BIND_MAP to add a map
      to a program's used_maps, even if the program instructions does not
      reference the map.
      
      libbpf is extended to always BPF_PROG_BIND_MAP .rodata section so the
      metadata is kept in place.
      bpftool is also extended to print metadata in the 'bpftool prog' list.
      
      The variable is considered metadata if it starts with the
      magic 'bpf_metadata_' prefix; everything after the prefix is the
      metadata name.
      
      An example use of this would be BPF C file declaring:
      
        volatile const char bpf_metadata_commit_hash[] SEC(".rodata") = "abcdef123456";
      
      and bpftool would emit:
      
        $ bpftool prog
        [...]
              metadata:
                      commit_hash = "abcdef123456"
      
      v6 changes:
      * libbpf: drop FEAT_GLOBAL_DATA from probe_prog_bind_map (Andrii Nakryiko)
      * bpftool: combine find_metadata_map_id & find_metadata;
        drops extra bpf_map_get_fd_by_id and bpf_map_get_fd_by_id (Andrii Nakryiko)
      * bpftool: use strncmp instead of strstr (Andrii Nakryiko)
      * bpftool: memset(map_info) and extra empty line (Andrii Nakryiko)
      
      v5 changes:
      * selftest: verify that prog holds rodata (Andrii Nakryiko)
      * selftest: use volatile for metadata (Andrii Nakryiko)
      * bpftool: use sizeof in BPF_METADATA_PREFIX_LEN (Andrii Nakryiko)
      * bpftool: new find_metadata that does map lookup (Andrii Nakryiko)
      * libbpf: don't generalize probe_create_global_data (Andrii Nakryiko)
      * libbpf: use OPTS_VALID in bpf_prog_bind_map (Andrii Nakryiko)
      * libbpf: keep LIBBPF_0.2.0 sorted (Andrii Nakryiko)
      
      v4 changes:
      * Don't return EEXIST from syscall if already bound (Andrii Nakryiko)
      * Removed --metadata argument (Andrii Nakryiko)
      * Removed custom .metadata section (Alexei Starovoitov)
      * Addressed Andrii's suggestions about btf helpers and vsi (Andrii Nakryiko)
      * Moved bpf_prog_find_metadata into bpftool (Alexei Starovoitov)
      
      v3 changes:
      * API changes for bpf_prog_find_metadata (Toke Høiland-Jørgensen)
      
      v2 changes:
      * Made struct bpf_prog_bind_opts in libbpf so flags is optional.
      * Deduped probe_kern_global_data and probe_prog_bind_map into a common
        helper.
      * Added comment regarding why EEXIST is ignored in libbpf bind map.
      * Froze all LIBBPF_MAP_METADATA internal maps.
      * Moved bpf_prog_bind_map into new LIBBPF_0.1.1 in libbpf.map.
      * Added p_err() calls on error cases in bpftool show_prog_metadata.
      * Reverse christmas tree coding style in bpftool show_prog_metadata.
      * Made bpftool gen skeleton recognize .metadata as an internal map and
        generate datasec definition in skeleton.
      * Added C test using skeleton to see asset that the metadata is what we
        expect and rebinding causes EEXIST.
      
      v1 changes:
      * Fixed a few missing unlocks, and missing close while iterating map fds.
      * Move mutex initialization to right after prog aux allocation, and mutex
        destroy to right after prog aux free.
      * s/ADD_MAP/BIND_MAP/
      * Use mutex only instead of RCU to protect the used_map array & count.
      
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      ====================
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ffa915f4
    • YiFei Zhu's avatar
      selftests/bpf: Test load and dump metadata with btftool and skel · d42d1cc4
      YiFei Zhu authored
      This is a simple test to check that loading and dumping metadata
      in btftool works, whether or not metadata contents are used by the
      program.
      
      A C test is also added to make sure the skeleton code can read the
      metadata values.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-6-sdf@google.com
      d42d1cc4
    • YiFei Zhu's avatar
      bpftool: Support dumping metadata · aff52e68
      YiFei Zhu authored
      Dump metadata in the 'bpftool prog' list if it's present.
      For some formatting some BTF code is put directly in the
      metadata dumping. Sanity checks on the map and the kind of the btf_type
      to make sure we are actually dumping what we are expecting.
      
      A helper jsonw_reset is added to json writer so we can reuse the same
      json writer without having extraneous commas.
      
      Sample output:
      
        $ bpftool prog
        6: cgroup_skb  name prog  tag bcf7977d3b93787c  gpl
        [...]
        	btf_id 4
        	metadata:
        		a = "foo"
        		b = 1
      
        $ bpftool prog --json --pretty
        [{
                "id": 6,
        [...]
                "btf_id": 4,
                "metadata": {
                    "a": "foo",
                    "b": 1
                }
            }
        ]
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-5-sdf@google.com
      aff52e68
    • YiFei Zhu's avatar
      libbpf: Add BPF_PROG_BIND_MAP syscall and use it on .rodata section · 5d23328d
      YiFei Zhu authored
      The patch adds a simple wrapper bpf_prog_bind_map around the syscall.
      When the libbpf tries to load a program, it will probe the kernel for
      the support of this syscall and unconditionally bind .rodata section
      to the program.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-4-sdf@google.com
      5d23328d
    • YiFei Zhu's avatar
      bpf: Add BPF_PROG_BIND_MAP syscall · ef15314a
      YiFei Zhu authored
      This syscall binds a map to a program. Returns success if the map is
      already bound to the program.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-3-sdf@google.com
      ef15314a
    • YiFei Zhu's avatar
      bpf: Mutex protect used_maps array and count · 984fe94f
      YiFei Zhu authored
      To support modifying the used_maps array, we use a mutex to protect
      the use of the counter and the array. The mutex is initialized right
      after the prog aux is allocated, and destroyed right before prog
      aux is freed. This way we guarantee it's initialized for both cBPF
      and eBPF.
      Signed-off-by: default avatarYiFei Zhu <zhuyifei@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: YiFei Zhu <zhuyifei1999@gmail.com>
      Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
      984fe94f
  5. 15 Sep, 2020 7 commits
    • Yonghong Song's avatar
      libbpf: Fix a compilation error with xsk.c for ubuntu 16.04 · d317b0a8
      Yonghong Song authored
      When syncing latest libbpf repo to bcc, ubuntu 16.04 (4.4.0 LTS kernel)
      failed compilation for xsk.c:
        In file included from /tmp/debuild.0jkauG/bcc/src/cc/libbpf/src/xsk.c:23:0:
        /tmp/debuild.0jkauG/bcc/src/cc/libbpf/src/xsk.c: In function ‘xsk_get_ctx’:
        /tmp/debuild.0jkauG/bcc/src/cc/libbpf/include/linux/list.h:81:9: warning: implicit
        declaration of function ‘container_of’ [-Wimplicit-function-declaration]
                 container_of(ptr, type, member)
                 ^
        /tmp/debuild.0jkauG/bcc/src/cc/libbpf/include/linux/list.h:83:9: note: in expansion
        of macro ‘list_entry’
                 list_entry((ptr)->next, type, member)
        ...
        src/cc/CMakeFiles/bpf-static.dir/build.make:209: recipe for target
        'src/cc/CMakeFiles/bpf-static.dir/libbpf/src/xsk.c.o' failed
      
      Commit 2f6324a3 ("libbpf: Support shared umems between queues and devices")
      added include file <linux/list.h>, which uses macro "container_of".
      xsk.c file also includes <linux/ethtool.h> before <linux/list.h>.
      
      In a more recent distro kernel, <linux/ethtool.h> includes <linux/kernel.h>
      which contains the macro definition for "container_of". So compilation is all fine.
      But in ubuntu 16.04 kernel, <linux/ethtool.h> does not contain <linux/kernel.h>
      which caused the above compilation error.
      
      Let explicitly add <linux/kernel.h> in xsk.c to avoid compilation error
      in old distro's.
      
      Fixes: 2f6324a3 ("libbpf: Support shared umems between queues and devices")
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200914223210.1831262-1-yhs@fb.com
      d317b0a8
    • Yonghong Song's avatar
      bpftool: Fix build failure · 63bea244
      Yonghong Song authored
      When building bpf selftests like
        make -C tools/testing/selftests/bpf -j20
      I hit the following errors:
        ...
        GEN      /net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-gen.8
        <stdin>:75: (WARNING/2) Block quote ends without a blank line; unexpected unindent.
        <stdin>:71: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:85: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:57: (WARNING/2) Block quote ends without a blank line; unexpected unindent.
        <stdin>:66: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:109: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:175: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        <stdin>:273: (WARNING/2) Literal block ends without a blank line; unexpected unindent.
        make[1]: *** [/net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-perf.8] Error 12
        make[1]: *** Waiting for unfinished jobs....
        make[1]: *** [/net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-iter.8] Error 12
        make[1]: *** [/net-next/tools/testing/selftests/bpf/tools/build/bpftool/Documentation/bpftool-struct_ops.8] Error 12
        ...
      
      I am using:
        -bash-4.4$ rst2man --version
        rst2man (Docutils 0.11 [repository], Python 2.7.5, on linux2)
        -bash-4.4$
      
      The Makefile generated final .rst file (e.g., bpftool-cgroup.rst) looks like
        ...
            ID       AttachType      AttachFlags     Name
        \n SEE ALSO\n========\n\t**bpf**\ (2),\n\t**bpf-helpers**\
        (7),\n\t**bpftool**\ (8),\n\t**bpftool-btf**\
        (8),\n\t**bpftool-feature**\ (8),\n\t**bpftool-gen**\
        (8),\n\t**bpftool-iter**\ (8),\n\t**bpftool-link**\
        (8),\n\t**bpftool-map**\ (8),\n\t**bpftool-net**\
        (8),\n\t**bpftool-perf**\ (8),\n\t**bpftool-prog**\
        (8),\n\t**bpftool-struct_ops**\ (8)\n
      
      The rst2man generated .8 file looks like
      Literal block ends without a blank line; unexpected unindent.
       .sp
       n SEEALSOn========nt**bpf**(2),nt**bpf\-helpers**(7),nt**bpftool**(8),nt**bpftool\-btf**(8),nt**
       bpftool\-feature**(8),nt**bpftool\-gen**(8),nt**bpftool\-iter**(8),nt**bpftool\-link**(8),nt**
       bpftool\-map**(8),nt**bpftool\-net**(8),nt**bpftool\-perf**(8),nt**bpftool\-prog**(8),nt**
       bpftool\-struct_ops**(8)n
      
      Looks like that particular version of rst2man prefers to have actual new line
      instead of \n.
      
      Since `echo -e` may not be available in some environment, let us use `printf`.
      Format string "%b" is used for `printf` to ensure all escape characters are
      interpretted properly.
      
      Fixes: 18841da9 ("tools: bpftool: Automate generation for "SEE ALSO" sections in man pages")
      Suggested-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Cc: Quentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200914183110.999906-1-yhs@fb.com
      63bea244
    • Magnus Karlsson's avatar
      xsk: Fix refcount warning in xp_dma_map · bf74a370
      Magnus Karlsson authored
      Fix a potential refcount warning that a zero value is increased to one
      in xp_dma_map, by initializing the refcount to one to start with,
      instead of zero plus a refcount_inc().
      
      Fixes: 921b6869 ("xsk: Enable sharing of dma mappings")
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/1600095036-23868-1-git-send-email-magnus.karlsson@gmail.com
      bf74a370
    • Magnus Karlsson's avatar
      samples/bpf: Add quiet option to xdpsock · 74e00676
      Magnus Karlsson authored
      Add a quiet option (-Q) that disables the statistics print outs of
      xdpsock. This is good to have when measuring 0% loss rate performance
      as it will be quite terrible if the application uses printfs.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1599726666-8431-4-git-send-email-magnus.karlsson@gmail.com
      74e00676
    • Magnus Karlsson's avatar
      samples/bpf: Fix possible deadlock in xdpsock · 5a2a0dd8
      Magnus Karlsson authored
      Fix a possible deadlock in the l2fwd application in xdpsock that can
      occur when there is no space in the Tx ring. There are two ways to get
      the kernel to consume entries in the Tx ring: calling sendto() to make
      it send packets and freeing entries from the completion ring, as the
      kernel will not send a packet if there is no space for it to add a
      completion entry in the completion ring. The Tx loop in l2fwd only
      used to call sendto(). This patches adds cleaning the completion ring
      in that loop.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1599726666-8431-3-git-send-email-magnus.karlsson@gmail.com
      5a2a0dd8
    • Magnus Karlsson's avatar
      samples/bpf: Fix one packet sending in xdpsock · 3131cf66
      Magnus Karlsson authored
      Fix the sending of a single packet (or small burst) in xdpsock when
      executing in copy mode. Currently, the l2fwd application in xdpsock
      only transmits the packets after a batch of them has been received,
      which might be confusing if you only send one packet and expect that
      it is returned pronto. Fix this by calling sendto() more often and add
      a comment in the code that states that this can be optimized if
      needed.
      Reported-by: default avatarTirthendu Sarkar <tirthendu.sarkar@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/1599726666-8431-2-git-send-email-magnus.karlsson@gmail.com
      3131cf66
    • Ilya Leoshkevich's avatar
      s390/bpf: Fix multiple tail calls · d72714c1
      Ilya Leoshkevich authored
      In order to branch around tail calls (due to out-of-bounds index,
      exceeding tail call count or missing tail call target), JIT uses
      label[0] field, which contains the address of the instruction following
      the tail call. When there are multiple tail calls, label[0] value comes
      from handling of a previous tail call, which is incorrect.
      
      Fix by getting rid of label array and resolving the label address
      locally: for all 3 branches that jump to it, emit 0 offsets at the
      beginning, and then backpatch them with the correct value.
      
      Also, do not use the long jump infrastructure: the tail call sequence
      is known to be short, so make all 3 jumps short.
      
      Fixes: 6651ee07 ("s390/bpf: implement bpf_tail_call() helper")
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200909232141.3099367-1-iii@linux.ibm.com
      d72714c1
  6. 11 Sep, 2020 6 commits
    • Alexei Starovoitov's avatar
      Merge branch 'improve-bpf-tcp-cc-init' · 2bab48c5
      Alexei Starovoitov authored
      Neal Cardwell says:
      
      ====================
      This patch series reorganizes TCP congestion control initialization so that if
      EBPF code called by tcp_init_transfer() sets the congestion control algorithm
      by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the
      congestion control module immediately, instead of having tcp_init_transfer()
      later initialize the congestion control module.
      
      This increases flexibility for the EBPF code that runs at connection
      establishment time, and simplifies the code.
      
      This has the following benefits:
      
      (1) This allows CC module customizations made by the EBPF called in
          tcp_init_transfer() to persist, and not be wiped out by a later
          call to tcp_init_congestion_control() in tcp_init_transfer().
      
      (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
          for existing code upstream that depends on the current order.
      
      (3) Does not cause 2 initializations for for CC in the case where the
          EBPF called in tcp_init_transfer() wants to set the CC to a new CC
          algorithm.
      
      (4) Allows follow-on simplifications to the code in net/core/filter.c
          and net/ipv4/tcp_cong.c, which currently both have some complexity
          to special-case CC initialization to avoid double CC
          initialization if EBPF sets the CC.
      
      changes in v2:
      
      o rebase onto bpf-next
      
      o add another follow-on simplification suggested by Martin KaFai Lau:
         "tcp: simplify tcp_set_congestion_control() load=false case"
      
      changes in v3:
      
      o no change in commits
      
      o resent patch series from @gmail.com, since mail from ncardwell@google.com
        stopped being accepted at netdev@vger.kernel.org mid-way through processing
        the v2 patch series (between patches 2 and 3), confusing patchwork about
        which patches belonged to the v2 patch series
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2bab48c5
    • Neal Cardwell's avatar
      tcp: Simplify tcp_set_congestion_control() load=false case · 5050bef8
      Neal Cardwell authored
      Simplify tcp_set_congestion_control() by removing the initialization
      code path for the !load case.
      
      There are only two call sites for tcp_set_congestion_control(). The
      EBPF call site is the only one that passes load=false; it also passes
      cap_net_admin=true. Because of that, the exact same behavior can be
      achieved by removing the special if (!load) branch of the logic. Both
      before and after this commit, the EBPF case will call
      bpf_try_module_get(), and if that succeeds then call
      tcp_reinit_congestion_control() or if that fails then return EBUSY.
      
      Note that this returns the logic to a structure very similar to the
      structure before:
        commit 91b5b21c ("bpf: Add support for changing congestion control")
      except that the CAP_NET_ADMIN status is passed in as a function
      argument.
      
      This clean-up was suggested by Martin KaFai Lau.
      Suggested-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Kevin Yang <yyd@google.com>
      5050bef8
    • Neal Cardwell's avatar
      tcp: simplify _bpf_setsockopt(): Remove flags argument · 5cdc744c
      Neal Cardwell authored
      Now that the previous patches have removed the code that uses the
      flags argument to _bpf_setsockopt(), we can remove that argument.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      5cdc744c
    • Neal Cardwell's avatar
      tcp: simplify tcp_set_congestion_control(): Always reinitialize · 29a94932
      Neal Cardwell authored
      Now that the previous patches ensure that all call sites for
      tcp_set_congestion_control() want to initialize congestion control, we
      can simplify tcp_set_congestion_control() by removing the reinit
      argument and the code to support it.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      29a94932
    • Neal Cardwell's avatar
      tcp: Simplify EBPF TCP_CONGESTION to always init CC · e7b10a4d
      Neal Cardwell authored
      Now that the previous patch ensures we don't initialize the congestion
      control twice, when EBPF sets the congestion control algorithm at
      connection establishment we can simplify the code by simply
      initializing the congestion control module at that time.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      e7b10a4d
    • Neal Cardwell's avatar
      tcp: Only init congestion control if not initialized already · 8919a9b3
      Neal Cardwell authored
      Change tcp_init_transfer() to only initialize congestion control if it
      has not been initialized already.
      
      With this new approach, we can arrange things so that if the EBPF code
      sets the congestion control by calling setsockopt(TCP_CONGESTION) then
      tcp_init_transfer() will not re-initialize the CC module.
      
      This is an approach that has the following beneficial properties:
      
      (1) This allows CC module customizations made by the EBPF called in
          tcp_init_transfer() to persist, and not be wiped out by a later
          call to tcp_init_congestion_control() in tcp_init_transfer().
      
      (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
          for existing code upstream that depends on the current order.
      
      (3) Does not cause 2 initializations for for CC in the case where the
          EBPF called in tcp_init_transfer() wants to set the CC to a new CC
          algorithm.
      
      (4) Allows follow-on simplifications to the code in net/core/filter.c
          and net/ipv4/tcp_cong.c, which currently both have some complexity
          to special-case CC initialization to avoid double CC
          initialization if EBPF sets the CC.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarKevin Yang <yyd@google.com>
      Cc: Lawrence Brakmo <brakmo@fb.com>
      8919a9b3