1. 27 Apr, 2019 8 commits
    • Alexei Starovoitov's avatar
      Merge branch 'writeable-bpf-tracepoints' · 3745dc24
      Alexei Starovoitov authored
      Matt Mullins says:
      
      ====================
      This adds an opt-in interface for tracepoints to expose a writable context to
      BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE programs that are attached, while
      supporting read-only access from existing BPF_PROG_TYPE_RAW_TRACEPOINT
      programs, as well as from non-BPF-based tracepoints.
      
      The initial motivation is to support tracing that can be observed from the
      remote end of an NBD socket, e.g. by adding flags to the struct nbd_request
      header.  Earlier attempts included adding an NBD-specific tracepoint fd, but in
      code review, I was recommended to implement it more generically -- as a result,
      this patchset is far simpler than my initial try.
      
      v4->v5:
        * rebased onto bpf-next/master and fixed merge conflicts
        * "tools: sync bpf.h" also syncs comments that have previously changed
          in bpf-next
      
      v3->v4:
        * fixed a silly copy/paste typo in include/trace/events/bpf_test_run.h
          (_TRACE_NBD_H -> _TRACE_BPF_TEST_RUN_H)
        * fixed incorrect/misleading wording in patch 1's commit message,
          since the pointer cannot be directly dereferenced in a
          BPF_PROG_TYPE_RAW_TRACEPOINT
        * cleaned up the error message wording if the prog_tests fail
        * Addressed feedback from Yonghong
          * reject non-pointer-sized accesses to the buffer pointer
          * use sizeof(struct nbd_request) as one-byte-past-the-end in
            raw_tp_writable_reject_nbd_invalid.c
          * use BPF_MOV64_IMM instead of BPF_LD_IMM64
      
      v2->v3:
        * Andrew addressed Josef's comments:
          * C-style commenting in nbd.c
          * Collapsed identical events into a single DECLARE_EVENT_CLASS.
            This saves about 2kB of kernel text
      
      v1->v2:
        * add selftests
          * sync tools/include/uapi/linux/bpf.h
        * reject variable offset into the buffer
        * add string representation of PTR_TO_TP_BUFFER to reg_type_str
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3745dc24
    • Matt Mullins's avatar
      selftests: bpf: test writable buffers in raw tps · e950e843
      Matt Mullins authored
      This tests that:
        * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it
          uses either:
          * a variable offset to the tracepoint buffer, or
          * an offset beyond the size of the tracepoint buffer
        * a tracer can modify the buffer provided when attached to a writable
          tracepoint in bpf_prog_test_run
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e950e843
    • Matt Mullins's avatar
      tools: sync bpf.h · 4635b0ae
      Matt Mullins authored
      This adds BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, and fixes up the
      
      	error: enumeration value ‘BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE’ not handled in switch [-Werror=switch-enum]
      
      build errors it would otherwise cause in libbpf.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4635b0ae
    • Andrew Hall's avatar
      nbd: add tracepoints for send/receive timing · 2abd2de7
      Andrew Hall authored
      This adds four tracepoints to nbd, enabling separate tracing of payload
      and header sending/receipt.
      
      In the send path for headers that have already been sent, we also
      explicitly initialize the handle so it can be referenced by the later
      tracepoint.
      Signed-off-by: default avatarAndrew Hall <hall@fb.com>
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2abd2de7
    • Matt Mullins's avatar
      nbd: trace sending nbd requests · ea106722
      Matt Mullins authored
      This adds a tracepoint that can both observe the nbd request being sent
      to the server, as well as modify that request , e.g., setting a flag in
      the request that will cause the server to collect detailed tracing data.
      
      The struct request * being handled is included to permit correlation
      with the block tracepoints.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ea106722
    • Matt Mullins's avatar
      bpf: add writable context for raw tracepoints · 9df1c28b
      Matt Mullins authored
      This is an opt-in interface that allows a tracepoint to provide a safe
      buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
      The size of the buffer must be a compile-time constant, and is checked
      before allowing a BPF program to attach to a tracepoint that uses this
      feature.
      
      The pointer to this buffer will be the first argument of tracepoints
      that opt in; the pointer is valid and can be bpf_probe_read() by both
      BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
      programs that attach to such a tracepoint, but the buffer to which it
      points may only be written by the latter.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9df1c28b
    • Daniel Borkmann's avatar
      bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd · 34b8ab09
      Daniel Borkmann authored
      Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016,
      lets add support for STADD and use that in favor of LDXR / STXR loop for
      the XADD mapping if available. STADD is encoded as an alias for LDADD with
      XZR as the destination register, therefore add LDADD to the instruction
      encoder along with STADD as special case and use it in the JIT for CPUs
      that advertise LSE atomics in CPUID register. If immediate offset in the
      BPF XADD insn is 0, then use dst register directly instead of temporary
      one.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      34b8ab09
    • Daniel Borkmann's avatar
      bpf, arm64: remove prefetch insn in xadd mapping · 8968c67a
      Daniel Borkmann authored
      Prefetch-with-intent-to-write is currently part of the XADD mapping in
      the AArch64 JIT and follows the kernel's implementation of atomic_add.
      This may interfere with other threads executing the LDXR/STXR loop,
      leading to potential starvation and fairness issues. Drop the optional
      prefetch instruction.
      
      Fixes: 85f68fe8 ("bpf, arm64: implement jiting of BPF_XADD")
      Reported-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8968c67a
  2. 26 Apr, 2019 6 commits
    • Alexei Starovoitov's avatar
      Merge branch 'btf-dump' · 0c0cad2c
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      This patch set adds a new `bpftool btf dump` sub-command, which allows to dump
      BTF contents (only types for now). Currently it only outputs low-level
      content, almost 1:1 with binary BTF format, but follow up patches will add
      ability to dump BTF types as a compilable C header file. JSON output is
      supported as well.
      
      Patch #1 adds `btf` sub-command, dumping BTF types in human-readable format.
      It also implements reading .BTF data from ELF file.
      Patch #2 adds minimal documentation with output format examples and different
      ways to specify source of BTF data.
      Patch #3 adds support for btf command in bash-completion/bpftool script.
      Patch #4 fixes minor indentation issue in bash-completion script.
      
      Output format is mostly following existing format of BPF verifier log, but
      deviates from it in few places. More details are in commit message for patch 1.
      
      Example of output for all supported BTF kinds are in patch #2 as part of
      documentation. Some field names are quite verbose and I'd rather shorten them,
      if we don't feel like being very close to BPF verifier names is a necessity,
      but in this patch I left them exactly the same as in verifier log.
      
      v3->v4:
        - reverse Christmas tree (Quentin)
        - better docs (Quentin)
      
      v2->v3:
        - make map's key|value|kv|all suggestion more precise (Quentin)
        - fix default case indentations (Quentin)
      
      v1->v2:
        - fix unnecessary trailing whitespaces in bpftool-btf.rst (Yonghong)
        - add btf in main.c for a list of possible OBJECTs
        - handle unknown keyword under `bpftool btf dump` (Yonghong)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0c0cad2c
    • Andrii Nakryiko's avatar
      bpftool: fix indendation in bash-completion/bpftool · 8ed1875b
      Andrii Nakryiko authored
      Fix misaligned default case branch for `prog dump` sub-command.
      Reported-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Cc: Yonghong Song <yhs@fb.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8ed1875b
    • Andrii Nakryiko's avatar
      bpftool: add bash completions for btf command · 4a714fee
      Andrii Nakryiko authored
      Add full support for btf command in bash-completion script.
      
      Cc: Quentin Monnet <quentin.monnet@netronome.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4a714fee
    • Andrii Nakryiko's avatar
      bpftool/docs: add btf sub-command documentation · ca253339
      Andrii Nakryiko authored
      Document usage and sample output format for `btf dump` sub-command.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ca253339
    • Andrii Nakryiko's avatar
      bpftool: add ability to dump BTF types · c93cc690
      Andrii Nakryiko authored
      Add new `btf dump` sub-command to bpftool. It allows to dump
      human-readable low-level BTF types representation of BTF types. BTF can
      be retrieved from few different sources:
        - from BTF object by ID;
        - from PROG, if it has associated BTF;
        - from MAP, if it has associated BTF data; it's possible to narrow
          down types to either key type, value type, both, or all BTF types;
        - from ELF file (.BTF section).
      
      Output format mostly follows BPF verifier log format with few notable
      exceptions:
        - all the type/field/param/etc names are enclosed in single quotes to
          allow easier grepping and to stand out a little bit more;
        - FUNC_PROTO output follows STRUCT/UNION/ENUM format of having one
          line per each argument; this is more uniform and allows easy
          grepping, as opposed to succinct, but inconvenient format that BPF
          verifier log is using.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c93cc690
    • Benjamin Poirier's avatar
      bpftool: Fix errno variable usage · 77d76426
      Benjamin Poirier authored
      The test meant to use the saved value of errno. Given the current code, it
      makes no practical difference however.
      
      Fixes: bf598a8f ("bpftool: Improve handling of ENOENT on map dumps")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77d76426
  3. 25 Apr, 2019 7 commits
    • Stanislav Fomichev's avatar
      bpftool: show flow_dissector attachment status · 7f0c57fe
      Stanislav Fomichev authored
      Right now there is no way to query whether BPF flow_dissector program
      is attached to a network namespace or not. In previous commit, I added
      support for querying that info, show it when doing `bpftool net`:
      
      $ bpftool prog loadall ./bpf_flow.o \
      	/sys/fs/bpf/flow type flow_dissector \
      	pinmaps /sys/fs/bpf/flow
      $ bpftool prog
      3: flow_dissector  name _dissect  tag 8c9e917b513dd5cc  gpl
              loaded_at 2019-04-23T16:14:48-0700  uid 0
              xlated 656B  jited 461B  memlock 4096B  map_ids 1,2
              btf_id 1
      ...
      
      $ bpftool net -j
      [{"xdp":[],"tc":[],"flow_dissector":[]}]
      
      $ bpftool prog attach pinned \
      	/sys/fs/bpf/flow/flow_dissector flow_dissector
      $ bpftool net -j
      [{"xdp":[],"tc":[],"flow_dissector":["id":3]}]
      
      Doesn't show up in a different net namespace:
      $ ip netns add test
      $ ip netns exec test bpftool net -j
      [{"xdp":[],"tc":[],"flow_dissector":[]}]
      
      Non-json output:
      $ bpftool net
      xdp:
      
      tc:
      
      flow_dissector:
      id 3
      
      v2:
      * initialization order (Jakub Kicinski)
      * clear errno for batch mode (Quentin Monnet)
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7f0c57fe
    • Stanislav Fomichev's avatar
      bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_type · 118c8e9a
      Stanislav Fomichev authored
      target_fd is target namespace. If there is a flow dissector BPF program
      attached to that namespace, its (single) id is returned.
      
      v5:
      * drop net ref right after rcu unlock (Daniel Borkmann)
      
      v4:
      * add missing put_net (Jann Horn)
      
      v3:
      * add missing inline to skb_flow_dissector_prog_query static def
        (kbuild test robot <lkp@intel.com>)
      
      v2:
      * don't sleep in rcu critical section (Jakub Kicinski)
      * check input prog_cnt (exit early)
      
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      118c8e9a
    • Daniel T. Lee's avatar
      samples: bpf: add hbm sample to .gitignore · ead442a0
      Daniel T. Lee authored
      This commit adds hbm to .gitignore which is
      currently ommited from the ignore file.
      Signed-off-by: default avatarDaniel T. Lee <danieltimlee@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ead442a0
    • Daniel T. Lee's avatar
      libbpf: fix samples/bpf build failure due to undefined UINT32_MAX · 32e621e5
      Daniel T. Lee authored
      Currently, building bpf samples will cause the following error.
      
          ./tools/lib/bpf/bpf.h:132:27: error: 'UINT32_MAX' undeclared here (not in a function) ..
           #define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8) /* verifier maximum in kernels <= 5.1 */
                                     ^
          ./samples/bpf/bpf_load.h:31:25: note: in expansion of macro 'BPF_LOG_BUF_SIZE'
           extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
                                   ^~~~~~~~~~~~~~~~
      
      Due to commit 4519efa6 ("libbpf: fix BPF_LOG_BUF_SIZE off-by-one error")
      hard-coded size of BPF_LOG_BUF_SIZE has been replaced with UINT32_MAX which is
      defined in <stdint.h> header.
      
      Even with this change, bpf selftests are running fine since these are built
      with clang and it includes header(-idirafter) from clang/6.0.0/include.
      (it has <stdint.h>)
      
          clang -I. -I./include/uapi -I../../../include/uapi -idirafter /usr/local/include -idirafter /usr/include \
          -idirafter /usr/lib/llvm-6.0/lib/clang/6.0.0/include -idirafter /usr/include/x86_64-linux-gnu \
          -Wno-compare-distinct-pointer-types -O2 -target bpf -emit-llvm -c progs/test_sysctl_prog.c -o - | \
          llc -march=bpf -mcpu=generic  -filetype=obj -o /linux/tools/testing/selftests/bpf/test_sysctl_prog.o
      
      But bpf samples are compiled with GCC, and it only searches and includes
      headers declared at the target file. As '#include <stdint.h>' hasn't been
      declared in tools/lib/bpf/bpf.h, it causes build failure of bpf samples.
      
          gcc -Wp,-MD,./samples/bpf/.sockex3_user.o.d -Wall -Wmissing-prototypes -Wstrict-prototypes \
          -O2 -fomit-frame-pointer -std=gnu89 -I./usr/include -I./tools/lib/ -I./tools/testing/selftests/bpf/ \
          -I./tools/  lib/ -I./tools/include -I./tools/perf -c -o ./samples/bpf/sockex3_user.o ./samples/bpf/sockex3_user.c;
      
      This commit add declaration of '#include <stdint.h>' to tools/lib/bpf/bpf.h
      to fix this problem.
      Signed-off-by: default avatarDaniel T. Lee <danieltimlee@gmail.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      32e621e5
    • Alexei Starovoitov's avatar
      Merge branch 'libbpf-fixes' · 0e33d334
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Two small fixes in relation to global data handling. Thanks!
      ====================
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0e33d334
    • Daniel Borkmann's avatar
      bpf, libbpf: fix segfault in bpf_object__init_maps' pr_debug statement · 4f8827d2
      Daniel Borkmann authored
      Ran into it while testing; in bpf_object__init_maps() data can be NULL
      in the case where no map section is present. Therefore we simply cannot
      access data->d_size before NULL test. Move the pr_debug() where it's
      safe to access.
      
      Fixes: d859900c ("bpf, libbpf: support global data/bss/rodata sections")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4f8827d2
    • Daniel Borkmann's avatar
      bpf, libbpf: handle old kernels more graceful wrt global data sections · 8837fe5d
      Daniel Borkmann authored
      Andrii reported a corner case where e.g. global static data is present
      in the BPF ELF file in form of .data/.bss/.rodata section, but without
      any relocations to it. Such programs could be loaded before commit
      d859900c ("bpf, libbpf: support global data/bss/rodata sections"),
      whereas afterwards if kernel lacks support then loading would fail.
      
      Add a probing mechanism which skips setting up libbpf internal maps
      in case of missing kernel support. In presence of relocation entries,
      we abort the load attempt.
      
      Fixes: d859900c ("bpf, libbpf: support global data/bss/rodata sections")
      Reported-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8837fe5d
  4. 23 Apr, 2019 19 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-proto-fixes' · a21b48a2
      Daniel Borkmann authored
      Willem de Bruijn says:
      
      ====================
      Expand the tc tunnel encap support with protocols that convert the
      network layer protocol, such as 6in4. This is analogous to existing
      support in bpf_skb_proto_6_to_4.
      
      Patch 1 implements the straightforward logic
      Patch 2 tests it with a 6in4 tunnel
      
      Changes v1->v2
        - improve documentation in test
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a21b48a2
    • Willem de Bruijn's avatar
      selftests/bpf: expand test_tc_tunnel with SIT encap · f6ad6acc
      Willem de Bruijn authored
      So far, all BPF tc tunnel testcases encapsulate in the same network
      protocol. Add an encap testcase that requires updating skb->protocol.
      
      The 6in4 tunnel encapsulates an IPv6 packet inside an IPv4 tunnel.
      Verify that bpf_skb_net_grow correctly updates skb->protocol to
      select the right protocol handler in __netif_receive_skb_core.
      
      The BPF program should also manually update the link layer header to
      encode the right network protocol.
      
      Changes v1->v2
        - improve documentation of non-obvious logic
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f6ad6acc
    • Willem de Bruijn's avatar
      bpf: update skb->protocol in bpf_skb_net_grow · 1b00e0df
      Willem de Bruijn authored
      Some tunnels, like sit, change the network protocol of packet.
      If so, update skb->protocol to match the new type.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1b00e0df
    • Daniel Borkmann's avatar
      Merge branch 'bpf-eth-get-headlen' · 2aad3261
      Daniel Borkmann authored
      Stanislav Fomichev says:
      
      ====================
      Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
      skb. Because we use passed skb to lookup associated networking namespace
      to find whether we have a BPF program attached or not, we always use
      C-based flow dissector in this case.
      
      The goal of this patch series is to add new networking namespace argument
      to the eth_get_headlen and make BPF flow dissector programs be able to
      work in the skb-less case.
      
      The series goes like this:
      * use new kernel context (struct bpf_flow_dissector) for flow dissector
        programs; this makes it easy to distinguish between skb and no-skb
        case and supports calling BPF flow dissector on a chunk of raw data
      * convert BPF_PROG_TEST_RUN to use raw data
      * plumb network namespace into __skb_flow_dissect from all callers
      * handle no-skb case in __skb_flow_dissect
      * update eth_get_headlen to include net namespace argument and
        convert all existing users
      * add selftest to make sure bpf_skb_load_bytes is not allowed in
        the no-skb mode
      * extend test_progs to exercise skb-less flow dissection as well
      * stop adjusting nhoff/thoff by ETH_HLEN in BPF_PROG_TEST_RUN
      
      v6:
      * more suggestions by Alexei:
        * eth_get_headlen now takes net dev, not net namespace
        * test skb-less case via tun eth_get_headlen
      * fix return errors in bpf_flow_load
      * don't adjust nhoff/thoff by ETH_HLEN
      
      v5:
      * API changes have been submitted via bpf/stable tree
      
      v4:
      * prohibit access to vlan fields as well (otherwise, inconsistent
        between skb/skb-less cases)
      * drop extra unneeded check for skb->vlan_present in bpf_flow.c
      
      v3:
      * new kernel xdp_buff-like context per Alexei suggestion
      * drop skb_net helper
      * properly clamp flow_keys->nhoff
      
      v2:
      * moved temporary skb from stack into percpu (avoids memset of ~200 bytes
        per packet)
      * tightened down access to __sk_buff fields from flow dissector programs to
        avoid touching shinfo (whitelist only relevant fields)
      * addressed suggestions from Willem
      ====================
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2aad3261
    • Stanislav Fomichev's avatar
      bpf/flow_dissector: don't adjust nhoff by ETH_HLEN in BPF_PROG_TEST_RUN · 02ee0658
      Stanislav Fomichev authored
      Now that we use skb-less flow dissector let's return true nhoff and
      thoff. We used to adjust them by ETH_HLEN because that's how it was
      done in the skb case. For VLAN tests that looks confusing: nhoff is
      pointing to vlan parts :-\
      
      Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop
      if you think that it's too late at this point to fix it.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      02ee0658
    • Stanislav Fomichev's avatar
      selftests/bpf: properly return error from bpf_flow_load · fe993c64
      Stanislav Fomichev authored
      Right now we incorrectly return 'ret' which is always zero at that
      point.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fe993c64
    • Stanislav Fomichev's avatar
      selftests/bpf: run flow dissector tests in skb-less mode · 0905beec
      Stanislav Fomichev authored
      Export last_dissection map from flow dissector and use a known place in
      tun driver to trigger BPF flow dissection.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0905beec
    • Stanislav Fomichev's avatar
      selftests/bpf: add flow dissector bpf_skb_load_bytes helper test · c9cb2c1e
      Stanislav Fomichev authored
      When flow dissector is called without skb, we want to make sure
      bpf_skb_load_bytes invocations return error. Add small test which tries
      to read single byte from a packet.
      
      bpf_skb_load_bytes should always fail under BPF_PROG_TEST_RUN because
      it was converted to the skb-less mode.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c9cb2c1e
    • Stanislav Fomichev's avatar
      net: pass net_device argument to the eth_get_headlen · c43f1255
      Stanislav Fomichev authored
      Update all users of eth_get_headlen to pass network device, fetch
      network namespace from it and pass it down to the flow dissector.
      This commit is a noop until administrator inserts BPF flow dissector
      program.
      
      Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Michael Chan <michael.chan@broadcom.com>
      Cc: Igor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c43f1255
    • Stanislav Fomichev's avatar
      flow_dissector: handle no-skb use case · 9b52e3f2
      Stanislav Fomichev authored
      When called without skb, gather all required data from the
      __skb_flow_dissect's arguments and use recently introduces
      no-skb mode of bpf flow dissector.
      
      Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9b52e3f2
    • Stanislav Fomichev's avatar
      net: plumb network namespace into __skb_flow_dissect · 3cbf4ffb
      Stanislav Fomichev authored
      This new argument will be used in the next patches for the
      eth_get_headlen use case. eth_get_headlen calls flow dissector
      with only data (without skb) so there is currently no way to
      pull attached BPF flow dissector program. With this new argument,
      we can amend the callers to explicitly pass network namespace
      so we can use attached BPF program.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3cbf4ffb
    • Stanislav Fomichev's avatar
      bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode · 7b8a1304
      Stanislav Fomichev authored
      Now that we have bpf_flow_dissect which can work on raw data,
      use it when doing BPF_PROG_TEST_RUN for flow dissector.
      
      Simplifies bpf_prog_test_run_flow_dissector and allows us to
      test no-skb mode.
      
      Note, that previously, with bpf_flow_dissect_skb we used to call
      eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
      skb_reset_network_header. That means flow_keys->nhoff would be
      initialized to 0 (skb_network_offset) in init_flow_keys.
      Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
      to undo it once the dissection is done to preserve the existing behavior.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7b8a1304
    • Stanislav Fomichev's avatar
      flow_dissector: switch kernel context to struct bpf_flow_dissector · 089b19a9
      Stanislav Fomichev authored
      struct bpf_flow_dissector has a small subset of sk_buff fields that
      flow dissector BPF program is allowed to access and an optional
      pointer to real skb. Real skb is used only in bpf_skb_load_bytes
      helper to read non-linear data.
      
      The real motivation for this is to be able to call flow dissector
      from eth_get_headlen context where we don't have an skb and need
      to dissect raw bytes.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      089b19a9
    • Florian Fainelli's avatar
      net: systemport: Remove need for DMA descriptor · 7e6e185c
      Florian Fainelli authored
      All we do is write the length/status and address bits to a DMA
      descriptor only to write its contents into on-chip registers right
      after, eliminate this unnecessary step.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e6e185c
    • Ido Schimmel's avatar
      bridge: Fix possible use-after-free when deleting bridge port · 697cd36c
      Ido Schimmel authored
      When a bridge port is being deleted, do not dereference it later in
      br_vlan_port_event() as it can result in a use-after-free [1] if the RCU
      callback was executed before invoking the function.
      
      [1]
      [  129.638551] ==================================================================
      [  129.646904] BUG: KASAN: use-after-free in br_vlan_port_event+0x53c/0x5fd
      [  129.654406] Read of size 8 at addr ffff8881e4aa1ae8 by task ip/483
      [  129.663008] CPU: 0 PID: 483 Comm: ip Not tainted 5.1.0-rc5-custom-02265-ga946bd73daac #1383
      [  129.672359] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
      [  129.682484] Call Trace:
      [  129.685242]  dump_stack+0xa9/0x10e
      [  129.689068]  print_address_description.cold.2+0x9/0x25e
      [  129.694930]  kasan_report.cold.3+0x78/0x9d
      [  129.704420]  br_vlan_port_event+0x53c/0x5fd
      [  129.728300]  br_device_event+0x2c7/0x7a0
      [  129.741505]  notifier_call_chain+0xb5/0x1c0
      [  129.746202]  rollback_registered_many+0x895/0xe90
      [  129.793119]  unregister_netdevice_many+0x48/0x210
      [  129.803384]  rtnl_delete_link+0xe1/0x140
      [  129.815906]  rtnl_dellink+0x2a3/0x820
      [  129.844166]  rtnetlink_rcv_msg+0x397/0x910
      [  129.868517]  netlink_rcv_skb+0x137/0x3a0
      [  129.882013]  netlink_unicast+0x49b/0x660
      [  129.900019]  netlink_sendmsg+0x755/0xc90
      [  129.915758]  ___sys_sendmsg+0x761/0x8e0
      [  129.966315]  __sys_sendmsg+0xf0/0x1c0
      [  129.988918]  do_syscall_64+0xa4/0x470
      [  129.993032]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  129.998696] RIP: 0033:0x7ff578104b58
      ...
      [  130.073811] Allocated by task 479:
      [  130.077633]  __kasan_kmalloc.constprop.5+0xc1/0xd0
      [  130.083008]  kmem_cache_alloc_trace+0x152/0x320
      [  130.088090]  br_add_if+0x39c/0x1580
      [  130.092005]  do_set_master+0x1aa/0x210
      [  130.096211]  do_setlink+0x985/0x3100
      [  130.100224]  __rtnl_newlink+0xc52/0x1380
      [  130.104625]  rtnl_newlink+0x6b/0xa0
      [  130.108541]  rtnetlink_rcv_msg+0x397/0x910
      [  130.113136]  netlink_rcv_skb+0x137/0x3a0
      [  130.117538]  netlink_unicast+0x49b/0x660
      [  130.121939]  netlink_sendmsg+0x755/0xc90
      [  130.126340]  ___sys_sendmsg+0x761/0x8e0
      [  130.130645]  __sys_sendmsg+0xf0/0x1c0
      [  130.134753]  do_syscall_64+0xa4/0x470
      [  130.138864]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [  130.146195] Freed by task 0:
      [  130.149421]  __kasan_slab_free+0x125/0x170
      [  130.154016]  kfree+0xf3/0x310
      [  130.157349]  kobject_put+0x1a8/0x4c0
      [  130.161363]  rcu_core+0x859/0x19b0
      [  130.165175]  __do_softirq+0x250/0xa26
      [  130.170956] The buggy address belongs to the object at ffff8881e4aa1ae8
                      which belongs to the cache kmalloc-1k of size 1024
      [  130.184972] The buggy address is located 0 bytes inside of
                      1024-byte region [ffff8881e4aa1ae8, ffff8881e4aa1ee8)
      
      Fixes: 9c0ec2e7 ("bridge: support binding vlan dev link state to vlan member bridge ports")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Cc: Mike Manning <mmanning@vyatta.att-mail.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      697cd36c
    • Crag.Wang's avatar
      r8152: sync sa_family with the media type of network device · a6cbcb77
      Crag.Wang authored
      Without this patch the socket address family sporadically gets wrong
      value ends up the dev_set_mac_address() fails to set the desired MAC
      address.
      
      Fixes: 25766271 ("r8152: Refresh MAC address during USBDEVFS_RESET")
      Signed-off-by: default avatarCrag.Wang <crag.wang@dell.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-By: default avatarMario Limonciello <mario.limonciello@dell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6cbcb77
    • David S. Miller's avatar
      Merge branch 'mlxsw-Shared-buffer-improvements' · 6f97955f
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Shared buffer improvements
      
      This patchset includes two improvements with regards to shared buffer
      configuration in mlxsw.
      
      The first part of this patchset forbids the user from performing illegal
      shared buffer configuration that can result in unnecessary packet loss.
      In order to better communicate these configuration failures to the user,
      extack is propagated from devlink towards drivers. This is done in
      patches #1-#8.
      
      The second part of the patchset deals with the shared buffer
      configuration of the CPU port. When a packet is trapped by the device,
      it is sent across the PCI bus to the attached host CPU. From the
      device's perspective, it is as if the packet is transmitted through the
      CPU port.
      
      While testing traffic directed at the CPU it became apparent that for
      certain packet sizes and certain burst sizes, the current shared buffer
      configuration of the CPU port is inadequate and results in packet drops.
      The configuration is adjusted by patches #9-#14 that create two new pools
      - ingress & egress - which are dedicated for CPU traffic.
      ====================
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f97955f
    • Ido Schimmel's avatar
      mlxsw: spectrum_buffers: Adjust CPU port shared buffer egress quotas · 7a1ff9f4
      Ido Schimmel authored
      Switch the CPU port to use the new dedicated egress pool instead the
      previously used egress pool which was shared with normal front panel
      ports.
      
      Add per-port quotas for the amount of traffic that can be buffered for
      the CPU port and also adjust the per-{port, TC} quotas.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a1ff9f4
    • Ido Schimmel's avatar
      mlxsw: spectrum_buffers: Allow skipping ingress port quota configuration · 6d28725c
      Ido Schimmel authored
      The CPU port is used to transmit traffic that is trapped to the host
      CPU. It is therefore irrelevant to define ingress quota for it.
      
      Add a 'skip_ingress' argument to the function tasked with configuring
      per-port quotas, so that ingress quotas could be skipped in case the
      passed local port is the CPU port.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6d28725c