1. 27 Apr, 2019 12 commits
    • Martin KaFai Lau's avatar
      bpf: Refactor BTF encoding macro to test_btf.h · 3f4d4c74
      Martin KaFai Lau authored
      Refactor common BTF encoding macros for other tests to use.
      The libbpf may reuse some of them in the future  which requires
      some more thoughts before publishing as a libbpf API.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3f4d4c74
    • Martin KaFai Lau's avatar
      bpf: Support BPF_MAP_TYPE_SK_STORAGE in bpf map probing · a19f89f3
      Martin KaFai Lau authored
      This patch supports probing for the new BPF_MAP_TYPE_SK_STORAGE.
      BPF_MAP_TYPE_SK_STORAGE enforces BTF usage, so the new probe
      requires to create and load a BTF also.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a19f89f3
    • Martin KaFai Lau's avatar
      bpf: Sync bpf.h to tools · 948d930e
      Martin KaFai Lau authored
      This patch sync the bpf.h to tools/.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      948d930e
    • Martin KaFai Lau's avatar
      bpf: Introduce bpf sk local storage · 6ac99e8f
      Martin KaFai Lau authored
      After allowing a bpf prog to
      - directly read the skb->sk ptr
      - get the fullsock bpf_sock by "bpf_sk_fullsock()"
      - get the bpf_tcp_sock by "bpf_tcp_sock()"
      - get the listener sock by "bpf_get_listener_sock()"
      - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
        into different bpf running context.
      
      this patch is another effort to make bpf's network programming
      more intuitive to do (together with memory and performance benefit).
      
      When bpf prog needs to store data for a sk, the current practice is to
      define a map with the usual 4-tuples (src/dst ip/port) as the key.
      If multiple bpf progs require to store different sk data, multiple maps
      have to be defined.  Hence, wasting memory to store the duplicated
      keys (i.e. 4 tuples here) in each of the bpf map.
      [ The smallest key could be the sk pointer itself which requires
        some enhancement in the verifier and it is a separate topic. ]
      
      Also, the bpf prog needs to clean up the elem when sk is freed.
      Otherwise, the bpf map will become full and un-usable quickly.
      The sk-free tracking currently could be done during sk state
      transition (e.g. BPF_SOCK_OPS_STATE_CB).
      
      The size of the map needs to be predefined which then usually ended-up
      with an over-provisioned map in production.  Even the map was re-sizable,
      while the sk naturally come and go away already, this potential re-size
      operation is arguably redundant if the data can be directly connected
      to the sk itself instead of proxy-ing through a bpf map.
      
      This patch introduces sk->sk_bpf_storage to provide local storage space
      at sk for bpf prog to use.  The space will be allocated when the first bpf
      prog has created data for this particular sk.
      
      The design optimizes the bpf prog's lookup (and then optionally followed by
      an inline update).  bpf_spin_lock should be used if the inline update needs
      to be protected.
      
      BPF_MAP_TYPE_SK_STORAGE:
      -----------------------
      To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
      this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
      be created to fit different bpf progs' needs.  The map enforces
      BTF to allow printing the sk-local-storage during a system-wise
      sk dump (e.g. "ss -ta") in the future.
      
      The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
      a "sk-local-storage" data from a particular sk.
      Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
      particular "type" of "sk-local-storage" data can then be stored in any sk.
      
      The main purposes of this map are mostly:
      1. Define the size of a "sk-local-storage" type.
      2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
         map-id, map-btf...etc.)
      3. Keep track of all sk's storages of this "type" and clean them up
         when the map is freed.
      
      sk->sk_bpf_storage:
      ------------------
      The main lookup/update/delete is done on sk->sk_bpf_storage (which
      is a "struct bpf_sk_storage").  When doing a lookup,
      the "map" pointer is now used as the "key" to search on the
      sk_storage->list.  The "map" pointer is actually serving
      as the "type" of the "sk-local-storage" that is being
      requested.
      
      To allow very fast lookup, it should be as fast as looking up an
      array at a stable-offset.  At the same time, it is not ideal to
      set a hard limit on the number of sk-local-storage "type" that the
      system can have.  Hence, this patch takes a cache approach.
      The last search result from sk_storage->list is cached in
      sk_storage->cache[] which is a stable sized array.  Each
      "sk-local-storage" type has a stable offset to the cache[] array.
      In the future, a map's flag could be introduced to do cache
      opt-out/enforcement if it became necessary.
      
      The cache size is 16 (i.e. 16 types of "sk-local-storage").
      Programs can share map.  On the program side, having a few bpf_progs
      running in the networking hotpath is already a lot.  The bpf_prog
      should have already consolidated the existing sock-key-ed map usage
      to minimize the map lookup penalty.  16 has enough runway to grow.
      
      All sk-local-storage data will be removed from sk->sk_bpf_storage
      during sk destruction.
      
      bpf_sk_storage_get() and bpf_sk_storage_delete():
      ------------------------------------------------
      Instead of using bpf_map_(lookup|update|delete)_elem(),
      the bpf prog needs to use the new helper bpf_sk_storage_get() and
      bpf_sk_storage_delete().  The verifier can then enforce the
      ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
      "create" new elem if one does not exist in the sk.  It is done by
      the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
      provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
      The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
      it has eliminated the potential use cases for an equivalent
      bpf_map_update_elem() API (for bpf_prog) in this patch.
      
      Misc notes:
      ----------
      1. map_get_next_key is not supported.  From the userspace syscall
         perspective,  the map has the socket fd as the key while the map
         can be shared by pinned-file or map-id.
      
         Since btf is enforced, the existing "ss" could be enhanced to pretty
         print the local-storage.
      
         Supporting a kernel defined btf with 4 tuples as the return key could
         be explored later also.
      
      2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
         e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
         Please refer to the source code comments for the details in
         synchronization cases and considerations.
      
      3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
      
      Benchmark:
      ---------
      Here is the benchmark data collected by turning on
      the "kernel.bpf_stats_enabled" sysctl.
      Two bpf progs are tested:
      
      One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
      sk ptr as the key. (verifier is modified to support sk ptr as the key
      That should have shortened the key lookup time.)
      
      Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
      
      Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
      each egress skb and then bump the cnt.  netperf is used to drive
      data with 4096 connected UDP sockets.
      
      BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
      27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
          loaded_at 2019-04-15T13:46:39-0700  uid 0
          xlated 344B  jited 258B  memlock 4096B  map_ids 16
          btf_id 5
      
      BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
      30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
          loaded_at 2019-04-15T13:47:54-0700  uid 0
          xlated 168B  jited 156B  memlock 4096B  map_ids 17
          btf_id 6
      
      Here is a high-level picture on how are the objects organized:
      
             sk
          ┌──────┐
          │      │
          │      │
          │      │
          │*sk_bpf_storage───── bpf_sk_storage
          └──────┘                 ┌───────┐
                       ┌───────────┤ list  │
                       │           │       │
                       │           │       │
                       │           │       │
                       │           └───────┘
                       │
                       │     elem
                       │  ┌────────┐
                       ├─│ snode  │
                       │  ├────────┤
                       │  │  data  │          bpf_map
                       │  ├────────┤        ┌─────────┐
                       │  │map_node│─┬─────┤  list   │
                       │  └────────┘  │     │         │
                       │              │     │         │
                       │     elem     │     │         │
                       │  ┌────────┐  │     └─────────┘
                       └─│ snode  │  │
                          ├────────┤  │
         bpf_map          │  data  │  │
       ┌─────────┐        ├────────┤  │
       │  list   ├───────│map_node│  │
       │         │        └────────┘  │
       │         │                    │
       │         │           elem     │
       └─────────┘        ┌────────┐  │
                       ┌─│ snode  │  │
                       │  ├────────┤  │
                       │  │  data  │  │
                       │  ├────────┤  │
                       │  │map_node│─┘
                       │  └────────┘
                       │
                       │
                       │          ┌───────┐
           sk          └──────────│ list  │
        ┌──────┐                  │       │
        │      │                  │       │
        │      │                  │       │
        │      │                  └───────┘
        │*sk_bpf_storage───────bpf_sk_storage
        └──────┘
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6ac99e8f
    • Alexei Starovoitov's avatar
      Merge branch 'writeable-bpf-tracepoints' · 3745dc24
      Alexei Starovoitov authored
      Matt Mullins says:
      
      ====================
      This adds an opt-in interface for tracepoints to expose a writable context to
      BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE programs that are attached, while
      supporting read-only access from existing BPF_PROG_TYPE_RAW_TRACEPOINT
      programs, as well as from non-BPF-based tracepoints.
      
      The initial motivation is to support tracing that can be observed from the
      remote end of an NBD socket, e.g. by adding flags to the struct nbd_request
      header.  Earlier attempts included adding an NBD-specific tracepoint fd, but in
      code review, I was recommended to implement it more generically -- as a result,
      this patchset is far simpler than my initial try.
      
      v4->v5:
        * rebased onto bpf-next/master and fixed merge conflicts
        * "tools: sync bpf.h" also syncs comments that have previously changed
          in bpf-next
      
      v3->v4:
        * fixed a silly copy/paste typo in include/trace/events/bpf_test_run.h
          (_TRACE_NBD_H -> _TRACE_BPF_TEST_RUN_H)
        * fixed incorrect/misleading wording in patch 1's commit message,
          since the pointer cannot be directly dereferenced in a
          BPF_PROG_TYPE_RAW_TRACEPOINT
        * cleaned up the error message wording if the prog_tests fail
        * Addressed feedback from Yonghong
          * reject non-pointer-sized accesses to the buffer pointer
          * use sizeof(struct nbd_request) as one-byte-past-the-end in
            raw_tp_writable_reject_nbd_invalid.c
          * use BPF_MOV64_IMM instead of BPF_LD_IMM64
      
      v2->v3:
        * Andrew addressed Josef's comments:
          * C-style commenting in nbd.c
          * Collapsed identical events into a single DECLARE_EVENT_CLASS.
            This saves about 2kB of kernel text
      
      v1->v2:
        * add selftests
          * sync tools/include/uapi/linux/bpf.h
        * reject variable offset into the buffer
        * add string representation of PTR_TO_TP_BUFFER to reg_type_str
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3745dc24
    • Matt Mullins's avatar
      selftests: bpf: test writable buffers in raw tps · e950e843
      Matt Mullins authored
      This tests that:
        * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it
          uses either:
          * a variable offset to the tracepoint buffer, or
          * an offset beyond the size of the tracepoint buffer
        * a tracer can modify the buffer provided when attached to a writable
          tracepoint in bpf_prog_test_run
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e950e843
    • Matt Mullins's avatar
      tools: sync bpf.h · 4635b0ae
      Matt Mullins authored
      This adds BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, and fixes up the
      
      	error: enumeration value ‘BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE’ not handled in switch [-Werror=switch-enum]
      
      build errors it would otherwise cause in libbpf.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4635b0ae
    • Andrew Hall's avatar
      nbd: add tracepoints for send/receive timing · 2abd2de7
      Andrew Hall authored
      This adds four tracepoints to nbd, enabling separate tracing of payload
      and header sending/receipt.
      
      In the send path for headers that have already been sent, we also
      explicitly initialize the handle so it can be referenced by the later
      tracepoint.
      Signed-off-by: default avatarAndrew Hall <hall@fb.com>
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2abd2de7
    • Matt Mullins's avatar
      nbd: trace sending nbd requests · ea106722
      Matt Mullins authored
      This adds a tracepoint that can both observe the nbd request being sent
      to the server, as well as modify that request , e.g., setting a flag in
      the request that will cause the server to collect detailed tracing data.
      
      The struct request * being handled is included to permit correlation
      with the block tracepoints.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ea106722
    • Matt Mullins's avatar
      bpf: add writable context for raw tracepoints · 9df1c28b
      Matt Mullins authored
      This is an opt-in interface that allows a tracepoint to provide a safe
      buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
      The size of the buffer must be a compile-time constant, and is checked
      before allowing a BPF program to attach to a tracepoint that uses this
      feature.
      
      The pointer to this buffer will be the first argument of tracepoints
      that opt in; the pointer is valid and can be bpf_probe_read() by both
      BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
      programs that attach to such a tracepoint, but the buffer to which it
      points may only be written by the latter.
      Signed-off-by: default avatarMatt Mullins <mmullins@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9df1c28b
    • Daniel Borkmann's avatar
      bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd · 34b8ab09
      Daniel Borkmann authored
      Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016,
      lets add support for STADD and use that in favor of LDXR / STXR loop for
      the XADD mapping if available. STADD is encoded as an alias for LDADD with
      XZR as the destination register, therefore add LDADD to the instruction
      encoder along with STADD as special case and use it in the JIT for CPUs
      that advertise LSE atomics in CPUID register. If immediate offset in the
      BPF XADD insn is 0, then use dst register directly instead of temporary
      one.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      34b8ab09
    • Daniel Borkmann's avatar
      bpf, arm64: remove prefetch insn in xadd mapping · 8968c67a
      Daniel Borkmann authored
      Prefetch-with-intent-to-write is currently part of the XADD mapping in
      the AArch64 JIT and follows the kernel's implementation of atomic_add.
      This may interfere with other threads executing the LDXR/STXR loop,
      leading to potential starvation and fairness issues. Drop the optional
      prefetch instruction.
      
      Fixes: 85f68fe8 ("bpf, arm64: implement jiting of BPF_XADD")
      Reported-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8968c67a
  2. 26 Apr, 2019 6 commits
    • Alexei Starovoitov's avatar
      Merge branch 'btf-dump' · 0c0cad2c
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      This patch set adds a new `bpftool btf dump` sub-command, which allows to dump
      BTF contents (only types for now). Currently it only outputs low-level
      content, almost 1:1 with binary BTF format, but follow up patches will add
      ability to dump BTF types as a compilable C header file. JSON output is
      supported as well.
      
      Patch #1 adds `btf` sub-command, dumping BTF types in human-readable format.
      It also implements reading .BTF data from ELF file.
      Patch #2 adds minimal documentation with output format examples and different
      ways to specify source of BTF data.
      Patch #3 adds support for btf command in bash-completion/bpftool script.
      Patch #4 fixes minor indentation issue in bash-completion script.
      
      Output format is mostly following existing format of BPF verifier log, but
      deviates from it in few places. More details are in commit message for patch 1.
      
      Example of output for all supported BTF kinds are in patch #2 as part of
      documentation. Some field names are quite verbose and I'd rather shorten them,
      if we don't feel like being very close to BPF verifier names is a necessity,
      but in this patch I left them exactly the same as in verifier log.
      
      v3->v4:
        - reverse Christmas tree (Quentin)
        - better docs (Quentin)
      
      v2->v3:
        - make map's key|value|kv|all suggestion more precise (Quentin)
        - fix default case indentations (Quentin)
      
      v1->v2:
        - fix unnecessary trailing whitespaces in bpftool-btf.rst (Yonghong)
        - add btf in main.c for a list of possible OBJECTs
        - handle unknown keyword under `bpftool btf dump` (Yonghong)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0c0cad2c
    • Andrii Nakryiko's avatar
      bpftool: fix indendation in bash-completion/bpftool · 8ed1875b
      Andrii Nakryiko authored
      Fix misaligned default case branch for `prog dump` sub-command.
      Reported-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Cc: Yonghong Song <yhs@fb.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8ed1875b
    • Andrii Nakryiko's avatar
      bpftool: add bash completions for btf command · 4a714fee
      Andrii Nakryiko authored
      Add full support for btf command in bash-completion script.
      
      Cc: Quentin Monnet <quentin.monnet@netronome.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4a714fee
    • Andrii Nakryiko's avatar
      bpftool/docs: add btf sub-command documentation · ca253339
      Andrii Nakryiko authored
      Document usage and sample output format for `btf dump` sub-command.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ca253339
    • Andrii Nakryiko's avatar
      bpftool: add ability to dump BTF types · c93cc690
      Andrii Nakryiko authored
      Add new `btf dump` sub-command to bpftool. It allows to dump
      human-readable low-level BTF types representation of BTF types. BTF can
      be retrieved from few different sources:
        - from BTF object by ID;
        - from PROG, if it has associated BTF;
        - from MAP, if it has associated BTF data; it's possible to narrow
          down types to either key type, value type, both, or all BTF types;
        - from ELF file (.BTF section).
      
      Output format mostly follows BPF verifier log format with few notable
      exceptions:
        - all the type/field/param/etc names are enclosed in single quotes to
          allow easier grepping and to stand out a little bit more;
        - FUNC_PROTO output follows STRUCT/UNION/ENUM format of having one
          line per each argument; this is more uniform and allows easy
          grepping, as opposed to succinct, but inconvenient format that BPF
          verifier log is using.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c93cc690
    • Benjamin Poirier's avatar
      bpftool: Fix errno variable usage · 77d76426
      Benjamin Poirier authored
      The test meant to use the saved value of errno. Given the current code, it
      makes no practical difference however.
      
      Fixes: bf598a8f ("bpftool: Improve handling of ENOENT on map dumps")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77d76426
  3. 25 Apr, 2019 7 commits
    • Stanislav Fomichev's avatar
      bpftool: show flow_dissector attachment status · 7f0c57fe
      Stanislav Fomichev authored
      Right now there is no way to query whether BPF flow_dissector program
      is attached to a network namespace or not. In previous commit, I added
      support for querying that info, show it when doing `bpftool net`:
      
      $ bpftool prog loadall ./bpf_flow.o \
      	/sys/fs/bpf/flow type flow_dissector \
      	pinmaps /sys/fs/bpf/flow
      $ bpftool prog
      3: flow_dissector  name _dissect  tag 8c9e917b513dd5cc  gpl
              loaded_at 2019-04-23T16:14:48-0700  uid 0
              xlated 656B  jited 461B  memlock 4096B  map_ids 1,2
              btf_id 1
      ...
      
      $ bpftool net -j
      [{"xdp":[],"tc":[],"flow_dissector":[]}]
      
      $ bpftool prog attach pinned \
      	/sys/fs/bpf/flow/flow_dissector flow_dissector
      $ bpftool net -j
      [{"xdp":[],"tc":[],"flow_dissector":["id":3]}]
      
      Doesn't show up in a different net namespace:
      $ ip netns add test
      $ ip netns exec test bpftool net -j
      [{"xdp":[],"tc":[],"flow_dissector":[]}]
      
      Non-json output:
      $ bpftool net
      xdp:
      
      tc:
      
      flow_dissector:
      id 3
      
      v2:
      * initialization order (Jakub Kicinski)
      * clear errno for batch mode (Quentin Monnet)
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7f0c57fe
    • Stanislav Fomichev's avatar
      bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_type · 118c8e9a
      Stanislav Fomichev authored
      target_fd is target namespace. If there is a flow dissector BPF program
      attached to that namespace, its (single) id is returned.
      
      v5:
      * drop net ref right after rcu unlock (Daniel Borkmann)
      
      v4:
      * add missing put_net (Jann Horn)
      
      v3:
      * add missing inline to skb_flow_dissector_prog_query static def
        (kbuild test robot <lkp@intel.com>)
      
      v2:
      * don't sleep in rcu critical section (Jakub Kicinski)
      * check input prog_cnt (exit early)
      
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      118c8e9a
    • Daniel T. Lee's avatar
      samples: bpf: add hbm sample to .gitignore · ead442a0
      Daniel T. Lee authored
      This commit adds hbm to .gitignore which is
      currently ommited from the ignore file.
      Signed-off-by: default avatarDaniel T. Lee <danieltimlee@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ead442a0
    • Daniel T. Lee's avatar
      libbpf: fix samples/bpf build failure due to undefined UINT32_MAX · 32e621e5
      Daniel T. Lee authored
      Currently, building bpf samples will cause the following error.
      
          ./tools/lib/bpf/bpf.h:132:27: error: 'UINT32_MAX' undeclared here (not in a function) ..
           #define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8) /* verifier maximum in kernels <= 5.1 */
                                     ^
          ./samples/bpf/bpf_load.h:31:25: note: in expansion of macro 'BPF_LOG_BUF_SIZE'
           extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
                                   ^~~~~~~~~~~~~~~~
      
      Due to commit 4519efa6 ("libbpf: fix BPF_LOG_BUF_SIZE off-by-one error")
      hard-coded size of BPF_LOG_BUF_SIZE has been replaced with UINT32_MAX which is
      defined in <stdint.h> header.
      
      Even with this change, bpf selftests are running fine since these are built
      with clang and it includes header(-idirafter) from clang/6.0.0/include.
      (it has <stdint.h>)
      
          clang -I. -I./include/uapi -I../../../include/uapi -idirafter /usr/local/include -idirafter /usr/include \
          -idirafter /usr/lib/llvm-6.0/lib/clang/6.0.0/include -idirafter /usr/include/x86_64-linux-gnu \
          -Wno-compare-distinct-pointer-types -O2 -target bpf -emit-llvm -c progs/test_sysctl_prog.c -o - | \
          llc -march=bpf -mcpu=generic  -filetype=obj -o /linux/tools/testing/selftests/bpf/test_sysctl_prog.o
      
      But bpf samples are compiled with GCC, and it only searches and includes
      headers declared at the target file. As '#include <stdint.h>' hasn't been
      declared in tools/lib/bpf/bpf.h, it causes build failure of bpf samples.
      
          gcc -Wp,-MD,./samples/bpf/.sockex3_user.o.d -Wall -Wmissing-prototypes -Wstrict-prototypes \
          -O2 -fomit-frame-pointer -std=gnu89 -I./usr/include -I./tools/lib/ -I./tools/testing/selftests/bpf/ \
          -I./tools/  lib/ -I./tools/include -I./tools/perf -c -o ./samples/bpf/sockex3_user.o ./samples/bpf/sockex3_user.c;
      
      This commit add declaration of '#include <stdint.h>' to tools/lib/bpf/bpf.h
      to fix this problem.
      Signed-off-by: default avatarDaniel T. Lee <danieltimlee@gmail.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      32e621e5
    • Alexei Starovoitov's avatar
      Merge branch 'libbpf-fixes' · 0e33d334
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Two small fixes in relation to global data handling. Thanks!
      ====================
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0e33d334
    • Daniel Borkmann's avatar
      bpf, libbpf: fix segfault in bpf_object__init_maps' pr_debug statement · 4f8827d2
      Daniel Borkmann authored
      Ran into it while testing; in bpf_object__init_maps() data can be NULL
      in the case where no map section is present. Therefore we simply cannot
      access data->d_size before NULL test. Move the pr_debug() where it's
      safe to access.
      
      Fixes: d859900c ("bpf, libbpf: support global data/bss/rodata sections")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4f8827d2
    • Daniel Borkmann's avatar
      bpf, libbpf: handle old kernels more graceful wrt global data sections · 8837fe5d
      Daniel Borkmann authored
      Andrii reported a corner case where e.g. global static data is present
      in the BPF ELF file in form of .data/.bss/.rodata section, but without
      any relocations to it. Such programs could be loaded before commit
      d859900c ("bpf, libbpf: support global data/bss/rodata sections"),
      whereas afterwards if kernel lacks support then loading would fail.
      
      Add a probing mechanism which skips setting up libbpf internal maps
      in case of missing kernel support. In presence of relocation entries,
      we abort the load attempt.
      
      Fixes: d859900c ("bpf, libbpf: support global data/bss/rodata sections")
      Reported-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8837fe5d
  4. 23 Apr, 2019 15 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-proto-fixes' · a21b48a2
      Daniel Borkmann authored
      Willem de Bruijn says:
      
      ====================
      Expand the tc tunnel encap support with protocols that convert the
      network layer protocol, such as 6in4. This is analogous to existing
      support in bpf_skb_proto_6_to_4.
      
      Patch 1 implements the straightforward logic
      Patch 2 tests it with a 6in4 tunnel
      
      Changes v1->v2
        - improve documentation in test
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a21b48a2
    • Willem de Bruijn's avatar
      selftests/bpf: expand test_tc_tunnel with SIT encap · f6ad6acc
      Willem de Bruijn authored
      So far, all BPF tc tunnel testcases encapsulate in the same network
      protocol. Add an encap testcase that requires updating skb->protocol.
      
      The 6in4 tunnel encapsulates an IPv6 packet inside an IPv4 tunnel.
      Verify that bpf_skb_net_grow correctly updates skb->protocol to
      select the right protocol handler in __netif_receive_skb_core.
      
      The BPF program should also manually update the link layer header to
      encode the right network protocol.
      
      Changes v1->v2
        - improve documentation of non-obvious logic
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f6ad6acc
    • Willem de Bruijn's avatar
      bpf: update skb->protocol in bpf_skb_net_grow · 1b00e0df
      Willem de Bruijn authored
      Some tunnels, like sit, change the network protocol of packet.
      If so, update skb->protocol to match the new type.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1b00e0df
    • Daniel Borkmann's avatar
      Merge branch 'bpf-eth-get-headlen' · 2aad3261
      Daniel Borkmann authored
      Stanislav Fomichev says:
      
      ====================
      Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
      skb. Because we use passed skb to lookup associated networking namespace
      to find whether we have a BPF program attached or not, we always use
      C-based flow dissector in this case.
      
      The goal of this patch series is to add new networking namespace argument
      to the eth_get_headlen and make BPF flow dissector programs be able to
      work in the skb-less case.
      
      The series goes like this:
      * use new kernel context (struct bpf_flow_dissector) for flow dissector
        programs; this makes it easy to distinguish between skb and no-skb
        case and supports calling BPF flow dissector on a chunk of raw data
      * convert BPF_PROG_TEST_RUN to use raw data
      * plumb network namespace into __skb_flow_dissect from all callers
      * handle no-skb case in __skb_flow_dissect
      * update eth_get_headlen to include net namespace argument and
        convert all existing users
      * add selftest to make sure bpf_skb_load_bytes is not allowed in
        the no-skb mode
      * extend test_progs to exercise skb-less flow dissection as well
      * stop adjusting nhoff/thoff by ETH_HLEN in BPF_PROG_TEST_RUN
      
      v6:
      * more suggestions by Alexei:
        * eth_get_headlen now takes net dev, not net namespace
        * test skb-less case via tun eth_get_headlen
      * fix return errors in bpf_flow_load
      * don't adjust nhoff/thoff by ETH_HLEN
      
      v5:
      * API changes have been submitted via bpf/stable tree
      
      v4:
      * prohibit access to vlan fields as well (otherwise, inconsistent
        between skb/skb-less cases)
      * drop extra unneeded check for skb->vlan_present in bpf_flow.c
      
      v3:
      * new kernel xdp_buff-like context per Alexei suggestion
      * drop skb_net helper
      * properly clamp flow_keys->nhoff
      
      v2:
      * moved temporary skb from stack into percpu (avoids memset of ~200 bytes
        per packet)
      * tightened down access to __sk_buff fields from flow dissector programs to
        avoid touching shinfo (whitelist only relevant fields)
      * addressed suggestions from Willem
      ====================
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2aad3261
    • Stanislav Fomichev's avatar
      bpf/flow_dissector: don't adjust nhoff by ETH_HLEN in BPF_PROG_TEST_RUN · 02ee0658
      Stanislav Fomichev authored
      Now that we use skb-less flow dissector let's return true nhoff and
      thoff. We used to adjust them by ETH_HLEN because that's how it was
      done in the skb case. For VLAN tests that looks confusing: nhoff is
      pointing to vlan parts :-\
      
      Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop
      if you think that it's too late at this point to fix it.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      02ee0658
    • Stanislav Fomichev's avatar
      selftests/bpf: properly return error from bpf_flow_load · fe993c64
      Stanislav Fomichev authored
      Right now we incorrectly return 'ret' which is always zero at that
      point.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fe993c64
    • Stanislav Fomichev's avatar
      selftests/bpf: run flow dissector tests in skb-less mode · 0905beec
      Stanislav Fomichev authored
      Export last_dissection map from flow dissector and use a known place in
      tun driver to trigger BPF flow dissection.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0905beec
    • Stanislav Fomichev's avatar
      selftests/bpf: add flow dissector bpf_skb_load_bytes helper test · c9cb2c1e
      Stanislav Fomichev authored
      When flow dissector is called without skb, we want to make sure
      bpf_skb_load_bytes invocations return error. Add small test which tries
      to read single byte from a packet.
      
      bpf_skb_load_bytes should always fail under BPF_PROG_TEST_RUN because
      it was converted to the skb-less mode.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c9cb2c1e
    • Stanislav Fomichev's avatar
      net: pass net_device argument to the eth_get_headlen · c43f1255
      Stanislav Fomichev authored
      Update all users of eth_get_headlen to pass network device, fetch
      network namespace from it and pass it down to the flow dissector.
      This commit is a noop until administrator inserts BPF flow dissector
      program.
      
      Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Michael Chan <michael.chan@broadcom.com>
      Cc: Igor Russkikh <igor.russkikh@aquantia.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c43f1255
    • Stanislav Fomichev's avatar
      flow_dissector: handle no-skb use case · 9b52e3f2
      Stanislav Fomichev authored
      When called without skb, gather all required data from the
      __skb_flow_dissect's arguments and use recently introduces
      no-skb mode of bpf flow dissector.
      
      Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9b52e3f2
    • Stanislav Fomichev's avatar
      net: plumb network namespace into __skb_flow_dissect · 3cbf4ffb
      Stanislav Fomichev authored
      This new argument will be used in the next patches for the
      eth_get_headlen use case. eth_get_headlen calls flow dissector
      with only data (without skb) so there is currently no way to
      pull attached BPF flow dissector program. With this new argument,
      we can amend the callers to explicitly pass network namespace
      so we can use attached BPF program.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3cbf4ffb
    • Stanislav Fomichev's avatar
      bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode · 7b8a1304
      Stanislav Fomichev authored
      Now that we have bpf_flow_dissect which can work on raw data,
      use it when doing BPF_PROG_TEST_RUN for flow dissector.
      
      Simplifies bpf_prog_test_run_flow_dissector and allows us to
      test no-skb mode.
      
      Note, that previously, with bpf_flow_dissect_skb we used to call
      eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
      skb_reset_network_header. That means flow_keys->nhoff would be
      initialized to 0 (skb_network_offset) in init_flow_keys.
      Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
      to undo it once the dissection is done to preserve the existing behavior.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7b8a1304
    • Stanislav Fomichev's avatar
      flow_dissector: switch kernel context to struct bpf_flow_dissector · 089b19a9
      Stanislav Fomichev authored
      struct bpf_flow_dissector has a small subset of sk_buff fields that
      flow dissector BPF program is allowed to access and an optional
      pointer to real skb. Real skb is used only in bpf_skb_load_bytes
      helper to read non-linear data.
      
      The real motivation for this is to be able to call flow dissector
      from eth_get_headlen context where we don't have an skb and need
      to dissect raw bytes.
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      089b19a9
    • Florian Fainelli's avatar
      net: systemport: Remove need for DMA descriptor · 7e6e185c
      Florian Fainelli authored
      All we do is write the length/status and address bits to a DMA
      descriptor only to write its contents into on-chip registers right
      after, eliminate this unnecessary step.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e6e185c
    • Ido Schimmel's avatar
      bridge: Fix possible use-after-free when deleting bridge port · 697cd36c
      Ido Schimmel authored
      When a bridge port is being deleted, do not dereference it later in
      br_vlan_port_event() as it can result in a use-after-free [1] if the RCU
      callback was executed before invoking the function.
      
      [1]
      [  129.638551] ==================================================================
      [  129.646904] BUG: KASAN: use-after-free in br_vlan_port_event+0x53c/0x5fd
      [  129.654406] Read of size 8 at addr ffff8881e4aa1ae8 by task ip/483
      [  129.663008] CPU: 0 PID: 483 Comm: ip Not tainted 5.1.0-rc5-custom-02265-ga946bd73daac #1383
      [  129.672359] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
      [  129.682484] Call Trace:
      [  129.685242]  dump_stack+0xa9/0x10e
      [  129.689068]  print_address_description.cold.2+0x9/0x25e
      [  129.694930]  kasan_report.cold.3+0x78/0x9d
      [  129.704420]  br_vlan_port_event+0x53c/0x5fd
      [  129.728300]  br_device_event+0x2c7/0x7a0
      [  129.741505]  notifier_call_chain+0xb5/0x1c0
      [  129.746202]  rollback_registered_many+0x895/0xe90
      [  129.793119]  unregister_netdevice_many+0x48/0x210
      [  129.803384]  rtnl_delete_link+0xe1/0x140
      [  129.815906]  rtnl_dellink+0x2a3/0x820
      [  129.844166]  rtnetlink_rcv_msg+0x397/0x910
      [  129.868517]  netlink_rcv_skb+0x137/0x3a0
      [  129.882013]  netlink_unicast+0x49b/0x660
      [  129.900019]  netlink_sendmsg+0x755/0xc90
      [  129.915758]  ___sys_sendmsg+0x761/0x8e0
      [  129.966315]  __sys_sendmsg+0xf0/0x1c0
      [  129.988918]  do_syscall_64+0xa4/0x470
      [  129.993032]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  129.998696] RIP: 0033:0x7ff578104b58
      ...
      [  130.073811] Allocated by task 479:
      [  130.077633]  __kasan_kmalloc.constprop.5+0xc1/0xd0
      [  130.083008]  kmem_cache_alloc_trace+0x152/0x320
      [  130.088090]  br_add_if+0x39c/0x1580
      [  130.092005]  do_set_master+0x1aa/0x210
      [  130.096211]  do_setlink+0x985/0x3100
      [  130.100224]  __rtnl_newlink+0xc52/0x1380
      [  130.104625]  rtnl_newlink+0x6b/0xa0
      [  130.108541]  rtnetlink_rcv_msg+0x397/0x910
      [  130.113136]  netlink_rcv_skb+0x137/0x3a0
      [  130.117538]  netlink_unicast+0x49b/0x660
      [  130.121939]  netlink_sendmsg+0x755/0xc90
      [  130.126340]  ___sys_sendmsg+0x761/0x8e0
      [  130.130645]  __sys_sendmsg+0xf0/0x1c0
      [  130.134753]  do_syscall_64+0xa4/0x470
      [  130.138864]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [  130.146195] Freed by task 0:
      [  130.149421]  __kasan_slab_free+0x125/0x170
      [  130.154016]  kfree+0xf3/0x310
      [  130.157349]  kobject_put+0x1a8/0x4c0
      [  130.161363]  rcu_core+0x859/0x19b0
      [  130.165175]  __do_softirq+0x250/0xa26
      [  130.170956] The buggy address belongs to the object at ffff8881e4aa1ae8
                      which belongs to the cache kmalloc-1k of size 1024
      [  130.184972] The buggy address is located 0 bytes inside of
                      1024-byte region [ffff8881e4aa1ae8, ffff8881e4aa1ee8)
      
      Fixes: 9c0ec2e7 ("bridge: support binding vlan dev link state to vlan member bridge ports")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Cc: Mike Manning <mmanning@vyatta.att-mail.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarMike Manning <mmanning@vyatta.att-mail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      697cd36c