1. 08 Oct, 2018 4 commits
  2. 05 Oct, 2018 6 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-xsk-fix-mixed-mode' · df1ea77b
      Daniel Borkmann authored
      Magnus Karlsson says:
      
      ====================
      Previously, the xsk code did not record which umem was bound to a
      specific queue id. This was not required if all drivers were zero-copy
      enabled as this had to be recorded in the driver anyway. So if a user
      tried to bind two umems to the same queue, the driver would say
      no. But if copy-mode was first enabled and then zero-copy mode (or the
      reverse order), we mistakenly enabled both of them on the same umem
      leading to buggy behavior. The main culprit for this is that we did
      not store the association of umem to queue id in the copy case and
      only relied on the driver reporting this. As this relation was not
      stored in the driver for copy mode (it does not rely on the AF_XDP
      NDOs), this obviously could not work.
      
      This patch fixes the problem by always recording the umem to queue id
      relationship in the netdev_queue and netdev_rx_queue structs. This way
      we always know what kind of umem has been bound to a queue id and can
      act appropriately at bind time. To make the bind semantics consistent
      with ethtool queue manipulations and to facilitate the implementation
      of drivers, we also forbid decreasing the number of queues/channels
      with ethtool if there is an active AF_XDP socket in the set of queues
      that are disabled.
      
      Jakub, please take a look at your patches. The last one I had to
      change slightly to make it fit with the new interface
      xdp_get_umem_from_qid(). An added bonus with this function is that we,
      in the future, can also use it from the driver to get a umem, thus
      simplifying driver implementations (and later remove the umem from the
      NDO completely). Björn will mail patches, at a later point in time,
      using this in the i40e and ixgbe drivers, that removes a good chunk of
      code from the ZC implementations. I also made your code aware of Tx
      queues. If we create a socket that only has a Tx queue, then the queue
      id will refer to a Tx queue id only and could be larger than the
      available amount of Rx queues. Please take a look at it.
      
      Differences against v1:
      * Included patches from Jakub that forbids decreasing the number of active
        queues if a queue to be deactivated has an AF_XDP socket. These have
        been adapted somewhat to the new interfaces in patch 2.
      * Removed redundant check against real_num_[rt]x_queue in xsk_bind
      * Only need to test against real_num_[rt]x_queues in
        xdp_clear_umem_at_qid.
      
      Patch 1: Introduces a umem reference in the netdev_rx_queue and
               netdev_queue structs.
      Patch 2: Records which queue_id is bound to which umem and make sure
               that you cannot bind two different umems to the same queue_id.
      Patch 3: Pre patch to ethtool_set_channels.
      Patch 4: Forbid decreasing the number of active queues if a deactivated
               queue has an AF_XDP socket.
      Patch 5: Simplify xdp_clear_umem_at_qid now when ethtool cannot deactivate
               the queue id we are running on.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      df1ea77b
    • Magnus Karlsson's avatar
      xsk: simplify xdp_clear_umem_at_qid implementation · a41b4f3c
      Magnus Karlsson authored
      As we now do not allow ethtool to deactivate the queue id we are
      running an AF_XDP socket on, we can simplify the implementation of
      xdp_clear_umem_at_qid().
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a41b4f3c
    • Jakub Kicinski's avatar
      ethtool: don't allow disabling queues with umem installed · 1661d346
      Jakub Kicinski authored
      We already check the RSS indirection table does not use queues which
      would be disabled by channel reconfiguration. Make sure user does not
      try to disable queues which have a UMEM and zero-copy AF_XDP socket
      installed.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1661d346
    • Jakub Kicinski's avatar
      ethtool: rename local variable max -> curr · b8c8a2e2
      Jakub Kicinski authored
      ethtool_set_channels() validates the config against driver's max
      settings. It retrieves the current config and stores it in a
      variable called max. This was okay when only max settings were
      accessed but we will soon want to access current settings as
      well, so calling the entire structure max makes the code less
      readable.
      
      While at it drop unnecessary parenthesis.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b8c8a2e2
    • Magnus Karlsson's avatar
      xsk: fix bug when trying to use both copy and zero-copy on one queue id · c9b47cc1
      Magnus Karlsson authored
      Previously, the xsk code did not record which umem was bound to a
      specific queue id. This was not required if all drivers were zero-copy
      enabled as this had to be recorded in the driver anyway. So if a user
      tried to bind two umems to the same queue, the driver would say
      no. But if copy-mode was first enabled and then zero-copy mode (or the
      reverse order), we mistakenly enabled both of them on the same umem
      leading to buggy behavior. The main culprit for this is that we did
      not store the association of umem to queue id in the copy case and
      only relied on the driver reporting this. As this relation was not
      stored in the driver for copy mode (it does not rely on the AF_XDP
      NDOs), this obviously could not work.
      
      This patch fixes the problem by always recording the umem to queue id
      relationship in the netdev_queue and netdev_rx_queue structs. This way
      we always know what kind of umem has been bound to a queue id and can
      act appropriately at bind time.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c9b47cc1
    • Magnus Karlsson's avatar
      net: add umem reference in netdev{_rx}_queue · 661b8d1b
      Magnus Karlsson authored
      These references to the umem will be used to store information
      on what kind of AF_XDP umem that is bound to a queue id, if any.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      661b8d1b
  3. 04 Oct, 2018 10 commits
    • Konrad Djimeli's avatar
      bpf: typo fix in Documentation/networking/af_xdp.rst · 7ccc4f18
      Konrad Djimeli authored
      Fix a simple typo: Completetion -> Completion
      Signed-off-by: default avatarKonrad Djimeli <kdjimeli@igalia.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7ccc4f18
    • Bo YU's avatar
      bpf, tracex3_user: erase "ARRAY_SIZE" redefined · 20cdeb54
      Bo YU authored
      There is a warning when compiling bpf sample programs in sample/bpf:
      
        make -C /home/foo/bpf/samples/bpf/../../tools/lib/bpf/ RM='rm -rf' LDFLAGS= srctree=/home/foo/bpf/samples/bpf/../../ O=
          HOSTCC  /home/foo/bpf/samples/bpf/tracex3_user.o
        /home/foo/bpf/samples/bpf/tracex3_user.c:20:0: warning: "ARRAY_SIZE" redefined
         #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
      
        In file included from /home/foo/bpf/samples/bpf/tracex3_user.c:18:0:
        ./tools/testing/selftests/bpf/bpf_util.h:48:0: note: this is the location of the previous definition
         # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
      Signed-off-by: default avatarBo YU <tsu.yubo@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      20cdeb54
    • Daniel Borkmann's avatar
      Merge branch 'bpf-libbpf-consistent-iface' · fc1dc766
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      This patch set renames a few interfaces in libbpf, mostly netlink related,
      so that all symbols provided by the library have only three possible
      prefixes:
      
      % nm -D tools/lib/bpf/libbpf.so  | \
          awk '$2 == "T" {sub(/[_\(].*/, "", $3); if ($3) print $3}' | \
          sort | \
          uniq -c
           91 bpf
            8 btf
           14 libbpf
      
      libbpf is used more and more outside kernel tree. That means the library
      should follow good practices in library design and implementation to
      play well with third party code that uses it.
      
      One of such practices is to have a common prefix (or a few) for every
      interface, function or data structure, library provides. It helps to
      avoid name conflicts with other libraries and keeps API/ABI consistent.
      
      Inconsistent names in libbpf already cause problems in real life. E.g.
      an application can't use both libbpf and libnl due to conflicting
      symbols (specifically nla_parse, nla_parse_nested and a few others).
      
      Some of problematic global symbols are not part of ABI and can be
      restricted from export with either visibility attribute/pragma or export
      map (what is useful by itself and can be done in addition). That won't
      solve the problem for those that are part of ABI though. Also export
      restrictions would help only in DSO case. If third party application links
      libbpf statically it won't help, and people do it (e.g. Facebook links
      most of libraries statically, including libbpf).
      
      libbpf already uses the following prefixes for its interfaces:
      * bpf_ for bpf system call wrappers, program/map/elf-object
        abstractions and a few other things;
      * btf_ for BTF related API;
      * libbpf_ for everything else.
      
      The patch adds libbpf_ prefix to interfaces that use none of mentioned
      above prefixes and don't fit well into the first two categories.
      
      Long term benefits of having common prefix should outweigh possible
      inconvenience of changing API for those functions now.
      
      Patches 2-4 add libbpf_ prefix to libbpf interfaces: separate patch per
      header. Other patches are simple improvements in API.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fc1dc766
    • Andrey Ignatov's avatar
      libbpf: Use __u32 instead of u32 in bpf_program__load · e5b0863c
      Andrey Ignatov authored
      Make bpf_program__load consistent with other interfaces: use __u32
      instead of u32. That in turn fixes build of samples:
      
      In file included from ./samples/bpf/trace_output_user.c:21:0:
      ./tools/lib/bpf/libbpf.h:132:9: error: unknown type name ‘u32’
               u32 kern_version);
               ^
      
      Fixes: commit 29cd77f4 ("libbpf: Support loading individual progs")
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e5b0863c
    • Andrey Ignatov's avatar
      libbpf: Make include guards consistent · eff81908
      Andrey Ignatov authored
      Rename include guards to have consistent names "__LIBBPF_<header_name>".
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      eff81908
    • Andrey Ignatov's avatar
      libbpf: Consistent prefixes for interfaces in str_error.h. · 24d6a808
      Andrey Ignatov authored
      libbpf is used more and more outside kernel tree. That means the library
      should follow good practices in library design and implementation to
      play well with third party code that uses it.
      
      One of such practices is to have a common prefix (or a few) for every
      interface, function or data structure, library provides. I helps to
      avoid name conflicts with other libraries and keeps API consistent.
      
      Inconsistent names in libbpf already cause problems in real life. E.g.
      an application can't use both libbpf and libnl due to conflicting
      symbols.
      
      Having common prefix will help to fix current and avoid future problems.
      
      libbpf already uses the following prefixes for its interfaces:
      * bpf_ for bpf system call wrappers, program/map/elf-object
        abstractions and a few other things;
      * btf_ for BTF related API;
      * libbpf_ for everything else.
      
      The patch renames function in str_error.h to have libbpf_ prefix since it
      misses one and doesn't fit well into the first two categories.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      24d6a808
    • Andrey Ignatov's avatar
      libbpf: Consistent prefixes for interfaces in nlattr.h. · f04bc8a4
      Andrey Ignatov authored
      libbpf is used more and more outside kernel tree. That means the library
      should follow good practices in library design and implementation to
      play well with third party code that uses it.
      
      One of such practices is to have a common prefix (or a few) for every
      interface, function or data structure, library provides. I helps to
      avoid name conflicts with other libraries and keeps API consistent.
      
      Inconsistent names in libbpf already cause problems in real life. E.g.
      an application can't use both libbpf and libnl due to conflicting
      symbols.
      
      Having common prefix will help to fix current and avoid future problems.
      
      libbpf already uses the following prefixes for its interfaces:
      * bpf_ for bpf system call wrappers, program/map/elf-object
        abstractions and a few other things;
      * btf_ for BTF related API;
      * libbpf_ for everything else.
      
      The patch adds libbpf_ prefix to interfaces in nlattr.h that use none of
      mentioned above prefixes and doesn't fit well into the first two
      categories.
      
      Since affected part of API is used in bpftool, the patch applies
      corresponding change to bpftool as well. Having it in a separate patch
      will cause a state of tree where bpftool is broken what may not be a
      good idea.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f04bc8a4
    • Andrey Ignatov's avatar
      libbpf: Consistent prefixes for interfaces in libbpf.h. · aae57780
      Andrey Ignatov authored
      libbpf is used more and more outside kernel tree. That means the library
      should follow good practices in library design and implementation to
      play well with third party code that uses it.
      
      One of such practices is to have a common prefix (or a few) for every
      interface, function or data structure, library provides. I helps to
      avoid name conflicts with other libraries and keeps API consistent.
      
      Inconsistent names in libbpf already cause problems in real life. E.g.
      an application can't use both libbpf and libnl due to conflicting
      symbols.
      
      Having common prefix will help to fix current and avoid future problems.
      
      libbpf already uses the following prefixes for its interfaces:
      * bpf_ for bpf system call wrappers, program/map/elf-object
        abstractions and a few other things;
      * btf_ for BTF related API;
      * libbpf_ for everything else.
      
      The patch adds libbpf_ prefix to functions and typedef in libbpf.h that
      use none of mentioned above prefixes and doesn't fit well into the first
      two categories.
      
      Since affected part of API is used in bpftool, the patch applies
      corresponding change to bpftool as well. Having it in a separate patch
      will cause a state of tree where bpftool is broken what may not be a
      good idea.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      aae57780
    • Andrey Ignatov's avatar
      libbpf: Move __dump_nlmsg_t from API to implementation · 434fe9d4
      Andrey Ignatov authored
      This typedef is used only by implementation in netlink.c. Nothing uses
      it in public API. Move it to netlink.c.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      434fe9d4
    • Joe Stringer's avatar
      net: core: Fix build with CONFIG_IPV6=m · d71019b5
      Joe Stringer authored
      Stephen Rothwell reports the following link failure with IPv6 as module:
      
        x86_64-linux-gnu-ld: net/core/filter.o: in function `sk_lookup':
        (.text+0x19219): undefined reference to `__udp6_lib_lookup'
      
      Fix the build by only enabling the IPv6 socket lookup if IPv6 support is
      compiled into the kernel.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d71019b5
  4. 03 Oct, 2018 14 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sk-lookup' · 33d9a7fd
      Daniel Borkmann authored
      Joe Stringer says:
      
      ====================
      This series proposes a new helper for the BPF API which allows BPF programs to
      perform lookups for sockets in a network namespace. This would allow programs
      to determine early on in processing whether the stack is expecting to receive
      the packet, and perform some action (eg drop, forward somewhere) based on this
      information.
      
      The series is structured roughly into:
      * Misc refactor
      * Add the socket pointer type
      * Add reference tracking to ensure that socket references are freed
      * Extend the BPF API to add sk_lookup_xxx() / sk_release() functions
      * Add tests/documentation
      
      The helper proposed in this series includes a parameter for a tuple which must
      be filled in by the caller to determine the socket to look up. The simplest
      case would be filling with the contents of the packet, ie mapping the packet's
      5-tuple into the parameter. In common cases, it may alternatively be useful to
      reverse the direction of the tuple and perform a lookup, to find the socket
      that initiates this connection; and if the BPF program ever performs a form of
      IP address translation, it may further be useful to be able to look up
      arbitrary tuples that are not based upon the packet, but instead based on state
      held in BPF maps or hardcoded in the BPF program.
      
      Currently, access into the socket's fields are limited to those which are
      otherwise already accessible, and are restricted to read-only access.
      
      Changes since v3:
      * New patch: "bpf: Reuse canonical string formatter for ctx errs"
      * Add PTR_TO_SOCKET to is_ctx_reg().
      * Add a few new checks to prevent mixing of socket/non-socket pointers.
      * Swap order of checks in sock_filter_is_valid_access().
      * Prefix register spill macros with "bpf_".
      * Add acks from previous round
      * Rebase
      
      Changes since v2:
      * New patch: "selftests/bpf: Generalize dummy program types".
        This enables adding verifier tests for socket lookup with tail calls.
      * Define the semantics of the new helpers more clearly in uAPI header.
      * Fix release of caller_net when netns is not specified.
      * Use skb->sk to find caller net when skb->dev is unavailable.
      * Fix build with !CONFIG_NET.
      * Replace ptr_id defensive coding when releasing reference state with an
        internal error (-EFAULT).
      * Remove flags argument to sk_release().
      * Add several new assembly tests suggested by Daniel.
      * Add a few new C tests.
      * Fix typo in verifier error message.
      
      Changes since v1:
      * Limit netns_id field to 32 bits
      * Reuse reg_type_mismatch() in more places
      * Reduce the number of passes at convert_ctx_access()
      * Replace ptr_id defensive coding when releasing reference state with an
        internal error (-EFAULT)
      * Rework 'struct bpf_sock_tuple' to allow passing a packet pointer
      * Allow direct packet access from helper
      * Fix compile error with CONFIG_IPV6 enabled
      * Improve commit messages
      
      Changes since RFC:
      * Split up sk_lookup() into sk_lookup_tcp(), sk_lookup_udp().
      * Only take references on the socket when necessary.
        * Make sk_release() only free the socket reference in this case.
      * Fix some runtime reference leaks:
        * Disallow BPF_LD_[ABS|IND] instructions while holding a reference.
        * Disallow bpf_tail_call() while holding a reference.
      * Prevent the same instruction being used for reference and other
        pointer type.
      * Simplify locating copies of a reference during helper calls by caching
        the pointer id from the caller.
      * Fix kbuild compilation warnings with particular configs.
      * Improve code comments describing the new verifier pieces.
      * Tested by Nitin
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      33d9a7fd
    • Joe Stringer's avatar
      Documentation: Describe bpf reference tracking · a610b665
      Joe Stringer authored
      Document the new pointer types in the verifier and how the pointer ID
      tracking works to ensure that references which are taken are later
      released.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a610b665
    • Joe Stringer's avatar
      selftests/bpf: Add C tests for reference tracking · de375f4e
      Joe Stringer authored
      Add some tests that demonstrate and test the balanced lookup/free
      nature of socket lookup. Section names that start with "fail" represent
      programs that are expected to fail verification; all others should
      succeed.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      de375f4e
    • Joe Stringer's avatar
      libbpf: Support loading individual progs · 29cd77f4
      Joe Stringer authored
      Allow the individual program load to be invoked. This will help with
      testing, where a single ELF may contain several sections, some of which
      denote subprograms that are expected to fail verification, along with
      some which are expected to pass verification. By allowing programs to be
      iterated and individually loaded, each program can be independently
      checked against its expected verification result.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      29cd77f4
    • Joe Stringer's avatar
      selftests/bpf: Add tests for reference tracking · b584ab88
      Joe Stringer authored
      reference tracking: leak potential reference
      reference tracking: leak potential reference on stack
      reference tracking: leak potential reference on stack 2
      reference tracking: zero potential reference
      reference tracking: copy and zero potential references
      reference tracking: release reference without check
      reference tracking: release reference
      reference tracking: release reference twice
      reference tracking: release reference twice inside branch
      reference tracking: alloc, check, free in one subbranch
      reference tracking: alloc, check, free in both subbranches
      reference tracking in call: free reference in subprog
      reference tracking in call: free reference in subprog and outside
      reference tracking in call: alloc & leak reference in subprog
      reference tracking in call: alloc in subprog, release outside
      reference tracking in call: sk_ptr leak into caller stack
      reference tracking in call: sk_ptr spill into caller stack
      reference tracking: allow LD_ABS
      reference tracking: forbid LD_ABS while holding reference
      reference tracking: allow LD_IND
      reference tracking: forbid LD_IND while holding reference
      reference tracking: check reference or tail call
      reference tracking: release reference then tail call
      reference tracking: leak possible reference over tail call
      reference tracking: leak checked reference over tail call
      reference tracking: mangle and release sock_or_null
      reference tracking: mangle and release sock
      reference tracking: access member
      reference tracking: write to member
      reference tracking: invalid 64-bit access of member
      reference tracking: access after release
      reference tracking: direct access for lookup
      unpriv: spill/fill of different pointers stx - ctx and sock
      unpriv: spill/fill of different pointers stx - leak sock
      unpriv: spill/fill of different pointers stx - sock and ctx (read)
      unpriv: spill/fill of different pointers stx - sock and ctx (write)
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b584ab88
    • Joe Stringer's avatar
      selftests/bpf: Generalize dummy program types · 0c586079
      Joe Stringer authored
      Don't hardcode the dummy program types to SOCKET_FILTER type, as this
      prevents testing bpf_tail_call in conjunction with other program types.
      Instead, use the program type specified in the test case.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0c586079
    • Joe Stringer's avatar
      bpf: Add helper to retrieve socket in BPF · 6acc9b43
      Joe Stringer authored
      This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
      bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
      socket listening on this host, and returns a socket pointer which the
      BPF program can then access to determine, for instance, whether to
      forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
      socket, so when a BPF program makes use of this function, it must
      subsequently pass the returned pointer into the newly added sk_release()
      to return the reference.
      
      By way of example, the following pseudocode would filter inbound
      connections at XDP if there is no corresponding service listening for
      the traffic:
      
        struct bpf_sock_tuple tuple;
        struct bpf_sock_ops *sk;
      
        populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
        sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
        if (!sk) {
          // Couldn't find a socket listening for this traffic. Drop.
          return TC_ACT_SHOT;
        }
        bpf_sk_release(sk, 0);
        return TC_ACT_OK;
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6acc9b43
    • Joe Stringer's avatar
      bpf: Add reference tracking to verifier · fd978bf7
      Joe Stringer authored
      Allow helper functions to acquire a reference and return it into a
      register. Specific pointer types such as the PTR_TO_SOCKET will
      implicitly represent such a reference. The verifier must ensure that
      these references are released exactly once in each path through the
      program.
      
      To achieve this, this commit assigns an id to the pointer and tracks it
      in the 'bpf_func_state', then when the function or program exits,
      verifies that all of the acquired references have been freed. When the
      pointer is passed to a function that frees the reference, it is removed
      from the 'bpf_func_state` and all existing copies of the pointer in
      registers are marked invalid.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fd978bf7
    • Joe Stringer's avatar
      bpf: Macrofy stack state copy · 84dbf350
      Joe Stringer authored
      An upcoming commit will need very similar copy/realloc boilerplate, so
      refactor the existing stack copy/realloc functions into macros to
      simplify it.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      84dbf350
    • Joe Stringer's avatar
      bpf: Add PTR_TO_SOCKET verifier type · c64b7983
      Joe Stringer authored
      Teach the verifier a little bit about a new type of pointer, a
      PTR_TO_SOCKET. This pointer type is accessed from BPF through the
      'struct bpf_sock' structure.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c64b7983
    • Joe Stringer's avatar
      bpf: Generalize ptr_or_null regs check · 840b9615
      Joe Stringer authored
      This check will be reused by an upcoming commit for conditional jump
      checks for sockets. Refactor it a bit to simplify the later commit.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      840b9615
    • Joe Stringer's avatar
      bpf: Reuse canonical string formatter for ctx errs · 9d2be44a
      Joe Stringer authored
      The array "reg_type_str" provides canonical formatting of register
      types, however a couple of places would previously check whether a
      register represented the context and write the name "context" directly.
      An upcoming commit will add another pointer type to these statements, so
      to provide more accurate error messages in the verifier, update these
      error messages to use "reg_type_str" instead.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9d2be44a
    • Joe Stringer's avatar
      bpf: Simplify ptr_min_max_vals adjustment · aad2eeaf
      Joe Stringer authored
      An upcoming commit will add another two pointer types that need very
      similar behaviour, so generalise this function now.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      aad2eeaf
    • Joe Stringer's avatar
      bpf: Add iterator for spilled registers · f3709f69
      Joe Stringer authored
      Add this iterator for spilled registers, it concentrates the details of
      how to get the current frame's spilled registers into a single macro
      while clarifying the intention of the code which is calling the macro.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f3709f69
  5. 02 Oct, 2018 4 commits
  6. 01 Oct, 2018 2 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-per-cpu-cgroup-storage' · cb86d0f8
      Daniel Borkmann authored
      Roman Gushchin says:
      
      ====================
      This patchset implements per-cpu cgroup local storage and provides
      an example how per-cpu and shared cgroup local storage can be used
      for efficient accounting of network traffic.
      
      v4->v3:
        1) incorporated Alexei's feedback
      
      v3->v2:
        1) incorporated Song's feedback
        2) rebased on top of current bpf-next
      
      v2->v1:
        1) added a selftest implementing network counters
        2) added a missing free() in cgroup local storage selftest
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cb86d0f8
    • Roman Gushchin's avatar
      selftests/bpf: cgroup local storage-based network counters · 371e4fcc
      Roman Gushchin authored
      This commit adds a bpf kselftest, which demonstrates how percpu
      and shared cgroup local storage can be used for efficient lookup-free
      network accounting.
      
      Cgroup local storage provides generic memory area with a very efficient
      lookup free access. To avoid expensive atomic operations for each
      packet, per-cpu cgroup local storage is used. Each packet is initially
      charged to a per-cpu counter, and only if the counter reaches certain
      value (32 in this case), the charge is moved into the global atomic
      counter. This allows to amortize atomic operations, keeping reasonable
      accuracy.
      
      The test also implements a naive network traffic throttling, mostly to
      demonstrate the possibility of bpf cgroup--based network bandwidth
      control.
      
      Expected output:
        ./test_netcnt
        test_netcnt:PASS
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      371e4fcc