1. 31 Mar, 2018 10 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-cgroup-bind-connect' · 7828f20e
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      v2->v3:
      - rebase due to conflicts
      - fix ipv6=m build
      
      v1->v2:
      - support expected_attach_type at prog load time so that prog (incl.
        context accesses and calls to helpers) can be validated with regard to
        specific attach point it is supposed to be attached to.
        Later, at attach time, attach type is checked so that it must be same as
        at load time if it was provided
      - reworked hooks to rely on expected_attach_type, and reduced number of new
        prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR
      - reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks
      - add selftests for post-sys_bind hook
      
      For our container management we've been using complicated and fragile setup
      consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
      all containerized applications. Unfortunately it doesn't work for apps that
      don't use glibc and changing all applications that run in the datacenter
      is not possible due to 3rd party code and libraries (despite being
      open source code) and sheer amount of legacy code that has to be rewritten
      (we're rewriting what we can in parallel)
      
      These applications are written without containers in mind and have
      builtin assumptions about network services. Like an application X
      expects to connect localhost:special_port and find service Y in there.
      To move application X and service Y into two different containers
      LD_PRELOAD approach is used to help one service connect to another
      without rewriting them.
      Moving these two applications into different L2 (netns) or L3 (vrf)
      network isolation scopes doesn't help to solve the problem, since
      applications need to see each other like they were running on
      the host without containers.
      So if app X and app Y would run in different netns something
      would need to punch a connectivity hole in those namespaces.
      That would be real layering violation (with corresponding
      network debugging pains), since clean l2, l3 abstraction would
      suddenly support something that breaks through the layers.
      
      Instead we used LD_PRELOAD (and now bpf programs) at bind/connect
      time to help applications discover and connect to each other.
      All applications are running in init_nens and there are no vrfs.
      After bind/connect the normal fib/neighbor core networking
      logic works as it should always do and the whole system is
      clean from network point of view and can be debugged with
      standard tools.
      
      We also considered resurrecting Hannes's afnetns work,
      but all hierarchical namespace abstraction don't work due
      to these builtin networking assumptions inside the apps.
      To run an application inside cgroup container that was not written
      with containers in mind we have to make an illusion of running
      in non-containerized environment.
      In some cases we remember the port and container id in the post-bind hook
      in a bpf map and when some other task in a different container is trying
      to connect to a service we need to know where this service is running.
      It can be remote and can be local. Both client and service may or may not
      be written with containers in mind and this sockaddr rewrite is providing
      connectivity and load balancing feature.
      
      BPF+cgroup looks to be the best solution for this problem.
      Hence we introduce 3 hooks:
      - at entry into sys_bind and sys_connect
        to let bpf prog look and modify 'struct sockaddr' provided
        by user space and fail bind/connect when appropriate
      - post sys_bind after port is allocated
      
      The approach works great and has zero overhead for anyone who doesn't
      use it and very low overhead when deployed.
      
      Different use case for this feature is to do low overhead firewall
      that doesn't need to inspect all packets and works at bind/connect time.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7828f20e
    • Andrey Ignatov's avatar
      selftests/bpf: Selftest for sys_bind post-hooks. · 1d436885
      Andrey Ignatov authored
      Add selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and
      `BPF_CGROUP_INET6_POST_BIND`.
      
      The main things tested are:
      * prog load behaves as expected (valid/invalid accesses in prog);
      * prog attach behaves as expected (load- vs attach-time attach types);
      * `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible
        way;
      * post-hooks return expected result and errno.
      
      Example:
        # ./test_sock
        Test case: bind4 load with invalid access: src_ip6 .. [PASS]
        Test case: bind4 load with invalid access: mark .. [PASS]
        Test case: bind6 load with invalid access: src_ip4 .. [PASS]
        Test case: sock_create load with invalid access: src_port .. [PASS]
        Test case: sock_create load w/o expected_attach_type (compat mode) ..
        [PASS]
        Test case: sock_create load w/ expected_attach_type .. [PASS]
        Test case: attach type mismatch bind4 vs bind6 .. [PASS]
        Test case: attach type mismatch bind6 vs bind4 .. [PASS]
        Test case: attach type mismatch default vs bind4 .. [PASS]
        Test case: attach type mismatch bind6 vs sock_create .. [PASS]
        Test case: bind4 reject all .. [PASS]
        Test case: bind6 reject all .. [PASS]
        Test case: bind6 deny specific IP & port .. [PASS]
        Test case: bind4 allow specific IP & port .. [PASS]
        Test case: bind4 allow all .. [PASS]
        Test case: bind6 allow all .. [PASS]
        Summary: 16 PASSED, 0 FAILED
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1d436885
    • Andrey Ignatov's avatar
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov authored
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • Andrey Ignatov's avatar
      selftests/bpf: Selftest for sys_connect hooks · 622adafb
      Andrey Ignatov authored
      Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT
      attach types.
      
      Try to connect(2) to specified IP:port and test that:
      * remote IP:port pair is overridden;
      * local end of connection is bound to specified IP.
      
      All combinations of IPv4/IPv6 and TCP/UDP are tested.
      
      Example:
        # tcpdump -pn -i lo -w connect.pcap 2>/dev/null &
        [1] 478
        # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh
        Wait for testing IPv4/IPv6 to become available ... OK
        Load bind4 with invalid type (can pollute stderr) ... REJECTED
        Load bind4 with valid type ... OK
        Attach bind4 with invalid type ... REJECTED
        Attach bind4 with valid type ... OK
        Load connect4 with invalid type (can pollute stderr) libbpf: load bpf \
          program failed: Permission denied
        libbpf: -- BEGIN DUMP LOG ---
        libbpf:
        0: (b7) r2 = 23569
        1: (63) *(u32 *)(r1 +24) = r2
        2: (b7) r2 = 16777343
        3: (63) *(u32 *)(r1 +4) = r2
        invalid bpf_context access off=4 size=4
        [ 1518.404609] random: crng init done
      
        libbpf: -- END LOG --
        libbpf: failed to load program 'cgroup/connect4'
        libbpf: failed to load object './connect4_prog.o'
        ... REJECTED
        Load connect4 with valid type ... OK
        Attach connect4 with invalid type ... REJECTED
        Attach connect4 with valid type ... OK
        Test case #1 (IPv4/TCP):
                Requested: bind(192.168.1.254, 4040) ..
                   Actual: bind(127.0.0.1, 4444)
                Requested: connect(192.168.1.254, 4040) from (*, *) ..
                   Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56068)
        Test case #2 (IPv4/UDP):
                Requested: bind(192.168.1.254, 4040) ..
                   Actual: bind(127.0.0.1, 4444)
                Requested: connect(192.168.1.254, 4040) from (*, *) ..
                   Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56447)
        Load bind6 with invalid type (can pollute stderr) ... REJECTED
        Load bind6 with valid type ... OK
        Attach bind6 with invalid type ... REJECTED
        Attach bind6 with valid type ... OK
        Load connect6 with invalid type (can pollute stderr) libbpf: load bpf \
          program failed: Permission denied
        libbpf: -- BEGIN DUMP LOG ---
        libbpf:
        0: (b7) r6 = 0
        1: (63) *(u32 *)(r1 +12) = r6
        invalid bpf_context access off=12 size=4
      
        libbpf: -- END LOG --
        libbpf: failed to load program 'cgroup/connect6'
        libbpf: failed to load object './connect6_prog.o'
        ... REJECTED
        Load connect6 with valid type ... OK
        Attach connect6 with invalid type ... REJECTED
        Attach connect6 with valid type ... OK
        Test case #3 (IPv6/TCP):
                Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
                   Actual: bind(::1, 6666)
                Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *)
                   Actual: connect(::1, 6666) from (::6, 37458)
        Test case #4 (IPv6/UDP):
                Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
                   Actual: bind(::1, 6666)
                Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *)
                   Actual: connect(::1, 6666) from (::6, 39315)
        ### SUCCESS
        # egrep 'connect\(.*AF_INET' connect.trace | \
        > egrep -vw 'htons\(1025\)' | fold -b -s -w 72
        502   connect(7, {sa_family=AF_INET, sin_port=htons(4040),
        sin_addr=inet_addr("192.168.1.254")}, 128) = 0
        502   connect(8, {sa_family=AF_INET, sin_port=htons(4040),
        sin_addr=inet_addr("192.168.1.254")}, 128) = 0
        502   connect(9, {sa_family=AF_INET6, sin6_port=htons(6060),
        inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr),
        sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
        502   connect(10, {sa_family=AF_INET6, sin6_port=htons(6060),
        inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr),
        sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0
        # fg
        tcpdump -pn -i lo -w connect.pcap 2> /dev/null
        # tcpdump -r connect.pcap -n tcp | cut -c 1-72
        reading from file connect.pcap, link-type EN10MB (Ethernet)
        17:57:40.383533 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [S], seq 1333
        17:57:40.383566 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [S.], seq 112
        17:57:40.383589 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [.], ack 1, w
        17:57:40.384578 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [R.], seq 1,
        17:57:40.403327 IP6 ::6.37458 > ::1.6666: Flags [S], seq 406513443, win
        17:57:40.403357 IP6 ::1.6666 > ::6.37458: Flags [S.], seq 2448389240, ac
        17:57:40.403376 IP6 ::6.37458 > ::1.6666: Flags [.], ack 1, win 342, opt
        17:57:40.404263 IP6 ::1.6666 > ::6.37458: Flags [R.], seq 1, ack 1, win
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      622adafb
    • Andrey Ignatov's avatar
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov authored
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • Andrey Ignatov's avatar
      net: Introduce __inet_bind() and __inet6_bind · 3679d585
      Andrey Ignatov authored
      Refactor `bind()` code to make it ready to be called from BPF helper
      function `bpf_bind()` (will be added soon). Implementation of
      `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and
      `__inet6_bind()` correspondingly. These function can be used from both
      `sk_prot->bind` and `bpf_bind()` contexts.
      
      New functions have two additional arguments.
      
      `force_bind_address_no_port` forces binding to IP only w/o checking
      `inet_sock.bind_address_no_port` field. It'll allow to bind local end of
      a connection to desired IP in `bpf_bind()` w/o changing
      `bind_address_no_port` field of a socket. It's useful since `bpf_bind()`
      can return an error and we'd need to restore original value of
      `bind_address_no_port` in that case if we changed this before calling to
      the helper.
      
      `with_lock` specifies whether to lock socket when working with `struct
      sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old
      behavior is preserved. But it will be set to `false` for `bpf_bind()`
      use-case. The reason is all call-sites, where `bpf_bind()` will be
      called, already hold that socket lock.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3679d585
    • Andrey Ignatov's avatar
      selftests/bpf: Selftest for sys_bind hooks · e50b0a6f
      Andrey Ignatov authored
      Add selftest to work with bpf_sock_addr context from
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs.
      
      Try to bind(2) on IP:port and apply:
      * loads to make sure context can be read correctly, including narrow
        loads (byte, half) for IP and full-size loads (word) for all fields;
      * stores to those fields allowed by verifier.
      
      All combination from IPv4/IPv6 and TCP/UDP are tested.
      
      Both scenarios are tested:
      * valid programs can be loaded and attached;
      * invalid programs can be neither loaded nor attached.
      
      Test passes when expected data can be read from context in the
      BPF-program, and after the call to bind(2) socket is bound to IP:port
      pair that was written by BPF-program to the context.
      
      Example:
        # ./test_sock_addr
        Attached bind4 program.
        Test case #1 (IPv4/TCP):
                Requested: bind(192.168.1.254, 4040) ..
                   Actual: bind(127.0.0.1, 4444)
        Test case #2 (IPv4/UDP):
                Requested: bind(192.168.1.254, 4040) ..
                   Actual: bind(127.0.0.1, 4444)
        Attached bind6 program.
        Test case #3 (IPv6/TCP):
                Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
                   Actual: bind(::1, 6666)
        Test case #4 (IPv6/UDP):
                Requested: bind(face:b00c:1234:5678::abcd, 6060) ..
                   Actual: bind(::1, 6666)
        ### SUCCESS
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e50b0a6f
    • Andrey Ignatov's avatar
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov authored
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • Andrey Ignatov's avatar
      libbpf: Support expected_attach_type at prog load · d7be143b
      Andrey Ignatov authored
      Support setting `expected_attach_type` at prog load time in both
      `bpf/bpf.h` and `bpf/libbpf.h`.
      
      Since both headers already have API to load programs, new functions are
      added not to break backward compatibility for existing ones:
      * `bpf_load_program_xattr()` is added to `bpf/bpf.h`;
      * `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`.
      
      Both new functions accept structures, `struct bpf_load_program_attr` and
      `struct bpf_prog_load_attr` correspondingly, where new fields can be
      added in the future w/o changing the API.
      
      Standard `_xattr` suffix is used to name the new API functions.
      
      Since `bpf_load_program_name()` is not used as heavily as
      `bpf_load_program()`, it was removed in favor of more generic
      `bpf_load_program_xattr()`.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d7be143b
    • Andrey Ignatov's avatar
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov authored
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
  2. 30 Mar, 2018 3 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sockmap-sg-api-fixes' · 807ae7da
      Daniel Borkmann authored
      Prashant Bhole says:
      
      ====================
      These patches fix sg api usage in sockmap. Previously sockmap didn't
      use sg_init_table(), which caused hitting BUG_ON in sg api, when
      CONFIG_DEBUG_SG is enabled
      
      v1: added sg_init_table() calls wherever needed.
      
      v2:
      - Patch1 adds new helper function in sg api. sg_init_marker()
      - Patch2 sg_init_marker() and sg_init_table() in appropriate places
      
      Backgroud:
      While reviewing v1, John Fastabend raised a valid point about
      unnecessary memset in sg_init_table() because sockmap uses sg table
      which embedded in a struct. As enclosing struct is zeroed out, there
      is unnecessary memset in sg_init_table.
      
      So Daniel Borkmann suggested to define another static inline function
      in scatterlist.h which only initializes sg_magic. Also this function
      will be called from sg_init_table. From this suggestion I defined a
      function sg_init_marker() which sets sg_magic and calls sg_mark_end()
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      807ae7da
    • Prashant Bhole's avatar
      bpf: sockmap: initialize sg table entries properly · 6ef6d84c
      Prashant Bhole authored
      When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized in
      sg_init_table() and it is verified in sg api while navigating. We hit
      BUG_ON when magic check is failed.
      
      In functions sg_tcp_sendpage and sg_tcp_sendmsg, the struct containing
      the scatterlist is already zeroed out. So to avoid extra memset, we
      use sg_init_marker() to initialize sg_magic.
      
      Fixed following things:
      - In bpf_tcp_sendpage: initialize sg using sg_init_marker
      - In bpf_tcp_sendmsg: Replace sg_init_table with sg_init_marker
      - In bpf_tcp_push: Replace memset with sg_init_table where consumed
        sg entry needs to be re-initialized.
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6ef6d84c
    • Prashant Bhole's avatar
      lib/scatterlist: add sg_init_marker() helper · f3851786
      Prashant Bhole authored
      sg_init_marker initializes sg_magic in the sg table and calls
      sg_mark_end() on the last entry of the table. This can be useful to
      avoid memset in sg_init_table() when scatterlist is already zeroed out
      
      For example: when scatterlist is embedded inside other struct and that
      container struct is zeroed out
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f3851786
  3. 29 Mar, 2018 20 commits
  4. 28 Mar, 2018 7 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-raw-tracepoints' · f6ef5658
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      v7->v8:
      - moved 'u32 num_args' from 'struct tracepoint' into 'struct bpf_raw_event_map'
        that increases memory overhead, but can be optimized/compressed later.
        Now it's zero changes in tracepoint.[ch]
      
      v6->v7:
      - adopted Steven's bpf_raw_tp_map section approach to find tracepoint
        and corresponding bpf probe function instead of kallsyms approach.
        dropped kernel_tracepoint_find_by_name() patch
      
      v5->v6:
      - avoid changing semantics of for_each_kernel_tracepoint() function, instead
        introduce kernel_tracepoint_find_by_name() helper
      
      v4->v5:
      - adopted Daniel's fancy REPEAT macro in bpf_trace.c in patch 6
      
      v3->v4:
      - adopted Linus's CAST_TO_U64 macro to cast any integer, pointer, or small
        struct to u64. That nicely reduced the size of patch 1
      
      v2->v3:
      - with Linus's suggestion introduced generic COUNT_ARGS and CONCATENATE macros
        (or rather moved them from apparmor)
        that cleaned up patch 6
      - added patch 4 to refactor trace_iwlwifi_dev_ucode_error() from 17 args to 4
        Now any tracepoint with >12 args will have build error
      
      v1->v2:
      - simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into
        bpf_raw_tp_open(name, prog_fd) as suggested by Daniel.
        That simplifies bpf_detach as well which is now simple close() of fd.
      - fixed memory leak in error path which was spotted by Daniel.
      - fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints
      - added more tests
      - fixed allyesconfig build caught by buildbot
      
      v1:
      This patch set is a different way to address the pressing need to access
      task_struct pointers in sched tracepoints from bpf programs.
      
      The first approach simply added these pointers to sched tracepoints:
      https://lkml.org/lkml/2017/12/14/753
      which Peter nacked.
      Few options were discussed and eventually the discussion converged on
      doing bpf specific tracepoint_probe_register() probe functions.
      Details here:
      https://lkml.org/lkml/2017/12/20/929
      
      Patch 1 is kernel wide cleanup of pass-struct-by-value into
      pass-struct-by-reference into tracepoints.
      
      Patches 2 and 3 are minor cleanups to address allyesconfig build
      
      Patch 4 refactor trace_iwlwifi_dev_ucode_error from 17 to 4 args
      
      Patch 5 introduces COUNT_ARGS macro
      
      Patch 6 introduces BPF_RAW_TRACEPOINT api.
      the auto-cleanup and multiple concurrent users are must have
      features of tracing api. For bpf raw tracepoints it looks like:
        // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
        prog_fd = bpf_prog_load(...);
      
        // receive anon_inode fd for given bpf_raw_tracepoint
        // and attach bpf program to it
        raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool will automatically
      detach bpf program, unload it and unregister tracepoint probe.
      More details in patch 6.
      
      Patch 7 - trivial support in libbpf
      Patches 8, 9 - user space tests
      
      samples/bpf/test_overhead performance on 1 cpu:
      
      tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
      task_rename   1.1M   769K        947K            1.0M
      urandom_read  789K   697K        750K            755K
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f6ef5658
    • Alexei Starovoitov's avatar
      selftests/bpf: test for bpf_get_stackid() from raw tracepoints · 3bbe0869
      Alexei Starovoitov authored
      similar to traditional traceopint test add bpf_get_stackid() test
      from raw tracepoints
      and reduce verbosity of existing stackmap test
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3bbe0869
    • Alexei Starovoitov's avatar
      samples/bpf: raw tracepoint test · 4662a4e5
      Alexei Starovoitov authored
      add empty raw_tracepoint bpf program to test overhead similar
      to kprobe and traditional tracepoint tests
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4662a4e5
    • Alexei Starovoitov's avatar
      libbpf: add bpf_raw_tracepoint_open helper · a0fe3e57
      Alexei Starovoitov authored
      add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a0fe3e57
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov authored
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
    • Alexei Starovoitov's avatar
      macro: introduce COUNT_ARGS() macro · cf14f27f
      Alexei Starovoitov authored
      move COUNT_ARGS() macro from apparmor to generic header and extend it
      to count till twelve.
      
      COUNT() was an alternative name for this logic, but it's used for
      different purpose in many other places.
      
      Similarly for CONCATENATE() macro.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cf14f27f
    • Alexei Starovoitov's avatar
      net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint · 4fe43c2c
      Alexei Starovoitov authored
      fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table
      instead of all 17 arguments by value.
      dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table'
      defined with very similar yet subtly different fields and offsets.
      tracepoint is still common and using definition of 'struct iwl_error_event_table'
      from dvm/commands.h while copying fields.
      Long term this tracepoint probably should be split into two.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4fe43c2c