1. 10 Apr, 2019 11 commits
    • Daniel Borkmann's avatar
      bpf, libbpf: support global data/bss/rodata sections · d859900c
      Daniel Borkmann authored
      This work adds BPF loader support for global data sections
      to libbpf. This allows to write BPF programs in more natural
      C-like way by being able to define global variables and const
      data.
      
      Back at LPC 2018 [0] we presented a first prototype which
      implemented support for global data sections by extending BPF
      syscall where union bpf_attr would get additional memory/size
      pair for each section passed during prog load in order to later
      add this base address into the ldimm64 instruction along with
      the user provided offset when accessing a variable. Consensus
      from LPC was that for proper upstream support, it would be
      more desirable to use maps instead of bpf_attr extension as
      this would allow for introspection of these sections as well
      as potential live updates of their content. This work follows
      this path by taking the following steps from loader side:
      
       1) In bpf_object__elf_collect() step we pick up ".data",
          ".rodata", and ".bss" section information.
      
       2) If present, in bpf_object__init_internal_map() we add
          maps to the obj's map array that corresponds to each
          of the present sections. Given section size and access
          properties can differ, a single entry array map is
          created with value size that is corresponding to the
          ELF section size of .data, .bss or .rodata. These
          internal maps are integrated into the normal map
          handling of libbpf such that when user traverses all
          obj maps, they can be differentiated from user-created
          ones via bpf_map__is_internal(). In later steps when
          we actually create these maps in the kernel via
          bpf_object__create_maps(), then for .data and .rodata
          sections their content is copied into the map through
          bpf_map_update_elem(). For .bss this is not necessary
          since array map is already zero-initialized by default.
          Additionally, for .rodata the map is frozen as read-only
          after setup, such that neither from program nor syscall
          side writes would be possible.
      
       3) In bpf_program__collect_reloc() step, we record the
          corresponding map, insn index, and relocation type for
          the global data.
      
       4) And last but not least in the actual relocation step in
          bpf_program__relocate(), we mark the ldimm64 instruction
          with src_reg = BPF_PSEUDO_MAP_VALUE where in the first
          imm field the map's file descriptor is stored as similarly
          done as in BPF_PSEUDO_MAP_FD, and in the second imm field
          (as ldimm64 is 2-insn wide) we store the access offset
          into the section. Given these maps have only single element
          ldimm64's off remains zero in both parts.
      
       5) On kernel side, this special marked BPF_PSEUDO_MAP_VALUE
          load will then store the actual target address in order
          to have a 'map-lookup'-free access. That is, the actual
          map value base address + offset. The destination register
          in the verifier will then be marked as PTR_TO_MAP_VALUE,
          containing the fixed offset as reg->off and backing BPF
          map as reg->map_ptr. Meaning, it's treated as any other
          normal map value from verification side, only with
          efficient, direct value access instead of actual call to
          map lookup helper as in the typical case.
      
      Currently, only support for static global variables has been
      added, and libbpf rejects non-static global variables from
      loading. This can be lifted until we have proper semantics
      for how BPF will treat multi-object BPF loads. From BTF side,
      libbpf will set the value type id of the types corresponding
      to the ".bss", ".data" and ".rodata" names which LLVM will
      emit without the object name prefix. The key type will be
      left as zero, thus making use of the key-less BTF option in
      array maps.
      
      Simple example dump of program using globals vars in each
      section:
      
        # bpftool prog
        [...]
        6784: sched_cls  name load_static_dat  tag a7e1291567277844  gpl
              loaded_at 2019-03-11T15:39:34+0000  uid 0
              xlated 1776B  jited 993B  memlock 4096B  map_ids 2238,2237,2235,2236,2239,2240
      
        # bpftool map show id 2237
        2237: array  name test_glo.bss  flags 0x0
              key 4B  value 64B  max_entries 1  memlock 4096B
        # bpftool map show id 2235
        2235: array  name test_glo.data  flags 0x0
              key 4B  value 64B  max_entries 1  memlock 4096B
        # bpftool map show id 2236
        2236: array  name test_glo.rodata  flags 0x80
              key 4B  value 96B  max_entries 1  memlock 4096B
      
        # bpftool prog dump xlated id 6784
        int load_static_data(struct __sk_buff * skb):
        ; int load_static_data(struct __sk_buff *skb)
           0: (b7) r6 = 0
        ; test_reloc(number, 0, &num0);
           1: (63) *(u32 *)(r10 -4) = r6
           2: (bf) r2 = r10
        ; int load_static_data(struct __sk_buff *skb)
           3: (07) r2 += -4
        ; test_reloc(number, 0, &num0);
           4: (18) r1 = map[id:2238]
           6: (18) r3 = map[id:2237][0]+0    <-- direct addr in .bss area
           8: (b7) r4 = 0
           9: (85) call array_map_update_elem#100464
          10: (b7) r1 = 1
        ; test_reloc(number, 1, &num1);
        [...]
        ; test_reloc(string, 2, str2);
         120: (18) r8 = map[id:2237][0]+16   <-- same here at offset +16
         122: (18) r1 = map[id:2239]
         124: (18) r3 = map[id:2237][0]+16
         126: (b7) r4 = 0
         127: (85) call array_map_update_elem#100464
         128: (b7) r1 = 120
        ; str1[5] = 'x';
         129: (73) *(u8 *)(r9 +5) = r1
        ; test_reloc(string, 3, str1);
         130: (b7) r1 = 3
         131: (63) *(u32 *)(r10 -4) = r1
         132: (b7) r9 = 3
         133: (bf) r2 = r10
        ; int load_static_data(struct __sk_buff *skb)
         134: (07) r2 += -4
        ; test_reloc(string, 3, str1);
         135: (18) r1 = map[id:2239]
         137: (18) r3 = map[id:2235][0]+16   <-- direct addr in .data area
         139: (b7) r4 = 0
         140: (85) call array_map_update_elem#100464
         141: (b7) r1 = 111
        ; __builtin_memcpy(&str2[2], "hello", sizeof("hello"));
         142: (73) *(u8 *)(r8 +6) = r1       <-- further access based on .bss data
         143: (b7) r1 = 108
         144: (73) *(u8 *)(r8 +5) = r1
        [...]
      
      For Cilium use-case in particular, this enables migrating configuration
      constants from Cilium daemon's generated header defines into global
      data sections such that expensive runtime recompilations with LLVM can
      be avoided altogether. Instead, the ELF file becomes effectively a
      "template", meaning, it is compiled only once (!) and the Cilium daemon
      will then rewrite relevant configuration data from the ELF's .data or
      .rodata sections directly instead of recompiling the program. The
      updated ELF is then loaded into the kernel and atomically replaces
      the existing program in the networking datapath. More info in [0].
      
      Based upon recent fix in LLVM, commit c0db6b6bd444 ("[BPF] Don't fail
      for static variables").
      
        [0] LPC 2018, BPF track, "ELF relocation for static data in BPF",
            http://vger.kernel.org/lpc-bpf2018.html#session-3Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d859900c
    • Joe Stringer's avatar
      bpf, libbpf: refactor relocation handling · f8c7a4d4
      Joe Stringer authored
      Adjust the code for relocations slightly with no functional changes,
      so that upcoming patches that will introduce support for relocations
      into the .data, .rodata and .bss sections can be added independent
      of these changes.
      Signed-off-by: default avatarJoe Stringer <joe@wand.net.nz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f8c7a4d4
    • Daniel Borkmann's avatar
      bpf: sync {btf, bpf}.h uapi header from tools infrastructure · c83fef6b
      Daniel Borkmann authored
      Pull in latest changes from both headers, so we can make use of
      them in libbpf.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c83fef6b
    • Daniel Borkmann's avatar
      bpf: allow for key-less BTF in array map · 2824ecb7
      Daniel Borkmann authored
      Given we'll be reusing BPF array maps for global data/bss/rodata
      sections, we need a way to associate BTF DataSec type as its map
      value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR()
      macro hack e.g. via 38d5d3b3 ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
      to get initial map to type association going. While more use cases
      for it are discouraged, this also won't work for global data since
      the use of array map is a BPF loader detail and therefore unknown
      at compilation time. For array maps with just a single entry we make
      an exception in terms of BTF in that key type is declared optional
      if value type is of DataSec type. The latter LLVM is guaranteed to
      emit and it also aligns with how we regard global data maps as just
      a plain buffer area reusing existing map facilities for allowing
      things like introspection with existing tools.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2824ecb7
    • Daniel Borkmann's avatar
      bpf: kernel side support for BTF Var and DataSec · 1dc92851
      Daniel Borkmann authored
      This work adds kernel-side verification, logging and seq_show dumping
      of BTF Var and DataSec kinds which are emitted with latest LLVM. The
      following constraints apply:
      
      BTF Var must have:
      
      - Its kind_flag is 0
      - Its vlen is 0
      - Must point to a valid type
      - Type must not resolve to a forward type
      - Size of underlying type must be > 0
      - Must have a valid name
      - Can only be a source type, not sink or intermediate one
      - Name may include dots (e.g. in case of static variables
        inside functions)
      - Cannot be a member of a struct/union
      - Linkage so far can either only be static or global/allocated
      
      BTF DataSec must have:
      
      - Its kind_flag is 0
      - Its vlen cannot be 0
      - Its size cannot be 0
      - Must have a valid name
      - Can only be a source type, not sink or intermediate one
      - Name may include dots (e.g. to represent .bss, .data, .rodata etc)
      - Cannot be a member of a struct/union
      - Inner btf_var_secinfo array with {type,offset,size} triple
        must be sorted by offset in ascending order
      - Type must always point to BTF Var
      - BTF resolved size of Var must be <= size provided by triple
      - DataSec size must be >= sum of triple sizes (thus holes
        are allowed)
      
      btf_var_resolve(), btf_ptr_resolve() and btf_modifier_resolve()
      are on a high level quite similar but each come with slight,
      subtle differences. They could potentially be a bit refactored
      in future which hasn't been done here to ease review.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1dc92851
    • Daniel Borkmann's avatar
      bpf: add specification for BTF Var and DataSec kinds · f063c889
      Daniel Borkmann authored
      This adds the BTF specification and UAPI bits for supporting BTF Var
      and DataSec kinds. This is following LLVM upstream commit ac4082b77e07
      ("[BPF] Add BTF Var and DataSec Support") which has been merged recently.
      Var itself is for describing a global variable and DataSec to describe
      ELF sections e.g. data/bss/rodata sections that hold one or multiple
      global variables.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f063c889
    • Daniel Borkmann's avatar
      bpf: allow . char as part of the object name · 3e0ddc4f
      Daniel Borkmann authored
      Trivial addition to allow '.' aside from '_' as "special" characters
      in the object name. Used to allow for substrings in maps from loader
      side such as ".bss", ".data", ".rodata", but could also be useful for
      other purposes.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3e0ddc4f
    • Daniel Borkmann's avatar
      bpf: add syscall side map freeze support · 87df15de
      Daniel Borkmann authored
      This patch adds a new BPF_MAP_FREEZE command which allows to
      "freeze" the map globally as read-only / immutable from syscall
      side.
      
      Map permission handling has been refactored into map_get_sys_perms()
      and drops FMODE_CAN_WRITE in case of locked map. Main use case is
      to allow for setting up .rodata sections from the BPF ELF which
      are loaded into the kernel, meaning BPF loader first allocates
      map, sets up map value by copying .rodata section into it and once
      complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
      modifications.
      
      Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
      bpf_attr members are required to be zero. I didn't add write-only
      locking here as counterpart since I don't have a concrete use-case
      for it on my side, and I think it makes probably more sense to wait
      once there is actually one. In that case bpf_attr can be extended
      as usual with a flag field and/or others where flag 0 means that
      we lock the map read-only hence this doesn't prevent to add further
      extensions to BPF_MAP_FREEZE upon need.
      
      A map creation flag like BPF_F_WRONCE was not considered for couple
      of reasons: i) in case of a generic implementation, a map can consist
      of more than just one element, thus there could be multiple map
      updates needed to set the map into a state where it can then be
      made immutable, ii) WRONCE indicates exact one-time write before
      it is then set immutable. A generic implementation would set a bit
      atomically on map update entry (if unset), indicating that every
      subsequent update from then onwards will need to bail out there.
      However, map updates can fail, so upon failure that flag would need
      to be unset again and the update attempt would need to be repeated
      for it to be eventually made immutable. While this can be made
      race-free, this approach feels less clean and in combination with
      reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
      command directly sets the flag and caller has the guarantee that
      map is immutable from syscall side upon successful return for any
      future syscall invocations that would alter the map state, which
      is also more intuitive from an API point of view. A command name
      such as BPF_MAP_LOCK has been avoided as it's too close with BPF
      map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
      is so far only enabled for privileged users.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87df15de
    • Daniel Borkmann's avatar
      bpf: add program side {rd, wr}only support for maps · 591fe988
      Daniel Borkmann authored
      This work adds two new map creation flags BPF_F_RDONLY_PROG
      and BPF_F_WRONLY_PROG in order to allow for read-only or
      write-only BPF maps from a BPF program side.
      
      Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
      applies to system call side, meaning the BPF program has full
      read/write access to the map as usual while bpf(2) calls with
      map fd can either only read or write into the map depending
      on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
      for the exact opposite such that verifier is going to reject
      program loads if write into a read-only map or a read into a
      write-only map is detected. For read-only map case also some
      helpers are forbidden for programs that would alter the map
      state such as map deletion, update, etc. As opposed to the two
      BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
      as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
      
      We've enabled this generic map extension to various non-special
      maps holding normal user data: array, hash, lru, lpm, local
      storage, queue and stack. Further generic map types could be
      followed up in future depending on use-case. Main use case
      here is to forbid writes into .rodata map values from verifier
      side.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      591fe988
    • Daniel Borkmann's avatar
      bpf: do not retain flags that are not tied to map lifetime · be70bcd5
      Daniel Borkmann authored
      Both BPF_F_WRONLY / BPF_F_RDONLY flags are tied to the map file
      descriptor, but not to the map object itself! Meaning, at map
      creation time BPF_F_RDONLY can be set to make the map read-only
      from syscall side, but this holds only for the returned fd, so
      any other fd either retrieved via bpf file system or via map id
      for the very same underlying map object can have read-write access
      instead.
      
      Given that, keeping the two flags around in the map_flags attribute
      and exposing them to user space upon map dump is misleading and
      may lead to false conclusions. Since these two flags are not
      tied to the map object lets also not store them as map property.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      be70bcd5
    • Daniel Borkmann's avatar
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann authored
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  2. 07 Apr, 2019 1 commit
  3. 05 Apr, 2019 9 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-varstack-fixes' · 347807d3
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      v2->v3:
      - sanity check max value for variable offset.
      
      v1->v2:
      - rely on meta = NULL to reject var_off stack access to uninit buffer.
      
      This patch set is a follow-up for discussion [1].
      
      It fixes variable offset stack access handling for raw and unprivileged
      mode, rejecting both of them, and sanity checks max variable offset value.
      
      Patch 1 handles raw (uninitialized) mode.
      Patch 2 adds test for raw mode.
      Patch 3 handles unprivileged mode.
      Patch 4 adds test for unprivileged mode.
      Patch 5 adds sanity check for max value of variable offset.
      Patch 6 adds test for variable offset max value checking.
      Patch 7 is a minor fix in verbose log.
      
      Unprivileged mode is an interesting case since one (and only?) way to come
      up with variable offset is to use pointer arithmetics. Though pointer
      arithmetics is already prohibited for unprivileged mode. I'm not sure if
      it's enough though and it seems like a good idea to still reject variable
      offset for unpriv in check_stack_boundary(). Please see patches 3 and 4
      for more details on this.
      
      [1] https://marc.info/?l=linux-netdev&m=155419526427742&w=2
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      347807d3
    • Andrey Ignatov's avatar
      bpf: Add missed newline in verifier verbose log · 1fbd20f8
      Andrey Ignatov authored
      check_stack_access() that prints verbose log is used in
      adjust_ptr_min_max_vals() that prints its own verbose log and now they
      stick together, e.g.:
      
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16
        size=1R2 stack pointer arithmetic goes out of range, prohibited for
        !root
      
      Add missing newline so that log is more readable:
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1
        R2 stack pointer arithmetic goes out of range, prohibited for !root
      
      Fixes: f1174f77 ("bpf/verifier: rework value tracking")
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1fbd20f8
    • Andrey Ignatov's avatar
      selftests/bpf: Test unbounded var_off stack access · 07f91962
      Andrey Ignatov authored
      Test the case when reg->smax_value is too small/big and can overflow,
      and separately min and max values outside of stack bounds.
      
      Example of output:
        # ./test_verifier
        #856/p indirect variable-offset stack access, unbounded OK
        #857/p indirect variable-offset stack access, max out of bound OK
        #858/p indirect variable-offset stack access, min out of bound OK
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      07f91962
    • Andrey Ignatov's avatar
      bpf: Sanity check max value for var_off stack access · 107c26a7
      Andrey Ignatov authored
      As discussed in [1] max value of variable offset has to be checked for
      overflow on stack access otherwise verifier would accept code like this:
      
        0: (b7) r2 = 6
        1: (b7) r3 = 28
        2: (7a) *(u64 *)(r10 -16) = 0
        3: (7a) *(u64 *)(r10 -8) = 0
        4: (79) r4 = *(u64 *)(r1 +168)
        5: (c5) if r4 s< 0x0 goto pc+4
         R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28
         R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0;
         0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm
        6: (17) r4 -= 16
        7: (0f) r4 += r10
        8: (b7) r5 = 8
        9: (85) call bpf_getsockopt#57
        10: (b7) r0 = 0
        11: (95) exit
      
      , where R4 obviosly has unbounded max value.
      
      Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF;
      BPF_MAX_VAR_OFF) range.
      
      reg->smax_value is used instead of reg->umax_value because stack
      pointers are calculated using negative offset from fp. This is opposite
      to e.g. map access where offset must be non-negative and where
      umax_value is used.
      
      Also dedicated verbose logs are added for both min and max bound check
      failures to have diagnostics consistent with variable offset handling in
      check_map_access().
      
      [1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      107c26a7
    • Andrey Ignatov's avatar
      selftests/bpf: Test indirect var_off stack access in unpriv mode · 2c6927db
      Andrey Ignatov authored
      Test that verifier rejects indirect stack access with variable offset in
      unprivileged mode and accepts same code in privileged mode.
      
      Since pointer arithmetics is prohibited in unprivileged mode verifier
      should reject the program even before it gets to helper call that uses
      variable offset, at the time when that variable offset is trying to be
      constructed.
      
      Example of output:
        # ./test_verifier
        ...
        #859/u indirect variable-offset stack access, priv vs unpriv OK
        #859/p indirect variable-offset stack access, priv vs unpriv OK
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2c6927db
    • Andrey Ignatov's avatar
      bpf: Reject indirect var_off stack access in unpriv mode · 088ec26d
      Andrey Ignatov authored
      Proper support of indirect stack access with variable offset in
      unprivileged mode (!root) requires corresponding support in Spectre
      masking for stack ALU in retrieve_ptr_limit().
      
      There are no use-case for variable offset in unprivileged mode though so
      make verifier reject such accesses for simplicity.
      
      Pointer arithmetics is one (and only?) way to cause variable offset and
      it's already rejected in unpriv mode so that verifier won't even get to
      helper function whose argument contains variable offset, e.g.:
      
        0: (7a) *(u64 *)(r10 -16) = 0
        1: (7a) *(u64 *)(r10 -8) = 0
        2: (61) r2 = *(u32 *)(r1 +0)
        3: (57) r2 &= 4
        4: (17) r2 -= 16
        5: (0f) r2 += r10
        variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
        stack pointer arithmetic goes out of range, prohibited for !root
      
      Still it looks like a good idea to reject variable offset indirect stack
      access for unprivileged mode in check_stack_boundary() explicitly.
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      088ec26d
    • Andrey Ignatov's avatar
      selftests/bpf: Test indirect var_off stack access in raw mode · f68a5b44
      Andrey Ignatov authored
      Test that verifier rejects indirect access to uninitialized stack with
      variable offset.
      
      Example of output:
        # ./test_verifier
        ...
        #859/p indirect variable-offset stack access, uninitialized OK
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f68a5b44
    • Andrey Ignatov's avatar
      bpf: Reject indirect var_off stack access in raw mode · f2bcd05e
      Andrey Ignatov authored
      It's hard to guarantee that whole memory is marked as initialized on
      helper return if uninitialized stack is accessed with variable offset
      since specific bounds are unknown to verifier. This may cause
      uninitialized stack leaking.
      
      Reject such an access in check_stack_boundary to prevent possible
      leaking.
      
      There are no known use-cases for indirect uninitialized stack access
      with variable offset so it shouldn't break anything.
      
      Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
      Reported-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f2bcd05e
    • Alexei Starovoitov's avatar
      samples/bpf: fix build with new clang · 636e78b1
      Alexei Starovoitov authored
      clang started to error on invalid asm clobber usage in x86 headers
      and many bpf program samples failed to build with the message:
      
        CLANG-bpf  /data/users/ast/bpf-next/samples/bpf/xdp_redirect_kern.o
      In file included from /data/users/ast/bpf-next/samples/bpf/xdp_redirect_kern.c:14:
      In file included from ../include/linux/in.h:23:
      In file included from ../include/uapi/linux/in.h:24:
      In file included from ../include/linux/socket.h:8:
      In file included from ../include/linux/uio.h:14:
      In file included from ../include/crypto/hash.h:16:
      In file included from ../include/linux/crypto.h:26:
      In file included from ../include/linux/uaccess.h:5:
      In file included from ../include/linux/sched.h:15:
      In file included from ../include/linux/sem.h:5:
      In file included from ../include/uapi/linux/sem.h:5:
      In file included from ../include/linux/ipc.h:9:
      In file included from ../include/linux/refcount.h:72:
      ../arch/x86/include/asm/refcount.h:72:36: error: asm-specifier for input or output variable conflicts with asm clobber list
                                               r->refs.counter, e, "er", i, "cx");
                                                                            ^
      ../arch/x86/include/asm/refcount.h:86:27: error: asm-specifier for input or output variable conflicts with asm clobber list
                                               r->refs.counter, e, "cx");
                                                                   ^
      2 errors generated.
      
      Override volatile() to workaround the problem.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      636e78b1
  4. 04 Apr, 2019 2 commits
  5. 03 Apr, 2019 11 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-verifier-scalability' · cc441a69
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      v1->v2:
      - fixed typo in patch 1
      - added a patch to convert kcalloc to kvcalloc
      - added a patch to verbose 16-bit jump offset check
      - added a test with 1m insns
      
      This patch set is the first step to be able to accept large programs.
      The verifier still suffers from its brute force algorithm and
      large programs can easily hit 1M insn_processed limit.
      A lot more work is necessary to be able to verify large programs.
      
      v1:
      Realize two key ideas to speed up verification speed by ~20 times
      1. every 'branching' instructions records all verifier states.
         not all of them are useful for search pruning.
         add a simple heuristic to keep states that were successful in search pruning
         and remove those that were not
      2. mark_reg_read walks parentage chain of registers to mark parents as LIVE_READ.
         Once the register is marked there is no need to remark it again in the future.
         Hence stop walking the chain once first LIVE_READ is seen.
      
      1st optimization gives 10x speed up on large programs
      and 2nd optimization reduces the cost of mark_reg_read from ~40% of cpu to <1%.
      Combined the deliver ~20x speedup on large programs.
      
      Faster and bounded verification time allows to increase insn_processed
      limit to 1 million from 130k.
      Worst case it takes 1/10 of a second to process that many instructions
      and peak memory consumption is peak_states * sizeof(struct bpf_verifier_state)
      which is around ~5Mbyte.
      
      Increase insn_per_program limit for root to insn_processed limit.
      
      Add verification stats and stress tests for verifier scalability.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cc441a69
    • Alexei Starovoitov's avatar
      selftests/bpf: synthetic tests to push verifier limits · 8aa2d4b4
      Alexei Starovoitov authored
      Add a test to generate 1m ld_imm64 insns to stress the verifier.
      
      Bump the size of fill_ld_abs_vlan_push_pop test from 4k to 29k
      and jump_around_ld_abs from 4k to 5.5k.
      Larger sizes are not possible due to 16-bit offset encoding
      in jump instructions.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8aa2d4b4
    • Alexei Starovoitov's avatar
      selftests/bpf: add few verifier scale tests · e5e7a8f2
      Alexei Starovoitov authored
      Add 3 basic tests that stress verifier scalability.
      
      test_verif_scale1.c calls non-inlined jhash() function 90 times on
      different position in the packet.
      This test simulates network packet parsing.
      jhash function is ~140 instructions and main program is ~1200 insns.
      
      test_verif_scale2.c force inlines jhash() function 90 times.
      This program is ~15k instructions long.
      
      test_verif_scale3.c calls non-inlined jhash() function 90 times on
      But this time jhash has to process 32-bytes from the packet
      instead of 14-bytes in tests 1 and 2.
      jhash function is ~230 insns and main program is ~1200 insns.
      
      $ test_progs -s
      can be used to see verifier stats.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e5e7a8f2
    • Alexei Starovoitov's avatar
      libbpf: teach libbpf about log_level bit 2 · da11b417
      Alexei Starovoitov authored
      Allow bpf_prog_load_xattr() to specify log_level for program loading.
      
      Teach libbpf to accept log_level with bit 2 set.
      
      Increase default BPF_LOG_BUF_SIZE from 256k to 16M.
      There is no downside to increase it to a maximum allowed by old kernels.
      Existing 256k limit caused ENOSPC errors and users were not able to see
      verifier error which is printed at the end of the verifier log.
      
      If ENOSPC is hit, double the verifier log and try again to capture
      the verifier error.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      da11b417
    • Alexei Starovoitov's avatar
      bpf: increase verifier log limit · 7a9f5c65
      Alexei Starovoitov authored
      The existing 16Mbyte verifier log limit is not enough for log_level=2
      even for small programs. Increase it to 1Gbyte.
      Note it's not a kernel memory limit.
      It's an amount of memory user space provides to store
      the verifier log. The kernel populates it 1k at a time.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7a9f5c65
    • Alexei Starovoitov's avatar
      bpf: increase complexity limit and maximum program size · c04c0d2b
      Alexei Starovoitov authored
      Large verifier speed improvements allow to increase
      verifier complexity limit.
      Now regardless of the program composition and its size it takes
      little time for the verifier to hit insn_processed limit.
      On typical x86 machine non-debug kernel processes 1M instructions
      in 1/10 of a second.
      (before these speed improvements specially crafted programs
      could be hitting multi-second verification times)
      Full kasan kernel with debug takes ~1 second for the same 1M insns.
      Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M.
      Also increase the number of instructions per program
      from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit.
      4k limit was confusing to users, since small programs with hundreds
      of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit.
      Sometimes adding more insns and bpf_trace_printk debug statements
      would make the verifier accept the program while removing
      code would make the verifier reject it.
      Some user space application started to add #define MAX_FOO to
      their programs and do:
        MAX_FOO=100;
      again:
        compile with MAX_FOO;
        try to load;
        if (fails_to_load) { reduce MAX_FOO; goto again; }
      to be able to fit maximum amount of processing into single program.
      Other users artificially split their single program into a set of programs
      and use all 32 iterations of tail_calls to increase compute limits.
      And the most advanced folks used unlimited tc-bpf filter list
      to execute many bpf programs.
      Essentially the users managed to workaround 4k insn limit.
      This patch removes the limit for root programs from uapi.
      BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit
      and success to load the program no longer depends on program size,
      but on 'smartness' of the verifier only.
      The verifier will continue to get smarter with every kernel release.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c04c0d2b
    • Alexei Starovoitov's avatar
      bpf: verbose jump offset overflow check · 4f73379e
      Alexei Starovoitov authored
      Larger programs may trigger 16-bit jump offset overflow check
      during instruction patching. Make this error verbose otherwise
      users cannot decipher error code without printks in the verifier.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4f73379e
    • Alexei Starovoitov's avatar
      bpf: convert temp arrays to kvcalloc · 71dde681
      Alexei Starovoitov authored
      Temporary arrays used during program verification need to be vmalloc-ed
      to support large bpf programs.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      71dde681
    • Alexei Starovoitov's avatar
      bpf: improve verification speed by not remarking live_read · 25af32da
      Alexei Starovoitov authored
      With large verifier speed improvement brought by the previous patch
      mark_reg_read() becomes the hottest function during verification.
      On a typical program it consumes 40% of cpu.
      mark_reg_read() walks parentage chain of registers to mark parents as LIVE_READ.
      Once the register is marked there is no need to remark it again in the future.
      Hence stop walking the chain once first LIVE_READ is seen.
      This optimization drops mark_reg_read() time from 40% of cpu to <1%
      and overall 2x improvement of verification speed.
      For some programs the longest_mark_read_walk counter improves from ~500 to ~5
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      25af32da
    • Alexei Starovoitov's avatar
      bpf: improve verification speed by droping states · 9f4686c4
      Alexei Starovoitov authored
      Branch instructions, branch targets and calls in a bpf program are
      the places where the verifier remembers states that led to successful
      verification of the program.
      These states are used to prune brute force program analysis.
      For unprivileged programs there is a limit of 64 states per such
      'branching' instructions (maximum length is tracked by max_states_per_insn
      counter introduced in the previous patch).
      Simply reducing this threshold to 32 or lower increases insn_processed
      metric to the point that small valid programs get rejected.
      For root programs there is no limit and cilium programs can have
      max_states_per_insn to be 100 or higher.
      Walking 100+ states multiplied by number of 'branching' insns during
      verification consumes significant amount of cpu time.
      Turned out simple LRU-like mechanism can be used to remove states
      that unlikely will be helpful in future search pruning.
      This patch introduces hit_cnt and miss_cnt counters:
      hit_cnt - this many times this state successfully pruned the search
      miss_cnt - this many times this state was not equivalent to other states
      (and that other states were added to state list)
      
      The heuristic introduced in this patch is:
      if (sl->miss_cnt > sl->hit_cnt * 3 + 3)
        /* drop this state from future considerations */
      
      Higher numbers increase max_states_per_insn (allow more states to be
      considered for pruning) and slow verification speed, but do not meaningfully
      reduce insn_processed metric.
      Lower numbers drop too many states and insn_processed increases too much.
      Many different formulas were considered.
      This one is simple and works well enough in practice.
      (the analysis was done on selftests/progs/* and on cilium programs)
      
      The end result is this heuristic improves verification speed by 10 times.
      Large synthetic programs that used to take a second more now take
      1/10 of a second.
      In cases where max_states_per_insn used to be 100 or more, now it's ~10.
      
      There is a slight increase in insn_processed for cilium progs:
                             before   after
      bpf_lb-DLB_L3.o 	1831	1838
      bpf_lb-DLB_L4.o 	3029	3218
      bpf_lb-DUNKNOWN.o 	1064	1064
      bpf_lxc-DDROP_ALL.o	26309	26935
      bpf_lxc-DUNKNOWN.o	33517	34439
      bpf_netdev.o		9713	9721
      bpf_overlay.o		6184	6184
      bpf_lcx_jit.o		37335	39389
      And 2-3 times improvement in the verification speed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9f4686c4
    • Alexei Starovoitov's avatar
      bpf: add verifier stats and log_level bit 2 · 06ee7115
      Alexei Starovoitov authored
      In order to understand the verifier bottlenecks add various stats
      and extend log_level:
      log_level 1 and 2 are kept as-is:
      bit 0 - level=1 - print every insn and verifier state at branch points
      bit 1 - level=2 - print every insn and verifier state at every insn
      bit 2 - level=4 - print verifier error and stats at the end of verification
      
      When verifier rejects the program the libbpf is trying to load the program twice.
      Once with log_level=0 (no messages, only error code is reported to user space)
      and second time with log_level=1 to tell the user why the verifier rejected it.
      
      With introduction of bit 2 - level=4 the libbpf can choose to always use that
      level and load programs once, since the verification speed is not affected and
      in case of error the verbose message will be available.
      
      Note that the verifier stats are not part of uapi just like all other
      verbose messages. They're expected to change in the future.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      06ee7115
  6. 02 Apr, 2019 6 commits