1. 06 Mar, 2024 9 commits
    • Eduard Zingerman's avatar
      libbpf: Allow version suffixes (___smth) for struct_ops types · a2a5172c
      Eduard Zingerman authored
      E.g. allow the following struct_ops definitions:
      
          struct bpf_testmod_ops___v1 { int (*test)(void); };
          struct bpf_testmod_ops___v2 { int (*test)(void); };
      
          SEC(".struct_ops.link")
          struct bpf_testmod_ops___v1 a = { .test = ... }
          SEC(".struct_ops.link")
          struct bpf_testmod_ops___v2 b = { .test = ... }
      
      Where both bpf_testmod_ops__v1 and bpf_testmod_ops__v2 would be
      resolved as 'struct bpf_testmod_ops' from kernel BTF.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240306104529.6453-2-eddyz87@gmail.com
      a2a5172c
    • Andrii Nakryiko's avatar
      Merge branch 'bpf-introduce-may_goto-and-cond_break' · 0f79bb89
      Andrii Nakryiko authored
      Alexei Starovoitov says:
      
      ====================
      bpf: Introduce may_goto and cond_break
      
      From: Alexei Starovoitov <ast@kernel.org>
      
      v5 -> v6:
      - Rename BPF_JMA to BPF_JCOND
      - Addressed Andrii's review comments
      
      v4 -> v5:
      - rewrote patch 1 to avoid fake may_goto_reg and use 'u32 may_goto_cnt' instead.
        This way may_goto handling is similar to bpf_loop() processing.
      - fixed bug in patch 2 that RANGE_WITHIN should not use
        rold->type == NOT_INIT as a safe signal.
      - patch 3 fixed negative offset computation in cond_break macro
      - using bpf_arena and cond_break recompiled lib/glob.c as bpf prog
        and it works! It will be added as a selftest to arena series.
      
      v3 -> v4:
      - fix drained issue reported by John.
        may_goto insn could be implemented with sticky state (once
        reaches 0 it stays 0), but the verifier shouldn't assume that.
        It has to explore both branches.
        Arguably drained iterator state shouldn't be there at all.
        bpf_iter_css_next() is not sticky. Can be fixed, but auditing all
        iterators for stickiness. That's an orthogonal discussion.
      - explained JMA name reasons in patch 1
      - fixed test_progs-no_alu32 issue and added another test
      
      v2 -> v3: Major change
      - drop bpf_can_loop() kfunc and introduce may_goto instruction instead
        kfunc is a function call while may_goto doesn't consume any registers
        and LLVM can produce much better code due to less register pressure.
      - instead of counting from zero to BPF_MAX_LOOPS start from it instead
        and break out of the loop when count reaches zero
      - use may_goto instruction in cond_break macro
      - recognize that 'exact' state comparison doesn't need to be truly exact.
        regsafe() should ignore precision and liveness marks, but range_within
        logic is safe to use while evaluating open coded iterators.
      ====================
      
      Link: https://lore.kernel.org/r/20240306031929.42666-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      0f79bb89
    • Alexei Starovoitov's avatar
      0c8bbf99
    • Alexei Starovoitov's avatar
      bpf: Add cond_break macro · 06375801
      Alexei Starovoitov authored
      Use may_goto instruction to implement cond_break macro.
      Ideally the macro should be written as:
        asm volatile goto(".byte 0xe5;
                           .byte 0;
                           .short %l[l_break] ...
                           .long 0;
      but LLVM doesn't recognize fixup of 2 byte PC relative yet.
      Hence use
        asm volatile goto(".byte 0xe5;
                           .byte 0;
                           .long %l[l_break] ...
                           .short 0;
      that produces correct asm on little endian.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Tested-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20240306031929.42666-4-alexei.starovoitov@gmail.com
      06375801
    • Alexei Starovoitov's avatar
      bpf: Recognize that two registers are safe when their ranges match · 4f81c16f
      Alexei Starovoitov authored
      When open code iterators, bpf_loop or may_goto are used the following two
      states are equivalent and safe to prune the search:
      
      cur state: fp-8_w=scalar(id=3,smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf))
      old state: fp-8_rw=scalar(id=2,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=11,var_off=(0x0; 0xf))
      
      In other words "exact" state match should ignore liveness and precision
      marks, since open coded iterator logic didn't complete their propagation,
      reg_old->type == NOT_INIT && reg_cur->type != NOT_INIT is also not safe to
      prune while looping, but range_within logic that applies to scalars,
      ptr_to_mem, map_value, pkt_ptr is safe to rely on.
      
      Avoid doing such comparison when regular infinite loop detection logic is
      used, otherwise bounded loop logic will declare such "infinite loop" as
      false positive. Such example is in progs/verifier_loops1.c
      not_an_inifinite_loop().
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Tested-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20240306031929.42666-3-alexei.starovoitov@gmail.com
      4f81c16f
    • Andrii Nakryiko's avatar
      Merge branch 'mm-enforce-ioremap-address-space-and-introduce-sparse-vm_area' · 9a9d1d36
      Andrii Nakryiko authored
      Alexei Starovoitov says:
      
      ====================
      mm: Enforce ioremap address space and introduce sparse vm_area
      
      From: Alexei Starovoitov <ast@kernel.org>
      
      v3 -> v4
      - dropped VM_XEN patch for now. It will be in the follow up.
      - fixed constant as pointed out by Mike
      
      v2 -> v3
      - added Christoph's reviewed-by to patch 1
      - cap commit log lines to 75 chars
      - factored out common checks in patch 3 into helper
      - made vm_area_unmap_pages() return void
      
      There are various users of kernel virtual address space:
      vmalloc, vmap, ioremap, xen.
      
      - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag
      and these areas are treated differently by KASAN.
      
      - the areas created by vmap() function should be tagged with VM_MAP
      (as majority of the users do).
      
      - ioremap areas are tagged with VM_IOREMAP and vm area start is aligned
      to size of the area unlike vmalloc/vmap.
      
      - there is also xen usage that is marked as VM_IOREMAP, but it doesn't
      call ioremap_page_range() unlike all other VM_IOREMAP users.
      
      To clean this up a bit, enforce that ioremap_page_range() checks the range
      and VM_IOREMAP flag.
      
      In addition BPF would like to reserve regions of kernel virtual address
      space and populate it lazily, similar to xen use cases.
      For that reason, introduce VM_SPARSE flag and vm_area_[un]map_pages()
      helpers to populate this sparse area.
      
      In the end the /proc/vmallocinfo will show
      "vmalloc"
      "vmap"
      "ioremap"
      "sparse"
      categories for different kinds of address regions.
      
      ioremap, sparse will return zero when dumped through /proc/kcore
      ====================
      
      Link: https://lore.kernel.org/r/20240305030516.41519-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      9a9d1d36
    • Alexei Starovoitov's avatar
      bpf: Introduce may_goto instruction · 011832b9
      Alexei Starovoitov authored
      Introduce may_goto instruction that from the verifier pov is similar to
      open coded iterators bpf_for()/bpf_repeat() and bpf_loop() helper, but it
      doesn't iterate any objects.
      In assembly 'may_goto' is a nop most of the time until bpf runtime has to
      terminate the program for whatever reason. In the current implementation
      may_goto has a hidden counter, but other mechanisms can be used.
      For programs written in C the later patch introduces 'cond_break' macro
      that combines 'may_goto' with 'break' statement and has similar semantics:
      cond_break is a nop until bpf runtime has to break out of this loop.
      It can be used in any normal "for" or "while" loop, like
      
        for (i = zero; i < cnt; cond_break, i++) {
      
      The verifier recognizes that may_goto is used in the program, reserves
      additional 8 bytes of stack, initializes them in subprog prologue, and
      replaces may_goto instruction with:
      aux_reg = *(u64 *)(fp - 40)
      if aux_reg == 0 goto pc+off
      aux_reg -= 1
      *(u64 *)(fp - 40) = aux_reg
      
      may_goto instruction can be used by LLVM to implement __builtin_memcpy,
      __builtin_strcmp.
      
      may_goto is not a full substitute for bpf_for() macro.
      bpf_for() doesn't have induction variable that verifiers sees,
      so 'i' in bpf_for(i, 0, 100) is seen as imprecise and bounded.
      
      But when the code is written as:
      for (i = 0; i < 100; cond_break, i++)
      the verifier see 'i' as precise constant zero,
      hence cond_break (aka may_goto) doesn't help to converge the loop.
      A static or global variable can be used as a workaround:
      static int zero = 0;
      for (i = zero; i < 100; cond_break, i++) // works!
      
      may_goto works well with arena pointers that don't need to be bounds
      checked on access. Load/store from arena returns imprecise unbounded
      scalar and loops with may_goto pass the verifier.
      
      Reserve new opcode BPF_JMP | BPF_JCOND for may_goto insn.
      JCOND stands for conditional pseudo jump.
      Since goto_or_nop insn was proposed, it may use the same opcode.
      may_goto vs goto_or_nop can be distinguished by src_reg:
      code = BPF_JMP | BPF_JCOND
      src_reg = 0 - may_goto
      src_reg = 1 - goto_or_nop
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Tested-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20240306031929.42666-2-alexei.starovoitov@gmail.com
      011832b9
    • Alexei Starovoitov's avatar
      mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages(). · e6f79822
      Alexei Starovoitov authored
      vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
      virtual space.
      
      get_vm_area() with appropriate flag is used to request an area of kernel
      address range. It's used for vmalloc, vmap, ioremap, xen use cases.
      - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
      - the areas created by vmap() function should be tagged with VM_MAP.
      - ioremap areas are tagged with VM_IOREMAP.
      
      BPF would like to extend the vmap API to implement a lazily-populated
      sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
      and vm_area_map_pages(area, start_addr, count, pages) API to map a set
      of pages within a given area.
      It has the same sanity checks as vmap() does.
      It also checks that get_vm_area() was created with VM_SPARSE flag
      which identifies such areas in /proc/vmallocinfo
      and returns zero pages on read through /proc/kcore.
      
      The next commits will introduce bpf_arena which is a sparsely populated
      shared memory region between bpf program and user space process. It will
      map privately-managed pages into a sparse vm area with the following steps:
      
        // request virtual memory region during bpf prog verification
        area = get_vm_area(area_size, VM_SPARSE);
      
        // on demand
        vm_area_map_pages(area, kaddr, kend, pages);
        vm_area_unmap_pages(area, kaddr, kend);
      
        // after bpf program is detached and unloaded
        free_vm_area(area);
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarPasha Tatashin <pasha.tatashin@soleen.com>
      Link: https://lore.kernel.org/bpf/20240305030516.41519-3-alexei.starovoitov@gmail.com
      e6f79822
    • Alexei Starovoitov's avatar
      mm: Enforce VM_IOREMAP flag and range in ioremap_page_range. · 3e49a866
      Alexei Starovoitov authored
      There are various users of get_vm_area() + ioremap_page_range() APIs.
      Enforce that get_vm_area() was requested as VM_IOREMAP type and range
      passed to ioremap_page_range() matches created vm_area to avoid
      accidentally ioremap-ing into wrong address range.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/bpf/20240305030516.41519-2-alexei.starovoitov@gmail.com
      3e49a866
  2. 04 Mar, 2024 8 commits
  3. 03 Mar, 2024 2 commits
    • Dave Thaler's avatar
      bpf, docs: Use IETF format for field definitions in instruction-set.rst · 4e73e1bc
      Dave Thaler authored
      In preparation for publication as an IETF RFC, the WG chairs asked me
      to convert the document to use IETF packet format for field layout, so
      this patch attempts to make it consistent with other IETF documents.
      
      Some fields that are not byte aligned were previously inconsistent
      in how values were defined.  Some were defined as the value of the
      byte containing the field (like 0x20 for a field holding the high
      four bits of the byte), and others were defined as the value of the
      field itself (like 0x2).  This PR makes them be consistent in using
      just the values of the field itself, which is IETF convention.
      
      As a result, some of the defines that used BPF_* would no longer
      match the value in the spec, and so this patch also drops the BPF_*
      prefix to avoid confusion with the defines that are the full-byte
      equivalent values.  For consistency, BPF_* is then dropped from
      other fields too.  BPF_<foo> is thus the Linux implementation-specific
      define for <foo> as it appears in the BPF ISA specification.
      
      The syntax BPF_ADD | BPF_X | BPF_ALU only worked for full-byte
      values so the convention {ADD, X, ALU} is proposed for referring
      to field values instead.
      
      Also replace the redundant "LSB bits" with "least significant bits".
      
      A preview of what the resulting Internet Draft would look like can
      be seen at:
      https://htmlpreview.github.io/?https://raw.githubusercontent.com/dthaler/ebp
      f-docs-1/format/draft-ietf-bpf-isa.html
      
      v1->v2: Fix sphinx issue as recommended by David Vernet
      Signed-off-by: default avatarDave Thaler <dthaler1968@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20240301222337.15931-1-dthaler1968@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4e73e1bc
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 4b2765ae
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2024-02-29
      
      We've added 119 non-merge commits during the last 32 day(s) which contain
      a total of 150 files changed, 3589 insertions(+), 995 deletions(-).
      
      The main changes are:
      
      1) Extend the BPF verifier to enable static subprog calls in spin lock
         critical sections, from Kumar Kartikeya Dwivedi.
      
      2) Fix confusing and incorrect inference of PTR_TO_CTX argument type
         in BPF global subprogs, from Andrii Nakryiko.
      
      3) Larger batch of riscv BPF JIT improvements and enabling inlining
         of the bpf_kptr_xchg() for RV64, from Pu Lehui.
      
      4) Allow skeleton users to change the values of the fields in struct_ops
         maps at runtime, from Kui-Feng Lee.
      
      5) Extend the verifier's capabilities of tracking scalars when they
         are spilled to stack, especially when the spill or fill is narrowing,
         from Maxim Mikityanskiy & Eduard Zingerman.
      
      6) Various BPF selftest improvements to fix errors under gcc BPF backend,
         from Jose E. Marchesi.
      
      7) Avoid module loading failure when the module trying to register
         a struct_ops has its BTF section stripped, from Geliang Tang.
      
      8) Annotate all kfuncs in .BTF_ids section which eventually allows
         for automatic kfunc prototype generation from bpftool, from Daniel Xu.
      
      9) Several updates to the instruction-set.rst IETF standardization
         document, from Dave Thaler.
      
      10) Shrink the size of struct bpf_map resp. bpf_array,
          from Alexei Starovoitov.
      
      11) Initial small subset of BPF verifier prepwork for sleepable bpf_timer,
          from Benjamin Tissoires.
      
      12) Fix bpftool to be more portable to musl libc by using POSIX's
          basename(), from Arnaldo Carvalho de Melo.
      
      13) Add libbpf support to gcc in CORE macro definitions,
          from Cupertino Miranda.
      
      14) Remove a duplicate type check in perf_event_bpf_event,
          from Florian Lehner.
      
      15) Fix bpf_spin_{un,}lock BPF helpers to actually annotate them
          with notrace correctly, from Yonghong Song.
      
      16) Replace the deprecated bpf_lpm_trie_key 0-length array with flexible
          array to fix build warnings, from Kees Cook.
      
      17) Fix resolve_btfids cross-compilation to non host-native endianness,
          from Viktor Malik.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (119 commits)
        selftests/bpf: Test if shadow types work correctly.
        bpftool: Add an example for struct_ops map and shadow type.
        bpftool: Generated shadow variables for struct_ops maps.
        libbpf: Convert st_ops->data to shadow type.
        libbpf: Set btf_value_type_id of struct bpf_map for struct_ops.
        bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
        bpf, arm64: use bpf_prog_pack for memory management
        arm64: patching: implement text_poke API
        bpf, arm64: support exceptions
        arm64: stacktrace: Implement arch_bpf_stack_walk() for the BPF JIT
        bpf: add is_async_callback_calling_insn() helper
        bpf: introduce in_sleepable() helper
        bpf: allow more maps in sleepable bpf programs
        selftests/bpf: Test case for lacking CFI stub functions.
        bpf: Check cfi_stubs before registering a struct_ops type.
        bpf: Clarify batch lookup/lookup_and_delete semantics
        bpf, docs: specify which BPF_ABS and BPF_IND fields were zero
        bpf, docs: Fix typos in instruction-set.rst
        selftests/bpf: update tcp_custom_syncookie to use scalar packet offset
        bpf: Shrink size of struct bpf_map/bpf_array.
        ...
      ====================
      
      Link: https://lore.kernel.org/r/20240301001625.8800-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4b2765ae
  4. 01 Mar, 2024 21 commits