1. 16 Jul, 2021 31 commits
  2. 15 Jul, 2021 9 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-timers' · 76283171
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      The first request to support timers in bpf was made in 2013 before sys_bpf
      syscall was added. That use case was periodic sampling. It was address with
      attaching bpf programs to perf_events. Then during XDP development the timers
      were requested to do garbage collection and health checks. They were worked
      around by implementing timers in user space and triggering progs with
      BPF_PROG_RUN command. The user space timers and perf_event+bpf timers are not
      armed by the bpf program. They're done asynchronously vs program execution.
      The XDP program cannot send a packet and arm the timer at the same time. The
      tracing prog cannot record an event and arm the timer right away. This large
      class of use cases remained unaddressed. The jiffy based and hrtimer based
      timers are essential part of the kernel development and with this patch set
      the hrtimer based timers will be available to bpf programs.
      
      TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
      to make sure bpf progs cannot crash the kernel.
      
      v6->v7:
      - address Andrii's comments and add his Acks.
      
      v5->v6:
      - address code review feedback from Martin and add his Acks.
      - add usercnt > 0 check to bpf_timer_init and remove timers_cancel_and_free
      second loop in map_free callbacks.
      - add cond_resched_rcu.
      
      v4->v5:
      - Martin noticed the following issues:
      . prog could be reallocated bpf_patch_insn_data().
      Fixed by passing 'aux' into bpf_timer_set_callback, since 'aux' is stable
      during insn patching.
      . Added missing rcu_read_lock.
      . Removed redundant record_map.
      - Discovered few bugs with stress testing:
      . One cpu does htab_free_prealloced_timers->bpf_timer_cancel_and_free->hrtimer_cancel
      while another is trying to do something with the timer like bpf_timer_start/set_callback.
      Those ops try to acquire bpf_spin_lock that is already taken by bpf_timer_cancel_and_free,
      so both cpus spin forever. The same problem existed in bpf_timer_cancel().
      One bpf prog on one cpu might call bpf_timer_cancel and wait, while another cpu is in
      the timer callback that tries to do bpf_timer_*() helper on the same timer.
      The fix is to do drop_prog_refcnt() and unlock. And only then hrtimer_cancel.
      Because of this had to add callback_fn != NULL check to bpf_timer_cb().
      Also removed redundant bpf_prog_inc/put from bpf_timer_cb() and replaced
      with rcu_dereference_check similar to recent rcu_read_lock-removal from drivers.
      bpf_timer_cb is in softirq.
      . Managed to hit refcnt==0 while doing bpf_prog_put from bpf_timer_cancel_and_free().
      That exposed the issue that bpf_prog_put wasn't ready to be called from irq context.
      Fixed similar to bpf_map_put which is irq ready.
      - Refactored BPF_CALL_1(bpf_spin_lock) into __bpf_spin_lock_irqsave() to
      make the main logic more clear, since Martin and Yonghong brought up this concern.
      
      v3->v4:
      1.
      Split callback_fn from bpf_timer_start into bpf_timer_set_callback as
      suggested by Martin. That makes bpf timer api match one to one to
      kernel hrtimer api and provides greater flexibility.
      2.
      Martin also discovered the following issue with uref approach:
      bpftool prog load xdp_timer.o /sys/fs/bpf/xdp_timer type xdp
      bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_timer dev lo
      rm /sys/fs/bpf/xdp_timer
      nc -6 ::1 8888
      bpftool net detach xdpgeneric dev lo
      The timer callback stays active in the kernel though the prog was detached
      and map usercnt == 0.
      It happened because 'bpftool prog load' pinned the prog only.
      The map usercnt went to zero. Subsequent attach and runs didn't
      affect map usercnt. The timer was able to start and bpf_prog_inc itself.
      When the prog was detached the prog stayed active.
      To address this issue added
      if (!atomic64_read(&(t->map->usercnt))) return -EPERM;
      to the first patch.
      Which means that timers are allowed only in the maps that are held
      by user space with open file descriptor or maps pinned in bpffs.
      3.
      Discovered that timers in inner maps were broken.
      The inner map pointers are dynamic. Therefore changed bpf_timer_init()
      to accept explicit map pointer supplied by the program instead
      of hidden map pointer supplied by the verifier.
      To make sure that pointer to a timer actually belongs to that map
      added the verifier check in patch 3.
      4.
      Addressed Yonghong's feedback. Improved comments and added
      dynamic in_nmi() check.
      Added Acks.
      
      v2->v3:
      The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
      called to make sure callback code doesn't disappear when timer is active and
      drop refcnt when timer cb is done. That led to a ton of race conditions between
      callback running and concurrent bpf_timer_init/start/cancel on another cpu,
      and concurrent bpf_map_update/delete_elem, and map destroy.
      
      Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
      timers that bpf prog armed in a link list and canceled them when prog refcnt
      went to zero. The race conditions disappeared, but timers in map-in-map could
      not be supported cleanly, since timers in inner maps have inner map's life time
      and don't match prog's life time.
      
      This v3 approach makes timers to be owned by maps. It allows timers in inner
      maps to be supported from the start. This apporach relies on "user refcnt"
      scheme used in prog_array that stores bpf programs for bpf_tail_call. The
      bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
      callback does decrement the refcnt. The ops->map_release_uref is
      responsible for cancelling the timers and dropping prog refcnt when user space
      reference to a map is dropped. That addressed all the races and simplified
      locking.
      
      Andrii presented a use case where specifying callback_fn in bpf_timer_init()
      is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
      typically is called outside for timer callback, while bpf_timer_start() most
      likely will be called from the callback.
      timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
      infinite loop to the verifier. The verifier had to be made smarter to recognize
      such async callbacks. Patches 7,8,9 addressed that.
      
      Patch 1 and 2 refactoring.
      Patch 3 implements bpf timer helpers and locking.
      Patch 4 implements map side of bpf timer support.
      Patch 5 prevent pointer mismatch in bpf_timer_init.
      Patch 6 adds support for BTF in inner maps.
      Patch 7 teaches check_cfg() pass to understand async callbacks.
      Patch 8 teaches do_check() pass to understand async callbacks.
      Patch 9 teaches check_max_stack_depth() pass to understand async callbacks.
      Patches 10 and 11 are the tests.
      
      v1->v2:
      - Addressed great feedback from Andrii and Toke.
      - Fixed race between parallel bpf_timer_*() ops.
      - Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
      - Disallowed mmap and global timers.
      - Allow spin_lock and bpf_timer in an element.
      - Fixed memory leaks due to map destruction and LRU eviction.
      - A ton more tests.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      76283171
    • Alexei Starovoitov's avatar
      selftests/bpf: Add a test with bpf_timer in inner map. · 61f71e74
      Alexei Starovoitov authored
      Check that map-in-map supports bpf timers.
      
      Check that indirect "recursion" of timer callbacks works:
      timer_cb1() { bpf_timer_set_callback(timer_cb2); }
      timer_cb2() { bpf_timer_set_callback(timer_cb1); }
      
      Check that
        bpf_map_release
          htab_free_prealloced_timers
            bpf_timer_cancel_and_free
              hrtimer_cancel
      works while timer cb is running.
      "while true; do ./test_progs -t timer_mim; done"
      is a great stress test. It caught missing timer cancel in htab->extra_elems.
      
      timer_mim_reject.c is a negative test that checks
      that timer<->map mismatch is prevented.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-12-alexei.starovoitov@gmail.com
      61f71e74
    • Alexei Starovoitov's avatar
      selftests/bpf: Add bpf_timer test. · 3540f7c6
      Alexei Starovoitov authored
      Add bpf_timer test that creates timers in preallocated and
      non-preallocated hash, in array and in lru maps.
      Let array timer expire once and then re-arm it for 35 seconds.
      Arm lru timer into the same callback.
      Then arm and re-arm hash timers 10 times each.
      At the last invocation of prealloc hash timer cancel the array timer.
      Force timer free via LRU eviction and direct bpf_map_delete_elem.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-11-alexei.starovoitov@gmail.com
      3540f7c6
    • Alexei Starovoitov's avatar
      bpf: Teach stack depth check about async callbacks. · 7ddc80a4
      Alexei Starovoitov authored
      Teach max stack depth checking algorithm about async callbacks
      that don't increase bpf program stack size.
      Also add sanity check that bpf_tail_call didn't sneak into async cb.
      It's impossible, since PTR_TO_CTX is not available in async cb,
      hence the program cannot contain bpf_tail_call(ctx,...);
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
      7ddc80a4
    • Alexei Starovoitov's avatar
      bpf: Implement verifier support for validation of async callbacks. · bfc6bb74
      Alexei Starovoitov authored
      bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
      PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
      and pass them into the helpers as function callbacks.
      In case of bpf_for_each_map_elem() the callback is invoked synchronously
      and the verifier treats it as a normal subprogram call by adding another
      bpf_func_state and new frame in __check_func_call().
      bpf_timer_set_callback() doesn't invoke the callback directly.
      The subprogram will be called asynchronously from bpf_timer_cb().
      Teach the verifier to validate such async callbacks as special kind
      of jump by pushing verifier state into stack and let pop_stack() process it.
      
      Special care needs to be taken during state pruning.
      The call insn doing bpf_timer_set_callback has to be a prune_point.
      Otherwise short timer callbacks might not have prune points in front of
      bpf_timer_set_callback() which means is_state_visited() will be called
      after this call insn is processed in __check_func_call(). Which means that
      another async_cb state will be pushed to be walked later and the verifier
      will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
      Since push_async_cb() looks like another push_stack() branch the
      infinite loop detection will trigger false positive. To recognize
      this case mark such states as in_async_callback_fn.
      To distinguish infinite loop in async callback vs the same callback called
      with different arguments for different map and timer add async_entry_cnt
      to bpf_func_state.
      
      Enforce return zero from async callbacks.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
      bfc6bb74
    • Alexei Starovoitov's avatar
      bpf: Relax verifier recursion check. · 86fc6ee6
      Alexei Starovoitov authored
      In the following bpf subprogram:
      static int timer_cb(void *map, void *key, void *value)
      {
          bpf_timer_set_callback(.., timer_cb);
      }
      
      the 'timer_cb' is a pointer to a function.
      ld_imm64 insn is used to carry this pointer.
      bpf_pseudo_func() returns true for such ld_imm64 insn.
      
      Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
      Relax control flow check to allow such "recursion" that is seen as an infinite
      loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
      bpf_timer_set_callback() is done in the follow up patch.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com
      86fc6ee6
    • Alexei Starovoitov's avatar
      bpf: Remember BTF of inner maps. · 40ec00ab
      Alexei Starovoitov authored
      BTF is required for 'struct bpf_timer' to be recognized inside map value.
      The bpf timers are supported inside inner maps.
      Remember 'struct btf *' in inner_map_meta to make it available
      to the verifier in the sequence:
      
      struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
      if (inner_map)
          timer = bpf_map_lookup_elem(&inner_map, ...);
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com
      40ec00ab
    • Alexei Starovoitov's avatar
      bpf: Prevent pointer mismatch in bpf_timer_init. · 3e8ce298
      Alexei Starovoitov authored
      bpf_timer_init() arguments are:
      1. pointer to a timer (which is embedded in map element).
      2. pointer to a map.
      Make sure that pointer to a timer actually belongs to that map.
      
      Use map_uid (which is unique id of inner map) to reject:
      inner_map1 = bpf_map_lookup_elem(outer_map, key1)
      inner_map2 = bpf_map_lookup_elem(outer_map, key2)
      if (inner_map1 && inner_map2) {
          timer = bpf_map_lookup_elem(inner_map1);
          if (timer)
              // mismatch would have been allowed
              bpf_timer_init(timer, inner_map2);
      }
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
      3e8ce298
    • Alexei Starovoitov's avatar
      bpf: Add map side support for bpf timers. · 68134668
      Alexei Starovoitov authored
      Restrict bpf timers to array, hash (both preallocated and kmalloced), and
      lru map types. The per-cpu maps with timers don't make sense, since 'struct
      bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
      the number of timers depends on number of possible cpus and timers would not be
      accessible from all cpus. lpm map support can be added in the future.
      The timers in inner maps are supported.
      
      The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
      bpf_timer in a given map element.
      
      Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
      that map element indeed contains 'struct bpf_timer'.
      
      Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
      map element data is reused in preallocated htab and lru maps.
      
      Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
      map element. There could be one of each, but not more than one. Due to 'one
      bpf_timer in one element' restriction do not support timers in global data,
      since global data is a map of single element, but from bpf program side it's
      seen as many global variables and restriction of single global timer would be
      odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
      timers, since user space could have corrupted mmap element and crashed the
      kernel. The maps with timers cannot be readonly. Due to these restrictions
      search for bpf_timer in datasec BTF in case it was placed in the global data to
      report clear error.
      
      The previous patch allowed 'struct bpf_timer' as a first field in a map
      element only. Relax this restriction.
      
      Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
      the timer when lru map deletes an element as a part of it eviction algorithm.
      
      Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
      The timer operation are done through helpers only.
      This is similar to 'struct bpf_spin_lock'.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
      68134668