1. 23 Jun, 2022 2 commits
    • Jian Shen's avatar
      test_bpf: fix incorrect netdev features · 9676fecc
      Jian Shen authored
      The prototype of .features is netdev_features_t, it should use
      NETIF_F_LLTX and NETIF_F_HW_VLAN_STAG_TX, not NETIF_F_LLTX_BIT
      and NETIF_F_HW_VLAN_STAG_TX_BIT.
      
      Fixes: cf204a71 ("bpf, testing: Introduce 'gso_linear_no_head_frag' skb_segment test")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20220622135002.8263-1-shenjian15@huawei.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9676fecc
    • Dave Marchevsky's avatar
      selftests/bpf: Add benchmark for local_storage get · 73087489
      Dave Marchevsky authored
      Add a benchmarks to demonstrate the performance cliff for local_storage
      get as the number of local_storage maps increases beyond current
      local_storage implementation's cache size.
      
      "sequential get" and "interleaved get" benchmarks are added, both of
      which do many bpf_task_storage_get calls on sets of task local_storage
      maps of various counts, while considering a single specific map to be
      'important' and counting task_storage_gets to the important map
      separately in addition to normal 'hits' count of all gets. Goal here is
      to mimic scenario where a particular program using one map - the
      important one - is running on a system where many other local_storage
      maps exist and are accessed often.
      
      While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
      ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
      bpf_task_storage_gets for the important map for every 10 map gets. This
      is meant to highlight performance differences when important map is
      accessed far more frequently than non-important maps.
      
      A "hashmap control" benchmark is also included for easy comparison of
      standard bpf hashmap lookup vs local_storage get. The benchmark is
      similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
      instead of local storage. Only one inner map is created - a hashmap
      meant to hold tid -> data mapping for all tasks. Size of the hashmap is
      hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
      keys which are actually fetched as part of the benchmark is
      configurable.
      
      Addition of this benchmark is inspired by conversation with Alexei in a
      previous patchset's thread [0], which highlighted the need for such a
      benchmark to motivate and validate improvements to local_storage
      implementation. My approach in that series focused on improving
      performance for explicitly-marked 'important' maps and was rejected
      with feedback to make more generally-applicable improvements while
      avoiding explicitly marking maps as important. Thus the benchmark
      reports both general and important-map-focused metrics, so effect of
      future work on both is clear.
      
      Regarding the benchmark results. On a powerful system (Skylake, 20
      cores, 256gb ram):
      
      Hashmap Control
      ===============
              num keys: 10
      hashmap (control) sequential    get:  hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s
      
              num keys: 1000
      hashmap (control) sequential    get:  hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s
      
              num keys: 10000
      hashmap (control) sequential    get:  hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s
      
              num keys: 100000
      hashmap (control) sequential    get:  hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s
      
              num keys: 4194304
      hashmap (control) sequential    get:  hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s
      
      Local Storage
      =============
              num_maps: 1
      local_storage cache sequential  get:  hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
      local_storage cache interleaved get:  hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s
      
              num_maps: 10
      local_storage cache sequential  get:  hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
      local_storage cache interleaved get:  hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s
      
              num_maps: 16
      local_storage cache sequential  get:  hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
      local_storage cache interleaved get:  hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s
      
              num_maps: 17
      local_storage cache sequential  get:  hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
      local_storage cache interleaved get:  hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s
      
              num_maps: 24
      local_storage cache sequential  get:  hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
      local_storage cache interleaved get:  hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s
      
              num_maps: 32
      local_storage cache sequential  get:  hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
      local_storage cache interleaved get:  hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s
      
              num_maps: 100
      local_storage cache sequential  get:  hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
      local_storage cache interleaved get:  hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s
      
              num_maps: 1000
      local_storage cache sequential  get:  hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
      local_storage cache interleaved get:  hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s
      
      Looking at the "sequential get" results, it's clear that as the
      number of task local_storage maps grows beyond the current cache size
      (16), there's a significant reduction in hits throughput. Note that
      current local_storage implementation assigns a cache_idx to maps as they
      are created. Since "sequential get" is creating maps 0..n in order and
      then doing bpf_task_storage_get calls in the same order, the benchmark
      is effectively ensuring that a map will not be in cache when the program
      tries to access it.
      
      For "interleaved get" results, important-map hits throughput is greatly
      increased as the important map is more likely to be in cache by virtue
      of being accessed far more frequently. Throughput still reduces as #
      maps increases, though.
      
      To get a sense of the overhead of the benchmark program, I
      commented out bpf_task_storage_get/bpf_map_lookup_elem in
      local_storage_bench.c and ran the benchmark on the same host as the
      'real' run. Results:
      
      Hashmap Control
      ===============
              num keys: 10
      hashmap (control) sequential    get:  hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s
      
              num keys: 1000
      hashmap (control) sequential    get:  hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s
      
              num keys: 10000
      hashmap (control) sequential    get:  hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s
      
              num keys: 100000
      hashmap (control) sequential    get:  hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s
      
              num keys: 4194304
      hashmap (control) sequential    get:  hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s
      
      Local Storage
      =============
              num_maps: 1
      local_storage cache sequential  get:  hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
      local_storage cache interleaved get:  hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s
      
              num_maps: 10
      local_storage cache sequential  get:  hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
      local_storage cache interleaved get:  hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s
      
              num_maps: 16
      local_storage cache sequential  get:  hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
      local_storage cache interleaved get:  hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s
      
              num_maps: 17
      local_storage cache sequential  get:  hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
      local_storage cache interleaved get:  hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s
      
              num_maps: 24
      local_storage cache sequential  get:  hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
      local_storage cache interleaved get:  hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s
      
              num_maps: 32
      local_storage cache sequential  get:  hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
      local_storage cache interleaved get:  hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s
      
              num_maps: 100
      local_storage cache sequential  get:  hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
      local_storage cache interleaved get:  hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s
      
              num_maps: 1000
      local_storage cache sequential  get:  hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
      local_storage cache interleaved get:  hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s
      
      Adjusting for overhead, latency numbers for "hashmap control" and
      "sequential get" are:
      
      hashmap_control_1k:   ~53.8ns
      hashmap_control_10k:  ~124.2ns
      hashmap_control_100k: ~206.5ns
      sequential_get_1:     ~12.1ns
      sequential_get_10:    ~16.0ns
      sequential_get_16:    ~13.8ns
      sequential_get_17:    ~16.8ns
      sequential_get_24:    ~40.9ns
      sequential_get_32:    ~65.2ns
      sequential_get_100:   ~148.2ns
      sequential_get_1000:  ~2204ns
      
      Clearly demonstrating a cliff.
      
      In the discussion for v1 of this patch, Alexei noted that local_storage
      was 2.5x faster than a large hashmap when initially implemented [1]. The
      benchmark results show that local_storage is 5-10x faster: a
      long-running BPF application putting some pid-specific info into a
      hashmap for each pid it sees will probably see on the order of 10-100k
      pids. Bench numbers for hashmaps of this size are ~10x slower than
      sequential_get_16, but as the number of local_storage maps grows far
      past local_storage cache size the performance advantage shrinks and
      eventually reverses.
      
      When running the benchmarks it may be necessary to bump 'open files'
      ulimit for a successful run.
      
        [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
        [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      73087489
  2. 22 Jun, 2022 1 commit
  3. 21 Jun, 2022 9 commits
    • Jakub Sitnicki's avatar
      bpf, arm64: Keep tail call count across bpf2bpf calls · d4609a5d
      Jakub Sitnicki authored
      Today doing a BPF tail call after a BPF to BPF call, that is from a
      subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features
      requires support from JIT. Tail call count has to be tracked through BPF to
      BPF calls, as well as through BPF tail calls to prevent unbounded chains of
      tail calls.
      
      arm64 BPF JIT stores the tail call count (TCC) in a dedicated
      register (X26). This makes it easier to support bpf2bpf calls mixed with
      tail calls than on x86 platform.
      
      In order to keep the tail call count in tact throughout bpf2bpf calls, all
      we need to do is tweak the program prologue generator. When emitting
      prologue for a subprogram, we skip the block that initializes the tail call
      count and emits a jump pad for the tail call.
      
      With this change, a sample execution flow where a bpf2bpf call is followed
      by a tail call would look like so:
      
      int entry(struct __sk_buff *skb):
         0xffffffc0090151d4:  paciasp
         0xffffffc0090151d8:  stp     x29, x30, [sp, #-16]!
         0xffffffc0090151dc:  mov     x29, sp
         0xffffffc0090151e0:  stp     x19, x20, [sp, #-16]!
         0xffffffc0090151e4:  stp     x21, x22, [sp, #-16]!
         0xffffffc0090151e8:  stp     x25, x26, [sp, #-16]!
         0xffffffc0090151ec:  stp     x27, x28, [sp, #-16]!
         0xffffffc0090151f0:  mov     x25, sp
         0xffffffc0090151f4:  mov     x26, #0x0                       // <- init TCC only
         0xffffffc0090151f8:  bti     j                               //    in main prog
         0xffffffc0090151fc:  sub     x27, x25, #0x0
         0xffffffc009015200:  sub     sp, sp, #0x10
         0xffffffc009015204:  mov     w1, #0x0
         0xffffffc009015208:  mov     x10, #0xffffffffffffffff
         0xffffffc00901520c:  strb    w1, [x25, x10]
         0xffffffc009015210:  mov     x10, #0xffffffffffffd25c
         0xffffffc009015214:  movk    x10, #0x902, lsl #16
         0xffffffc009015218:  movk    x10, #0xffc0, lsl #32
         0xffffffc00901521c:  blr     x10 -------------------.        // bpf2bpf call
         0xffffffc009015220:  add     x7, x0, #0x0 <-------------.
         0xffffffc009015224:  add     sp, sp, #0x10          |   |
         0xffffffc009015228:  ldp     x27, x28, [sp], #16    |   |
         0xffffffc00901522c:  ldp     x25, x26, [sp], #16    |   |
         0xffffffc009015230:  ldp     x21, x22, [sp], #16    |   |
         0xffffffc009015234:  ldp     x19, x20, [sp], #16    |   |
         0xffffffc009015238:  ldp     x29, x30, [sp], #16    |   |
         0xffffffc00901523c:  add     x0, x7, #0x0           |   |
         0xffffffc009015240:  autiasp                        |   |
         0xffffffc009015244:  ret                            |   |
                                                             |   |
      int subprog_tail(struct __sk_buff *skb):               |   |
         0xffffffc00902d25c:  paciasp <----------------------'   |
         0xffffffc00902d260:  stp     x29, x30, [sp, #-16]!      |
         0xffffffc00902d264:  mov     x29, sp                    |
         0xffffffc00902d268:  stp     x19, x20, [sp, #-16]!      |
         0xffffffc00902d26c:  stp     x21, x22, [sp, #-16]!      |
         0xffffffc00902d270:  stp     x25, x26, [sp, #-16]!      |
         0xffffffc00902d274:  stp     x27, x28, [sp, #-16]!      |
         0xffffffc00902d278:  mov     x25, sp                    |
         0xffffffc00902d27c:  sub     x27, x25, #0x0             |
         0xffffffc00902d280:  sub     sp, sp, #0x10              |    // <- end of prologue, notice:
         0xffffffc00902d284:  add     x19, x0, #0x0              |    //    1) TCC not touched, and
         0xffffffc00902d288:  mov     w0, #0x1                   |    //    2) no tail call jump pad
         0xffffffc00902d28c:  mov     x10, #0xfffffffffffffffc   |
         0xffffffc00902d290:  str     w0, [x25, x10]             |
         0xffffffc00902d294:  mov     x20, #0xffffff80ffffffff   |
         0xffffffc00902d298:  movk    x20, #0xc033, lsl #16      |
         0xffffffc00902d29c:  movk    x20, #0x4e00               |
         0xffffffc00902d2a0:  add     x0, x19, #0x0              |
         0xffffffc00902d2a4:  add     x1, x20, #0x0              |
         0xffffffc00902d2a8:  mov     x2, #0x0                   |
         0xffffffc00902d2ac:  mov     w10, #0x24                 |
         0xffffffc00902d2b0:  ldr     w10, [x1, x10]             |
         0xffffffc00902d2b4:  add     w2, w2, #0x0               |
         0xffffffc00902d2b8:  cmp     w2, w10                    |
         0xffffffc00902d2bc:  b.cs    0xffffffc00902d2f8         |
         0xffffffc00902d2c0:  mov     w10, #0x21                 |
         0xffffffc00902d2c4:  cmp     x26, x10                   |    // TCC >= MAX_TAIL_CALL_CNT?
         0xffffffc00902d2c8:  b.cs    0xffffffc00902d2f8         |
         0xffffffc00902d2cc:  add     x26, x26, #0x1             |    // TCC++
         0xffffffc00902d2d0:  mov     w10, #0x110                |
         0xffffffc00902d2d4:  add     x10, x1, x10               |
         0xffffffc00902d2d8:  lsl     x11, x2, #3                |
         0xffffffc00902d2dc:  ldr     x11, [x10, x11]            |
         0xffffffc00902d2e0:  cbz     x11, 0xffffffc00902d2f8    |
         0xffffffc00902d2e4:  mov     w10, #0x30                 |
         0xffffffc00902d2e8:  ldr     x10, [x11, x10]            |
         0xffffffc00902d2ec:  add     x10, x10, #0x24            |
         0xffffffc00902d2f0:  add     sp, sp, #0x10              |    // <- destroy just current
         0xffffffc00902d2f4:  br      x10 ---------------------. |    //    BPF stack frame
         0xffffffc00902d2f8:  mov     x10, #0xfffffffffffffffc | |    //    before the tail call
         0xffffffc00902d2fc:  ldr     w7, [x25, x10]           | |
         0xffffffc00902d300:  add     sp, sp, #0x10            | |
         0xffffffc00902d304:  ldp     x27, x28, [sp], #16      | |
         0xffffffc00902d308:  ldp     x25, x26, [sp], #16      | |
         0xffffffc00902d30c:  ldp     x21, x22, [sp], #16      | |
         0xffffffc00902d310:  ldp     x19, x20, [sp], #16      | |
         0xffffffc00902d314:  ldp     x29, x30, [sp], #16      | |
         0xffffffc00902d318:  add     x0, x7, #0x0             | |
         0xffffffc00902d31c:  autiasp                          | |
         0xffffffc00902d320:  ret                              | |
                                                               | |
      int classifier_0(struct __sk_buff *skb):                 | |
         0xffffffc008ff5874:  paciasp                          | |
         0xffffffc008ff5878:  stp     x29, x30, [sp, #-16]!    | |
         0xffffffc008ff587c:  mov     x29, sp                  | |
         0xffffffc008ff5880:  stp     x19, x20, [sp, #-16]!    | |
         0xffffffc008ff5884:  stp     x21, x22, [sp, #-16]!    | |
         0xffffffc008ff5888:  stp     x25, x26, [sp, #-16]!    | |
         0xffffffc008ff588c:  stp     x27, x28, [sp, #-16]!    | |
         0xffffffc008ff5890:  mov     x25, sp                  | |
         0xffffffc008ff5894:  mov     x26, #0x0                | |
         0xffffffc008ff5898:  bti     j <----------------------' |
         0xffffffc008ff589c:  sub     x27, x25, #0x0             |
         0xffffffc008ff58a0:  sub     sp, sp, #0x0               |
         0xffffffc008ff58a4:  mov     x0, #0xffffffc0ffffffff    |
         0xffffffc008ff58a8:  movk    x0, #0x8fc, lsl #16        |
         0xffffffc008ff58ac:  movk    x0, #0x6000                |
         0xffffffc008ff58b0:  mov     w1, #0x1                   |
         0xffffffc008ff58b4:  str     w1, [x0]                   |
         0xffffffc008ff58b8:  mov     w7, #0x0                   |
         0xffffffc008ff58bc:  mov     sp, sp                     |
         0xffffffc008ff58c0:  ldp     x27, x28, [sp], #16        |
         0xffffffc008ff58c4:  ldp     x25, x26, [sp], #16        |
         0xffffffc008ff58c8:  ldp     x21, x22, [sp], #16        |
         0xffffffc008ff58cc:  ldp     x19, x20, [sp], #16        |
         0xffffffc008ff58d0:  ldp     x29, x30, [sp], #16        |
         0xffffffc008ff58d4:  add     x0, x7, #0x0               |
         0xffffffc008ff58d8:  autiasp                            |
         0xffffffc008ff58dc:  ret -------------------------------'
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com
      d4609a5d
    • Tony Ambardar's avatar
      bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT · 95acd881
      Tony Ambardar authored
      The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail
      calls for only x86-64. Change the logic to instead rely on a new weak
      function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable
      JIT backend can override.
      
      Update the x86-64 eBPF JIT to reflect this.
      Signed-off-by: default avatarTony Ambardar <Tony.Ambardar@gmail.com>
      [jakub: drop MIPS bits and tweak patch subject]
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com
      95acd881
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_loop inlining' · b40b414e
      Alexei Starovoitov authored
      Eduard Zingerman says:
      
      ====================
      
      Hi Everyone,
      
      This is the next iteration of the patch. It includes changes suggested
      by Song, Joanne and Alexei. Please find updated intro message and
      change log below.
      
      This patch implements inlining of calls to bpf_loop helper function
      when bpf_loop's callback is statically known. E.g. the rewrite does
      the following transformation during BPF program processing:
      
        bpf_loop(10, foo, NULL, 0);
      
       ->
      
        for (int i = 0; i < 10; ++i)
          foo(i, NULL);
      
      The transformation leads to measurable latency change for simple
      loops. Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU /
      KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
      
      The change is split in five parts:
      
      * Update to test_verifier.c to specify expected and unexpected
        instruction sequences. This allows to check BPF program rewrites
        applied by e.g. do_mix_fixups function.
      
      * Update to test_verifier.c to specify BTF function infos and types
        per test case. This is necessary for tests that load sub-program
        addresses to a variable because of the checks applied by
        check_ld_imm function.
      
      * The update to verifier.c that tracks state of the parameters for
        each bpf_loop call in a program and decides whether it could be
        replaced by a loop.
      
      * A set of test cases for `test_verifier` that use capabilities added
        by the first two patches to verify instructions produced by inlining
        logic.
      
      * Two test cases for `test_prog` to check that possible corner cases
        behave as expected.
      
      Additional details are available in commit messages for each patch.
      
      Changes since v7:
       - Call to `mark_chain_precision` is added in `loop_flag_is_zero` to
         avoid potential issues with state pruning and precision tracking.
       - `flags non-zero` test_verifier test case is updated to have two
         execution paths reaching `bpf_loop` call, one with flags = 0,
         another with flags = 1. Potentially this test case should be able
         to show that call to `mark_chain_precision` is necessary in
         `loop_flag_is_zero` but not at the moment. Please refer to
         discussion for [PATCH bpf-next v7 3/5] for additional details.
       - `stack_depth_extra` computation is updated to guarantee that R6, R7
         and R8 offsets are always aligned on 8 byte boundary.
       - `stack locations for loop vars` test_verifier test case updated to
         show that R6, R7, R8 offsets are indeed aligned when function stack
         depth is not a multiple of 8.
       - I removed Song Liu's ACK from commit message for [PATCH bpf-next v8
         4/5] because I updated the patch. (Please let me know if I had to
         keep the ACK tag).
      
      Changes since v6:
       - Return value of the `optimize_bpf_loop` function is no longer
         ignored. This is necessary to properly propagate -ENOMEM error.
      
      Changes since v5:
       - Added function `loop_flag_is_zero` to skip a few checks in
         `update_loop_inline_state` when loop instruction is not fit for
         inline.
      
      Changes since v4:
       - Added missing `static` modifier for `update_loop_inline_state` and
         `inline_bpf_loop` functions.
       - `update_loop_inline_state` updated for better readability.
       - Fields `initialized` and `fit_for_inline` of `struct
         bpf_loop_inline_state` are changed back from `bool` to bitfields.
       - Acks from Song Liu added to comments for patches 1/5, 2/5, 4/5,
         5/5.
      
      Changes since v3:
       - Function `adjust_stack_depth_for_loop_inlining` is replaced by
         function `optimize_bpf_loop`. Function `optimize_bpf_loop` is
         responsible for both stack depth adjustment and call instruction
         replacement.
       - Changes in `do_misc_fixups` are reverted.
       - Changes in `adjust_subprog_starts_after_remove` are reverted and
         function `adjust_loop_inline_subprogno` is removed. This is
         possible because call to `optimize_bpf_loop` is placed before the
         dead code removal in `opt_remove_dead_code` (in contrast to the
         position of `do_misc_fixups` where inlining was done in v3).
       - Field `bpf_insn_aux_data.loop_inline_state` is now a part of
         anonymous union at the start of the `bpf_insn_aux_data`.
       - Data structure `bpf_loop_inline_state` is simplified to use single
         flag field `fit_for_inline` instead of separate fields
         `flags_is_zero` & `callback_is_constant`.
       - Macro definition `BPF_MAX_LOOPS` is moved from
         `include/linux/bpf_verifier.h` to `include/linux/bpf.h` to avoid
         include of `include/linux/bpf_verifier.h` in `bpf_iter.c`.
       - `inline_bpf_loop` changed back to use array initialization and hard
         coded offsets as in v2.
       - Style / formatting updates.
      
      Changes since v2:
       - fix for `stack_check` test case in `test_progs-no_alu32`, all tests
         are passing now;
       - v2 3/3 patch is split in three parts:
         - kernel changes
         - test_verifier changes
         - test_prog changes
       - updated `inline_bpf_loop` in `verifier.c` to calculate each offset
         used in instructions to avoid "magic" numbers;
       - removed newline handling logic in `fail_log` branch of
         `do_single_test` in `test_verifier.c` to simplify the patch set;
       - styling fixes suggested in review for v2 of this patch set.
      
      Changes since v1:
       - allow to use SKIP_INSNS in instruction pattern specification in
         test_verifier tests;
       - fix for a bug in spill offset assignement for loop vars when
         bpf_loop is located in a non-main function.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b40b414e
    • Eduard Zingerman's avatar
      selftests/bpf: BPF test_prog selftests for bpf_loop inlining · 0e1bf9ed
      Eduard Zingerman authored
      Two new test BPF programs for test_prog selftests checking bpf_loop
      behavior. Both are corner cases for bpf_loop inlinig transformation:
       - check that bpf_loop behaves correctly when callback function is not
         a compile time constant
       - check that local function variables are not affected by allocating
         additional stack storage for registers spilled by loop inlining
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0e1bf9ed
    • Eduard Zingerman's avatar
      selftests/bpf: BPF test_verifier selftests for bpf_loop inlining · f8acfdd0
      Eduard Zingerman authored
      A number of test cases for BPF selftests test_verifier to check how
      bpf_loop inline transformation rewrites the BPF program. The following
      cases are covered:
       - happy path
       - no-rewrite when flags is non-zero
       - no-rewrite when callback is non-constant
       - subprogno in insn_aux is updated correctly when dead sub-programs
         are removed
       - check that correct stack offsets are assigned for spilling of R6-R8
         registers
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/r/20220620235344.569325-5-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f8acfdd0
    • Eduard Zingerman's avatar
      bpf: Inline calls to bpf_loop when callback is known · 1ade2371
      Eduard Zingerman authored
      Calls to `bpf_loop` are replaced with direct loops to avoid
      indirection. E.g. the following:
      
        bpf_loop(10, foo, NULL, 0);
      
      Is replaced by equivalent of the following:
      
        for (int i = 0; i < 10; ++i)
          foo(i, NULL);
      
      This transformation could be applied when:
      - callback is known and does not change during program execution;
      - flags passed to `bpf_loop` are always zero.
      
      Inlining logic works as follows:
      
      - During execution simulation function `update_loop_inline_state`
        tracks the following information for each `bpf_loop` call
        instruction:
        - is callback known and constant?
        - are flags constant and zero?
      - Function `optimize_bpf_loop` increases stack depth for functions
        where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop`
        to apply the inlining. The additional stack space is used to spill
        registers R6, R7 and R8. These registers are used as loop counter,
        loop maximal bound and callback context parameter;
      
      Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on
      i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1ade2371
    • Eduard Zingerman's avatar
      selftests/bpf: allow BTF specs and func infos in test_verifier tests · 7a42008c
      Eduard Zingerman authored
      The BTF and func_info specification for test_verifier tests follows
      the same notation as in prog_tests/btf.c tests. E.g.:
      
        ...
        .func_info = { { 0, 6 }, { 8, 7 } },
        .func_info_cnt = 2,
        .btf_strings = "\0int\0",
        .btf_types = {
          BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4),
          BTF_PTR_ENC(1),
        },
        ...
      
      The BTF specification is loaded only when specified.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/r/20220620235344.569325-3-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7a42008c
    • Eduard Zingerman's avatar
      selftests/bpf: specify expected instructions in test_verifier tests · 933ff531
      Eduard Zingerman authored
      Allows to specify expected and unexpected instruction sequences in
      test_verifier test cases. The instructions are requested from kernel
      after BPF program loading, thus allowing to check some of the
      transformations applied by BPF verifier.
      
      - `expected_insn` field specifies a sequence of instructions expected
        to be found in the program;
      - `unexpected_insn` field specifies a sequence of instructions that
        are not expected to be found in the program;
      - `INSN_OFF_MASK` and `INSN_IMM_MASK` values could be used to mask
        `off` and `imm` fields.
      - `SKIP_INSNS` could be used to specify that some instructions in the
        (un)expected pattern are not important (behavior similar to usage of
        `\t` in `errstr` field).
      
      The intended usage is as follows:
      
        {
      	"inline simple bpf_loop call",
      	.insns = {
      	/* main */
      	BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
      	BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, BPF_REG_2,
      			BPF_PSEUDO_FUNC, 0, 6),
          ...
      	BPF_EXIT_INSN(),
      	/* callback */
      	BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 1),
      	BPF_EXIT_INSN(),
      	},
      	.expected_insns = {
      	BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
      	SKIP_INSNS(),
      	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_CALL, 8, 1)
      	},
      	.unexpected_insns = {
      	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0,
      			INSN_OFF_MASK, INSN_IMM_MASK),
      	},
      	.prog_type = BPF_PROG_TYPE_TRACEPOINT,
      	.result = ACCEPT,
      	.runs = 0,
        },
      
      Here it is expected that move of 1 to register 1 would remain in place
      and helper function call instruction would be replaced by a relative
      call instruction.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/r/20220620235344.569325-2-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      933ff531
    • Delyan Kratunov's avatar
      uprobe: gate bpf call behind BPF_EVENTS · aca80dd9
      Delyan Kratunov authored
      The call into bpf from uprobes needs to be gated now that it doesn't use
      the trace_events.h helpers.
      
      Randy found this as a randconfig build failure on linux-next [1].
      
        [1]: https://lore.kernel.org/linux-next/2de99180-7d55-2fdf-134d-33198c27cc58@infradead.org/Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDelyan Kratunov <delyank@fb.com>
      Tested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Link: https://lore.kernel.org/r/cb8bfbbcde87ed5d811227a393ef4925f2aadb7b.camel@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      aca80dd9
  4. 20 Jun, 2022 23 commits
  5. 19 Jun, 2022 5 commits