1. 29 Jun, 2022 6 commits
    • Stanislav Fomichev's avatar
      bpf: expose bpf_{g,s}etsockopt to lsm cgroup · 9113d7e4
      Stanislav Fomichev authored
      I don't see how to make it nice without introducing btf id lists
      for the hooks where these helpers are allowed. Some LSM hooks
      work on the locked sockets, some are triggering early and
      don't grab any locks, so have two lists for now:
      
      1. LSM hooks which trigger under socket lock - minority of the hooks,
         but ideal case for us, we can expose existing BTF-based helpers
      2. LSM hooks which trigger without socket lock, but they trigger
         early in the socket creation path where it should be safe to
         do setsockopt without any locks
      3. The rest are prohibited. I'm thinking that this use-case might
         be a good gateway to sleeping lsm cgroup hooks in the future.
         We can either expose lock/unlock operations (and add tracking
         to the verifier) or have another set of bpf_setsockopt
         wrapper that grab the locks and might sleep.
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-7-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9113d7e4
    • Stanislav Fomichev's avatar
      bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP · b79c9fc9
      Stanislav Fomichev authored
      We have two options:
      1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
      2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point
      
      I was doing (2) in the original patch, but switching to (1) here:
      
      * bpf_prog_query returns all attached BPF_LSM_CGROUP programs
      regardless of attach_btf_id
      * attach_btf_id is exported via bpf_prog_info
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b79c9fc9
    • Stanislav Fomichev's avatar
      bpf: minimize number of allocated lsm slots per program · c0e19f2c
      Stanislav Fomichev authored
      Previous patch adds 1:1 mapping between all 211 LSM hooks
      and bpf_cgroup program array. Instead of reserving a slot per
      possible hook, reserve 10 slots per cgroup for lsm programs.
      Those slots are dynamically allocated on demand and reclaimed.
      
      struct cgroup_bpf {
      	struct bpf_prog_array *    effective[33];        /*     0   264 */
      	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
      	struct hlist_head          progs[33];            /*   264   264 */
      	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
      	u8                         flags[33];            /*   528    33 */
      
      	/* XXX 7 bytes hole, try to pack */
      
      	struct list_head           storages;             /*   568    16 */
      	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
      	struct bpf_prog_array *    inactive;             /*   584     8 */
      	struct percpu_ref          refcnt;               /*   592    16 */
      	struct work_struct         release_work;         /*   608    72 */
      
      	/* size: 680, cachelines: 11, members: 7 */
      	/* sum members: 673, holes: 1, sum holes: 7 */
      	/* last cacheline: 40 bytes */
      };
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c0e19f2c
    • Stanislav Fomichev's avatar
      bpf: per-cgroup lsm flavor · 69fd337a
      Stanislav Fomichev authored
      Allow attaching to lsm hooks in the cgroup context.
      
      Attaching to per-cgroup LSM works exactly like attaching
      to other per-cgroup hooks. New BPF_LSM_CGROUP is added
      to trigger new mode; the actual lsm hook we attach to is
      signaled via existing attach_btf_id.
      
      For the hooks that have 'struct socket' or 'struct sock' as its first
      argument, we use the cgroup associated with that socket. For the rest,
      we use 'current' cgroup (this is all on default hierarchy == v2 only).
      Note that for some hooks that work on 'struct sock' we still
      take the cgroup from 'current' because some of them work on the socket
      that hasn't been properly initialized yet.
      
      Behind the scenes, we allocate a shim program that is attached
      to the trampoline and runs cgroup effective BPF programs array.
      This shim has some rudimentary ref counting and can be shared
      between several programs attaching to the same lsm hook from
      different cgroups.
      
      Note that this patch bloats cgroup size because we add 211
      cgroup_bpf_attach_type(s) for simplicity sake. This will be
      addressed in the subsequent patch.
      
      Also note that we only add non-sleepable flavor for now. To enable
      sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
      shim programs have to be freed via trace rcu, cgroup_bpf.effective
      should be also trace-rcu-managed + maybe some other changes that
      I'm not aware of.
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69fd337a
    • Stanislav Fomichev's avatar
      bpf: convert cgroup_bpf.progs to hlist · 00442143
      Stanislav Fomichev authored
      This lets us reclaim some space to be used by new cgroup lsm slots.
      
      Before:
      struct cgroup_bpf {
      	struct bpf_prog_array *    effective[23];        /*     0   184 */
      	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
      	struct list_head           progs[23];            /*   184   368 */
      	/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
      	u32                        flags[23];            /*   552    92 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
      	struct list_head           storages;             /*   648    16 */
      	struct bpf_prog_array *    inactive;             /*   664     8 */
      	struct percpu_ref          refcnt;               /*   672    16 */
      	struct work_struct         release_work;         /*   688    32 */
      
      	/* size: 720, cachelines: 12, members: 7 */
      	/* sum members: 716, holes: 1, sum holes: 4 */
      	/* last cacheline: 16 bytes */
      };
      
      After:
      struct cgroup_bpf {
      	struct bpf_prog_array *    effective[23];        /*     0   184 */
      	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
      	struct hlist_head          progs[23];            /*   184   184 */
      	/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
      	u8                         flags[23];            /*   368    23 */
      
      	/* XXX 1 byte hole, try to pack */
      
      	/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
      	struct list_head           storages;             /*   392    16 */
      	struct bpf_prog_array *    inactive;             /*   408     8 */
      	struct percpu_ref          refcnt;               /*   416    16 */
      	struct work_struct         release_work;         /*   432    72 */
      
      	/* size: 504, cachelines: 8, members: 7 */
      	/* sum members: 503, holes: 1, sum holes: 1 */
      	/* last cacheline: 56 bytes */
      };
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      00442143
    • Stanislav Fomichev's avatar
      bpf: add bpf_func_t and trampoline helpers · af3f4134
      Stanislav Fomichev authored
      I'll be adding lsm cgroup specific helpers that grab
      trampoline mutex.
      
      No functional changes.
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20220628174314.1216643-2-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af3f4134
  2. 28 Jun, 2022 17 commits
  3. 24 Jun, 2022 7 commits
  4. 23 Jun, 2022 9 commits
    • Jörn-Thorben Hinz's avatar
      selftests/bpf: Fix rare segfault in sock_fields prog test · 6dc7a0ba
      Jörn-Thorben Hinz authored
      test_sock_fields__detach() got called with a null pointer here when one
      of the CHECKs or ASSERTs up to the test_sock_fields__open_and_load()
      call resulted in a jump to the "done" label.
      
      A skeletons *__detach() is not safe to call with a null pointer, though.
      This led to a segfault.
      
      Go the easy route and only call test_sock_fields__destroy() which is
      null-pointer safe and includes detaching.
      
      Came across this while looking[1] to introduce the usage of
      bpf_tcp_helpers.h (included in progs/test_sock_fields.c) together with
      vmlinux.h.
      
      [1] https://lore.kernel.org/bpf/629bc069dd807d7ac646f836e9dca28bbc1108e2.camel@mailbox.tu-berlin.de/
      
      Fixes: 8f50f16f ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads")
      Signed-off-by: default avatarJörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220621070116.307221-1-jthinz@mailbox.tu-berlin.de
      6dc7a0ba
    • Alexei Starovoitov's avatar
      Merge branch 'Align BPF TCP CCs implementing cong_control() with non-BPF CCs' · bb7a4257
      Alexei Starovoitov authored
      Jörn-Thorben Hinz says:
      
      ====================
      
      This series corrects some inconveniences for a BPF TCP CC that
      implements and uses tcp_congestion_ops.cong_control(). Until now, such a
      CC did not have all necessary write access to struct sock and
      unnecessarily needed to implement cong_avoid().
      
      v4:
       - Remove braces around single statements after if
       - Don’t check pointer passed to bpf_link__destroy()
      v3:
       - Add a selftest writing sk_pacing_*
       - Add a selftest with incomplete tcp_congestion_ops
       - Add a selftest with unsupported get_info()
       - Remove an unused variable
       - Reword a comment about reg() in bpf_struct_ops_map_update_elem()
      v2:
       - Drop redundant check for required functions and just rely on
         tcp_register_congestion_control() (Martin KaFai Lau)
      ====================
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bb7a4257
    • Jörn-Thorben Hinz's avatar
      selftests/bpf: Test a BPF CC implementing the unsupported get_info() · f14a3f64
      Jörn-Thorben Hinz authored
      Test whether a TCP CC implemented in BPF providing get_info() is
      rejected correctly. get_info() is unsupported in a BPF CC. The check for
      required functions in a BPF CC has moved, this test ensures unsupported
      functions are still rejected correctly.
      Signed-off-by: default avatarJörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/r/20220622191227.898118-6-jthinz@mailbox.tu-berlin.deSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f14a3f64
    • Jörn-Thorben Hinz's avatar
      selftests/bpf: Test an incomplete BPF CC · 0735627d
      Jörn-Thorben Hinz authored
      Test whether a TCP CC implemented in BPF providing neither cong_avoid()
      nor cong_control() is correctly rejected. This check solely depends on
      tcp_register_congestion_control() now, which is invoked during
      bpf_map__attach_struct_ops().
      Signed-off-by: default avatarJörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
      Link: https://lore.kernel.org/r/20220622191227.898118-5-jthinz@mailbox.tu-berlin.deSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0735627d
    • Jörn-Thorben Hinz's avatar
      selftests/bpf: Test a BPF CC writing sk_pacing_* · 6e945d57
      Jörn-Thorben Hinz authored
      Test whether a TCP CC implemented in BPF is allowed to write
      sk_pacing_rate and sk_pacing_status in struct sock. This is needed when
      cong_control() is implemented and used.
      Signed-off-by: default avatarJörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
      Link: https://lore.kernel.org/r/20220622191227.898118-4-jthinz@mailbox.tu-berlin.deSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6e945d57
    • Jörn-Thorben Hinz's avatar
      bpf: Require only one of cong_avoid() and cong_control() from a TCP CC · 9f0265e9
      Jörn-Thorben Hinz authored
      Remove the check for required and optional functions in a struct
      tcp_congestion_ops from bpf_tcp_ca.c. Rely on
      tcp_register_congestion_control() to reject a BPF CC that does not
      implement all required functions, as it will do for a non-BPF CC.
      
      When a CC implements tcp_congestion_ops.cong_control(), the alternate
      cong_avoid() is not in use in the TCP stack. Previously, a BPF CC was
      still forced to implement cong_avoid() as a no-op since it was
      non-optional in bpf_tcp_ca.c.
      Signed-off-by: default avatarJörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/r/20220622191227.898118-3-jthinz@mailbox.tu-berlin.deSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9f0265e9
    • Jörn-Thorben Hinz's avatar
      bpf: Allow a TCP CC to write sk_pacing_rate and sk_pacing_status · 41c95dd6
      Jörn-Thorben Hinz authored
      A CC that implements tcp_congestion_ops.cong_control() should be able to
      control sk_pacing_rate and set sk_pacing_status, since
      tcp_update_pacing_rate() is never called in this case. A built-in CC or
      one from a kernel module is already able to write to both struct sock
      members. For a BPF program, write access has not been allowed, yet.
      Signed-off-by: default avatarJörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
      Reviewed-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/r/20220622191227.898118-2-jthinz@mailbox.tu-berlin.deSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41c95dd6
    • Jian Shen's avatar
      test_bpf: fix incorrect netdev features · 9676fecc
      Jian Shen authored
      The prototype of .features is netdev_features_t, it should use
      NETIF_F_LLTX and NETIF_F_HW_VLAN_STAG_TX, not NETIF_F_LLTX_BIT
      and NETIF_F_HW_VLAN_STAG_TX_BIT.
      
      Fixes: cf204a71 ("bpf, testing: Introduce 'gso_linear_no_head_frag' skb_segment test")
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20220622135002.8263-1-shenjian15@huawei.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9676fecc
    • Dave Marchevsky's avatar
      selftests/bpf: Add benchmark for local_storage get · 73087489
      Dave Marchevsky authored
      Add a benchmarks to demonstrate the performance cliff for local_storage
      get as the number of local_storage maps increases beyond current
      local_storage implementation's cache size.
      
      "sequential get" and "interleaved get" benchmarks are added, both of
      which do many bpf_task_storage_get calls on sets of task local_storage
      maps of various counts, while considering a single specific map to be
      'important' and counting task_storage_gets to the important map
      separately in addition to normal 'hits' count of all gets. Goal here is
      to mimic scenario where a particular program using one map - the
      important one - is running on a system where many other local_storage
      maps exist and are accessed often.
      
      While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
      ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
      bpf_task_storage_gets for the important map for every 10 map gets. This
      is meant to highlight performance differences when important map is
      accessed far more frequently than non-important maps.
      
      A "hashmap control" benchmark is also included for easy comparison of
      standard bpf hashmap lookup vs local_storage get. The benchmark is
      similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
      instead of local storage. Only one inner map is created - a hashmap
      meant to hold tid -> data mapping for all tasks. Size of the hashmap is
      hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
      keys which are actually fetched as part of the benchmark is
      configurable.
      
      Addition of this benchmark is inspired by conversation with Alexei in a
      previous patchset's thread [0], which highlighted the need for such a
      benchmark to motivate and validate improvements to local_storage
      implementation. My approach in that series focused on improving
      performance for explicitly-marked 'important' maps and was rejected
      with feedback to make more generally-applicable improvements while
      avoiding explicitly marking maps as important. Thus the benchmark
      reports both general and important-map-focused metrics, so effect of
      future work on both is clear.
      
      Regarding the benchmark results. On a powerful system (Skylake, 20
      cores, 256gb ram):
      
      Hashmap Control
      ===============
              num keys: 10
      hashmap (control) sequential    get:  hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s
      
              num keys: 1000
      hashmap (control) sequential    get:  hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s
      
              num keys: 10000
      hashmap (control) sequential    get:  hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s
      
              num keys: 100000
      hashmap (control) sequential    get:  hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s
      
              num keys: 4194304
      hashmap (control) sequential    get:  hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s
      
      Local Storage
      =============
              num_maps: 1
      local_storage cache sequential  get:  hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
      local_storage cache interleaved get:  hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s
      
              num_maps: 10
      local_storage cache sequential  get:  hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
      local_storage cache interleaved get:  hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s
      
              num_maps: 16
      local_storage cache sequential  get:  hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
      local_storage cache interleaved get:  hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s
      
              num_maps: 17
      local_storage cache sequential  get:  hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
      local_storage cache interleaved get:  hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s
      
              num_maps: 24
      local_storage cache sequential  get:  hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
      local_storage cache interleaved get:  hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s
      
              num_maps: 32
      local_storage cache sequential  get:  hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
      local_storage cache interleaved get:  hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s
      
              num_maps: 100
      local_storage cache sequential  get:  hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
      local_storage cache interleaved get:  hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s
      
              num_maps: 1000
      local_storage cache sequential  get:  hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
      local_storage cache interleaved get:  hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s
      
      Looking at the "sequential get" results, it's clear that as the
      number of task local_storage maps grows beyond the current cache size
      (16), there's a significant reduction in hits throughput. Note that
      current local_storage implementation assigns a cache_idx to maps as they
      are created. Since "sequential get" is creating maps 0..n in order and
      then doing bpf_task_storage_get calls in the same order, the benchmark
      is effectively ensuring that a map will not be in cache when the program
      tries to access it.
      
      For "interleaved get" results, important-map hits throughput is greatly
      increased as the important map is more likely to be in cache by virtue
      of being accessed far more frequently. Throughput still reduces as #
      maps increases, though.
      
      To get a sense of the overhead of the benchmark program, I
      commented out bpf_task_storage_get/bpf_map_lookup_elem in
      local_storage_bench.c and ran the benchmark on the same host as the
      'real' run. Results:
      
      Hashmap Control
      ===============
              num keys: 10
      hashmap (control) sequential    get:  hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s
      
              num keys: 1000
      hashmap (control) sequential    get:  hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s
      
              num keys: 10000
      hashmap (control) sequential    get:  hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s
      
              num keys: 100000
      hashmap (control) sequential    get:  hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s
      
              num keys: 4194304
      hashmap (control) sequential    get:  hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s
      
      Local Storage
      =============
              num_maps: 1
      local_storage cache sequential  get:  hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
      local_storage cache interleaved get:  hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s
      
              num_maps: 10
      local_storage cache sequential  get:  hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
      local_storage cache interleaved get:  hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s
      
              num_maps: 16
      local_storage cache sequential  get:  hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
      local_storage cache interleaved get:  hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s
      
              num_maps: 17
      local_storage cache sequential  get:  hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
      local_storage cache interleaved get:  hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s
      
              num_maps: 24
      local_storage cache sequential  get:  hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
      local_storage cache interleaved get:  hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s
      
              num_maps: 32
      local_storage cache sequential  get:  hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
      local_storage cache interleaved get:  hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s
      
              num_maps: 100
      local_storage cache sequential  get:  hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
      local_storage cache interleaved get:  hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s
      
              num_maps: 1000
      local_storage cache sequential  get:  hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
      local_storage cache interleaved get:  hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s
      
      Adjusting for overhead, latency numbers for "hashmap control" and
      "sequential get" are:
      
      hashmap_control_1k:   ~53.8ns
      hashmap_control_10k:  ~124.2ns
      hashmap_control_100k: ~206.5ns
      sequential_get_1:     ~12.1ns
      sequential_get_10:    ~16.0ns
      sequential_get_16:    ~13.8ns
      sequential_get_17:    ~16.8ns
      sequential_get_24:    ~40.9ns
      sequential_get_32:    ~65.2ns
      sequential_get_100:   ~148.2ns
      sequential_get_1000:  ~2204ns
      
      Clearly demonstrating a cliff.
      
      In the discussion for v1 of this patch, Alexei noted that local_storage
      was 2.5x faster than a large hashmap when initially implemented [1]. The
      benchmark results show that local_storage is 5-10x faster: a
      long-running BPF application putting some pid-specific info into a
      hashmap for each pid it sees will probably see on the order of 10-100k
      pids. Bench numbers for hashmaps of this size are ~10x slower than
      sequential_get_16, but as the number of local_storage maps grows far
      past local_storage cache size the performance advantage shrinks and
      eventually reverses.
      
      When running the benchmarks it may be necessary to bump 'open files'
      ulimit for a successful run.
      
        [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
        [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      73087489
  5. 22 Jun, 2022 1 commit