• Andrii Nakryiko's avatar
    selftests/bpf: Add benchmark runner infrastructure · 8e7c2a02
    Andrii Nakryiko authored
    While working on BPF ringbuf implementation, testing, and benchmarking, I've
    developed a pretty generic and modular benchmark runner, which seems to be
    generically useful, as I've already used it for one more purpose (testing
    fastest way to trigger BPF program, to minimize overhead of in-kernel code).
    
    This patch adds generic part of benchmark runner and sets up Makefile for
    extending it with more sets of benchmarks.
    
    Benchmarker itself operates by spinning up specified number of producer and
    consumer threads, setting up interval timer sending SIGALARM signal to
    application once a second. Every second, current snapshot with hits/drops
    counters are collected and stored in an array. Drops are useful for
    producer/consumer benchmarks in which producer might overwhelm consumers.
    
    Once test finishes after given amount of warm-up and testing seconds, mean and
    stddev are calculated (ignoring warm-up results) and is printed out to stdout.
    This setup seems to give consistent and accurate results.
    
    To validate behavior, I added two atomic counting tests: global and local.
    For global one, all the producer threads are atomically incrementing same
    counter as fast as possible. This, of course, leads to huge drop of
    performance once there is more than one producer thread due to CPUs fighting
    for the same memory location.
    
    Local counting, on the other hand, maintains one counter per each producer
    thread, incremented independently. Once per second, all counters are read and
    added together to form final "counting throughput" measurement. As expected,
    such setup demonstrates linear scalability with number of producers (as long
    as there are enough physical CPU cores, of course). See example output below.
    Also, this setup can nicely demonstrate disastrous effects of false sharing,
    if care is not taken to take those per-producer counters apart into
    independent cache lines.
    
    Demo output shows global counter first with 1 producer, then with 4. Both
    total and per-producer performance significantly drop. The last run is local
    counter with 4 producers, demonstrating near-perfect scalability.
    
    $ ./bench -a -w1 -d2 -p1 count-global
    Setting up benchmark 'count-global'...
    Benchmark 'count-global' started.
    Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
    Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
    Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
    Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
    Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
    
    $ ./bench -a -w1 -d2 -p4 count-global
    Setting up benchmark 'count-global'...
    Benchmark 'count-global' started.
    Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
    Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
    Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
    Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
    Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
    
    $ ./bench -a -w1 -d2 -p4 count-local
    Setting up benchmark 'count-local'...
    Benchmark 'count-local' started.
    Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
    Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
    Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
    Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
    Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
    
    Benchmark runner supports setting thread affinity for producer and consumer
    threads. You can use -a flag for default CPU selection scheme, where first
    consumer gets CPU #0, next one gets CPU #1, and so on. Then producer threads
    pick up next CPU and increment one-by-one as well. But user can also specify
    a set of CPUs independently for producers and consumers with --prod-affinity
    1,2-10,15 and --cons-affinity <set-of-cpus>. The latter allows to force
    producers and consumers to share same set of CPUs, if necessary.
    Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
    Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Acked-by: default avatarYonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20200512192445.2351848-3-andriin@fb.com
    8e7c2a02
bench.c 10.2 KB