• Noah Goldstein's avatar
    x86/csum: Improve performance of `csum_partial` · 688eb819
    Noah Goldstein authored
    1) Add special case for len == 40 as that is the hottest value. The
       nets a ~8-9% latency improvement and a ~30% throughput improvement
       in the len == 40 case.
    
    2) Use multiple accumulators in the 64-byte loop. This dramatically
       improves ILP and results in up to a 40% latency/throughput
       improvement (better for more iterations).
    
    Results from benchmarking on Icelake. Times measured with rdtsc()
     len   lat_new   lat_old      r    tput_new  tput_old      r
       8      3.58      3.47  1.032        3.58      3.51  1.021
      16      4.14      4.02  1.028        3.96      3.78  1.046
      24      4.99      5.03  0.992        4.23      4.03  1.050
      32      5.09      5.08  1.001        4.68      4.47  1.048
      40      5.57      6.08  0.916        3.05      4.43  0.690
      48      6.65      6.63  1.003        4.97      4.69  1.059
      56      7.74      7.72  1.003        5.22      4.95  1.055
      64      6.65      7.22  0.921        6.38      6.42  0.994
      96      9.43      9.96  0.946        7.46      7.54  0.990
     128      9.39     12.15  0.773        8.90      8.79  1.012
     200     12.65     18.08  0.699       11.63     11.60  1.002
     272     15.82     23.37  0.677       14.43     14.35  1.005
     440     24.12     36.43  0.662       21.57     22.69  0.951
     952     46.20     74.01  0.624       42.98     53.12  0.809
    1024     47.12     78.24  0.602       46.36     58.83  0.788
    1552     72.01    117.30  0.614       71.92     96.78  0.743
    2048     93.07    153.25  0.607       93.28    137.20  0.680
    2600    114.73    194.30  0.590      114.28    179.32  0.637
    3608    156.34    268.41  0.582      154.97    254.02  0.610
    4096    175.01    304.03  0.576      175.89    292.08  0.602
    
    There is no such thing as a free lunch, however, and the special case
    for len == 40 does add overhead to the len != 40 cases. This seems to
    amount to be ~5% throughput and slightly less in terms of latency.
    
    Testing:
    Part of this change is a new kunit test. The tests check all
    alignment X length pairs in [0, 64) X [0, 512).
    There are three cases.
        1) Precomputed random inputs/seed. The expected results where
           generated use the generic implementation (which is assumed to be
           non-buggy).
        2) An input of all 1s. The goal of this test is to catch any case
           a carry is missing.
        3) An input that never carries. The goal of this test si to catch
           any case of incorrectly carrying.
    
    More exhaustive tests that test all alignment X length pairs in
    [0, 8192) X [0, 8192] on random data are also available here:
    https://github.com/goldsteinn/csum-reproduction
    
    The reposity also has the code for reproducing the above benchmark
    numbers.
    Signed-off-by: default avatarNoah Goldstein <goldstein.w.n@gmail.com>
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com
    688eb819
csum-partial_64.c 4.15 KB