• Eric Biggers's avatar
    crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM · e6e758fa
    Eric Biggers authored
    Rewrite the AES-NI implementations of AES-GCM, taking advantage of
    things I learned while writing the VAES-AVX10 implementations.  This is
    a complete rewrite that reduces the AES-NI GCM source code size by about
    70% and the binary code size by about 95%, while not regressing
    performance and in fact improving it significantly in many cases.
    
    The following summarizes the state before this patch:
    
    - The aesni-intel module registered algorithms "generic-gcm-aesni" and
      "rfc4106-gcm-aesni" with the crypto API that actually delegated to one
      of three underlying implementations according to the CPU capabilities
      detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.
    
    - The AES-NI + AVX and AES-NI + AVX2 assembly code was in
      aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
      257 KB of binary.  This massive binary size was not really
      appropriate, and depending on the kconfig it could take up over 1% the
      size of the entire vmlinux.  The main loops did 8 blocks per
      iteration.  The AVX code minimized the use of carryless multiplication
      whereas the AVX2 code did not.  The "AVX2" code did not actually use
      AVX2; the check for AVX2 was really a check for Intel Haswell or later
      to detect support for fast carryless multiplication.  The long source
      length was caused by factors such as significant code duplication.
    
    - The AES-NI only assembly code was in aesni-intel_asm.S and consisted
      of 1501 lines of source and 15 KB of binary.  The main loops did 4
      blocks per iteration and minimized the use of carryless multiplication
      by using Karatsuba multiplication and a multiplication-less reduction.
    
    - The assembly code was contributed in 2010-2013.  Maintenance has been
      sporadic and most design choices haven't been revisited.
    
    - The assembly function prototypes and the corresponding glue code were
      separate from and were not consistent with the new VAES-AVX10 code I
      recently added.  The older code had several issues such as not
      precomputing the GHASH key powers, which hurt performance.
    
    This rewrite achieves the following goals:
    
    - Much shorter source and binary sizes.  The assembly source shrinks
      from 4300 lines to 1130 lines, and it produces about 9 KB of binary
      instead of 272 KB.  This is achieved via a better designed AES-GCM
      implementation that doesn't excessively unroll the code and instead
      prioritizes the parts that really matter.  Sharing the C glue code
      with the VAES-AVX10 implementations also saves 250 lines of C source.
    
    - Improve performance on most (possibly all) CPUs on which this code
      runs, for most (possibly all) message lengths.  Benchmark results are
      given in Tables 1 and 2 below.
    
    - Use the same function prototypes and glue code as the new VAES-AVX10
      algorithms.  This fixes some issues with the integration of the
      assembly and results in some significant performance improvements,
      primarily on short messages.  Also, the AVX and non-AVX
      implementations are now registered as separate algorithms with the
      crypto API, which makes them both testable by the self-tests.
    
    - Keep support for AES-NI without AVX (for Westmere, Silvermont,
      Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
      Since 256-bit vectors cannot be used without VAES anyway, this is made
      feasible by just using the non-VEX coded form of most instructions.
    
    - Use a unified approach where the main loop does 8 blocks per iteration
      and uses Karatsuba multiplication to save one pclmulqdq per block but
      does not use the multiplication-less reduction.  This strikes a good
      balance across the range of CPUs on which this code runs.
    
    - Don't spam the kernel log with an informational message on every boot.
    
    The following tables summarize the improvement in AES-GCM throughput on
    various CPU microarchitectures as a result of this patch:
    
    Table 1: AES-256-GCM encryption throughput improvement,
             CPU microarchitecture vs. message length in bytes:
    
                       | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
    -------------------+-------+-------+-------+-------+-------+-------+
    Intel Broadwell    |    2% |    8% |   11% |   18% |   31% |   26% |
    Intel Skylake      |    1% |    4% |    7% |   12% |   26% |   19% |
    Intel Cascade Lake |    3% |    8% |   10% |   18% |   33% |   24% |
    AMD Zen 1          |    6% |   12% |    6% |   15% |   27% |   24% |
    AMD Zen 2          |    8% |   13% |   13% |   19% |   26% |   28% |
    AMD Zen 3          |    8% |   14% |   13% |   19% |   26% |   25% |
    
                       |   300 |   200 |    64 |    63 |    16 |
    -------------------+-------+-------+-------+-------+-------+
    Intel Broadwell    |   35% |   29% |   45% |   55% |   54% |
    Intel Skylake      |   25% |   19% |   28% |   33% |   27% |
    Intel Cascade Lake |   36% |   28% |   39% |   49% |   54% |
    AMD Zen 1          |   27% |   22% |   23% |   29% |   26% |
    AMD Zen 2          |   32% |   24% |   22% |   25% |   31% |
    AMD Zen 3          |   30% |   24% |   22% |   23% |   26% |
    
    Table 2: AES-256-GCM decryption throughput improvement,
             CPU microarchitecture vs. message length in bytes:
    
                       | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
    -------------------+-------+-------+-------+-------+-------+-------+
    Intel Broadwell    |    3% |    8% |   11% |   19% |   32% |   28% |
    Intel Skylake      |    3% |    4% |    7% |   13% |   28% |   27% |
    Intel Cascade Lake |    3% |    9% |   11% |   19% |   33% |   28% |
    AMD Zen 1          |   15% |   18% |   14% |   20% |   36% |   33% |
    AMD Zen 2          |    9% |   16% |   13% |   21% |   26% |   27% |
    AMD Zen 3          |    8% |   15% |   12% |   18% |   23% |   23% |
    
                       |   300 |   200 |    64 |    63 |    16 |
    -------------------+-------+-------+-------+-------+-------+
    Intel Broadwell    |   36% |   31% |   40% |   51% |   53% |
    Intel Skylake      |   28% |   21% |   23% |   30% |   30% |
    Intel Cascade Lake |   36% |   29% |   36% |   47% |   53% |
    AMD Zen 1          |   35% |   31% |   32% |   35% |   36% |
    AMD Zen 2          |   31% |   30% |   27% |   38% |   30% |
    AMD Zen 3          |   27% |   23% |   24% |   32% |   26% |
    
    The above numbers are percentage improvements in single-thread
    throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
    listed as 10%.  They were collected by directly measuring the Linux
    crypto API performance using a custom kernel module.  Note that indirect
    benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
    include more overhead and won't see quite as much of a difference.  All
    these benchmarks used an associated data length of 16 bytes.  Note that
    AES-GCM is almost always used with short associated data lengths.
    
    I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
    Intel low-power CPUs, as these weren't readily available to me.
    However, based on the design of the new code and the available
    information about these other CPU microarchitectures, I wouldn't expect
    any significant regressions, and there's a good chance performance is
    improved just as it is above.
    Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    e6e758fa
Makefile 5.34 KB