1. 06 Jul, 2024 8 commits
  2. 28 Jun, 2024 10 commits
  3. 21 Jun, 2024 9 commits
  4. 16 Jun, 2024 5 commits
    • Stefan Berger's avatar
      crypto: ecc - Fix off-by-one missing to clear most significant digit · 1dcf865d
      Stefan Berger authored
      Fix an off-by-one error where the most significant digit was not
      initialized leading to signature verification failures by the testmgr.
      
      Example: If a curve requires ndigits (=9) and diff (=2) indicates that
      2 digits need to be set to zero then start with digit 'ndigits - diff' (=7)
      and clear 'diff' digits starting from there, so 7 and 8.
      Reported-by: default avatarVenkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
      Closes: https://lore.kernel.org/linux-crypto/619bc2de-b18a-4939-a652-9ca886bf6349@linux.ibm.com/T/#m045d8812409ce233c17fcdb8b88b6629c671f9f4
      Fixes: 2fd2a82c ("crypto: ecdsa - Use ecc_digits_from_bytes to create hash digits array")
      Signed-off-by: default avatarStefan Berger <stefanb@linux.ibm.com>
      Tested-by: default avatarVenkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      1dcf865d
    • Stefan Berger's avatar
      crypto: ecc - Add comment to ecc_digits_from_bytes about input byte array · 0eb3bed5
      Stefan Berger authored
      Add comment to ecc_digits_from_bytes kdoc that the first byte is expected
      to hold the most significant bits of the large integer that is converted
      into an array of digits.
      Signed-off-by: default avatarStefan Berger <stefanb@linux.ibm.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      0eb3bed5
    • Andy Shevchenko's avatar
      hwrng: core - Remove list.h from the hw_random.h · 4604b388
      Andy Shevchenko authored
      The 'struct list' type is defined in types.h, no need to include list.h
      for that.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      4604b388
    • Neil Armstrong's avatar
      dt-bindings: rng: meson: add optional power-domains · 293695f1
      Neil Armstrong authored
      On newer SoCs, the random number generator can require a power-domain to
      operate, add it as optional.
      Signed-off-by: default avatarNeil Armstrong <neil.armstrong@linaro.org>
      Acked-by: default avatarRob Herring (Arm) <robh@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      293695f1
    • Kim Phillips's avatar
      crypto: ccp - Fix null pointer dereference in __sev_snp_shutdown_locked · 468e3295
      Kim Phillips authored
      Fix a null pointer dereference induced by DEBUG_TEST_DRIVER_REMOVE.
      Return from __sev_snp_shutdown_locked() if the psp_device or the
      sev_device structs are not initialized. Without the fix, the driver will
      produce the following splat:
      
         ccp 0000:55:00.5: enabling device (0000 -> 0002)
         ccp 0000:55:00.5: sev enabled
         ccp 0000:55:00.5: psp enabled
         BUG: kernel NULL pointer dereference, address: 00000000000000f0
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 0 P4D 0
         Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
         CPU: 262 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc1+ #29
         RIP: 0010:__sev_snp_shutdown_locked+0x2e/0x150
         Code: 00 55 48 89 e5 41 57 41 56 41 54 53 48 83 ec 10 41 89 f7 49 89 fe 65 48 8b 04 25 28 00 00 00 48 89 45 d8 48 8b 05 6a 5a 7f 06 <4c> 8b a0 f0 00 00 00 41 0f b6 9c 24 a2 00 00 00 48 83 fb 02 0f 83
         RSP: 0018:ffffb2ea4014b7b8 EFLAGS: 00010286
         RAX: 0000000000000000 RBX: ffff9e4acd2e0a28 RCX: 0000000000000000
         RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb2ea4014b808
         RBP: ffffb2ea4014b7e8 R08: 0000000000000106 R09: 000000000003d9c0
         R10: 0000000000000001 R11: ffffffffa39ff070 R12: ffff9e49d40590c8
         R13: 0000000000000000 R14: ffffb2ea4014b808 R15: 0000000000000000
         FS:  0000000000000000(0000) GS:ffff9e58b1e00000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00000000000000f0 CR3: 0000000418a3e001 CR4: 0000000000770ef0
         PKRU: 55555554
         Call Trace:
          <TASK>
          ? __die_body+0x6f/0xb0
          ? __die+0xcc/0xf0
          ? page_fault_oops+0x330/0x3a0
          ? save_trace+0x2a5/0x360
          ? do_user_addr_fault+0x583/0x630
          ? exc_page_fault+0x81/0x120
          ? asm_exc_page_fault+0x2b/0x30
          ? __sev_snp_shutdown_locked+0x2e/0x150
          __sev_firmware_shutdown+0x349/0x5b0
          ? pm_runtime_barrier+0x66/0xe0
          sev_dev_destroy+0x34/0xb0
          psp_dev_destroy+0x27/0x60
          sp_destroy+0x39/0x90
          sp_pci_remove+0x22/0x60
          pci_device_remove+0x4e/0x110
          really_probe+0x271/0x4e0
          __driver_probe_device+0x8f/0x160
          driver_probe_device+0x24/0x120
          __driver_attach+0xc7/0x280
          ? driver_attach+0x30/0x30
          bus_for_each_dev+0x10d/0x130
          driver_attach+0x22/0x30
          bus_add_driver+0x171/0x2b0
          ? unaccepted_memory_init_kdump+0x20/0x20
          driver_register+0x67/0x100
          __pci_register_driver+0x83/0x90
          sp_pci_init+0x22/0x30
          sp_mod_init+0x13/0x30
          do_one_initcall+0xb8/0x290
          ? sched_clock_noinstr+0xd/0x10
          ? local_clock_noinstr+0x3e/0x100
          ? stack_depot_save_flags+0x21e/0x6a0
          ? local_clock+0x1c/0x60
          ? stack_depot_save_flags+0x21e/0x6a0
          ? sched_clock_noinstr+0xd/0x10
          ? local_clock_noinstr+0x3e/0x100
          ? __lock_acquire+0xd90/0xe30
          ? sched_clock_noinstr+0xd/0x10
          ? local_clock_noinstr+0x3e/0x100
          ? __create_object+0x66/0x100
          ? local_clock+0x1c/0x60
          ? __create_object+0x66/0x100
          ? parameq+0x1b/0x90
          ? parse_one+0x6d/0x1d0
          ? parse_args+0xd7/0x1f0
          ? do_initcall_level+0x180/0x180
          do_initcall_level+0xb0/0x180
          do_initcalls+0x60/0xa0
          ? kernel_init+0x1f/0x1d0
          do_basic_setup+0x41/0x50
          kernel_init_freeable+0x1ac/0x230
          ? rest_init+0x1f0/0x1f0
          kernel_init+0x1f/0x1d0
          ? rest_init+0x1f0/0x1f0
          ret_from_fork+0x3d/0x50
          ? rest_init+0x1f0/0x1f0
          ret_from_fork_asm+0x11/0x20
          </TASK>
         Modules linked in:
         CR2: 00000000000000f0
         ---[ end trace 0000000000000000 ]---
         RIP: 0010:__sev_snp_shutdown_locked+0x2e/0x150
         Code: 00 55 48 89 e5 41 57 41 56 41 54 53 48 83 ec 10 41 89 f7 49 89 fe 65 48 8b 04 25 28 00 00 00 48 89 45 d8 48 8b 05 6a 5a 7f 06 <4c> 8b a0 f0 00 00 00 41 0f b6 9c 24 a2 00 00 00 48 83 fb 02 0f 83
         RSP: 0018:ffffb2ea4014b7b8 EFLAGS: 00010286
         RAX: 0000000000000000 RBX: ffff9e4acd2e0a28 RCX: 0000000000000000
         RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb2ea4014b808
         RBP: ffffb2ea4014b7e8 R08: 0000000000000106 R09: 000000000003d9c0
         R10: 0000000000000001 R11: ffffffffa39ff070 R12: ffff9e49d40590c8
         R13: 0000000000000000 R14: ffffb2ea4014b808 R15: 0000000000000000
         FS:  0000000000000000(0000) GS:ffff9e58b1e00000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00000000000000f0 CR3: 0000000418a3e001 CR4: 0000000000770ef0
         PKRU: 55555554
         Kernel panic - not syncing: Fatal exception
         Kernel Offset: 0x1fc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      
      Fixes: 1ca5614b ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Reviewed-by: default avatarMario Limonciello <mario.limonciello@amd.com>
      Reviewed-by: default avatarJohn Allen <john.allen@amd.com>
      Reviewed-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      468e3295
  5. 07 Jun, 2024 8 commits
    • Jeff Johnson's avatar
      hwrng: omap - add missing MODULE_DESCRIPTION() macro · 6d4e1993
      Jeff Johnson authored
      make allmodconfig && make W=1 C=1 reports:
      WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/omap-rng.o
      WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/omap3-rom-rng.o
      
      Add the missing invocation of the MODULE_DESCRIPTION() macro.
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      6d4e1993
    • Jeff Johnson's avatar
      crypto: xilinx - add missing MODULE_DESCRIPTION() macro · ed6261d5
      Jeff Johnson authored
      make allmodconfig && make W=1 C=1 reports:
      WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/xilinx/zynqmp-aes-gcm.o
      
      Add the missing invocation of the MODULE_DESCRIPTION() macro.
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Reviewed-by: default avatarMichal Simek <michal.simek@amd.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ed6261d5
    • Jeff Johnson's avatar
      crypto: sa2ul - add missing MODULE_DESCRIPTION() macro · c8edb3cc
      Jeff Johnson authored
      make allmodconfig && make W=1 C=1 reports:
      WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/sa2ul.o
      
      Add the missing invocation of the MODULE_DESCRIPTION() macro.
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c8edb3cc
    • Jeff Johnson's avatar
      crypto: keembay - add missing MODULE_DESCRIPTION() macro · f2cbb746
      Jeff Johnson authored
      make allmodconfig && make W=1 C=1 reports:
      WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/intel/keembay/keembay-ocs-hcu.o
      
      Add the missing invocation of the MODULE_DESCRIPTION() macro.
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      f2cbb746
    • Jeff Johnson's avatar
      crypto: atmel-sha204a - add missing MODULE_DESCRIPTION() macro · 3aa461e3
      Jeff Johnson authored
      make allmodconfig && make W=1 C=1 reports:
      WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/atmel-sha204a.o
      
      Add the missing invocation of the MODULE_DESCRIPTION() macro.
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      3aa461e3
    • Eric Biggers's avatar
      crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM · e6e758fa
      Eric Biggers authored
      Rewrite the AES-NI implementations of AES-GCM, taking advantage of
      things I learned while writing the VAES-AVX10 implementations.  This is
      a complete rewrite that reduces the AES-NI GCM source code size by about
      70% and the binary code size by about 95%, while not regressing
      performance and in fact improving it significantly in many cases.
      
      The following summarizes the state before this patch:
      
      - The aesni-intel module registered algorithms "generic-gcm-aesni" and
        "rfc4106-gcm-aesni" with the crypto API that actually delegated to one
        of three underlying implementations according to the CPU capabilities
        detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.
      
      - The AES-NI + AVX and AES-NI + AVX2 assembly code was in
        aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
        257 KB of binary.  This massive binary size was not really
        appropriate, and depending on the kconfig it could take up over 1% the
        size of the entire vmlinux.  The main loops did 8 blocks per
        iteration.  The AVX code minimized the use of carryless multiplication
        whereas the AVX2 code did not.  The "AVX2" code did not actually use
        AVX2; the check for AVX2 was really a check for Intel Haswell or later
        to detect support for fast carryless multiplication.  The long source
        length was caused by factors such as significant code duplication.
      
      - The AES-NI only assembly code was in aesni-intel_asm.S and consisted
        of 1501 lines of source and 15 KB of binary.  The main loops did 4
        blocks per iteration and minimized the use of carryless multiplication
        by using Karatsuba multiplication and a multiplication-less reduction.
      
      - The assembly code was contributed in 2010-2013.  Maintenance has been
        sporadic and most design choices haven't been revisited.
      
      - The assembly function prototypes and the corresponding glue code were
        separate from and were not consistent with the new VAES-AVX10 code I
        recently added.  The older code had several issues such as not
        precomputing the GHASH key powers, which hurt performance.
      
      This rewrite achieves the following goals:
      
      - Much shorter source and binary sizes.  The assembly source shrinks
        from 4300 lines to 1130 lines, and it produces about 9 KB of binary
        instead of 272 KB.  This is achieved via a better designed AES-GCM
        implementation that doesn't excessively unroll the code and instead
        prioritizes the parts that really matter.  Sharing the C glue code
        with the VAES-AVX10 implementations also saves 250 lines of C source.
      
      - Improve performance on most (possibly all) CPUs on which this code
        runs, for most (possibly all) message lengths.  Benchmark results are
        given in Tables 1 and 2 below.
      
      - Use the same function prototypes and glue code as the new VAES-AVX10
        algorithms.  This fixes some issues with the integration of the
        assembly and results in some significant performance improvements,
        primarily on short messages.  Also, the AVX and non-AVX
        implementations are now registered as separate algorithms with the
        crypto API, which makes them both testable by the self-tests.
      
      - Keep support for AES-NI without AVX (for Westmere, Silvermont,
        Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
        Since 256-bit vectors cannot be used without VAES anyway, this is made
        feasible by just using the non-VEX coded form of most instructions.
      
      - Use a unified approach where the main loop does 8 blocks per iteration
        and uses Karatsuba multiplication to save one pclmulqdq per block but
        does not use the multiplication-less reduction.  This strikes a good
        balance across the range of CPUs on which this code runs.
      
      - Don't spam the kernel log with an informational message on every boot.
      
      The following tables summarize the improvement in AES-GCM throughput on
      various CPU microarchitectures as a result of this patch:
      
      Table 1: AES-256-GCM encryption throughput improvement,
               CPU microarchitecture vs. message length in bytes:
      
                         | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
      -------------------+-------+-------+-------+-------+-------+-------+
      Intel Broadwell    |    2% |    8% |   11% |   18% |   31% |   26% |
      Intel Skylake      |    1% |    4% |    7% |   12% |   26% |   19% |
      Intel Cascade Lake |    3% |    8% |   10% |   18% |   33% |   24% |
      AMD Zen 1          |    6% |   12% |    6% |   15% |   27% |   24% |
      AMD Zen 2          |    8% |   13% |   13% |   19% |   26% |   28% |
      AMD Zen 3          |    8% |   14% |   13% |   19% |   26% |   25% |
      
                         |   300 |   200 |    64 |    63 |    16 |
      -------------------+-------+-------+-------+-------+-------+
      Intel Broadwell    |   35% |   29% |   45% |   55% |   54% |
      Intel Skylake      |   25% |   19% |   28% |   33% |   27% |
      Intel Cascade Lake |   36% |   28% |   39% |   49% |   54% |
      AMD Zen 1          |   27% |   22% |   23% |   29% |   26% |
      AMD Zen 2          |   32% |   24% |   22% |   25% |   31% |
      AMD Zen 3          |   30% |   24% |   22% |   23% |   26% |
      
      Table 2: AES-256-GCM decryption throughput improvement,
               CPU microarchitecture vs. message length in bytes:
      
                         | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
      -------------------+-------+-------+-------+-------+-------+-------+
      Intel Broadwell    |    3% |    8% |   11% |   19% |   32% |   28% |
      Intel Skylake      |    3% |    4% |    7% |   13% |   28% |   27% |
      Intel Cascade Lake |    3% |    9% |   11% |   19% |   33% |   28% |
      AMD Zen 1          |   15% |   18% |   14% |   20% |   36% |   33% |
      AMD Zen 2          |    9% |   16% |   13% |   21% |   26% |   27% |
      AMD Zen 3          |    8% |   15% |   12% |   18% |   23% |   23% |
      
                         |   300 |   200 |    64 |    63 |    16 |
      -------------------+-------+-------+-------+-------+-------+
      Intel Broadwell    |   36% |   31% |   40% |   51% |   53% |
      Intel Skylake      |   28% |   21% |   23% |   30% |   30% |
      Intel Cascade Lake |   36% |   29% |   36% |   47% |   53% |
      AMD Zen 1          |   35% |   31% |   32% |   35% |   36% |
      AMD Zen 2          |   31% |   30% |   27% |   38% |   30% |
      AMD Zen 3          |   27% |   23% |   24% |   32% |   26% |
      
      The above numbers are percentage improvements in single-thread
      throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
      listed as 10%.  They were collected by directly measuring the Linux
      crypto API performance using a custom kernel module.  Note that indirect
      benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
      include more overhead and won't see quite as much of a difference.  All
      these benchmarks used an associated data length of 16 bytes.  Note that
      AES-GCM is almost always used with short associated data lengths.
      
      I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
      Intel low-power CPUs, as these weren't readily available to me.
      However, based on the design of the new code and the available
      information about these other CPU microarchitectures, I wouldn't expect
      any significant regressions, and there's a good chance performance is
      improved just as it is above.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      e6e758fa
    • Eric Biggers's avatar
      crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM · b06affb1
      Eric Biggers authored
      Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
      AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
      AVX10.  There are two implementations, sharing most source code: one
      using 256-bit vectors and one using 512-bit vectors.  This patch
      improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.
      
      I wrote the new AES-GCM assembly code from scratch, focusing on
      correctness, performance, code size (both source and binary), and
      documenting the source.  The new assembly file aes-gcm-avx10-x86_64.S is
      about 1200 lines including extensive comments, and it generates less
      than 8 KB of binary code.  The main loop does 4 vectors at a time, with
      the AES and GHASH instructions interleaved.  Any remainder is handled
      using a simple 1 vector at a time loop, with masking.
      
      Several VAES + AVX512 implementations of AES-GCM exist from Intel,
      including one in OpenSSL and one proposed for inclusion in Linux in 2021
      (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
      These aren't really suitable to be used, though, due to the massive
      amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
      and well as the significantly larger amount of assembly source (4978
      lines for OpenSSL, 1788 lines for Linux).  Also, Intel's code does not
      support 256-bit vectors, which makes it not usable on future
      AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
      downclocking issues.  So I ended up starting from scratch.  Usually my
      much shorter code is actually slightly faster than Intel's AVX512 code,
      though it depends on message length and on which of Intel's
      implementations is used; for details, see Tables 3 and 4 below.
      
      To facilitate potential integration into other projects, I've
      dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
      the same as the recently added RISC-V crypto code.
      
      The following two tables summarize the performance improvement over the
      existing AES-GCM code in Linux that uses AES-NI and AVX2:
      
      Table 1: AES-256-GCM encryption throughput improvement,
               CPU microarchitecture vs. message length in bytes:
      
                            | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
      ----------------------+-------+-------+-------+-------+-------+-------+
      Intel Ice Lake        |   42% |   48% |   60% |   62% |   70% |   69% |
      Intel Sapphire Rapids |  157% |  145% |  162% |  119% |   96% |   96% |
      Intel Emerald Rapids  |  156% |  144% |  161% |  115% |   95% |  100% |
      AMD Zen 4             |  103% |   89% |   78% |   56% |   54% |   54% |
      
                            |   300 |   200 |    64 |    63 |    16 |
      ----------------------+-------+-------+-------+-------+-------+
      Intel Ice Lake        |   66% |   48% |   49% |   70% |   53% |
      Intel Sapphire Rapids |   80% |   60% |   41% |   62% |   38% |
      Intel Emerald Rapids  |   79% |   60% |   41% |   62% |   38% |
      AMD Zen 4             |   51% |   35% |   27% |   32% |   25% |
      
      Table 2: AES-256-GCM decryption throughput improvement,
               CPU microarchitecture vs. message length in bytes:
      
                            | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
      ----------------------+-------+-------+-------+-------+-------+-------+
      Intel Ice Lake        |   42% |   48% |   59% |   63% |   67% |   71% |
      Intel Sapphire Rapids |  159% |  145% |  161% |  125% |  102% |  100% |
      Intel Emerald Rapids  |  158% |  144% |  161% |  124% |  100% |  103% |
      AMD Zen 4             |  110% |   95% |   80% |   59% |   56% |   54% |
      
                            |   300 |   200 |    64 |    63 |    16 |
      ----------------------+-------+-------+-------+-------+-------+
      Intel Ice Lake        |   67% |   56% |   46% |   70% |   56% |
      Intel Sapphire Rapids |   79% |   62% |   39% |   61% |   39% |
      Intel Emerald Rapids  |   80% |   62% |   40% |   58% |   40% |
      AMD Zen 4             |   49% |   36% |   30% |   35% |   28% |
      
      The above numbers are percentage improvements in single-thread
      throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
      listed as 50%.  They were collected by directly measuring the Linux
      crypto API performance using a custom kernel module.  Note that indirect
      benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
      include more overhead and won't see quite as much of a difference.  All
      these benchmarks used an associated data length of 16 bytes.  Note that
      AES-GCM is almost always used with short associated data lengths.
      
      The following two tables summarize how the performance of my code
      compares with Intel's AVX512 AES-GCM code, both the version that is in
      OpenSSL and the version that was proposed for inclusion in Linux.
      Neither version exists in Linux currently, but these are alternative
      AES-GCM implementations that could be chosen instead of mine.  I
      collected the following numbers on Emerald Rapids using a userspace
      benchmark program that calls the assembly functions directly.
      
      I've also included a comparison with Cloudflare's AES-GCM implementation
      from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.
      
      Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
               implementation name vs. message length in bytes:
      
                           | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
      ---------------------+-------+-------+-------+-------+-------+-------+
      This implementation  | 14171 | 12956 | 12318 |  9588 |  7293 |  6449 |
      AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 |  9107 |  5891 |  6472 |
      AVX512_Intel_Linux   | 13954 | 12277 | 11530 |  8712 |  6627 |  5898 |
      AVX512_Cloudflare    | 12564 | 11050 | 10905 |  8152 |  5345 |  5202 |
      
                           |   300 |   200 |    64 |    63 |    16 |
      ---------------------+-------+-------+-------+-------+-------+
      This implementation  |  4939 |  3688 |  1846 |  1821 |   738 |
      AVX512_Intel_OpenSSL |  4629 |  4532 |  2734 |  2332 |  1131 |
      AVX512_Intel_Linux   |  4035 |  2966 |  1567 |  1330 |   639 |
      AVX512_Cloudflare    |  3344 |  2485 |  1141 |  1127 |   456 |
      
      Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
               implementation name vs. message length in bytes:
      
                           | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
      ---------------------+-------+-------+-------+-------+-------+-------+
      This implementation  | 14276 | 13311 | 13007 | 11086 |  8268 |  8086 |
      AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 |  9587 |  5954 |  7060 |
      AVX512_Intel_Linux   | 14116 | 12795 | 11778 |  9269 |  7735 |  6455 |
      AVX512_Cloudflare    | 13301 | 12018 | 11919 |  9182 |  7189 |  6726 |
      
                           |   300 |   200 |    64 |    63 |    16 |
      ---------------------+-------+-------+-------+-------+-------+
      This implementation  |  6454 |  5020 |  2635 |  2602 |  1079 |
      AVX512_Intel_OpenSSL |  5184 |  5799 |  2957 |  2545 |  1228 |
      AVX512_Intel_Linux   |  4394 |  4247 |  2235 |  1635 |   922 |
      AVX512_Cloudflare    |  4289 |  3851 |  1435 |  1417 |   574 |
      
      So, usually my code is actually slightly faster than Intel's code,
      though the OpenSSL implementation has a slight edge on messages shorter
      than 256 bytes in this microbenchmark.  (This also holds true when doing
      the same tests on AMD Zen 4.)  It can be seen that the large code size
      (up to 94x larger!) of the Intel implementations doesn't seem to bring
      much benefit, so starting from scratch with much smaller code, as I've
      done, seems appropriate.  The performance of my code on messages shorter
      than 256 bytes could be improved through a limited amount of unrolling,
      but it's unclear it would be worth it, given code size considerations
      (e.g. caches) that don't get measured in microbenchmarks.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      b06affb1
    • Chenghai Huang's avatar
      crypto: hisilicon/zip - optimize the address offset of the reg query function · c17b56d9
      Chenghai Huang authored
      Currently, the reg is queried based on the fixed address offset
      array. When the number of accelerator cores changes, the system
      can not flexibly respond to the change.
      
      Therefore, the reg to be queried is calculated based on the
      comp or decomp core base address.
      Signed-off-by: default avatarChenghai Huang <huangchenghai2@huawei.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c17b56d9