Commits · 8323c036789b8b4a61925fce439a89dba17b7f2f · Kirill Smelkov / linux

06 Jul, 2024 8 commits

crypto: starfive - Fix nent assignment in rsa dec · 8323c036

Jia Jie Ho authored Jun 26, 2024

Missing src scatterlist nent assignment in rsa decrypt function.
Removing all unneeded assignment and use nents value from req->src
instead.
Signed-off-by: Jia Jie Ho <jiajie.ho@starfivetech.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

8323c036

crypto: starfive - Align rsa input data to 32-bit · 6aad7019

Jia Jie Ho authored Jun 26, 2024

Hardware expects RSA input plain/ciphertext to be 32-bit aligned.
Set fixed length for preallocated buffer to the maximum supported
keysize of the hardware and shift input text accordingly.
Signed-off-by: Jia Jie Ho <jiajie.ho@starfivetech.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

6aad7019

crypto: qat - fix unintentional re-enabling of error interrupts · f0622894

Hareshx Sankar Raj authored Jun 25, 2024

The logic that detects pending VF2PF interrupts unintentionally clears
the section of the error mask register(s) not related to VF2PF.
This might cause interrupts unrelated to VF2PF, reported through
errsou3 and errsou5, to be reported again after the execution
of the function disable_pending_vf2pf_interrupts() in dh895xcc
and GEN2 devices.

Fix by updating only section of errmsk3 and errmsk5 related to VF2PF.
Signed-off-by: Hareshx Sankar Raj <hareshx.sankar.raj@intel.com>
Reviewed-by: Damian Muszynski <damian.muszynski@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

f0622894

crypto: qat - extend scope of lock in adf_cfg_add_key_value_param() · 6424da7d

Nivas Varadharajan Mugunthakumar authored Jun 25, 2024

The function adf_cfg_add_key_value_param() attempts to access and modify
the key value store of the driver without locking.

Extend the scope of cfg->lock to avoid a potential race condition.

Fixes: 92bf269f ("crypto: qat - change behaviour of adf_cfg_add_key_value_param()")
Signed-off-by: Nivas Varadharajan Mugunthakumar <nivasx.varadharajan.mugunthakumar@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

6424da7d

Documentation: qat: fix auto_reset attribute details · d26cb4f5

Damian Muszynski authored Jun 25, 2024

The auto_reset attribute was introduced in kernel 6.9. Fix version and
date in documentation.

Fixes: f5419a42 ("crypto: qat - add auto reset on error")
Signed-off-by: Damian Muszynski <damian.muszynski@intel.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

d26cb4f5

crypto: sun8i-ce - add Allwinner H616 support · 1611f749

Andre Przywara authored Jun 25, 2024

The crypto engine in the Allwinner H616 is very similar to the H6, but
needs the base address for the task descriptor and the addresses within
it to be expressed in words, not in bytes.

Add a new variant struct entry for the H616, and set the new flag to
mark the use of 34 bit addresses. Also the internal 32K oscillator is
required for TRNG operation, so specify all four clocks.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Chen-Yu Tsai <wens@csie.org>
Tested-by: Ryan Walklin <ryan@testtoast.com>
Tested-by: Philippe Simons <simons.philippe@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

1611f749

crypto: sun8i-ce - wrap accesses to descriptor address fields · e0740bee

Andre Przywara authored Jun 25, 2024

The Allwinner H616 (and later) SoCs support more than 32 bits worth of
physical addresses. To accommodate the larger address space, the CE task
descriptor fields holding addresses are now encoded as "word addresses",
so take the actual address divided by four.
This is true for the fields within the descriptor, but also for the
descriptor base address, in the CE_TDA register.

Wrap all accesses to those fields in a function, which will do the
required division if needed. For now this in unused, so there should be
no change in behaviour.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

e0740bee

dt-bindings: crypto: sun8i-ce: Add compatible for H616 · 996f8a96

Andre Przywara authored Jun 25, 2024

The Allwinner H616 has a crypto engine very similar to the one in the
H6, although all addresses in the DMA descriptors are shifted by 2 bits,
to accommodate for the larger physical address space. That makes it
incompatible to the H6 variant, and thus requires a new compatible
string. Clock wise it relies on the internal oscillator for the TRNG,
so needs all four possible clocks specified.

Add the compatible string to the list of recognised names, and add the
H616 to list of devices requiring all four clocks.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

996f8a96

28 Jun, 2024 10 commits

hwrng: core - Fix wrong quality calculation at hw rng registration · 95c0f5c3

Harald Freudenberger authored Jun 21, 2024

When there are rng sources registering at the hwrng core via
hwrng_register() a struct hwrng is delivered. There is a quality
field in there which is used to decide which of the registered
hw rng sources will be used by the hwrng core.

With commit 16bdbae3 ("hwrng: core - treat default_quality as
a maximum and default to 1024") there came in a new default of
1024 in case this field is empty and all the known hw rng sources
at that time had been reworked to not fill this field and thus
use the default of 1024.

The code choosing the 'better' hw rng source during registration
of a new hw rng source has never been adapted to this and thus
used 0 if the hw rng implementation does not fill the quality field.
So when two rng sources register, one with 0 (meaning 1024) and
the other one with 999, the 999 hw rng will be chosen.

As the later invoked function hwrng_init() anyway adjusts the
quality field of the hw rng source, this adjustment is now done
during registration of this new hw rng source.

Tested on s390 with two hardware rng sources: crypto cards and
trng true random generator device driver.

Fixes: 16bdbae3 ("hwrng: core - treat default_quality as a maximum and default to 1024")
Reported-by: Christian Rund <Christian.Rund@de.ibm.com>
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

95c0f5c3

hwrng: exynos - Enable Exynos850 support · b0c2036d

Sam Protsenko authored Jun 20, 2024

Add Exynos850 compatible and its driver data. It's only possible to
access TRNG block via SMC calls in Exynos850, so specify that fact using
EXYNOS_SMC flag in the driver data.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Łukasz Stelmach <l.stelmach@samsung.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

b0c2036d

hwrng: exynos - Add SMC based TRNG operation · 10bb6ac8

Sam Protsenko authored Jun 20, 2024

On some Exynos chips like Exynos850 the access to Security Sub System
(SSS) registers is protected with TrustZone, and therefore only possible
from EL3 monitor software. The Linux kernel is running in EL1, so the
only way for the driver to obtain TRNG data is via SMC calls to EL3
monitor. Implement such SMC operation and use it when EXYNOS_SMC flag is
set in the corresponding chip driver data.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

10bb6ac8

hwrng: exynos - Implement bus clock control · e003d670

Sam Protsenko authored Jun 20, 2024

Some SoCs like Exynos850 might require the SSS bus clock (PCLK) to be
enabled in order to access TRNG registers. Add and handle the optional
PCLK clock accordingly to make it possible.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Anand Moon <linux.amoon@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

e003d670

hwrng: exynos - Use devm_clk_get_enabled() to get the clock · 81da8056

Sam Protsenko authored Jun 20, 2024

Use devm_clk_get_enabled() helper instead of calling devm_clk_get() and
then clk_prepare_enable(). It simplifies the error handling and makes
the code more compact. Also use dev_err_probe() to handle possible
-EPROBE_DEFER errors if the clock is not available yet.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Anand Moon <linux.amoon@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

81da8056

hwrng: exynos - Improve coding style · 76536caa

Sam Protsenko authored Jun 20, 2024

Fix obvious style issues. Some of those were found with checkpatch, and
some just contradict the kernel coding style guide.

No functional change.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Łukasz Stelmach <l.stelmach@samsung.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

76536caa

dt-bindings: rng: Add Exynos850 support to exynos-trng · 70003f51

Sam Protsenko authored Jun 20, 2024

The TRNG block in Exynos850 is pretty much the same as in Exynos5250,
but there are two clocks that has to be controlled to make it work:
  1. Functional (operating) clock: called ACLK in Exynos850, the same as
     "secss" clock in Exynos5250
  2. Interface (bus) clock: called PCLK in Exynos850. It has to be
     enabled in order to access TRNG registers

Document Exynos850 compatible and the related clock changes.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

70003f51

crypto: qat - initialize user_input.lock for rate_limiting · ccacbbc3

Jiwei Sun authored Jun 20, 2024

If the following configurations are set,
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_RWSEM_SPIN_ON_OWNER=y

And run the following command,
[root@localhost sys]# cat /sys/devices/pci0000:6b/0000:6b:00.0/qat_rl/pir
The following warning log appears,

------------[ cut here ]------------
DEBUG_RWSEMS_WARN_ON(sem->magic != sem): count = 0x0, magic = 0x0, owner = 0x1, curr 0xff11000119288040, list not empty
WARNING: CPU: 131 PID: 1254984 at kernel/locking/rwsem.c:1280 down_read+0x439/0x7f0
CPU: 131 PID: 1254984 Comm: cat Kdump: loaded Tainted: G        W          6.10.0-rc4+ #86 b2ae60c8ceabed15f4fd2dba03c1c5a5f7f4040c
Hardware name: Lenovo ThinkServer SR660 V3/SR660 V3, BIOS T8E166X-2.54 05/30/2024
RIP: 0010:down_read+0x439/0x7f0
Code: 44 24 10 80 3c 02 00 0f 85 05 03 00 00 48 8b 13 41 54 48 c7 c6 a0 3e 0e b4 48 c7 c7 e0 3e 0e b4 4c 8b 4c 24 08 e8 77 d5 40 fd <0f> 0b 59 e9 bc fc ff ff 0f 1f 44 00 00 e9 e2 fd ff ff 4c 8d 7b 08
RSP: 0018:ffa0000035f67a78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ff1100012b03a658 RCX: 0000000000000000
RDX: 0000000080000002 RSI: 0000000000000008 RDI: 0000000000000001
RBP: 1ff4000006becf53 R08: fff3fc0006becf17 R09: fff3fc0006becf17
R10: fff3fc0006becf16 R11: ffa0000035f678b7 R12: ffffffffb40e3e60
R13: ffffffffb627d1f4 R14: ff1100012b03a6d0 R15: ff1100012b03a6c8
FS:  00007fa9ff9a6740(0000) GS:ff1100081e600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa9ff984000 CR3: 00000002118ae006 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 pir_show+0x5d/0xe0 [intel_qat 9e297e249ab040329cf58b657b06f418fd5c5855]
 dev_attr_show+0x3f/0xc0
 sysfs_kf_seq_show+0x1ce/0x400
 seq_read_iter+0x3fa/0x10b0
 vfs_read+0x6f5/0xb20
 ksys_read+0xe9/0x1d0
 do_syscall_64+0x8a/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa9ff6fd9b2
Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 1d 0c 00 e8 c5 fd 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffc0616b968 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fa9ff6fd9b2
RDX: 0000000000020000 RSI: 00007fa9ff985000 RDI: 0000000000000003
RBP: 00007fa9ff985000 R08: 00007fa9ff984010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
 </TASK>
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb102c126>] copy_process+0x21e6/0x6e70
softirqs last  enabled at (0): [<ffffffffb102c176>] copy_process+0x2236/0x6e70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace 0000000000000000 ]---

The rate_limiting->user_input.lock rwsem lock is not initialized before
use. Let's initialize it.
Signed-off-by: Jiwei Sun <sunjw10@lenovo.com>
Reviewed-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

ccacbbc3

crypto: deflate - Add aliases to deflate · e0eece0c

Kyle Meyer authored Jun 17, 2024

iaa_crypto depends on the deflate compression algorithm that's provided
by deflate.

If the algorithm is not available because CRYPTO_DEFLATE=m and deflate
is not inserted, iaa_crypto will request "crypto-deflate-generic".
Deflate will not be inserted because "crypto-deflate-generic" is not a
valid alias.

Add deflate-generic and crypto-deflate-generic aliases to deflate.
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

e0eece0c

crypto: tcrypt - add skcipher speed for given alg · 7b3058eb

Sergey Portnoy authored Jun 17, 2024

Allow to run skcipher speed for given algorithm.
Case 600 is modified to cover ENCRYPT and DECRYPT
directions.

Example:
   modprobe tcrypt mode=600 alg="qat_aes_xts" klen=32

If succeed, the performance numbers will be printed in dmesg:
   testing speed of multibuffer qat_aes_xts (qat_aes_xts) encryption
   test 0 (256 bit key, 16 byte blocks): 1 operation in 14596 cycles (16 bytes)
   ...
   test 6 (256 bit key, 4096 byte blocks): 1 operation in 8053 cycles (4096 bytes)
Signed-off-by: Sergey Portnoy <sergey.portnoy@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

7b3058eb

21 Jun, 2024 9 commits

crypto: arm/crc32 - add kCFI annotations to asm routines · ff33c2e6

Ard Biesheuvel authored Jun 10, 2024

The crc32/crc32c implementations using the scalar CRC32 instructions are
accessed via indirect calls, and so they must be annotated with type ids
in order to execute correctly when kCFI is enabled.

Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

ff33c2e6

crypto: lib - add missing MODULE_DESCRIPTION() macros · 3cbe18b0

Jeff Johnson authored Jun 15, 2024

With ARCH=arm, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in lib/crypto/libsha256.o

Add the missing invocation of the MODULE_DESCRIPTION() macro to all
files which have a MODULE_LICENSE().

This includes sha1.c and utils.c which, although they did not produce
a warning with the arm allmodconfig configuration, may cause this
warning with other configurations.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

3cbe18b0

crypto: arm - add missing MODULE_DESCRIPTION() macros · 70d57ffb

Jeff Johnson authored Jun 15, 2024

With ARCH=arm, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm/crypto/aes-arm-bs.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm/crypto/crc32-arm-ce.o

Add the missing invocation of the MODULE_DESCRIPTION() macro to all
files which have a MODULE_LICENSE().

This includes crct10dif-ce-glue.c and curve25519-glue.c which,
although they did not produce a warning with the arm allmodconfig
configuration, may cause this warning with other configurations.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

70d57ffb

crypto: lib/mpi - Use swap() in mpi_powm() · b44327eb

Jiapeng Chong authored Jun 14, 2024

Use existing swap() function rather than duplicating its implementation.

./lib/crypto/mpi/mpi-pow.c:211:11-12: WARNING opportunity for swap().
./lib/crypto/mpi/mpi-pow.c:239:12-13: WARNING opportunity for swap().
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9327Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

b44327eb

crypto: lib/mpi - Use swap() in mpi_ec_mul_point() · f0da7a23

Jiapeng Chong authored Jun 14, 2024

Use existing swap() function rather than duplicating its implementation.

./lib/crypto/mpi/ec.c:1291:20-21: WARNING opportunity for swap().
./lib/crypto/mpi/ec.c:1292:20-21: WARNING opportunity for swap().
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9328Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

f0da7a23

crypto: arm/poly1305 - add missing MODULE_DESCRIPTION() macro · 5ca95a90

Jeff Johnson authored Jun 13, 2024

With ARCH=arm, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm/crypto/poly1305-arm.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

5ca95a90

hwrng: drivers - add missing Arm & Cavium MODULE_DESCRIPTION() macros · 691eaf1d

Jeff Johnson authored Jun 13, 2024

With ARCH=arm64, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/cavium-rng.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/cavium-rng-vf.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/arm_smccc_trng.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

691eaf1d

crypto: arm64 - add missing MODULE_DESCRIPTION() macros · b568826e

Jeff Johnson authored Jun 12, 2024

With ARCH=arm64, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm64/crypto/crct10dif-ce.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm64/crypto/poly1305-neon.o
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/arm64/crypto/aes-neon-bs.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

b568826e

crypto: qat - make adf_ctl_class constant · a654b354

Greg Kroah-Hartman authored Jun 10, 2024

Now that the driver core allows for struct class to be in read-only
memory, we should make all 'class' structures declared at build time
placing them into read-only memory, instead of having to be dynamically
allocated at runtime.

Cc: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Guerin <adam.guerin@intel.com>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Shashank Gupta <shashank.gupta@intel.com>
Cc: qat-linux@intel.com
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

a654b354

16 Jun, 2024 5 commits

crypto: ecc - Fix off-by-one missing to clear most significant digit · 1dcf865d

Stefan Berger authored Jun 13, 2024

Fix an off-by-one error where the most significant digit was not
initialized leading to signature verification failures by the testmgr.

Example: If a curve requires ndigits (=9) and diff (=2) indicates that
2 digits need to be set to zero then start with digit 'ndigits - diff' (=7)
and clear 'diff' digits starting from there, so 7 and 8.
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Closes: https://lore.kernel.org/linux-crypto/619bc2de-b18a-4939-a652-9ca886bf6349@linux.ibm.com/T/#m045d8812409ce233c17fcdb8b88b6629c671f9f4
Fixes: 2fd2a82c ("crypto: ecdsa - Use ecc_digits_from_bytes to create hash digits array")
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

1dcf865d

crypto: ecc - Add comment to ecc_digits_from_bytes about input byte array · 0eb3bed5

Stefan Berger authored Jun 07, 2024

Add comment to ecc_digits_from_bytes kdoc that the first byte is expected
to hold the most significant bits of the large integer that is converted
into an array of digits.
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

0eb3bed5

hwrng: core - Remove list.h from the hw_random.h · 4604b388

Andy Shevchenko authored Jun 05, 2024

The 'struct list' type is defined in types.h, no need to include list.h
for that.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

4604b388

dt-bindings: rng: meson: add optional power-domains · 293695f1

Neil Armstrong authored Jun 05, 2024

On newer SoCs, the random number generator can require a power-domain to
operate, add it as optional.
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

293695f1

crypto: ccp - Fix null pointer dereference in __sev_snp_shutdown_locked · 468e3295

Kim Phillips authored Jun 04, 2024

Fix a null pointer dereference induced by DEBUG_TEST_DRIVER_REMOVE.
Return from __sev_snp_shutdown_locked() if the psp_device or the
sev_device structs are not initialized. Without the fix, the driver will
produce the following splat:

   ccp 0000:55:00.5: enabling device (0000 -> 0002)
   ccp 0000:55:00.5: sev enabled
   ccp 0000:55:00.5: psp enabled
   BUG: kernel NULL pointer dereference, address: 00000000000000f0
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0
   Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
   CPU: 262 PID: 1 Comm: swapper/0 Not tainted 6.9.0-rc1+ #29
   RIP: 0010:__sev_snp_shutdown_locked+0x2e/0x150
   Code: 00 55 48 89 e5 41 57 41 56 41 54 53 48 83 ec 10 41 89 f7 49 89 fe 65 48 8b 04 25 28 00 00 00 48 89 45 d8 48 8b 05 6a 5a 7f 06 <4c> 8b a0 f0 00 00 00 41 0f b6 9c 24 a2 00 00 00 48 83 fb 02 0f 83
   RSP: 0018:ffffb2ea4014b7b8 EFLAGS: 00010286
   RAX: 0000000000000000 RBX: ffff9e4acd2e0a28 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb2ea4014b808
   RBP: ffffb2ea4014b7e8 R08: 0000000000000106 R09: 000000000003d9c0
   R10: 0000000000000001 R11: ffffffffa39ff070 R12: ffff9e49d40590c8
   R13: 0000000000000000 R14: ffffb2ea4014b808 R15: 0000000000000000
   FS:  0000000000000000(0000) GS:ffff9e58b1e00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00000000000000f0 CR3: 0000000418a3e001 CR4: 0000000000770ef0
   PKRU: 55555554
   Call Trace:
    <TASK>
    ? __die_body+0x6f/0xb0
    ? __die+0xcc/0xf0
    ? page_fault_oops+0x330/0x3a0
    ? save_trace+0x2a5/0x360
    ? do_user_addr_fault+0x583/0x630
    ? exc_page_fault+0x81/0x120
    ? asm_exc_page_fault+0x2b/0x30
    ? __sev_snp_shutdown_locked+0x2e/0x150
    __sev_firmware_shutdown+0x349/0x5b0
    ? pm_runtime_barrier+0x66/0xe0
    sev_dev_destroy+0x34/0xb0
    psp_dev_destroy+0x27/0x60
    sp_destroy+0x39/0x90
    sp_pci_remove+0x22/0x60
    pci_device_remove+0x4e/0x110
    really_probe+0x271/0x4e0
    __driver_probe_device+0x8f/0x160
    driver_probe_device+0x24/0x120
    __driver_attach+0xc7/0x280
    ? driver_attach+0x30/0x30
    bus_for_each_dev+0x10d/0x130
    driver_attach+0x22/0x30
    bus_add_driver+0x171/0x2b0
    ? unaccepted_memory_init_kdump+0x20/0x20
    driver_register+0x67/0x100
    __pci_register_driver+0x83/0x90
    sp_pci_init+0x22/0x30
    sp_mod_init+0x13/0x30
    do_one_initcall+0xb8/0x290
    ? sched_clock_noinstr+0xd/0x10
    ? local_clock_noinstr+0x3e/0x100
    ? stack_depot_save_flags+0x21e/0x6a0
    ? local_clock+0x1c/0x60
    ? stack_depot_save_flags+0x21e/0x6a0
    ? sched_clock_noinstr+0xd/0x10
    ? local_clock_noinstr+0x3e/0x100
    ? __lock_acquire+0xd90/0xe30
    ? sched_clock_noinstr+0xd/0x10
    ? local_clock_noinstr+0x3e/0x100
    ? __create_object+0x66/0x100
    ? local_clock+0x1c/0x60
    ? __create_object+0x66/0x100
    ? parameq+0x1b/0x90
    ? parse_one+0x6d/0x1d0
    ? parse_args+0xd7/0x1f0
    ? do_initcall_level+0x180/0x180
    do_initcall_level+0xb0/0x180
    do_initcalls+0x60/0xa0
    ? kernel_init+0x1f/0x1d0
    do_basic_setup+0x41/0x50
    kernel_init_freeable+0x1ac/0x230
    ? rest_init+0x1f0/0x1f0
    kernel_init+0x1f/0x1d0
    ? rest_init+0x1f0/0x1f0
    ret_from_fork+0x3d/0x50
    ? rest_init+0x1f0/0x1f0
    ret_from_fork_asm+0x11/0x20
    </TASK>
   Modules linked in:
   CR2: 00000000000000f0
   ---[ end trace 0000000000000000 ]---
   RIP: 0010:__sev_snp_shutdown_locked+0x2e/0x150
   Code: 00 55 48 89 e5 41 57 41 56 41 54 53 48 83 ec 10 41 89 f7 49 89 fe 65 48 8b 04 25 28 00 00 00 48 89 45 d8 48 8b 05 6a 5a 7f 06 <4c> 8b a0 f0 00 00 00 41 0f b6 9c 24 a2 00 00 00 48 83 fb 02 0f 83
   RSP: 0018:ffffb2ea4014b7b8 EFLAGS: 00010286
   RAX: 0000000000000000 RBX: ffff9e4acd2e0a28 RCX: 0000000000000000
   RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb2ea4014b808
   RBP: ffffb2ea4014b7e8 R08: 0000000000000106 R09: 000000000003d9c0
   R10: 0000000000000001 R11: ffffffffa39ff070 R12: ffff9e49d40590c8
   R13: 0000000000000000 R14: ffffb2ea4014b808 R15: 0000000000000000
   FS:  0000000000000000(0000) GS:ffff9e58b1e00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00000000000000f0 CR3: 0000000418a3e001 CR4: 0000000000770ef0
   PKRU: 55555554
   Kernel panic - not syncing: Fatal exception
   Kernel Offset: 0x1fc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 1ca5614b ("crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP")
Cc: stable@vger.kernel.org
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: John Allen <john.allen@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

468e3295

07 Jun, 2024 8 commits

hwrng: omap - add missing MODULE_DESCRIPTION() macro · 6d4e1993

Jeff Johnson authored Jun 03, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/omap-rng.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/char/hw_random/omap3-rom-rng.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

6d4e1993

crypto: xilinx - add missing MODULE_DESCRIPTION() macro · ed6261d5

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/xilinx/zynqmp-aes-gcm.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Michal Simek <michal.simek@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

ed6261d5

crypto: sa2ul - add missing MODULE_DESCRIPTION() macro · c8edb3cc

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/sa2ul.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

c8edb3cc

crypto: keembay - add missing MODULE_DESCRIPTION() macro · f2cbb746

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/intel/keembay/keembay-ocs-hcu.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

f2cbb746

crypto: atmel-sha204a - add missing MODULE_DESCRIPTION() macro · 3aa461e3

Jeff Johnson authored Jun 02, 2024

make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/crypto/atmel-sha204a.o

Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

3aa461e3

crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM · e6e758fa

Eric Biggers authored Jun 02, 2024

Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations.  This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.

The following summarizes the state before this patch:

- The aesni-intel module registered algorithms "generic-gcm-aesni" and
  "rfc4106-gcm-aesni" with the crypto API that actually delegated to one
  of three underlying implementations according to the CPU capabilities
  detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.

- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
  aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
  257 KB of binary.  This massive binary size was not really
  appropriate, and depending on the kconfig it could take up over 1% the
  size of the entire vmlinux.  The main loops did 8 blocks per
  iteration.  The AVX code minimized the use of carryless multiplication
  whereas the AVX2 code did not.  The "AVX2" code did not actually use
  AVX2; the check for AVX2 was really a check for Intel Haswell or later
  to detect support for fast carryless multiplication.  The long source
  length was caused by factors such as significant code duplication.

- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
  of 1501 lines of source and 15 KB of binary.  The main loops did 4
  blocks per iteration and minimized the use of carryless multiplication
  by using Karatsuba multiplication and a multiplication-less reduction.

- The assembly code was contributed in 2010-2013.  Maintenance has been
  sporadic and most design choices haven't been revisited.

- The assembly function prototypes and the corresponding glue code were
  separate from and were not consistent with the new VAES-AVX10 code I
  recently added.  The older code had several issues such as not
  precomputing the GHASH key powers, which hurt performance.

This rewrite achieves the following goals:

- Much shorter source and binary sizes.  The assembly source shrinks
  from 4300 lines to 1130 lines, and it produces about 9 KB of binary
  instead of 272 KB.  This is achieved via a better designed AES-GCM
  implementation that doesn't excessively unroll the code and instead
  prioritizes the parts that really matter.  Sharing the C glue code
  with the VAES-AVX10 implementations also saves 250 lines of C source.

- Improve performance on most (possibly all) CPUs on which this code
  runs, for most (possibly all) message lengths.  Benchmark results are
  given in Tables 1 and 2 below.

- Use the same function prototypes and glue code as the new VAES-AVX10
  algorithms.  This fixes some issues with the integration of the
  assembly and results in some significant performance improvements,
  primarily on short messages.  Also, the AVX and non-AVX
  implementations are now registered as separate algorithms with the
  crypto API, which makes them both testable by the self-tests.

- Keep support for AES-NI without AVX (for Westmere, Silvermont,
  Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
  Since 256-bit vectors cannot be used without VAES anyway, this is made
  feasible by just using the non-VEX coded form of most instructions.

- Use a unified approach where the main loop does 8 blocks per iteration
  and uses Karatsuba multiplication to save one pclmulqdq per block but
  does not use the multiplication-less reduction.  This strikes a good
  balance across the range of CPUs on which this code runs.

- Don't spam the kernel log with an informational message on every boot.

The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    2% |    8% |   11% |   18% |   31% |   26% |
Intel Skylake      |    1% |    4% |    7% |   12% |   26% |   19% |
Intel Cascade Lake |    3% |    8% |   10% |   18% |   33% |   24% |
AMD Zen 1          |    6% |   12% |    6% |   15% |   27% |   24% |
AMD Zen 2          |    8% |   13% |   13% |   19% |   26% |   28% |
AMD Zen 3          |    8% |   14% |   13% |   19% |   26% |   25% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   35% |   29% |   45% |   55% |   54% |
Intel Skylake      |   25% |   19% |   28% |   33% |   27% |
Intel Cascade Lake |   36% |   28% |   39% |   49% |   54% |
AMD Zen 1          |   27% |   22% |   23% |   29% |   26% |
AMD Zen 2          |   32% |   24% |   22% |   25% |   31% |
AMD Zen 3          |   30% |   24% |   22% |   23% |   26% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                   | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell    |    3% |    8% |   11% |   19% |   32% |   28% |
Intel Skylake      |    3% |    4% |    7% |   13% |   28% |   27% |
Intel Cascade Lake |    3% |    9% |   11% |   19% |   33% |   28% |
AMD Zen 1          |   15% |   18% |   14% |   20% |   36% |   33% |
AMD Zen 2          |    9% |   16% |   13% |   21% |   26% |   27% |
AMD Zen 3          |    8% |   15% |   12% |   18% |   23% |   23% |

                   |   300 |   200 |    64 |    63 |    16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell    |   36% |   31% |   40% |   51% |   53% |
Intel Skylake      |   28% |   21% |   23% |   30% |   30% |
Intel Cascade Lake |   36% |   29% |   36% |   47% |   53% |
AMD Zen 1          |   35% |   31% |   32% |   35% |   36% |
AMD Zen 2          |   31% |   30% |   27% |   38% |   30% |
AMD Zen 3          |   27% |   23% |   24% |   32% |   26% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

e6e758fa

crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM · b06affb1

Eric Biggers authored Jun 02, 2024

Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
AVX10.  There are two implementations, sharing most source code: one
using 256-bit vectors and one using 512-bit vectors.  This patch
improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.

I wrote the new AES-GCM assembly code from scratch, focusing on
correctness, performance, code size (both source and binary), and
documenting the source.  The new assembly file aes-gcm-avx10-x86_64.S is
about 1200 lines including extensive comments, and it generates less
than 8 KB of binary code.  The main loop does 4 vectors at a time, with
the AES and GHASH instructions interleaved.  Any remainder is handled
using a simple 1 vector at a time loop, with masking.

Several VAES + AVX512 implementations of AES-GCM exist from Intel,
including one in OpenSSL and one proposed for inclusion in Linux in 2021
(https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
These aren't really suitable to be used, though, due to the massive
amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
and well as the significantly larger amount of assembly source (4978
lines for OpenSSL, 1788 lines for Linux).  Also, Intel's code does not
support 256-bit vectors, which makes it not usable on future
AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
downclocking issues.  So I ended up starting from scratch.  Usually my
much shorter code is actually slightly faster than Intel's AVX512 code,
though it depends on message length and on which of Intel's
implementations is used; for details, see Tables 3 and 4 below.

To facilitate potential integration into other projects, I've
dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
the same as the recently added RISC-V crypto code.

The following two tables summarize the performance improvement over the
existing AES-GCM code in Linux that uses AES-NI and AVX2:

Table 1: AES-256-GCM encryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   42% |   48% |   60% |   62% |   70% |   69% |
Intel Sapphire Rapids |  157% |  145% |  162% |  119% |   96% |   96% |
Intel Emerald Rapids  |  156% |  144% |  161% |  115% |   95% |  100% |
AMD Zen 4             |  103% |   89% |   78% |   56% |   54% |   54% |

                      |   300 |   200 |    64 |    63 |    16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   66% |   48% |   49% |   70% |   53% |
Intel Sapphire Rapids |   80% |   60% |   41% |   62% |   38% |
Intel Emerald Rapids  |   79% |   60% |   41% |   62% |   38% |
AMD Zen 4             |   51% |   35% |   27% |   32% |   25% |

Table 2: AES-256-GCM decryption throughput improvement,
         CPU microarchitecture vs. message length in bytes:

                      | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   42% |   48% |   59% |   63% |   67% |   71% |
Intel Sapphire Rapids |  159% |  145% |  161% |  125% |  102% |  100% |
Intel Emerald Rapids  |  158% |  144% |  161% |  124% |  100% |  103% |
AMD Zen 4             |  110% |   95% |   80% |   59% |   56% |   54% |

                      |   300 |   200 |    64 |    63 |    16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake        |   67% |   56% |   46% |   70% |   56% |
Intel Sapphire Rapids |   79% |   62% |   39% |   61% |   39% |
Intel Emerald Rapids  |   80% |   62% |   40% |   58% |   40% |
AMD Zen 4             |   49% |   36% |   30% |   35% |   28% |

The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
listed as 50%.  They were collected by directly measuring the Linux
crypto API performance using a custom kernel module.  Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference.  All
these benchmarks used an associated data length of 16 bytes.  Note that
AES-GCM is almost always used with short associated data lengths.

The following two tables summarize how the performance of my code
compares with Intel's AVX512 AES-GCM code, both the version that is in
OpenSSL and the version that was proposed for inclusion in Linux.
Neither version exists in Linux currently, but these are alternative
AES-GCM implementations that could be chosen instead of mine.  I
collected the following numbers on Emerald Rapids using a userspace
benchmark program that calls the assembly functions directly.

I've also included a comparison with Cloudflare's AES-GCM implementation
from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.

Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
         implementation name vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation  | 14171 | 12956 | 12318 |  9588 |  7293 |  6449 |
AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 |  9107 |  5891 |  6472 |
AVX512_Intel_Linux   | 13954 | 12277 | 11530 |  8712 |  6627 |  5898 |
AVX512_Cloudflare    | 12564 | 11050 | 10905 |  8152 |  5345 |  5202 |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
This implementation  |  4939 |  3688 |  1846 |  1821 |   738 |
AVX512_Intel_OpenSSL |  4629 |  4532 |  2734 |  2332 |  1131 |
AVX512_Intel_Linux   |  4035 |  2966 |  1567 |  1330 |   639 |
AVX512_Cloudflare    |  3344 |  2485 |  1141 |  1127 |   456 |

Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
         implementation name vs. message length in bytes:

                     | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation  | 14276 | 13311 | 13007 | 11086 |  8268 |  8086 |
AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 |  9587 |  5954 |  7060 |
AVX512_Intel_Linux   | 14116 | 12795 | 11778 |  9269 |  7735 |  6455 |
AVX512_Cloudflare    | 13301 | 12018 | 11919 |  9182 |  7189 |  6726 |

                     |   300 |   200 |    64 |    63 |    16 |
---------------------+-------+-------+-------+-------+-------+
This implementation  |  6454 |  5020 |  2635 |  2602 |  1079 |
AVX512_Intel_OpenSSL |  5184 |  5799 |  2957 |  2545 |  1228 |
AVX512_Intel_Linux   |  4394 |  4247 |  2235 |  1635 |   922 |
AVX512_Cloudflare    |  4289 |  3851 |  1435 |  1417 |   574 |

So, usually my code is actually slightly faster than Intel's code,
though the OpenSSL implementation has a slight edge on messages shorter
than 256 bytes in this microbenchmark.  (This also holds true when doing
the same tests on AMD Zen 4.)  It can be seen that the large code size
(up to 94x larger!) of the Intel implementations doesn't seem to bring
much benefit, so starting from scratch with much smaller code, as I've
done, seems appropriate.  The performance of my code on messages shorter
than 256 bytes could be improved through a limited amount of unrolling,
but it's unclear it would be worth it, given code size considerations
(e.g. caches) that don't get measured in microbenchmarks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

b06affb1

crypto: hisilicon/zip - optimize the address offset of the reg query function · c17b56d9

Chenghai Huang authored Jun 01, 2024

Currently, the reg is queried based on the fixed address offset
array. When the number of accelerator cores changes, the system
can not flexibly respond to the change.

Therefore, the reg to be queried is calculated based on the
comp or decomp core base address.
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

c17b56d9