Commit b71acb0e authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
 "API:
   - Add 1472-byte test to tcrypt for IPsec
   - Reintroduced crypto stats interface with numerous changes
   - Support incremental algorithm dumps

  Algorithms:
   - Add xchacha12/20
   - Add nhpoly1305
   - Add adiantum
   - Add streebog hash
   - Mark cts(cbc(aes)) as FIPS allowed

  Drivers:
   - Improve performance of arm64/chacha20
   - Improve performance of x86/chacha20
   - Add NEON-accelerated nhpoly1305
   - Add SSE2 accelerated nhpoly1305
   - Add AVX2 accelerated nhpoly1305
   - Add support for 192/256-bit keys in gcmaes AVX
   - Add SG support in gcmaes AVX
   - ESN for inline IPsec tx in chcr
   - Add support for CryptoCell 703 in ccree
   - Add support for CryptoCell 713 in ccree
   - Add SM4 support in ccree
   - Add SM3 support in ccree
   - Add support for chacha20 in caam/qi2
   - Add support for chacha20 + poly1305 in caam/jr
   - Add support for chacha20 + poly1305 in caam/qi2
   - Add AEAD cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits)
  crypto: skcipher - remove remnants of internal IV generators
  crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS
  crypto: salsa20-generic - don't unnecessarily use atomic walk
  crypto: skcipher - add might_sleep() to skcipher_walk_virt()
  crypto: x86/chacha - avoid sleeping under kernel_fpu_begin()
  crypto: cavium/nitrox - Added AEAD cipher support
  crypto: mxc-scc - fix build warnings on ARM64
  crypto: api - document missing stats member
  crypto: user - remove unused dump functions
  crypto: chelsio - Fix wrong error counter increments
  crypto: chelsio - Reset counters on cxgb4 Detach
  crypto: chelsio - Handle PCI shutdown event
  crypto: chelsio - cleanup:send addr as value in function argument
  crypto: chelsio - Use same value for both channel in single WR
  crypto: chelsio - Swap location of AAD and IV sent in WR
  crypto: chelsio - remove set but not used variable 'kctx_len'
  crypto: ux500 - Use proper enum in hash_set_dma_transfer
  crypto: ux500 - Use proper enum in cryp_set_dma_transfer
  crypto: aesni - Add scatter/gather avx stubs, and use them in C
  crypto: aesni - Introduce partial block macro
  ..
parents e0c38a4d c79b411e
Programming Interface
=====================
Please note that the kernel crypto API contains the AEAD givcrypt API
(crypto_aead_giv\* and aead_givcrypt\* function calls in
include/crypto/aead.h). This API is obsolete and will be removed in the
future. To obtain the functionality of an AEAD cipher with internal IV
generation, use the IV generator as a regular cipher. For example,
rfc4106(gcm(aes)) is the AEAD cipher with external IV generation and
seqniv(rfc4106(gcm(aes))) implies that the kernel crypto API generates
the IV. Different IV generators are available.
.. class:: toc-title
Table of contents
......
......@@ -157,10 +157,6 @@ applicable to a cipher, it is not displayed:
- rng for random number generator
- givcipher for cipher with associated IV generator (see the geniv
entry below for the specification of the IV generator type used by
the cipher implementation)
- kpp for a Key-agreement Protocol Primitive (KPP) cipher such as
an ECDH or DH implementation
......@@ -174,16 +170,7 @@ applicable to a cipher, it is not displayed:
- digestsize: output size of the message digest
- geniv: IV generation type:
- eseqiv for encrypted sequence number based IV generation
- seqiv for sequence number based IV generation
- chainiv for chain iv generation
- <builtin> is a marker that the cipher implements IV generation and
handling as it is specific to the given cipher
- geniv: IV generator (obsolete)
Key Sizes
---------
......@@ -218,10 +205,6 @@ the aforementioned cipher types:
- CRYPTO_ALG_TYPE_ABLKCIPHER Asynchronous multi-block cipher
- CRYPTO_ALG_TYPE_GIVCIPHER Asynchronous multi-block cipher packed
together with an IV generator (see geniv field in the /proc/crypto
listing for the known IV generators)
- CRYPTO_ALG_TYPE_KPP Key-agreement Protocol Primitive (KPP) such as
an ECDH or DH implementation
......@@ -338,18 +321,14 @@ uses the API applicable to the cipher type specified for the block.
The following call sequence is applicable when the IPSEC layer triggers
an encryption operation with the esp_output function. During
configuration, the administrator set up the use of rfc4106(gcm(aes)) as
the cipher for ESP. The following call sequence is now depicted in the
ASCII art above:
configuration, the administrator set up the use of seqiv(rfc4106(gcm(aes)))
as the cipher for ESP. The following call sequence is now depicted in
the ASCII art above:
1. esp_output() invokes crypto_aead_encrypt() to trigger an
encryption operation of the AEAD cipher with IV generator.
In case of GCM, the SEQIV implementation is registered as GIVCIPHER
in crypto_rfc4106_alloc().
The SEQIV performs its operation to generate an IV where the core
function is seqiv_geniv().
The SEQIV generates the IV.
2. Now, SEQIV uses the AEAD API function calls to invoke the associated
AEAD cipher. In our case, during the instantiation of SEQIV, the
......
Arm TrustZone CryptoCell cryptographic engine
Required properties:
- compatible: Should be one of: "arm,cryptocell-712-ree",
"arm,cryptocell-710-ree" or "arm,cryptocell-630p-ree".
- compatible: Should be one of -
"arm,cryptocell-713-ree"
"arm,cryptocell-703-ree"
"arm,cryptocell-712-ree"
"arm,cryptocell-710-ree"
"arm,cryptocell-630p-ree"
- reg: Base physical address of the engine and length of memory mapped region.
- interrupts: Interrupt number for the device.
......
......@@ -6,6 +6,8 @@ Required properties:
- interrupts : Should contain MXS DCP interrupt numbers, VMI IRQ and DCP IRQ
must be supplied, optionally Secure IRQ can be present, but
is currently not implemented and not used.
- clocks : Clock reference (only required on some SOCs: 6ull and 6sll).
- clock-names : Must be "dcp".
Example:
......
......@@ -3484,6 +3484,7 @@ F: include/linux/spi/cc2520.h
F: Documentation/devicetree/bindings/net/ieee802154/cc2520.txt
CCREE ARM TRUSTZONE CRYPTOCELL REE DRIVER
M: Yael Chemla <yael.chemla@foss.arm.com>
M: Gilad Ben-Yossef <gilad@benyossef.com>
L: linux-crypto@vger.kernel.org
S: Supported
......@@ -7147,7 +7148,9 @@ F: crypto/842.c
F: lib/842/
IBM Power in-Nest Crypto Acceleration
M: Paulo Flabiano Smorigo <pfsmorigo@linux.ibm.com>
M: Breno Leitão <leitao@debian.org>
M: Nayna Jain <nayna@linux.ibm.com>
M: Paulo Flabiano Smorigo <pfsmorigo@gmail.com>
L: linux-crypto@vger.kernel.org
S: Supported
F: drivers/crypto/nx/Makefile
......@@ -7211,7 +7214,9 @@ S: Supported
F: drivers/scsi/ibmvscsi_tgt/
IBM Power VMX Cryptographic instructions
M: Paulo Flabiano Smorigo <pfsmorigo@linux.ibm.com>
M: Breno Leitão <leitao@debian.org>
M: Nayna Jain <nayna@linux.ibm.com>
M: Paulo Flabiano Smorigo <pfsmorigo@gmail.com>
L: linux-crypto@vger.kernel.org
S: Supported
F: drivers/crypto/vmx/Makefile
......
......@@ -69,6 +69,15 @@ config CRYPTO_AES_ARM
help
Use optimized AES assembler routines for ARM platforms.
On ARM processors without the Crypto Extensions, this is the
fastest AES implementation for single blocks. For multiple
blocks, the NEON bit-sliced implementation is usually faster.
This implementation may be vulnerable to cache timing attacks,
since it uses lookup tables. However, as countermeasures it
disables IRQs and preloads the tables; it is hoped this makes
such attacks very difficult.
config CRYPTO_AES_ARM_BS
tristate "Bit sliced AES using NEON instructions"
depends on KERNEL_MODE_NEON
......@@ -117,9 +126,14 @@ config CRYPTO_CRC32_ARM_CE
select CRYPTO_HASH
config CRYPTO_CHACHA20_NEON
tristate "NEON accelerated ChaCha20 symmetric cipher"
tristate "NEON accelerated ChaCha stream cipher algorithms"
depends on KERNEL_MODE_NEON
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
config CRYPTO_NHPOLY1305_NEON
tristate "NEON accelerated NHPoly1305 hash function (for Adiantum)"
depends on KERNEL_MODE_NEON
select CRYPTO_NHPOLY1305
endif
......@@ -9,7 +9,8 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
ce-obj-$(CONFIG_CRYPTO_SHA1_ARM_CE) += sha1-arm-ce.o
......@@ -52,7 +53,8 @@ aes-arm-ce-y := aes-ce-core.o aes-ce-glue.o
ghash-arm-ce-y := ghash-ce-core.o ghash-ce-glue.o
crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
ifdef REGENERATE_ARM_CRYPTO
quiet_cmd_perl = PERL $@
......
......@@ -10,7 +10,6 @@
#include <asm/hwcap.h>
#include <asm/neon.h>
#include <asm/hwcap.h>
#include <crypto/aes.h>
#include <crypto/internal/simd.h>
#include <crypto/internal/skcipher.h>
......
......@@ -10,6 +10,7 @@
*/
#include <linux/linkage.h>
#include <asm/assembler.h>
#include <asm/cache.h>
.text
......@@ -41,7 +42,7 @@
.endif
.endm
.macro __hround, out0, out1, in0, in1, in2, in3, t3, t4, enc, sz, op
.macro __hround, out0, out1, in0, in1, in2, in3, t3, t4, enc, sz, op, oldcpsr
__select \out0, \in0, 0
__select t0, \in1, 1
__load \out0, \out0, 0, \sz, \op
......@@ -73,6 +74,14 @@
__load t0, t0, 3, \sz, \op
__load \t4, \t4, 3, \sz, \op
.ifnb \oldcpsr
/*
* This is the final round and we're done with all data-dependent table
* lookups, so we can safely re-enable interrupts.
*/
restore_irqs \oldcpsr
.endif
eor \out1, \out1, t1, ror #24
eor \out0, \out0, t2, ror #16
ldm rk!, {t1, t2}
......@@ -83,14 +92,14 @@
eor \out1, \out1, t2
.endm
.macro fround, out0, out1, out2, out3, in0, in1, in2, in3, sz=2, op
.macro fround, out0, out1, out2, out3, in0, in1, in2, in3, sz=2, op, oldcpsr
__hround \out0, \out1, \in0, \in1, \in2, \in3, \out2, \out3, 1, \sz, \op
__hround \out2, \out3, \in2, \in3, \in0, \in1, \in1, \in2, 1, \sz, \op
__hround \out2, \out3, \in2, \in3, \in0, \in1, \in1, \in2, 1, \sz, \op, \oldcpsr
.endm
.macro iround, out0, out1, out2, out3, in0, in1, in2, in3, sz=2, op
.macro iround, out0, out1, out2, out3, in0, in1, in2, in3, sz=2, op, oldcpsr
__hround \out0, \out1, \in0, \in3, \in2, \in1, \out2, \out3, 0, \sz, \op
__hround \out2, \out3, \in2, \in1, \in0, \in3, \in1, \in0, 0, \sz, \op
__hround \out2, \out3, \in2, \in1, \in0, \in3, \in1, \in0, 0, \sz, \op, \oldcpsr
.endm
.macro __rev, out, in
......@@ -118,13 +127,14 @@
.macro do_crypt, round, ttab, ltab, bsz
push {r3-r11, lr}
// Load keys first, to reduce latency in case they're not cached yet.
ldm rk!, {r8-r11}
ldr r4, [in]
ldr r5, [in, #4]
ldr r6, [in, #8]
ldr r7, [in, #12]
ldm rk!, {r8-r11}
#ifdef CONFIG_CPU_BIG_ENDIAN
__rev r4, r4
__rev r5, r5
......@@ -138,6 +148,25 @@
eor r7, r7, r11
__adrl ttab, \ttab
/*
* Disable interrupts and prefetch the 1024-byte 'ft' or 'it' table into
* L1 cache, assuming cacheline size >= 32. This is a hardening measure
* intended to make cache-timing attacks more difficult. They may not
* be fully prevented, however; see the paper
* https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
* ("Cache-timing attacks on AES") for a discussion of the many
* difficulties involved in writing truly constant-time AES software.
*/
save_and_disable_irqs t0
.set i, 0
.rept 1024 / 128
ldr r8, [ttab, #i + 0]
ldr r9, [ttab, #i + 32]
ldr r10, [ttab, #i + 64]
ldr r11, [ttab, #i + 96]
.set i, i + 128
.endr
push {t0} // oldcpsr
tst rounds, #2
bne 1f
......@@ -151,8 +180,21 @@
\round r4, r5, r6, r7, r8, r9, r10, r11
b 0b
2: __adrl ttab, \ltab
\round r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b
2: .ifb \ltab
add ttab, ttab, #1
.else
__adrl ttab, \ltab
// Prefetch inverse S-box for final round; see explanation above
.set i, 0
.rept 256 / 64
ldr t0, [ttab, #i + 0]
ldr t1, [ttab, #i + 32]
.set i, i + 64
.endr
.endif
pop {rounds} // oldcpsr
\round r4, r5, r6, r7, r8, r9, r10, r11, \bsz, b, rounds
#ifdef CONFIG_CPU_BIG_ENDIAN
__rev r4, r4
......@@ -175,7 +217,7 @@
.endm
ENTRY(__aes_arm_encrypt)
do_crypt fround, crypto_ft_tab, crypto_ft_tab + 1, 2
do_crypt fround, crypto_ft_tab,, 2
ENDPROC(__aes_arm_encrypt)
.align 5
......
/*
* ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions
* ChaCha/XChaCha NEON helper functions
*
* Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
......@@ -27,9 +27,9 @@
* (d) vtbl.8 + vtbl.8 (multiple of 8 bits rotations only,
* needs index vector)
*
* ChaCha20 has 16, 12, 8, and 7-bit rotations. For the 12 and 7-bit
* rotations, the only choices are (a) and (b). We use (a) since it takes
* two-thirds the cycles of (b) on both Cortex-A7 and Cortex-A53.
* ChaCha has 16, 12, 8, and 7-bit rotations. For the 12 and 7-bit rotations,
* the only choices are (a) and (b). We use (a) since it takes two-thirds the
* cycles of (b) on both Cortex-A7 and Cortex-A53.
*
* For the 16-bit rotation, we use vrev32.16 since it's consistently fastest
* and doesn't need a temporary register.
......@@ -52,30 +52,20 @@
.fpu neon
.align 5
ENTRY(chacha20_block_xor_neon)
// r0: Input state matrix, s
// r1: 1 data block output, o
// r2: 1 data block input, i
//
// This function encrypts one ChaCha20 block by loading the state matrix
// in four NEON registers. It performs matrix operation on four words in
// parallel, but requireds shuffling to rearrange the words after each
// round.
//
// x0..3 = s0..3
add ip, r0, #0x20
vld1.32 {q0-q1}, [r0]
vld1.32 {q2-q3}, [ip]
vmov q8, q0
vmov q9, q1
vmov q10, q2
vmov q11, q3
/*
* chacha_permute - permute one block
*
* Permute one 64-byte block where the state matrix is stored in the four NEON
* registers q0-q3. It performs matrix operations on four words in parallel,
* but requires shuffling to rearrange the words after each round.
*
* The round count is given in r3.
*
* Clobbers: r3, ip, q4-q5
*/
chacha_permute:
adr ip, .Lrol8_table
mov r3, #10
vld1.8 {d10}, [ip, :64]
.Ldoubleround:
......@@ -139,9 +129,31 @@ ENTRY(chacha20_block_xor_neon)
// x3 = shuffle32(x3, MASK(0, 3, 2, 1))
vext.8 q3, q3, q3, #4
subs r3, r3, #1
subs r3, r3, #2
bne .Ldoubleround
bx lr
ENDPROC(chacha_permute)
ENTRY(chacha_block_xor_neon)
// r0: Input state matrix, s
// r1: 1 data block output, o
// r2: 1 data block input, i
// r3: nrounds
push {lr}
// x0..3 = s0..3
add ip, r0, #0x20
vld1.32 {q0-q1}, [r0]
vld1.32 {q2-q3}, [ip]
vmov q8, q0
vmov q9, q1
vmov q10, q2
vmov q11, q3
bl chacha_permute
add ip, r2, #0x20
vld1.8 {q4-q5}, [r2]
vld1.8 {q6-q7}, [ip]
......@@ -166,15 +178,33 @@ ENTRY(chacha20_block_xor_neon)
vst1.8 {q0-q1}, [r1]
vst1.8 {q2-q3}, [ip]
bx lr
ENDPROC(chacha20_block_xor_neon)
pop {pc}
ENDPROC(chacha_block_xor_neon)
ENTRY(hchacha_block_neon)
// r0: Input state matrix, s
// r1: output (8 32-bit words)
// r2: nrounds
push {lr}
vld1.32 {q0-q1}, [r0]!
vld1.32 {q2-q3}, [r0]
mov r3, r2
bl chacha_permute
vst1.32 {q0}, [r1]!
vst1.32 {q3}, [r1]
pop {pc}
ENDPROC(hchacha_block_neon)
.align 4
.Lctrinc: .word 0, 1, 2, 3
.Lrol8_table: .byte 3, 0, 1, 2, 7, 4, 5, 6
.align 5
ENTRY(chacha20_4block_xor_neon)
ENTRY(chacha_4block_xor_neon)
push {r4-r5}
mov r4, sp // preserve the stack pointer
sub ip, sp, #0x20 // allocate a 32 byte buffer
......@@ -184,9 +214,10 @@ ENTRY(chacha20_4block_xor_neon)
// r0: Input state matrix, s
// r1: 4 data blocks output, o
// r2: 4 data blocks input, i
// r3: nrounds
//
// This function encrypts four consecutive ChaCha20 blocks by loading
// This function encrypts four consecutive ChaCha blocks by loading
// the state matrix in NEON registers four times. The algorithm performs
// each operation on the corresponding word of each state matrix, hence
// requires no word shuffling. The words are re-interleaved before the
......@@ -219,7 +250,6 @@ ENTRY(chacha20_4block_xor_neon)
vdup.32 q0, d0[0]
adr ip, .Lrol8_table
mov r3, #10
b 1f
.Ldoubleround4:
......@@ -417,7 +447,7 @@ ENTRY(chacha20_4block_xor_neon)
vsri.u32 q5, q8, #25
vsri.u32 q6, q9, #25
subs r3, r3, #1
subs r3, r3, #2
bne .Ldoubleround4
// x0..7[0-3] are in q0-q7, x10..15[0-3] are in q10-q15.
......@@ -527,4 +557,4 @@ ENTRY(chacha20_4block_xor_neon)
pop {r4-r5}
bx lr
ENDPROC(chacha20_4block_xor_neon)
ENDPROC(chacha_4block_xor_neon)
/*
* ARM NEON accelerated ChaCha and XChaCha stream ciphers,
* including ChaCha20 (RFC7539)
*
* Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
* Based on:
* ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
*
* Copyright (C) 2015 Martin Willi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*/
#include <crypto/algapi.h>
#include <crypto/chacha.h>
#include <crypto/internal/skcipher.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <asm/hwcap.h>
#include <asm/neon.h>
#include <asm/simd.h>
asmlinkage void chacha_block_xor_neon(const u32 *state, u8 *dst, const u8 *src,
int nrounds);
asmlinkage void chacha_4block_xor_neon(const u32 *state, u8 *dst, const u8 *src,
int nrounds);
asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes, int nrounds)
{
u8 buf[CHACHA_BLOCK_SIZE];
while (bytes >= CHACHA_BLOCK_SIZE * 4) {
chacha_4block_xor_neon(state, dst, src, nrounds);
bytes -= CHACHA_BLOCK_SIZE * 4;
src += CHACHA_BLOCK_SIZE * 4;
dst += CHACHA_BLOCK_SIZE * 4;
state[12] += 4;
}
while (bytes >= CHACHA_BLOCK_SIZE) {
chacha_block_xor_neon(state, dst, src, nrounds);
bytes -= CHACHA_BLOCK_SIZE;
src += CHACHA_BLOCK_SIZE;
dst += CHACHA_BLOCK_SIZE;
state[12]++;
}
if (bytes) {
memcpy(buf, src, bytes);
chacha_block_xor_neon(state, buf, buf, nrounds);
memcpy(dst, buf, bytes);
}
}
static int chacha_neon_stream_xor(struct skcipher_request *req,
struct chacha_ctx *ctx, u8 *iv)
{
struct skcipher_walk walk;
u32 state[16];
int err;
err = skcipher_walk_virt(&walk, req, false);
crypto_chacha_init(state, ctx, iv);
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
if (nbytes < walk.total)
nbytes = round_down(nbytes, walk.stride);
kernel_neon_begin();
chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
nbytes, ctx->nrounds);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
return err;
}
static int chacha_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
return crypto_chacha_crypt(req);
return chacha_neon_stream_xor(req, ctx, req->iv);
}
static int xchacha_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
struct chacha_ctx subctx;
u32 state[16];
u8 real_iv[16];
if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
return crypto_xchacha_crypt(req);
crypto_chacha_init(state, ctx, req->iv);
kernel_neon_begin();
hchacha_block_neon(state, subctx.key, ctx->nrounds);
kernel_neon_end();
subctx.nrounds = ctx->nrounds;
memcpy(&real_iv[0], req->iv + 24, 8);
memcpy(&real_iv[8], req->iv + 16, 8);
return chacha_neon_stream_xor(req, &subctx, real_iv);
}
static struct skcipher_alg algs[] = {
{
.base.cra_name = "chacha20",
.base.cra_driver_name = "chacha20-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA_KEY_SIZE,
.max_keysize = CHACHA_KEY_SIZE,
.ivsize = CHACHA_IV_SIZE,
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 4 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
.encrypt = chacha_neon,
.decrypt = chacha_neon,
}, {
.base.cra_name = "xchacha20",
.base.cra_driver_name = "xchacha20-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA_KEY_SIZE,
.max_keysize = CHACHA_KEY_SIZE,
.ivsize = XCHACHA_IV_SIZE,
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 4 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
.encrypt = xchacha_neon,
.decrypt = xchacha_neon,
}, {
.base.cra_name = "xchacha12",
.base.cra_driver_name = "xchacha12-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA_KEY_SIZE,
.max_keysize = CHACHA_KEY_SIZE,
.ivsize = XCHACHA_IV_SIZE,
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 4 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha12_setkey,
.encrypt = xchacha_neon,
.decrypt = xchacha_neon,
}
};
static int __init chacha_simd_mod_init(void)
{
if (!(elf_hwcap & HWCAP_NEON))
return -ENODEV;
return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
}
static void __exit chacha_simd_mod_fini(void)
{
crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
}
module_init(chacha_simd_mod_init);
module_exit(chacha_simd_mod_fini);
MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (NEON accelerated)");
MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("chacha20");
MODULE_ALIAS_CRYPTO("chacha20-neon");
MODULE_ALIAS_CRYPTO("xchacha20");
MODULE_ALIAS_CRYPTO("xchacha20-neon");
MODULE_ALIAS_CRYPTO("xchacha12");
MODULE_ALIAS_CRYPTO("xchacha12-neon");
/*
* ChaCha20 256-bit cipher algorithm, RFC7539, ARM NEON functions
*
* Copyright (C) 2016 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
* Based on:
* ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
*
* Copyright (C) 2015 Martin Willi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*/
#include <crypto/algapi.h>
#include <crypto/chacha20.h>
#include <crypto/internal/skcipher.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <asm/hwcap.h>
#include <asm/neon.h>
#include <asm/simd.h>
asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
u8 buf[CHACHA20_BLOCK_SIZE];
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
chacha20_4block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
src += CHACHA20_BLOCK_SIZE;
dst += CHACHA20_BLOCK_SIZE;
state[12]++;
}
if (bytes) {
memcpy(buf, src, bytes);
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
}
static int chacha20_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
u32 state[16];
int err;
if (req->cryptlen <= CHACHA20_BLOCK_SIZE || !may_use_simd())
return crypto_chacha20_crypt(req);
err = skcipher_walk_virt(&walk, req, true);
crypto_chacha20_init(state, ctx, walk.iv);
kernel_neon_begin();
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
if (nbytes < walk.total)
nbytes = round_down(nbytes, walk.stride);
chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
kernel_neon_end();
return err;
}
static struct skcipher_alg alg = {
.base.cra_name = "chacha20",
.base.cra_driver_name = "chacha20-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha20_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA20_KEY_SIZE,
.max_keysize = CHACHA20_KEY_SIZE,
.ivsize = CHACHA20_IV_SIZE,
.chunksize = CHACHA20_BLOCK_SIZE,
.walksize = 4 * CHACHA20_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
.encrypt = chacha20_neon,
.decrypt = chacha20_neon,
};
static int __init chacha20_simd_mod_init(void)
{
if (!(elf_hwcap & HWCAP_NEON))
return -ENODEV;
return crypto_register_skcipher(&alg);
}
static void __exit chacha20_simd_mod_fini(void)
{
crypto_unregister_skcipher(&alg);
}
module_init(chacha20_simd_mod_init);
module_exit(chacha20_simd_mod_fini);
MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("chacha20");
/* SPDX-License-Identifier: GPL-2.0 */
/*
* NH - ε-almost-universal hash function, NEON accelerated version
*
* Copyright 2018 Google LLC
*
* Author: Eric Biggers <ebiggers@google.com>
*/
#include <linux/linkage.h>
.text
.fpu neon
KEY .req r0
MESSAGE .req r1
MESSAGE_LEN .req r2
HASH .req r3
PASS0_SUMS .req q0
PASS0_SUM_A .req d0
PASS0_SUM_B .req d1
PASS1_SUMS .req q1
PASS1_SUM_A .req d2
PASS1_SUM_B .req d3
PASS2_SUMS .req q2
PASS2_SUM_A .req d4
PASS2_SUM_B .req d5
PASS3_SUMS .req q3
PASS3_SUM_A .req d6
PASS3_SUM_B .req d7
K0 .req q4
K1 .req q5
K2 .req q6
K3 .req q7
T0 .req q8
T0_L .req d16
T0_H .req d17
T1 .req q9
T1_L .req d18
T1_H .req d19
T2 .req q10
T2_L .req d20
T2_H .req d21
T3 .req q11
T3_L .req d22
T3_H .req d23
.macro _nh_stride k0, k1, k2, k3
// Load next message stride
vld1.8 {T3}, [MESSAGE]!
// Load next key stride
vld1.32 {\k3}, [KEY]!
// Add message words to key words
vadd.u32 T0, T3, \k0
vadd.u32 T1, T3, \k1
vadd.u32 T2, T3, \k2
vadd.u32 T3, T3, \k3
// Multiply 32x32 => 64 and accumulate
vmlal.u32 PASS0_SUMS, T0_L, T0_H
vmlal.u32 PASS1_SUMS, T1_L, T1_H
vmlal.u32 PASS2_SUMS, T2_L, T2_H
vmlal.u32 PASS3_SUMS, T3_L, T3_H
.endm
/*
* void nh_neon(const u32 *key, const u8 *message, size_t message_len,
* u8 hash[NH_HASH_BYTES])
*
* It's guaranteed that message_len % 16 == 0.
*/
ENTRY(nh_neon)
vld1.32 {K0,K1}, [KEY]!
vmov.u64 PASS0_SUMS, #0
vmov.u64 PASS1_SUMS, #0
vld1.32 {K2}, [KEY]!
vmov.u64 PASS2_SUMS, #0
vmov.u64 PASS3_SUMS, #0
subs MESSAGE_LEN, MESSAGE_LEN, #64
blt .Lloop4_done
.Lloop4:
_nh_stride K0, K1, K2, K3
_nh_stride K1, K2, K3, K0
_nh_stride K2, K3, K0, K1
_nh_stride K3, K0, K1, K2
subs MESSAGE_LEN, MESSAGE_LEN, #64
bge .Lloop4
.Lloop4_done:
ands MESSAGE_LEN, MESSAGE_LEN, #63
beq .Ldone
_nh_stride K0, K1, K2, K3
subs MESSAGE_LEN, MESSAGE_LEN, #16
beq .Ldone
_nh_stride K1, K2, K3, K0
subs MESSAGE_LEN, MESSAGE_LEN, #16
beq .Ldone
_nh_stride K2, K3, K0, K1
.Ldone:
// Sum the accumulators for each pass, then store the sums to 'hash'
vadd.u64 T0_L, PASS0_SUM_A, PASS0_SUM_B
vadd.u64 T0_H, PASS1_SUM_A, PASS1_SUM_B
vadd.u64 T1_L, PASS2_SUM_A, PASS2_SUM_B
vadd.u64 T1_H, PASS3_SUM_A, PASS3_SUM_B
vst1.8 {T0-T1}, [HASH]
bx lr
ENDPROC(nh_neon)
// SPDX-License-Identifier: GPL-2.0
/*
* NHPoly1305 - ε-almost-∆-universal hash function for Adiantum
* (NEON accelerated version)
*
* Copyright 2018 Google LLC
*/
#include <asm/neon.h>
#include <asm/simd.h>
#include <crypto/internal/hash.h>
#include <crypto/nhpoly1305.h>
#include <linux/module.h>
asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t message_len,
u8 hash[NH_HASH_BYTES]);
/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */
static void _nh_neon(const u32 *key, const u8 *message, size_t message_len,
__le64 hash[NH_NUM_PASSES])
{
nh_neon(key, message, message_len, (u8 *)hash);
}
static int nhpoly1305_neon_update(struct shash_desc *desc,
const u8 *src, unsigned int srclen)
{
if (srclen < 64 || !may_use_simd())
return crypto_nhpoly1305_update(desc, src, srclen);
do {
unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
kernel_neon_begin();
crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
kernel_neon_end();
src += n;
srclen -= n;
} while (srclen);
return 0;
}
static struct shash_alg nhpoly1305_alg = {
.base.cra_name = "nhpoly1305",
.base.cra_driver_name = "nhpoly1305-neon",
.base.cra_priority = 200,
.base.cra_ctxsize = sizeof(struct nhpoly1305_key),
.base.cra_module = THIS_MODULE,
.digestsize = POLY1305_DIGEST_SIZE,
.init = crypto_nhpoly1305_init,
.update = nhpoly1305_neon_update,
.final = crypto_nhpoly1305_final,
.setkey = crypto_nhpoly1305_setkey,
.descsize = sizeof(struct nhpoly1305_state),
};
static int __init nhpoly1305_mod_init(void)
{
if (!(elf_hwcap & HWCAP_NEON))
return -ENODEV;
return crypto_register_shash(&nhpoly1305_alg);
}
static void __exit nhpoly1305_mod_exit(void)
{
crypto_unregister_shash(&nhpoly1305_alg);
}
module_init(nhpoly1305_mod_init);
module_exit(nhpoly1305_mod_exit);
MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (NEON-accelerated)");
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
MODULE_ALIAS_CRYPTO("nhpoly1305");
MODULE_ALIAS_CRYPTO("nhpoly1305-neon");
......@@ -101,11 +101,16 @@ config CRYPTO_AES_ARM64_NEON_BLK
select CRYPTO_SIMD
config CRYPTO_CHACHA20_NEON
tristate "NEON accelerated ChaCha20 symmetric cipher"
tristate "ChaCha20, XChaCha20, and XChaCha12 stream ciphers using NEON instructions"
depends on KERNEL_MODE_NEON
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
config CRYPTO_NHPOLY1305_NEON
tristate "NHPoly1305 hash function using NEON instructions (for Adiantum)"
depends on KERNEL_MODE_NEON
select CRYPTO_NHPOLY1305
config CRYPTO_AES_ARM64_BS
tristate "AES in ECB/CBC/CTR/XTS modes using bit-sliced NEON algorithm"
depends on KERNEL_MODE_NEON
......
......@@ -50,8 +50,11 @@ sha256-arm64-y := sha256-glue.o sha256-core.o
obj-$(CONFIG_CRYPTO_SHA512_ARM64) += sha512-arm64.o
sha512-arm64-y := sha512-glue.o sha512-core.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
chacha-neon-y := chacha-neon-core.o chacha-neon-glue.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
nhpoly1305-neon-y := nh-neon-core.o nhpoly1305-neon-glue.o
obj-$(CONFIG_CRYPTO_AES_ARM64) += aes-arm64.o
aes-arm64-y := aes-cipher-core.o aes-cipher-glue.o
......
/*
* ARM NEON accelerated ChaCha and XChaCha stream ciphers,
* including ChaCha20 (RFC7539)
*
* Copyright (C) 2016 - 2017 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
* Based on:
* ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
*
* Copyright (C) 2015 Martin Willi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*/
#include <crypto/algapi.h>
#include <crypto/chacha.h>
#include <crypto/internal/skcipher.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <asm/hwcap.h>
#include <asm/neon.h>
#include <asm/simd.h>
asmlinkage void chacha_block_xor_neon(u32 *state, u8 *dst, const u8 *src,
int nrounds);
asmlinkage void chacha_4block_xor_neon(u32 *state, u8 *dst, const u8 *src,
int nrounds, int bytes);
asmlinkage void hchacha_block_neon(const u32 *state, u32 *out, int nrounds);
static void chacha_doneon(u32 *state, u8 *dst, const u8 *src,
int bytes, int nrounds)
{
while (bytes > 0) {
int l = min(bytes, CHACHA_BLOCK_SIZE * 5);
if (l <= CHACHA_BLOCK_SIZE) {
u8 buf[CHACHA_BLOCK_SIZE];
memcpy(buf, src, l);
chacha_block_xor_neon(state, buf, buf, nrounds);
memcpy(dst, buf, l);
state[12] += 1;
break;
}
chacha_4block_xor_neon(state, dst, src, nrounds, l);
bytes -= CHACHA_BLOCK_SIZE * 5;
src += CHACHA_BLOCK_SIZE * 5;
dst += CHACHA_BLOCK_SIZE * 5;
state[12] += 5;
}
}
static int chacha_neon_stream_xor(struct skcipher_request *req,
struct chacha_ctx *ctx, u8 *iv)
{
struct skcipher_walk walk;
u32 state[16];
int err;
err = skcipher_walk_virt(&walk, req, false);
crypto_chacha_init(state, ctx, iv);
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
if (nbytes < walk.total)
nbytes = rounddown(nbytes, walk.stride);
kernel_neon_begin();
chacha_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
nbytes, ctx->nrounds);
kernel_neon_end();
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
return err;
}
static int chacha_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
return crypto_chacha_crypt(req);
return chacha_neon_stream_xor(req, ctx, req->iv);
}
static int xchacha_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm);
struct chacha_ctx subctx;
u32 state[16];
u8 real_iv[16];
if (req->cryptlen <= CHACHA_BLOCK_SIZE || !may_use_simd())
return crypto_xchacha_crypt(req);
crypto_chacha_init(state, ctx, req->iv);
kernel_neon_begin();
hchacha_block_neon(state, subctx.key, ctx->nrounds);
kernel_neon_end();
subctx.nrounds = ctx->nrounds;
memcpy(&real_iv[0], req->iv + 24, 8);
memcpy(&real_iv[8], req->iv + 16, 8);
return chacha_neon_stream_xor(req, &subctx, real_iv);
}
static struct skcipher_alg algs[] = {
{
.base.cra_name = "chacha20",
.base.cra_driver_name = "chacha20-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA_KEY_SIZE,
.max_keysize = CHACHA_KEY_SIZE,
.ivsize = CHACHA_IV_SIZE,
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 5 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
.encrypt = chacha_neon,
.decrypt = chacha_neon,
}, {
.base.cra_name = "xchacha20",
.base.cra_driver_name = "xchacha20-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA_KEY_SIZE,
.max_keysize = CHACHA_KEY_SIZE,
.ivsize = XCHACHA_IV_SIZE,
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 5 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
.encrypt = xchacha_neon,
.decrypt = xchacha_neon,
}, {
.base.cra_name = "xchacha12",
.base.cra_driver_name = "xchacha12-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA_KEY_SIZE,
.max_keysize = CHACHA_KEY_SIZE,
.ivsize = XCHACHA_IV_SIZE,
.chunksize = CHACHA_BLOCK_SIZE,
.walksize = 5 * CHACHA_BLOCK_SIZE,
.setkey = crypto_chacha12_setkey,
.encrypt = xchacha_neon,
.decrypt = xchacha_neon,
}
};
static int __init chacha_simd_mod_init(void)
{
if (!(elf_hwcap & HWCAP_ASIMD))
return -ENODEV;
return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
}
static void __exit chacha_simd_mod_fini(void)
{
crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
}
module_init(chacha_simd_mod_init);
module_exit(chacha_simd_mod_fini);
MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (NEON accelerated)");
MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("chacha20");
MODULE_ALIAS_CRYPTO("chacha20-neon");
MODULE_ALIAS_CRYPTO("xchacha20");
MODULE_ALIAS_CRYPTO("xchacha20-neon");
MODULE_ALIAS_CRYPTO("xchacha12");
MODULE_ALIAS_CRYPTO("xchacha12-neon");
/*
* ChaCha20 256-bit cipher algorithm, RFC7539, arm64 NEON functions
*
* Copyright (C) 2016 - 2017 Linaro, Ltd. <ard.biesheuvel@linaro.org>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
* Based on:
* ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
*
* Copyright (C) 2015 Martin Willi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*/
#include <crypto/algapi.h>
#include <crypto/chacha20.h>
#include <crypto/internal/skcipher.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <asm/hwcap.h>
#include <asm/neon.h>
#include <asm/simd.h>
asmlinkage void chacha20_block_xor_neon(u32 *state, u8 *dst, const u8 *src);
asmlinkage void chacha20_4block_xor_neon(u32 *state, u8 *dst, const u8 *src);
static void chacha20_doneon(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
u8 buf[CHACHA20_BLOCK_SIZE];
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
kernel_neon_begin();
chacha20_4block_xor_neon(state, dst, src);
kernel_neon_end();
bytes -= CHACHA20_BLOCK_SIZE * 4;
src += CHACHA20_BLOCK_SIZE * 4;
dst += CHACHA20_BLOCK_SIZE * 4;
state[12] += 4;
}
if (!bytes)
return;
kernel_neon_begin();
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_neon(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
src += CHACHA20_BLOCK_SIZE;
dst += CHACHA20_BLOCK_SIZE;
state[12]++;
}
if (bytes) {
memcpy(buf, src, bytes);
chacha20_block_xor_neon(state, buf, buf);
memcpy(dst, buf, bytes);
}
kernel_neon_end();
}
static int chacha20_neon(struct skcipher_request *req)
{
struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
struct chacha20_ctx *ctx = crypto_skcipher_ctx(tfm);
struct skcipher_walk walk;
u32 state[16];
int err;
if (!may_use_simd() || req->cryptlen <= CHACHA20_BLOCK_SIZE)
return crypto_chacha20_crypt(req);
err = skcipher_walk_virt(&walk, req, false);
crypto_chacha20_init(state, ctx, walk.iv);
while (walk.nbytes > 0) {
unsigned int nbytes = walk.nbytes;
if (nbytes < walk.total)
nbytes = round_down(nbytes, walk.stride);
chacha20_doneon(state, walk.dst.virt.addr, walk.src.virt.addr,
nbytes);
err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
}
return err;
}
static struct skcipher_alg alg = {
.base.cra_name = "chacha20",
.base.cra_driver_name = "chacha20-neon",
.base.cra_priority = 300,
.base.cra_blocksize = 1,
.base.cra_ctxsize = sizeof(struct chacha20_ctx),
.base.cra_module = THIS_MODULE,
.min_keysize = CHACHA20_KEY_SIZE,
.max_keysize = CHACHA20_KEY_SIZE,
.ivsize = CHACHA20_IV_SIZE,
.chunksize = CHACHA20_BLOCK_SIZE,
.walksize = 4 * CHACHA20_BLOCK_SIZE,
.setkey = crypto_chacha20_setkey,
.encrypt = chacha20_neon,
.decrypt = chacha20_neon,
};
static int __init chacha20_simd_mod_init(void)
{
if (!(elf_hwcap & HWCAP_ASIMD))
return -ENODEV;
return crypto_register_skcipher(&alg);
}
static void __exit chacha20_simd_mod_fini(void)
{
crypto_unregister_skcipher(&alg);
}
module_init(chacha20_simd_mod_init);
module_exit(chacha20_simd_mod_fini);
MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_CRYPTO("chacha20");
/* SPDX-License-Identifier: GPL-2.0 */
/*
* NH - ε-almost-universal hash function, ARM64 NEON accelerated version
*
* Copyright 2018 Google LLC
*
* Author: Eric Biggers <ebiggers@google.com>
*/
#include <linux/linkage.h>
KEY .req x0
MESSAGE .req x1
MESSAGE_LEN .req x2
HASH .req x3
PASS0_SUMS .req v0
PASS1_SUMS .req v1
PASS2_SUMS .req v2
PASS3_SUMS .req v3
K0 .req v4
K1 .req v5
K2 .req v6
K3 .req v7
T0 .req v8
T1 .req v9
T2 .req v10
T3 .req v11
T4 .req v12
T5 .req v13
T6 .req v14
T7 .req v15
.macro _nh_stride k0, k1, k2, k3
// Load next message stride
ld1 {T3.16b}, [MESSAGE], #16
// Load next key stride
ld1 {\k3\().4s}, [KEY], #16
// Add message words to key words
add T0.4s, T3.4s, \k0\().4s
add T1.4s, T3.4s, \k1\().4s
add T2.4s, T3.4s, \k2\().4s
add T3.4s, T3.4s, \k3\().4s
// Multiply 32x32 => 64 and accumulate
mov T4.d[0], T0.d[1]
mov T5.d[0], T1.d[1]
mov T6.d[0], T2.d[1]
mov T7.d[0], T3.d[1]
umlal PASS0_SUMS.2d, T0.2s, T4.2s
umlal PASS1_SUMS.2d, T1.2s, T5.2s
umlal PASS2_SUMS.2d, T2.2s, T6.2s
umlal PASS3_SUMS.2d, T3.2s, T7.2s
.endm
/*
* void nh_neon(const u32 *key, const u8 *message, size_t message_len,
* u8 hash[NH_HASH_BYTES])
*
* It's guaranteed that message_len % 16 == 0.
*/
ENTRY(nh_neon)
ld1 {K0.4s,K1.4s}, [KEY], #32
movi PASS0_SUMS.2d, #0
movi PASS1_SUMS.2d, #0
ld1 {K2.4s}, [KEY], #16
movi PASS2_SUMS.2d, #0
movi PASS3_SUMS.2d, #0
subs MESSAGE_LEN, MESSAGE_LEN, #64
blt .Lloop4_done
.Lloop4:
_nh_stride K0, K1, K2, K3
_nh_stride K1, K2, K3, K0
_nh_stride K2, K3, K0, K1
_nh_stride K3, K0, K1, K2
subs MESSAGE_LEN, MESSAGE_LEN, #64
bge .Lloop4
.Lloop4_done:
ands MESSAGE_LEN, MESSAGE_LEN, #63
beq .Ldone
_nh_stride K0, K1, K2, K3
subs MESSAGE_LEN, MESSAGE_LEN, #16
beq .Ldone
_nh_stride K1, K2, K3, K0
subs MESSAGE_LEN, MESSAGE_LEN, #16
beq .Ldone
_nh_stride K2, K3, K0, K1
.Ldone:
// Sum the accumulators for each pass, then store the sums to 'hash'
addp T0.2d, PASS0_SUMS.2d, PASS1_SUMS.2d
addp T1.2d, PASS2_SUMS.2d, PASS3_SUMS.2d
st1 {T0.16b,T1.16b}, [HASH]
ret
ENDPROC(nh_neon)
// SPDX-License-Identifier: GPL-2.0
/*
* NHPoly1305 - ε-almost-∆-universal hash function for Adiantum
* (ARM64 NEON accelerated version)
*
* Copyright 2018 Google LLC
*/
#include <asm/neon.h>
#include <asm/simd.h>
#include <crypto/internal/hash.h>
#include <crypto/nhpoly1305.h>
#include <linux/module.h>
asmlinkage void nh_neon(const u32 *key, const u8 *message, size_t message_len,
u8 hash[NH_HASH_BYTES]);
/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */
static void _nh_neon(const u32 *key, const u8 *message, size_t message_len,
__le64 hash[NH_NUM_PASSES])
{
nh_neon(key, message, message_len, (u8 *)hash);
}
static int nhpoly1305_neon_update(struct shash_desc *desc,
const u8 *src, unsigned int srclen)
{
if (srclen < 64 || !may_use_simd())
return crypto_nhpoly1305_update(desc, src, srclen);
do {
unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
kernel_neon_begin();
crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon);
kernel_neon_end();
src += n;
srclen -= n;
} while (srclen);
return 0;
}
static struct shash_alg nhpoly1305_alg = {
.base.cra_name = "nhpoly1305",
.base.cra_driver_name = "nhpoly1305-neon",
.base.cra_priority = 200,
.base.cra_ctxsize = sizeof(struct nhpoly1305_key),
.base.cra_module = THIS_MODULE,
.digestsize = POLY1305_DIGEST_SIZE,
.init = crypto_nhpoly1305_init,
.update = nhpoly1305_neon_update,
.final = crypto_nhpoly1305_final,
.setkey = crypto_nhpoly1305_setkey,
.descsize = sizeof(struct nhpoly1305_state),
};
static int __init nhpoly1305_mod_init(void)
{
if (!(elf_hwcap & HWCAP_ASIMD))
return -ENODEV;
return crypto_register_shash(&nhpoly1305_alg);
}
static void __exit nhpoly1305_mod_exit(void)
{
crypto_unregister_shash(&nhpoly1305_alg);
}
module_init(nhpoly1305_mod_init);
module_exit(nhpoly1305_mod_exit);
MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (NEON-accelerated)");
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
MODULE_ALIAS_CRYPTO("nhpoly1305");
MODULE_ALIAS_CRYPTO("nhpoly1305-neon");
......@@ -137,7 +137,7 @@ static int fallback_init_cip(struct crypto_tfm *tfm)
struct s390_aes_ctx *sctx = crypto_tfm_ctx(tfm);
sctx->fallback.cip = crypto_alloc_cipher(name, 0,
CRYPTO_ALG_ASYNC | CRYPTO_ALG_NEED_FALLBACK);
CRYPTO_ALG_NEED_FALLBACK);
if (IS_ERR(sctx->fallback.cip)) {
pr_err("Allocating AES fallback algorithm %s failed\n",
......
......@@ -476,11 +476,6 @@ static bool __init sparc64_has_aes_opcode(void)
static int __init aes_sparc64_mod_init(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(algs); i++)
INIT_LIST_HEAD(&algs[i].cra_list);
if (sparc64_has_aes_opcode()) {
pr_info("Using sparc64 aes opcodes optimized AES implementation\n");
return crypto_register_algs(algs, ARRAY_SIZE(algs));
......
......@@ -299,11 +299,6 @@ static bool __init sparc64_has_camellia_opcode(void)
static int __init camellia_sparc64_mod_init(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(algs); i++)
INIT_LIST_HEAD(&algs[i].cra_list);
if (sparc64_has_camellia_opcode()) {
pr_info("Using sparc64 camellia opcodes optimized CAMELLIA implementation\n");
return crypto_register_algs(algs, ARRAY_SIZE(algs));
......
......@@ -510,11 +510,6 @@ static bool __init sparc64_has_des_opcode(void)
static int __init des_sparc64_mod_init(void)
{
int i;
for (i = 0; i < ARRAY_SIZE(algs); i++)
INIT_LIST_HEAD(&algs[i].cra_list);
if (sparc64_has_des_opcode()) {
pr_info("Using sparc64 des opcodes optimized DES implementation\n");
return crypto_register_algs(algs, ARRAY_SIZE(algs));
......
......@@ -8,6 +8,7 @@ OBJECT_FILES_NON_STANDARD := y
avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no)
avx2_supported := $(call as-instr,vpgatherdd %ymm0$(comma)(%eax$(comma)%ymm1\
$(comma)4)$(comma)%ymm2,yes,no)
avx512_supported :=$(call as-instr,vpmovm2b %k1$(comma)%zmm5,yes,no)
sha1_ni_supported :=$(call as-instr,sha1msg1 %xmm0$(comma)%xmm1,yes,no)
sha256_ni_supported :=$(call as-instr,sha256msg1 %xmm0$(comma)%xmm1,yes,no)
......@@ -23,7 +24,7 @@ obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o
obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o
obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
......@@ -46,6 +47,9 @@ obj-$(CONFIG_CRYPTO_MORUS1280_GLUE) += morus1280_glue.o
obj-$(CONFIG_CRYPTO_MORUS640_SSE2) += morus640-sse2.o
obj-$(CONFIG_CRYPTO_MORUS1280_SSE2) += morus1280-sse2.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o
obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o
# These modules require assembler to support AVX.
ifeq ($(avx_supported),yes)
obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += \
......@@ -74,7 +78,7 @@ camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o
blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o
chacha-x86_64-y := chacha-ssse3-x86_64.o chacha_glue.o
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o
aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
......@@ -84,6 +88,8 @@ aegis256-aesni-y := aegis256-aesni-asm.o aegis256-aesni-glue.o
morus640-sse2-y := morus640-sse2-asm.o morus640-sse2-glue.o
morus1280-sse2-y := morus1280-sse2-asm.o morus1280-sse2-glue.o
nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o
ifeq ($(avx_supported),yes)
camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o \
camellia_aesni_avx_glue.o
......@@ -97,10 +103,16 @@ endif
ifeq ($(avx2_supported),yes)
camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o
chacha20-x86_64-y += chacha20-avx2-x86_64.o
chacha-x86_64-y += chacha-avx2-x86_64.o
serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
morus1280-avx2-y := morus1280-avx2-asm.o morus1280-avx2-glue.o
nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o
endif
ifeq ($(avx512_supported),yes)
chacha-x86_64-y += chacha-avx512vl-x86_64.o
endif
aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
/*
* ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
* ChaCha 256-bit cipher algorithm, x64 SSSE3 functions
*
* Copyright (C) 2015 Martin Willi
*
......@@ -10,6 +10,7 @@
*/
#include <linux/linkage.h>
#include <asm/frame.h>
.section .rodata.cst16.ROT8, "aM", @progbits, 16
.align 16
......@@ -23,35 +24,25 @@ CTRINC: .octa 0x00000003000000020000000100000000
.text
ENTRY(chacha20_block_xor_ssse3)
# %rdi: Input state matrix, s
# %rsi: 1 data block output, o
# %rdx: 1 data block input, i
# This function encrypts one ChaCha20 block by loading the state matrix
# in four SSE registers. It performs matrix operation on four words in
# parallel, but requireds shuffling to rearrange the words after each
# round. 8/16-bit word rotation is done with the slightly better
# performing SSSE3 byte shuffling, 7/12-bit word rotation uses
# traditional shift+OR.
# x0..3 = s0..3
movdqa 0x00(%rdi),%xmm0
movdqa 0x10(%rdi),%xmm1
movdqa 0x20(%rdi),%xmm2
movdqa 0x30(%rdi),%xmm3
movdqa %xmm0,%xmm8
movdqa %xmm1,%xmm9
movdqa %xmm2,%xmm10
movdqa %xmm3,%xmm11
/*
* chacha_permute - permute one block
*
* Permute one 64-byte block where the state matrix is in %xmm0-%xmm3. This
* function performs matrix operations on four words in parallel, but requires
* shuffling to rearrange the words after each round. 8/16-bit word rotation is
* done with the slightly better performing SSSE3 byte shuffling, 7/12-bit word
* rotation uses traditional shift+OR.
*
* The round count is given in %r8d.
*
* Clobbers: %r8d, %xmm4-%xmm7
*/
chacha_permute:
movdqa ROT8(%rip),%xmm4
movdqa ROT16(%rip),%xmm5
mov $10,%ecx
.Ldoubleround:
# x0 += x1, x3 = rotl32(x3 ^ x0, 16)
paddd %xmm1,%xmm0
pxor %xmm0,%xmm3
......@@ -118,39 +109,129 @@ ENTRY(chacha20_block_xor_ssse3)
# x3 = shuffle32(x3, MASK(0, 3, 2, 1))
pshufd $0x39,%xmm3,%xmm3
dec %ecx
sub $2,%r8d
jnz .Ldoubleround
ret
ENDPROC(chacha_permute)
ENTRY(chacha_block_xor_ssse3)
# %rdi: Input state matrix, s
# %rsi: up to 1 data block output, o
# %rdx: up to 1 data block input, i
# %rcx: input/output length in bytes
# %r8d: nrounds
FRAME_BEGIN
# x0..3 = s0..3
movdqa 0x00(%rdi),%xmm0
movdqa 0x10(%rdi),%xmm1
movdqa 0x20(%rdi),%xmm2
movdqa 0x30(%rdi),%xmm3
movdqa %xmm0,%xmm8
movdqa %xmm1,%xmm9
movdqa %xmm2,%xmm10
movdqa %xmm3,%xmm11
mov %rcx,%rax
call chacha_permute
# o0 = i0 ^ (x0 + s0)
movdqu 0x00(%rdx),%xmm4
paddd %xmm8,%xmm0
cmp $0x10,%rax
jl .Lxorpart
movdqu 0x00(%rdx),%xmm4
pxor %xmm4,%xmm0
movdqu %xmm0,0x00(%rsi)
# o1 = i1 ^ (x1 + s1)
movdqu 0x10(%rdx),%xmm5
paddd %xmm9,%xmm1
pxor %xmm5,%xmm1
movdqu %xmm1,0x10(%rsi)
movdqa %xmm1,%xmm0
cmp $0x20,%rax
jl .Lxorpart
movdqu 0x10(%rdx),%xmm0
pxor %xmm1,%xmm0
movdqu %xmm0,0x10(%rsi)
# o2 = i2 ^ (x2 + s2)
movdqu 0x20(%rdx),%xmm6
paddd %xmm10,%xmm2
pxor %xmm6,%xmm2
movdqu %xmm2,0x20(%rsi)
movdqa %xmm2,%xmm0
cmp $0x30,%rax
jl .Lxorpart
movdqu 0x20(%rdx),%xmm0
pxor %xmm2,%xmm0
movdqu %xmm0,0x20(%rsi)
# o3 = i3 ^ (x3 + s3)
movdqu 0x30(%rdx),%xmm7
paddd %xmm11,%xmm3
pxor %xmm7,%xmm3
movdqu %xmm3,0x30(%rsi)
movdqa %xmm3,%xmm0
cmp $0x40,%rax
jl .Lxorpart
movdqu 0x30(%rdx),%xmm0
pxor %xmm3,%xmm0
movdqu %xmm0,0x30(%rsi)
.Ldone:
FRAME_END
ret
.Lxorpart:
# xor remaining bytes from partial register into output
mov %rax,%r9
and $0x0f,%r9
jz .Ldone
and $~0x0f,%rax
mov %rsi,%r11
lea 8(%rsp),%r10
sub $0x10,%rsp
and $~31,%rsp
lea (%rdx,%rax),%rsi
mov %rsp,%rdi
mov %r9,%rcx
rep movsb
pxor 0x00(%rsp),%xmm0
movdqa %xmm0,0x00(%rsp)
mov %rsp,%rsi
lea (%r11,%rax),%rdi
mov %r9,%rcx
rep movsb
lea -8(%r10),%rsp
jmp .Ldone
ENDPROC(chacha_block_xor_ssse3)
ENTRY(hchacha_block_ssse3)
# %rdi: Input state matrix, s
# %rsi: output (8 32-bit words)
# %edx: nrounds
FRAME_BEGIN
movdqa 0x00(%rdi),%xmm0
movdqa 0x10(%rdi),%xmm1
movdqa 0x20(%rdi),%xmm2
movdqa 0x30(%rdi),%xmm3
mov %edx,%r8d
call chacha_permute
movdqu %xmm0,0x00(%rsi)
movdqu %xmm3,0x10(%rsi)
FRAME_END
ret
ENDPROC(chacha20_block_xor_ssse3)
ENDPROC(hchacha_block_ssse3)
ENTRY(chacha20_4block_xor_ssse3)
ENTRY(chacha_4block_xor_ssse3)
# %rdi: Input state matrix, s
# %rsi: 4 data blocks output, o
# %rdx: 4 data blocks input, i
# %rsi: up to 4 data blocks output, o
# %rdx: up to 4 data blocks input, i
# %rcx: input/output length in bytes
# %r8d: nrounds
# This function encrypts four consecutive ChaCha20 blocks by loading the
# This function encrypts four consecutive ChaCha blocks by loading the
# the state matrix in SSE registers four times. As we need some scratch
# registers, we save the first four registers on the stack. The
# algorithm performs each operation on the corresponding word of each
......@@ -163,6 +244,7 @@ ENTRY(chacha20_4block_xor_ssse3)
lea 8(%rsp),%r10
sub $0x80,%rsp
and $~63,%rsp
mov %rcx,%rax
# x0..15[0-3] = s0..3[0..3]
movq 0x00(%rdi),%xmm1
......@@ -202,8 +284,6 @@ ENTRY(chacha20_4block_xor_ssse3)
# x12 += counter values 0-3
paddd %xmm1,%xmm12
mov $10,%ecx
.Ldoubleround4:
# x0 += x4, x12 = rotl32(x12 ^ x0, 16)
movdqa 0x00(%rsp),%xmm0
......@@ -421,7 +501,7 @@ ENTRY(chacha20_4block_xor_ssse3)
psrld $25,%xmm4
por %xmm0,%xmm4
dec %ecx
sub $2,%r8d
jnz .Ldoubleround4
# x0[0-3] += s0[0]
......@@ -573,58 +653,143 @@ ENTRY(chacha20_4block_xor_ssse3)
# xor with corresponding input, write to output
movdqa 0x00(%rsp),%xmm0
cmp $0x10,%rax
jl .Lxorpart4
movdqu 0x00(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x00(%rsi)
movdqa 0x10(%rsp),%xmm0
movdqu 0x80(%rdx),%xmm1
movdqu %xmm4,%xmm0
cmp $0x20,%rax
jl .Lxorpart4
movdqu 0x10(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x80(%rsi)
movdqu %xmm0,0x10(%rsi)
movdqu %xmm8,%xmm0
cmp $0x30,%rax
jl .Lxorpart4
movdqu 0x20(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x20(%rsi)
movdqu %xmm12,%xmm0
cmp $0x40,%rax
jl .Lxorpart4
movdqu 0x30(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x30(%rsi)
movdqa 0x20(%rsp),%xmm0
cmp $0x50,%rax
jl .Lxorpart4
movdqu 0x40(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x40(%rsi)
movdqu %xmm6,%xmm0
cmp $0x60,%rax
jl .Lxorpart4
movdqu 0x50(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x50(%rsi)
movdqu %xmm10,%xmm0
cmp $0x70,%rax
jl .Lxorpart4
movdqu 0x60(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x60(%rsi)
movdqu %xmm14,%xmm0
cmp $0x80,%rax
jl .Lxorpart4
movdqu 0x70(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x70(%rsi)
movdqa 0x10(%rsp),%xmm0
cmp $0x90,%rax
jl .Lxorpart4
movdqu 0x80(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x80(%rsi)
movdqu %xmm5,%xmm0
cmp $0xa0,%rax
jl .Lxorpart4
movdqu 0x90(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0x90(%rsi)
movdqu %xmm9,%xmm0
cmp $0xb0,%rax
jl .Lxorpart4
movdqu 0xa0(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0xa0(%rsi)
movdqu %xmm13,%xmm0
cmp $0xc0,%rax
jl .Lxorpart4
movdqu 0xb0(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0xb0(%rsi)
movdqa 0x30(%rsp),%xmm0
cmp $0xd0,%rax
jl .Lxorpart4
movdqu 0xc0(%rdx),%xmm1
pxor %xmm1,%xmm0
movdqu %xmm0,0xc0(%rsi)
movdqu 0x10(%rdx),%xmm1
pxor %xmm1,%xmm4
movdqu %xmm4,0x10(%rsi)
movdqu 0x90(%rdx),%xmm1
pxor %xmm1,%xmm5
movdqu %xmm5,0x90(%rsi)
movdqu 0x50(%rdx),%xmm1
pxor %xmm1,%xmm6
movdqu %xmm6,0x50(%rsi)
movdqu %xmm7,%xmm0
cmp $0xe0,%rax
jl .Lxorpart4
movdqu 0xd0(%rdx),%xmm1
pxor %xmm1,%xmm7
movdqu %xmm7,0xd0(%rsi)
movdqu 0x20(%rdx),%xmm1
pxor %xmm1,%xmm8
movdqu %xmm8,0x20(%rsi)
movdqu 0xa0(%rdx),%xmm1
pxor %xmm1,%xmm9
movdqu %xmm9,0xa0(%rsi)
movdqu 0x60(%rdx),%xmm1
pxor %xmm1,%xmm10
movdqu %xmm10,0x60(%rsi)
pxor %xmm1,%xmm0
movdqu %xmm0,0xd0(%rsi)
movdqu %xmm11,%xmm0
cmp $0xf0,%rax
jl .Lxorpart4
movdqu 0xe0(%rdx),%xmm1
pxor %xmm1,%xmm11
movdqu %xmm11,0xe0(%rsi)
movdqu 0x30(%rdx),%xmm1
pxor %xmm1,%xmm12
movdqu %xmm12,0x30(%rsi)
movdqu 0xb0(%rdx),%xmm1
pxor %xmm1,%xmm13
movdqu %xmm13,0xb0(%rsi)
movdqu 0x70(%rdx),%xmm1
pxor %xmm1,%xmm14
movdqu %xmm14,0x70(%rsi)
pxor %xmm1,%xmm0
movdqu %xmm0,0xe0(%rsi)
movdqu %xmm15,%xmm0
cmp $0x100,%rax
jl .Lxorpart4
movdqu 0xf0(%rdx),%xmm1
pxor %xmm1,%xmm15
movdqu %xmm15,0xf0(%rsi)
pxor %xmm1,%xmm0
movdqu %xmm0,0xf0(%rsi)
.Ldone4:
lea -8(%r10),%rsp
ret
ENDPROC(chacha20_4block_xor_ssse3)
.Lxorpart4:
# xor remaining bytes from partial register into output
mov %rax,%r9
and $0x0f,%r9
jz .Ldone4
and $~0x0f,%rax
mov %rsi,%r11
lea (%rdx,%rax),%rsi
mov %rsp,%rdi
mov %r9,%rcx
rep movsb
pxor 0x00(%rsp),%xmm0
movdqa %xmm0,0x00(%rsp)
mov %rsp,%rsi
lea (%r11,%rax),%rdi
mov %r9,%rcx
rep movsb
jmp .Ldone4
ENDPROC(chacha_4block_xor_ssse3)
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/*
* NHPoly1305 - ε-almost-∆-universal hash function for Adiantum
* (AVX2 accelerated version)
*
* Copyright 2018 Google LLC
*/
#include <crypto/internal/hash.h>
#include <crypto/nhpoly1305.h>
#include <linux/module.h>
#include <asm/fpu/api.h>
asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t message_len,
u8 hash[NH_HASH_BYTES]);
/* wrapper to avoid indirect call to assembly, which doesn't work with CFI */
static void _nh_avx2(const u32 *key, const u8 *message, size_t message_len,
__le64 hash[NH_NUM_PASSES])
{
nh_avx2(key, message, message_len, (u8 *)hash);
}
static int nhpoly1305_avx2_update(struct shash_desc *desc,
const u8 *src, unsigned int srclen)
{
if (srclen < 64 || !irq_fpu_usable())
return crypto_nhpoly1305_update(desc, src, srclen);
do {
unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE);
kernel_fpu_begin();
crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2);
kernel_fpu_end();
src += n;
srclen -= n;
} while (srclen);
return 0;
}
static struct shash_alg nhpoly1305_alg = {
.base.cra_name = "nhpoly1305",
.base.cra_driver_name = "nhpoly1305-avx2",
.base.cra_priority = 300,
.base.cra_ctxsize = sizeof(struct nhpoly1305_key),
.base.cra_module = THIS_MODULE,
.digestsize = POLY1305_DIGEST_SIZE,
.init = crypto_nhpoly1305_init,
.update = nhpoly1305_avx2_update,
.final = crypto_nhpoly1305_final,
.setkey = crypto_nhpoly1305_setkey,
.descsize = sizeof(struct nhpoly1305_state),
};
static int __init nhpoly1305_mod_init(void)
{
if (!boot_cpu_has(X86_FEATURE_AVX2) ||
!boot_cpu_has(X86_FEATURE_OSXSAVE))
return -ENODEV;
return crypto_register_shash(&nhpoly1305_alg);
}
static void __exit nhpoly1305_mod_exit(void)
{
crypto_unregister_shash(&nhpoly1305_alg);
}
module_init(nhpoly1305_mod_init);
module_exit(nhpoly1305_mod_exit);
MODULE_DESCRIPTION("NHPoly1305 ε-almost-∆-universal hash function (AVX2-accelerated)");
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
MODULE_ALIAS_CRYPTO("nhpoly1305");
MODULE_ALIAS_CRYPTO("nhpoly1305-avx2");
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -54,7 +54,8 @@ cryptomgr-y := algboss.o testmgr.o
obj-$(CONFIG_CRYPTO_MANAGER2) += cryptomgr.o
obj-$(CONFIG_CRYPTO_USER) += crypto_user.o
crypto_user-y := crypto_user_base.o crypto_user_stat.o
crypto_user-y := crypto_user_base.o
crypto_user-$(CONFIG_CRYPTO_STATS) += crypto_user_stat.o
obj-$(CONFIG_CRYPTO_CMAC) += cmac.o
obj-$(CONFIG_CRYPTO_HMAC) += hmac.o
obj-$(CONFIG_CRYPTO_VMAC) += vmac.o
......@@ -71,6 +72,7 @@ obj-$(CONFIG_CRYPTO_SHA256) += sha256_generic.o
obj-$(CONFIG_CRYPTO_SHA512) += sha512_generic.o
obj-$(CONFIG_CRYPTO_SHA3) += sha3_generic.o
obj-$(CONFIG_CRYPTO_SM3) += sm3_generic.o
obj-$(CONFIG_CRYPTO_STREEBOG) += streebog_generic.o
obj-$(CONFIG_CRYPTO_WP512) += wp512.o
CFLAGS_wp512.o := $(call cc-option,-fno-schedule-insns) # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149
obj-$(CONFIG_CRYPTO_TGR192) += tgr192.o
......@@ -84,6 +86,8 @@ obj-$(CONFIG_CRYPTO_LRW) += lrw.o
obj-$(CONFIG_CRYPTO_XTS) += xts.o
obj-$(CONFIG_CRYPTO_CTR) += ctr.o
obj-$(CONFIG_CRYPTO_KEYWRAP) += keywrap.o
obj-$(CONFIG_CRYPTO_ADIANTUM) += adiantum.o
obj-$(CONFIG_CRYPTO_NHPOLY1305) += nhpoly1305.o
obj-$(CONFIG_CRYPTO_GCM) += gcm.o
obj-$(CONFIG_CRYPTO_CCM) += ccm.o
obj-$(CONFIG_CRYPTO_CHACHA20POLY1305) += chacha20poly1305.o
......@@ -116,7 +120,7 @@ obj-$(CONFIG_CRYPTO_KHAZAD) += khazad.o
obj-$(CONFIG_CRYPTO_ANUBIS) += anubis.o
obj-$(CONFIG_CRYPTO_SEED) += seed.o
obj-$(CONFIG_CRYPTO_SALSA20) += salsa20_generic.o
obj-$(CONFIG_CRYPTO_CHACHA20) += chacha20_generic.o
obj-$(CONFIG_CRYPTO_CHACHA20) += chacha_generic.o
obj-$(CONFIG_CRYPTO_POLY1305) += poly1305_generic.o
obj-$(CONFIG_CRYPTO_DEFLATE) += deflate.o
obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment