1. 16 Jan, 2020 5 commits
    • Jason A. Donenfeld's avatar
      crypto: {arm,arm64,mips}/poly1305 - remove redundant non-reduction from emit · 31899908
      Jason A. Donenfeld authored
      This appears to be some kind of copy and paste error, and is actually
      dead code.
      
      Pre: f = 0 ⇒ (f >> 32) = 0
          f = (f >> 32) + le32_to_cpu(digest[0]);
      Post: 0 ≤ f < 2³²
          put_unaligned_le32(f, dst);
      
      Pre: 0 ≤ f < 2³² ⇒ (f >> 32) = 0
          f = (f >> 32) + le32_to_cpu(digest[1]);
      Post: 0 ≤ f < 2³²
          put_unaligned_le32(f, dst + 4);
      
      Pre: 0 ≤ f < 2³² ⇒ (f >> 32) = 0
          f = (f >> 32) + le32_to_cpu(digest[2]);
      Post: 0 ≤ f < 2³²
          put_unaligned_le32(f, dst + 8);
      
      Pre: 0 ≤ f < 2³² ⇒ (f >> 32) = 0
          f = (f >> 32) + le32_to_cpu(digest[3]);
      Post: 0 ≤ f < 2³²
          put_unaligned_le32(f, dst + 12);
      
      Therefore this sequence is redundant. And Andy's code appears to handle
      misalignment acceptably.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Tested-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reviewed-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      31899908
    • Jason A. Donenfeld's avatar
      crypto: x86/poly1305 - wire up faster implementations for kernel · d7d7b853
      Jason A. Donenfeld authored
      These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
      The AVX-512F implementation is disabled on Skylake, due to throttling,
      but it is quite fast on >= Cannonlake.
      
      On the left is cycle counts on a Core i7 6700HQ using the AVX-2
      codepath, comparing this implementation ("new") to the implementation in
      the current crypto api ("old"). On the right are benchmarks on a Xeon
      Gold 5120 using the AVX-512 codepath. The new implementation is faster
      on all benchmarks.
      
              AVX-2                  AVX-512
            ---------              -----------
      
          size    old     new      size   old     new
          ----    ----    ----     ----   ----    ----
          0       70      68       0      74      70
          16      92      90       16     96      92
          32      134     104      32     136     106
          48      172     120      48     184     124
          64      218     136      64     218     138
          80      254     158      80     260     160
          96      298     174      96     300     176
          112     342     192      112    342     194
          128     388     212      128    384     212
          144     428     228      144    420     226
          160     466     246      160    464     248
          176     510     264      176    504     264
          192     550     282      192    544     282
          208     594     302      208    582     300
          224     628     316      224    624     318
          240     676     334      240    662     338
          256     716     354      256    708     358
          272     764     374      272    748     372
          288     802     352      288    788     358
          304     420     366      304    422     370
          320     428     360      320    432     364
          336     484     378      336    486     380
          352     426     384      352    434     390
          368     478     400      368    480     408
          384     488     394      384    490     398
          400     542     408      400    542     412
          416     486     416      416    492     426
          432     534     430      432    538     436
          448     544     422      448    546     432
          464     600     438      464    600     448
          480     540     448      480    548     456
          496     594     464      496    594     476
          512     602     456      512    606     470
          528     656     476      528    656     480
          544     600     480      544    606     498
          560     650     494      560    652     512
          576     664     490      576    662     508
          592     714     508      592    716     522
          608     656     514      608    664     538
          624     708     532      624    710     552
          640     716     524      640    720     516
          656     770     536      656    772     526
          672     716     548      672    722     544
          688     770     562      688    768     556
          704     774     552      704    778     556
          720     826     568      720    832     568
          736     768     574      736    780     584
          752     822     592      752    826     600
          768     830     584      768    836     560
          784     884     602      784    888     572
          800     828     610      800    838     588
          816     884     628      816    884     604
          832     888     618      832    894     598
          848     942     632      848    946     612
          864     884     644      864    896     628
          880     936     660      880    942     644
          896     948     652      896    952     608
          912     1000    664      912    1004    616
          928     942     676      928    954     634
          944     994     690      944    1000    646
          960     1002    680      960    1008    646
          976     1054    694      976    1062    658
          992     1002    706      992    1012    674
          1008    1052    720      1008   1058    690
      
      This commit wires in the prior implementation from Andy, and makes the
      following changes to be suitable for kernel land.
      
        - Some cosmetic and structural changes, like renaming labels to
          .Lname, constants, and other Linux conventions, as well as making
          the code easy for us to maintain moving forward.
      
        - CPU feature checking is done in C by the glue code.
      
        - We avoid jumping into the middle of functions, to appease objtool,
          and instead parameterize shared code.
      
        - We maintain frame pointers so that stack traces make sense.
      
        - We remove the dependency on the perl xlate code, which transforms
          the output into things that assemblers we don't care about use.
      
      Importantly, none of our changes affect the arithmetic or core code, but
      just involve the differing environment of kernel space.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarSamuel Neves <sneves@dei.uc.pt>
      Co-developed-by: default avatarSamuel Neves <sneves@dei.uc.pt>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      d7d7b853
    • Jason A. Donenfeld's avatar
      crypto: x86/poly1305 - import unmodified cryptogams implementation · 0896ca2a
      Jason A. Donenfeld authored
      These x86_64 vectorized implementations come from Andy Polyakov's
      CRYPTOGAMS implementation, and are included here in raw form without
      modification, so that subsequent commits that fix these up for the
      kernel can see how it has changed.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      0896ca2a
    • Jason A. Donenfeld's avatar
      crypto: poly1305 - add new 32 and 64-bit generic versions · 1c08a104
      Jason A. Donenfeld authored
      These two C implementations from Zinc -- a 32x32 one and a 64x64 one,
      depending on the platform -- come from Andrew Moon's public domain
      poly1305-donna portable code, modified for usage in the kernel. The
      precomputation in the 32-bit version and the use of 64x64 multiplies in
      the 64-bit version make these perform better than the code it replaces.
      Moon's code is also very widespread and has received many eyeballs of
      scrutiny.
      
      There's a bit of interference between the x86 implementation, which
      relies on internal details of the old scalar implementation. In the next
      commit, the x86 implementation will be replaced with a faster one that
      doesn't rely on this, so none of this matters much. But for now, to keep
      this passing the tests, we inline the bits of the old implementation
      that the x86 implementation relied on. Also, since we now support a
      slightly larger key space, via the union, some offsets had to be fixed
      up.
      
      Nonce calculation was folded in with the emit function, to take
      advantage of 64x64 arithmetic. However, Adiantum appeared to rely on no
      nonce handling in emit, so this path was conditionalized. We also
      introduced a new struct, poly1305_core_key, to represent the precise
      amount of space that particular implementation uses.
      
      Testing with kbench9000, depending on the CPU, the update function for
      the 32x32 version has been improved by 4%-7%, and for the 64x64 by
      19%-30%. The 32x32 gains are small, but I think there's great value in
      having a parallel implementation to the 64x64 one so that the two can be
      compared side-by-side as nice stand-alone units.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      1c08a104
    • Herbert Xu's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · e3419426
      Herbert Xu authored
      Merge crypto tree to pick up hisilicon patch.
      e3419426
  2. 09 Jan, 2020 35 commits