1. 11 Nov, 2022 2 commits
    • Ard Biesheuvel's avatar
      crypto: lib/gf128mul - make gf128mul_lle time invariant · b67ce439
      Ard Biesheuvel authored
      The gf128mul library has different variants with different
      memory/performance tradeoffs, where the faster ones use 4k or 64k lookup
      tables precomputed at runtime, which are based on one of the
      multiplication factors, which is commonly the key for keyed hash
      algorithms such as GHASH.
      
      The slowest variant is gf128_mul_lle() [and its bbe/ble counterparts],
      which does not use precomputed lookup tables, but it still relies on a
      single u16[256] lookup table which is input independent. The use of such
      a table may cause the execution time of gf128_mul_lle() to correlate
      with the value of the inputs, which is generally something that must be
      avoided for cryptographic algorithms. On top of that, the function uses
      a sequence of if () statements that conditionally invoke be128_xor()
      based on which bits are set in the second argument of the function,
      which is usually a pointer to the multiplication factor that represents
      the key.
      
      In order to remove the correlation between the execution time of
      gf128_mul_lle() and the value of its inputs, let's address the
      identified shortcomings:
      - add a time invariant version of gf128mul_x8_lle() that replaces the
        table lookup with the expression that is used at compile time to
        populate the lookup table;
      - make the invocations of be128_xor() unconditional, but pass a zero
        vector as the third argument if the associated bit in the key is
        cleared.
      
      The resulting code is likely to be significantly slower. However, given
      that this is the slowest version already, making it even slower in order
      to make it more secure is assumed to be justified.
      
      The bbe and ble counterparts could receive the same treatment, but the
      former is never used anywhere in the kernel, and the latter is only
      used in the driver for a asynchronous crypto h/w accelerator (Chelsio),
      where timing variances are unlikely to matter.
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      b67ce439
    • Ard Biesheuvel's avatar
      crypto: move gf128mul library into lib/crypto · 61c581a4
      Ard Biesheuvel authored
      The gf128mul library does not depend on the crypto API at all, so it can
      be moved into lib/crypto. This will allow us to use it in other library
      code in a subsequent patch without having to depend on CONFIG_CRYPTO.
      
      While at it, change the Kconfig symbol name to align with other crypto
      library implementations. However, the source file name is retained, as
      it is reflected in the module .ko filename, and changing this might
      break things for users.
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      61c581a4
  2. 04 Nov, 2022 18 commits
    • Ralph Siemsen's avatar
      crypto: doc - use correct function name · 329cfa42
      Ralph Siemsen authored
      The hashing API does not have a function called .finish()
      Signed-off-by: default avatarRalph Siemsen <ralph.siemsen@linaro.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      329cfa42
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for GCM mode · ae1b83c7
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for GCM mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 224 and 224
      modes of tcrypt, and compared the performance before and after this patch (the
      driver used before this patch is gcm_base(ctr-sm4-ce,ghash-generic)).
      The abscissas are blocks of different lengths. The data is tabulated and the
      unit is Mb/s:
      
      Before (gcm_base(ctr-sm4-ce,ghash-generic)):
      
      gcm(sm4)     |     16      64      256      512     1024     1420     4096     8192
      -------------+---------------------------------------------------------------------
        GCM enc    |  25.24   64.65   104.66   116.69   123.81   125.12   129.67   130.62
        GCM dec    |  25.40   64.80   104.74   116.70   123.81   125.21   129.68   130.59
        GCM mb enc |  24.95   64.06   104.20   116.38   123.55   124.97   129.63   130.61
        GCM mb dec |  24.92   64.00   104.13   116.34   123.55   124.98   129.56   130.48
      
      After:
      
      gcm-sm4-ce   |     16      64      256      512     1024     1420     4096     8192
      -------------+---------------------------------------------------------------------
        GCM enc    | 108.62  397.18   971.60  1283.92  1522.77  1513.39  1777.00  1806.96
        GCM dec    | 116.36  398.14  1004.27  1319.11  1624.21  1635.43  1932.54  1974.20
        GCM mb enc | 107.13  391.79   962.05  1274.94  1514.76  1508.57  1769.07  1801.58
        GCM mb dec | 113.40  389.36   988.51  1307.68  1619.10  1631.55  1931.70  1970.86
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ae1b83c7
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for CCM mode · 67fa3a7f
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for CCM mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 223 and 225
      modes of tcrypt, and compared the performance before and after this patch (the
      driver used before this patch is ccm_base(ctr-sm4-ce,cbcmac-sm4-ce)).
      The abscissas are blocks of different lengths. The data is tabulated and the
      unit is Mb/s:
      
      Before (rfc4309(ccm_base(ctr-sm4-ce,cbcmac-sm4-ce))):
      
      ccm(sm4)     |     16      64     256     512    1024    1420    4096    8192
      -------------+---------------------------------------------------------------
        CCM enc    |  35.07  125.40  336.47  468.17  581.97  619.18  712.56  736.01
        CCM dec    |  34.87  124.40  335.08  466.75  581.04  618.81  712.25  735.89
        CCM mb enc |  34.71  123.96  333.92  465.39  579.91  617.49  711.45  734.92
        CCM mb dec |  34.42  122.80  331.02  462.81  578.28  616.42  709.88  734.19
      
      After (rfc4309(ccm-sm4-ce)):
      
      ccm-sm4-ce   |     16      64     256     512    1024    1420    4096    8192
      -------------+---------------------------------------------------------------
        CCM enc    |  77.12  249.82  569.94  725.17  839.27  867.71  952.87  969.89
        CCM dec    |  75.90  247.26  566.29  722.12  836.90  865.95  951.74  968.57
        CCM mb enc |  75.98  245.25  562.91  718.99  834.76  864.70  950.17  967.90
        CCM mb dec |  75.06  243.78  560.58  717.13  833.68  862.70  949.35  967.11
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      67fa3a7f
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for cmac/xcbc/cbcmac · 6b5360a5
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for cmac/xcbc/cbcmac.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 300 mode of
      tcrypt, and compared the performance before and after this patch (the driver
      used before this patch is XXXmac(sm4-ce)). The abscissas are blocks of
      different lengths. The data is tabulated and the unit is Mb/s:
      
      Before:
      
      update-size    |      16      64     256    1024    2048    4096    8192
      ---------------+--------------------------------------------------------
      cmac(sm4-ce)   |  293.33  403.69  503.76  527.78  531.10  535.46  535.81
      xcbc(sm4-ce)   |  292.83  402.50  504.02  529.08  529.87  536.55  538.24
      cbcmac(sm4-ce) |  318.42  415.79  497.12  515.05  523.15  521.19  523.01
      
      After:
      
      update-size    |      16      64     256    1024    2048    4096    8192
      ---------------+--------------------------------------------------------
      cmac-sm4-ce    |  371.99  675.28  903.56  971.65  980.57  990.40  991.04
      xcbc-sm4-ce    |  372.11  674.55  903.47  971.61  980.96  990.42  991.10
      cbcmac-sm4-ce  |  371.63  675.33  903.23  972.07  981.42  990.93  991.45
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      6b5360a5
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for XTS mode · 01f63311
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for XTS mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
      tcrypt, and compared the performance before and after this patch (the driver
      used before this patch is xts(ecb-sm4-ce)). The abscissas are blocks of
      different lengths. The data is tabulated and the unit is Mb/s:
      
      Before:
      
      xts(ecb-sm4-ce) |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
              XTS enc |  117.17   430.56   732.92  1134.98  2007.03  2136.23  2347.20
              XTS dec |  116.89   429.02   733.40  1132.96  2006.13  2130.50  2347.92
      
      After:
      
      xts-sm4-ce      |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
              XTS enc |  224.68   798.91  1248.08  1714.60  2413.73  2467.84  2612.62
              XTS dec |  229.85   791.34  1237.79  1720.00  2413.30  2473.84  2611.95
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      01f63311
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - add CE implementation for CTS-CBC mode · b1863fd0
      Tianjia Zhang authored
      This patch is a CE-optimized assembly implementation for CTS-CBC mode.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 218 mode of
      tcrypt, and compared the performance before and after this patch (the driver
      used before this patch is cts(cbc-sm4-ce)). The abscissas are blocks of
      different lengths. The data is tabulated and the unit is Mb/s:
      
      Before:
      
      cts(cbc-sm4-ce) |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
          CTS-CBC enc |  286.09   297.17   457.97   627.75   868.58   900.80   957.69
          CTS-CBC dec |  286.67   285.63   538.35   947.08  2241.03  2577.32  3391.14
      
      After:
      
      cts-cbc-sm4-ce  |      16       64      128      256     1024     1420     4096
      ----------------+--------------------------------------------------------------
          CTS-CBC enc |  288.19   428.80   593.57   741.04   911.73   931.80   950.00
          CTS-CBC dec |  292.22   468.99   838.23  1380.76  2741.17  3036.42  3409.62
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      b1863fd0
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - export reusable CE acceleration functions · 45089dbe
      Tianjia Zhang authored
      In the accelerated implementation of the SM4 algorithm using the Crypto
      Extension instructions, there are some functions that can be reused in
      the upcoming accelerated implementation of the GCM/CCM mode, and the
      CBC/CFB encryption is reused in the optimized implementation of SVESM4.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      45089dbe
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - simplify sm4_ce_expand_key() of CE implementation · cb9ba02b
      Tianjia Zhang authored
      Use a 128-bit swap mask and tbl instruction to simplify the implementation
      for generating SM4 rkey_dec.
      
      Also fixed the issue of not being wrapped by kernel_neon_begin/end() when
      using the sm4_ce_expand_key() function.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      cb9ba02b
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - refactor and simplify CE implementation · ce41fefd
      Tianjia Zhang authored
      This patch does not add new features, but only refactors and simplifies the
      implementation of the Crypto Extension acceleration of the SM4 algorithm:
      
      Extract the macro optimized by SM4 Crypto Extension for reuse in the
      subsequent optimization of CCM/GCM modes.
      
      Encryption in CBC and CFB modes processes four blocks at a time instead of
      one, allowing the ld1 instruction to load 64 bytes of data at a time, which
      will reduces unnecessary memory accesses.
      
      CBC/CFB/CTR makes full use of free registers to reduce redundant memory
      accesses, and rearranges some instructions to improve out-of-order execution
      capabilities.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      ce41fefd
    • Tianjia Zhang's avatar
      crypto: tcrypt - add SM4 cts-cbc/xts/xcbc test · 3c383637
      Tianjia Zhang authored
      Added CTS-CBC/XTS/XCBC tests for SM4 algorithms, as well as
      corresponding speed tests, this is to test performance-optimized
      implementations of these modes.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      3c383637
    • Tianjia Zhang's avatar
      crypto: testmgr - add SM4 cts-cbc/xts/xcbc test vectors · c24ee936
      Tianjia Zhang authored
      This patch newly adds the test vectors of CTS-CBC/XTS/XCBC modes of
      the SM4 algorithm, and also added some test vectors for SM4 GCM/CCM.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      c24ee936
    • Tianjia Zhang's avatar
      crypto: arm64/sm4 - refactor and simplify NEON implementation · 62508017
      Tianjia Zhang authored
      This patch does not add new features. The main work is to refactor and
      simplify the implementation of SM4 NEON, which is reflected in the
      following aspects:
      
      The accelerated implementation supports the arbitrary number of blocks,
      not just multiples of 8, which simplifies the implementation and brings
      some optimization acceleration for data that is not aligned by 8 blocks.
      
      When loading the input data, use the ld4 instruction to replace the
      original ld1 instruction as much as possible, which will save the cost
      of matrix transposition of the input data.
      
      Use 8-block parallelism whenever possible to speed up matrix transpose
      and rotation operations, instead of up to 4-block parallelism.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      62508017
    • Tianjia Zhang's avatar
      crypto: arm64/sm3 - add NEON assembly implementation · a41b2129
      Tianjia Zhang authored
      This patch adds the NEON acceleration implementation of the SM3 hash
      algorithm. The main algorithm is based on SM3 NEON accelerated work of
      the libgcrypt project.
      
      Benchmark on T-Head Yitian-710 2.75 GHz, the data comes from the 326 mode
      of tcrypt, and compares the performance data of sm3-generic and sm3-ce.
      The abscissas are blocks of different lengths. The data is tabulated and
      the unit is Mb/s:
      
      update-size    |      16      64     256    1024    2048    4096    8192
      ---------------+--------------------------------------------------------
      sm3-generic    |  185.24  221.28  301.26  307.43  300.83  308.82  308.91
      sm3-neon       |  171.81  220.20  322.94  339.28  334.09  343.61  343.87
      sm3-ce         |  227.48  333.48  502.62  527.87  520.45  534.91  535.40
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      a41b2129
    • Tianjia Zhang's avatar
      crypto: arm64/sm3 - raise the priority of the CE implementation · e1fa51aa
      Tianjia Zhang authored
      Raise the priority of the sm3-ce algorithm from 200 to 400, this is
      to make room for the implementation of sm3-neon.
      Signed-off-by: default avatarTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      e1fa51aa
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Drop leading newlines from prints · 3513828c
      Anirudh Venkataramanan authored
      The top level print banners have a leading newline. It's not entirely
      clear why this exists, but it makes it harder to parse tcrypt test output
      using a script. Drop said newlines.
      
      tcrypt output before this patch:
      
      [...]
            testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] test 0 (160 bit key, 16 byte blocks): 1 operation in 2320 cycles (16 bytes)
      
      tcrypt output with this patch:
      
      [...] testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] test 0 (160 bit key, 16 byte blocks): 1 operation in 2320 cycles (16 bytes)
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      3513828c
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Drop module name from print string · a2ef5630
      Anirudh Venkataramanan authored
      The pr_fmt() define includes KBUILD_MODNAME, and so there's no need
      for pr_err() to also print it. Drop module name from the print string.
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      a2ef5630
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Use pr_info/pr_err · 837a99f5
      Anirudh Venkataramanan authored
      Currently, there's mixed use of printk() and pr_info()/pr_err(). The latter
      prints the module name (because pr_fmt() is defined so) but the former does
      not. As a result there's inconsistency in the printed output. For example:
      
      modprobe mode=211:
      
      [...] test 0 (160 bit key, 16 byte blocks): 1 operation in 2320 cycles (16 bytes)
      [...] test 1 (160 bit key, 64 byte blocks): 1 operation in 2336 cycles (64 bytes)
      
      modprobe mode=215:
      
      [...] tcrypt: test 0 (160 bit key, 16 byte blocks): 1 operation in 2173 cycles (16 bytes)
      [...] tcrypt: test 1 (160 bit key, 64 byte blocks): 1 operation in 2241 cycles (64 bytes)
      
      Replace all instances of printk() with pr_info()/pr_err() so that the
      module name is printed consistently.
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      837a99f5
    • Anirudh Venkataramanan's avatar
      crypto: tcrypt - Use pr_cont to print test results · fdaeb224
      Anirudh Venkataramanan authored
      For some test cases, a line break gets inserted between the test banner
      and the results. For example, with mode=211 this is the output:
      
      [...]
            testing speed of rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] test 0 (160 bit key, 16 byte blocks):
      [...] 1 operation in 2373 cycles (16 bytes)
      
      --snip--
      
      [...]
            testing speed of gcm(aes) (generic-gcm-aesni) encryption
      [...] test 0 (128 bit key, 16 byte blocks):
      [...] 1 operation in 2338 cycles (16 bytes)
      
      Similar behavior is seen in the following cases as well:
      
        modprobe tcrypt mode=212
        modprobe tcrypt mode=213
        modprobe tcrypt mode=221
        modprobe tcrypt mode=300 sec=1
        modprobe tcrypt mode=400 sec=1
      
      This doesn't happen with mode=215:
      
      [...] tcrypt:
                    testing speed of multibuffer rfc4106(gcm(aes)) (rfc4106-gcm-aesni) encryption
      [...] tcrypt: test 0 (160 bit key, 16 byte blocks): 1 operation in 2215 cycles (16 bytes)
      
      --snip--
      
      [...] tcrypt:
                    testing speed of multibuffer gcm(aes) (generic-gcm-aesni) encryption
      [...] tcrypt: test 0 (128 bit key, 16 byte blocks): 1 operation in 2191 cycles (16 bytes)
      
      This print inconsistency is because printk() is used instead of pr_cont()
      in a few places. Change these to be pr_cont().
      
      checkpatch warns that pr_cont() shouldn't be used. This can be ignored in
      this context as tcrypt already uses pr_cont().
      Signed-off-by: default avatarAnirudh Venkataramanan <anirudh.venkataramanan@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      fdaeb224
  3. 28 Oct, 2022 20 commits