• Herbert Xu's avatar
    crypto: api - Fix generic algorithm self-test races · e7a4142b
    Herbert Xu authored
    On Fri, Aug 30, 2024 at 10:51:54AM -0700, Eric Biggers wrote:
    >
    > Given below in defconfig form, use 'make olddefconfig' to apply.  The failures
    > are nondeterministic and sometimes there are different ones, for example:
    >
    > [    0.358017] alg: skcipher: failed to allocate transform for cbc(twofish-generic): -2
    > [    0.358365] alg: self-tests for cbc(twofish) using cbc(twofish-generic) failed (rc=-2)
    > [    0.358535] alg: skcipher: failed to allocate transform for cbc(camellia-generic): -2
    > [    0.358918] alg: self-tests for cbc(camellia) using cbc(camellia-generic) failed (rc=-2)
    > [    0.371533] alg: skcipher: failed to allocate transform for xts(ecb(aes-generic)): -2
    > [    0.371922] alg: self-tests for xts(aes) using xts(ecb(aes-generic)) failed (rc=-2)
    >
    > Modules are not enabled, maybe that matters (I haven't checked yet).
    
    Yes I think that was the key.  This triggers a massive self-test
    run which executes in parallel and reveals a few race conditions
    in the system.  I think it boils down to the following scenario:
    
    Base algorithm X-generic, X-optimised
    Template Y
    Optimised algorithm Y-X-optimised
    
    Everything gets registered, and then the self-tests are started.
    When Y-X-optimised gets tested, it requests the creation of the
    generic Y(X-generic).  Which then itself undergoes testing.
    
    The race is that after Y(X-generic) gets registered, but just
    before it gets tested, X-optimised finally finishes self-testing
    which then causes all spawns of X-generic to be destroyed.  So
    by the time the self-test for Y(X-generic) comes along, it can
    no longer find the algorithm.  This error then bubbles up all
    the way up to the self-test of Y-X-optimised which then fails.
    
    Note that there is some complexity that I've omitted here because
    when the generic self-test fails to find Y(X-generic) it actually
    triggers the construction of it again which then fails for various
    other reasons (these are not important because the construction
    should *not* be triggered at this point).
    
    So in a way the error is expected, and we should probably remove
    the pr_err for the case where ENOENT is returned for the algorithm
    that we're currently testing.
    
    The solution is two-fold.  First when an algorithm undergoes
    self-testing it should not trigger its construction.  Secondly
    if an instance larval fails to materialise due to it being destroyed
    by a more optimised algorithm coming along, it should obviously
    retry the construction.
    
    Remove the check in __crypto_alg_lookup that stops a larval from
    matching new requests based on differences in the mask.  It is better
    to block new requests even if it is wrong and then simply retry the
    lookup.  If this ends up being the wrong larval it will sort iself
    out during the retry.
    
    Reduce the CRYPTO_ALG_TYPE_MASK bits in type during larval creation
    as otherwise LSKCIPHER algorithms may not match SKCIPHER larvals.
    
    Also block the instance creation during self-testing in the function
    crypto_larval_lookup by checking for CRYPTO_ALG_TESTED in the mask
    field.
    
    Finally change the return value when crypto_alg_lookup fails in
    crypto_larval_wait to EAGAIN to redo the lookup.
    
    Fixes: 37da5d0f ("crypto: api - Do not wait for tests during registration")
    Reported-by: default avatarEric Biggers <ebiggers@kernel.org>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    e7a4142b
api.c 16.2 KB