• WANG Xuerui's avatar
    raid6: Add LoongArch SIMD recovery implementation · f2091321
    WANG Xuerui authored
    Similar to the syndrome calculation, the recovery algorithms also work
    on 64 bytes at a time to align with the L1 cache line size of current
    and future LoongArch cores (that we care about). Which means
    unrolled-by-4 LSX and unrolled-by-2 LASX code.
    
    The assembly is originally based on the x86 SSSE3/AVX2 ports, but
    register allocation has been redone to take advantage of LSX/LASX's 32
    vector registers, and instruction sequence has been optimized to suit
    (e.g. LoongArch can perform per-byte srl and andi on vectors, but x86
    cannot).
    
    Performance numbers measured by instrumenting the raid6test code, on a
    3A5000 system clocked at 2.5GHz:
    
    > lasx  2data: 354.987 MiB/s
    > lasx  datap: 350.430 MiB/s
    > lsx   2data: 340.026 MiB/s
    > lsx   datap: 337.318 MiB/s
    > intx1 2data: 164.280 MiB/s
    > intx1 datap: 187.966 MiB/s
    
    Because recovery algorithms are chosen solely based on priority and
    availability, lasx is marked as priority 2 and lsx priority 1. At least
    for the current generation of LoongArch micro-architectures, LASX should
    always be faster than LSX whenever supported, and have similar power
    consumption characteristics (because the only known LASX-capable uarch,
    the LA464, always compute the full 256-bit result for vector ops).
    Acked-by: default avatarSong Liu <song@kernel.org>
    Signed-off-by: default avatarWANG Xuerui <git@xen0n.name>
    Signed-off-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
    f2091321
algos.c 6.4 KB