LoongArch: vDSO: Tune chacha implementation
As Christophe pointed out, tuning the chacha implementation by scheduling the instructions like what GCC does can improve the performance. The tuning does not introduce too much complexity (basically it's just reordering some instructions). And the tuning does not hurt readibility too much: actually the tuned code looks even more similar to a textbook-style implementation based on 128-bit vectors. So overall it's a good deal to me. Tested with vdso_test_getchacha and benched with vdso_test_getrandom. On a LA664 the speedup is 5%, and I expect a larger speedup on LA[2-4]64 with a lower issue rate. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/77655d9e-fc05-4300-8f0d-7b2ad840d091@csgroup.eu/Signed-off-by: Xi Ruoyao <xry111@xry111.site> Reviewed-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Showing
Please register or sign in to comment