Commit 1aa00ff3 authored by Michał Górny's avatar Michał Górny Committed by Benjamin Peterson

fixes bpo-31834: Use optimized code for BLAKE2 only with SSSE3+ (#4066)

Rework the code choosing BLAKE2 code paths from using the optimized
variant on all x86_64 machines to using it when SSSE3 or better
supported instructions sets are available.

Firstly, this solves the problem of using pure SSE2 code path on x86_64
machines. As reported in the bug, this code is slower than the reference
code on all tested x86_64 machines. Furthermore, on Athlon64 that lacks
SSSE3, it is even 2.5 times slower than the reference code! Checking
for SSSE3 therefore ensures that the optimized implementation will only
be used when it has a chance of performing better.

Secondly, this makes it possible to use SSSE3+ optimizations on 32-bit
x86 systems. This allows for even 2 times speed gain on modern 32-bit
x86 systems (tested in a 32-bit chroot).
parent 3b66ebe7
Use optimized code for BLAKE2 only with SSSE3+. The pure SSE2 implementation
is slower than the pure C reference implementation.
......@@ -26,7 +26,9 @@
#include "impl/blake2.h"
#include "impl/blake2-impl.h" /* for secure_zero_memory() and store48() */
#ifdef BLAKE2_USE_SSE
/* pure SSE2 implementation is very slow, so only use the more optimized SSSE3+
* https://bugs.python.org/issue31834 */
#if defined(__SSSE3__) || defined(__SSE4_1__) || defined(__AVX__) || defined(__XOP__)
#include "impl/blake2b.c"
#else
#include "impl/blake2b-ref.c"
......
......@@ -26,7 +26,9 @@
#include "impl/blake2.h"
#include "impl/blake2-impl.h" /* for secure_zero_memory() and store48() */
#ifdef BLAKE2_USE_SSE
/* pure SSE2 implementation is very slow, so only use the more optimized SSSE3+
* https://bugs.python.org/issue31834 */
#if defined(__SSSE3__) || defined(__SSE4_1__) || defined(__AVX__) || defined(__XOP__)
#include "impl/blake2s.c"
#else
#include "impl/blake2s-ref.c"
......
......@@ -922,19 +922,10 @@ class PyBuildExt(build_ext):
'Modules/_blake2/impl/*'))
blake2_deps.append('hashlib.h')
blake2_macros = []
if (not cross_compiling and
os.uname().machine == "x86_64" and
sys.maxsize > 2**32):
# Every x86_64 machine has at least SSE2. Check for sys.maxsize
# in case that kernel is 64-bit but userspace is 32-bit.
blake2_macros.append(('BLAKE2_USE_SSE', '1'))
exts.append( Extension('_blake2',
['_blake2/blake2module.c',
'_blake2/blake2b_impl.c',
'_blake2/blake2s_impl.c'],
define_macros=blake2_macros,
depends=blake2_deps) )
sha3_deps = glob(os.path.join(os.getcwd(), srcdir,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment