• Linus Torvalds's avatar
    x86-64: word-at-a-time: improve byte count calculations · 4b8fa117
    Linus Torvalds authored
    This switches x86-64 over to using 'tzcount' instead of the integer
    multiply trick to turn the bytemask information into actual byte counts.
    
    We even had a comment saying that a fast bit count instruction is better
    than a multiply, but x86 bit counting has traditionally been
    "questionably fast", and so avoiding it was the right thing back in the
    days.
    
    Now, on any half-way modern core, using bit counting is cheaper and
    smaller than the large constant multiply, so let's just switch over.
    
    Note that as part of switching over to counting bits, we also do it at a
    different point.  We used to create the byte count from the final byte
    mask, but once you use the 'tzcount' instruction (aka 'bsf' on older
    CPU's), you can actually count the leading zeroes using a value we have
    available earlier.
    
    In fact, we can just use the very first mask of bits that tells us
    whether we have any zero bytes at all.  The zero bytes in the word will
    have the high bit set, so just doing 'tzcount' on that value and
    dividing by 8 will give the number of bytes that precede the first NUL
    character, which is exactly what we want.
    
    Note also that the input value to the tzcount is by definition not zero,
    since that is the condition that we already used to check the whole "do
    we have any zero bytes at all".  So we don't need to worry about the
    legacy instruction behavior of pre-lzcount days when 'bsf' didn't have a
    result for zero input.
    
    The 32-bit code continues to use the bimple bit op trick that is faster
    even on newer cores, but particularly on the older 32-bit-only ones.
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    4b8fa117
word-at-a-time.h 1.98 KB