• Linus Torvalds's avatar
    x86: explicitly align IO accesses in memcpy_{to,from}io · c228d294
    Linus Torvalds authored
    In commit 170d13ca ("x86: re-introduce non-generic memcpy_{to,from}io")
    I made our copy from IO space use a separate copy routine rather than
    rely on the generic memcpy.  I did that because our generic memory copy
    isn't actually well-defined when it comes to internal access ordering or
    alignment, and will in fact depend on various CPUID flags.
    
    In particular, the default memcpy() for a modern Intel CPU will
    generally be just a "rep movsb", which works reasonably well for
    medium-sized memory copies of regular RAM, since the CPU will turn it
    into fairly optimized microcode.
    
    However, for non-cached memory and IO, "rep movs" ends up being
    horrendously slow and will just do the architectural "one byte at a
    time" accesses implied by the movsb.
    
    At the other end of the spectrum, if you _don't_ end up using the "rep
    movsb" code, you'd likely fall back to the software copy, which does
    overlapping accesses for the tail, and may copy things backwards.
    Again, for regular memory that's fine, for IO memory not so much.
    
    The thinking was that clearly nobody really cared (because things
    worked), but some people had seen horrible performance due to the byte
    accesses, so let's just revert back to our long ago version that dod
    "rep movsl" for the bulk of the copy, and then fixed up the potentially
    last few bytes of the tail with "movsw/b".
    
    Interestingly (and perhaps not entirely surprisingly), while that was
    our original memory copy implementation, and had been used before for
    IO, in the meantime many new users of memcpy_*io() had come about.  And
    while the access patterns for the memory copy weren't well-defined (so
    arguably _any_ access pattern should work), in practice the "rep movsb"
    case had been very common for the last several years.
    
    In particular Jarkko Sakkinen reported that the memcpy_*io() change
    resuled in weird errors from his Geminilake NUC TPM module.
    
    And it turns out that the TPM TCG accesses according to spec require
    that the accesses be
    
     (a) done strictly sequentially
    
     (b) be naturally aligned
    
    otherwise the TPM chip will abort the PCI transaction.
    
    And, in fact, the tpm_crb.c driver did this:
    
    	memcpy_fromio(buf, priv->rsp, 6);
    	...
    	memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);
    
    which really should never have worked in the first place, but back
    before commit 170d13ca it *happened* to work, because the
    memcpy_fromio() would be expanded to a regular memcpy, and
    
     (a) gcc would expand the first memcpy in-line, and turn it into a
         4-byte and a 2-byte read, and they happened to be in the right
         order, and the alignment was right.
    
     (b) gcc would call "memcpy()" for the second one, and the machines that
         had this TPM chip also apparently ended up always having ERMS
         ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
         movbs" for that copy.
    
    In other words, basically by pure luck, the code happened to use the
    right access sizes in the (two different!) memcpy() implementations to
    make it all work.
    
    But after commit 170d13ca, both of the memcpy_fromio() calls
    resulted in a call to the routine with the consistent memory accesses,
    and in both cases it started out transferring with 4-byte accesses.
    Which worked for the first copy, but resulted in the second copy doing a
    32-bit read at an address that was only 2-byte aligned.
    
    Jarkko is actually fixing the fragile code in the TPM driver, but since
    this is an excellent example of why we absolutely must not use a generic
    memcpy for IO accesses, _and_ an IO-specific one really should strive to
    align the IO accesses, let's do exactly that.
    
    Side note: Jarkko also noted that the driver had been used on ARM
    platforms, and had worked.  That was because on 32-bit ARM, memcpy_*io()
    ends up always doing byte accesses, and on 64-bit ARM it first does byte
    accesses to align to 8-byte boundaries, and then does 8-byte accesses
    for the bulk.
    
    So ARM actually worked by design, and the x86 case worked by pure luck.
    
    We *might* want to make x86-64 do the 8-byte case too.  That should be a
    pretty straightforward extension, but let's do one thing at a time.  And
    generally MMIO accesses aren't really all that performance-critical, as
    shown by the fact that for a long time we just did them a byte at a
    time, and very few people ever noticed.
    Reported-and-tested-by: default avatarJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
    Tested-by: default avatarJerry Snitselaar <jsnitsel@redhat.com>
    Cc: David Laight <David.Laight@aculab.com>
    Fixes: 170d13ca ("x86: re-introduce non-generic memcpy_{to,from}io")
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c228d294
iomem.c 1.58 KB