• NeilBrown's avatar
    NFSv3: handle out-of-order write replies. · 3db63daa
    NeilBrown authored
    NFSv3 includes pre/post wcc attributes which allow the client to
    determine if all changes to the file have been made by the client
    itself, or if any might have been made by some other client.
    
    If there are gaps in the pre/post ctime sequence it must be assumed that
    some other client changed the file in that gap and the local cache must
    be suspect.  The next time the file is opened the cache should be
    invalidated.
    
    Since Commit 1c341b77 ("NFS: Add deferred cache invalidation for
    close-to-open consistency violations") in linux 5.3 the Linux client has
    been triggering this invalidation.  The chunk in nfs_update_inode() in
    particularly triggers.
    
    Unfortunately Linux NFS assumes that all replies will be processed in
    the order sent, and will arrive in the order processed.  This is not
    true in general.  Consequently Linux NFS might ignore the wcc info in a
    WRITE reply because the reply is in response to a WRITE that was sent
    before some other request for which a reply has already been seen.  This
    is detected by Linux using the gencount tests in nfs_inode_attr_cmp().
    
    Also, when the gencount tests pass it is still possible that the request
    were processed on the server in a different order, and a gap seen in
    the ctime sequence might be filled in by a subsequent reply, so gaps
    should not immediately trigger delayed invalidation.
    
    The net result is that writing to a server and then reading the file
    back can result in going to the server for the read rather than serving
    it from cache - all because a couple of replies arrived out-of-order.
    This is a performance regression over kernels before 5.3, though the
    change in 5.3 is a correctness improvement.
    
    This has been seen with Linux writing to a Netapp server which
    occasionally re-orders requests.  In testing the majority of requests
    were in-order, but a few (maybe 2 or three at a time) could be
    re-ordered.
    
    This patch addresses the problem by recording any gaps seen in the
    pre/post ctime sequence and not triggering invalidation until either
    there are too many gaps to fit in the table, or until there are no more
    active writes and the remaining gaps cannot be resolved.
    
    We allocate a table of 16 gaps on demand.  If the allocation fails we
    revert to current behaviour which is of little cost as we are unlikely
    to be able to cache the writes anyway.
    
    In the table we store "start->end" pair when iversion is updated and
    "end<-start" pairs pre/post pairs reported by the server.  Usually these
    exactly cancel out and so nothing is stored.  When there are
    out-of-order replies we do store gaps and these will eventually be
    cancelled against later replies when this client is the only writer.
    
    If the final write is out-of-order there may be one gap remaining when
    the file is closed.  This will be noticed and if there is precisely on
    gap and if the iversion can be advanced to match it, then we do so.
    
    This patch makes no attempt to handle directories correctly.  The same
    problem potentially exists in the out-of-order replies to create/unlink
    requests can cause future lookup requires to be sent to the server
    unnecessarily.  A similar scheme using the same primitives could be used
    to notice and handle out-of-order replies.
    Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
    3db63daa
inode.c 70.2 KB