• Bill Baker's avatar
    NFSv4 client live hangs after live data migration recovery · 0f90be13
    Bill Baker authored
    After a live data migration event at the NFS server, the client may send
    I/O requests to the wrong server, causing a live hang due to repeated
    recovery events.  On the wire, this will appear as an I/O request failing
    with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly.
    NFS4ERR_BADSSESSION is returned because the session ID being used was
    issued by the other server and is not valid at the old server.
    
    The failure is caused by async worker threads having cached the transport
    (xprt) in the rpc_task structure.  After the migration recovery completes,
    the task is redispatched and the task resends the request to the wrong
    server based on the old value still present in tk_xprt.
    
    The solution is to recompute the tk_xprt field of the rpc_task structure
    so that the request goes to the correct server.
    Signed-off-by: default avatarBill Baker <bill.baker@oracle.com>
    Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
    Tested-by: default avatarHelen Chao <helen.chao@oracle.com>
    Fixes: fb43d172 ("SUNRPC: Use the multipath iterator to assign a ...")
    Cc: stable@vger.kernel.org # v4.9+
    Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
    0f90be13
clnt.c 69.1 KB