• Sagi Grimberg's avatar
    nvme-tcp: fix possible use-after-completion · 825619b0
    Sagi Grimberg authored
    Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
    context") added a second context that may perform a network send.
    This means that now RX and TX are not serialized in nvme_tcp_io_work
    and can run concurrently.
    
    While there is correct mutual exclusion in the TX path (where
    the send_mutex protect the queue socket send activity) RX activity,
    and more specifically request completion may run concurrently.
    
    This means we must guarantee that any mutation of the request state
    related to its lifetime, bytes sent must not be accessed when a completion
    may have possibly arrived back (and processed).
    
    The race may trigger when a request completion arrives, processed
    _and_ reused as a fresh new request, exactly in the (relatively short)
    window between the last data payload sent and before the request iov_iter
    is advanced.
    
    Consider the following race:
    1. 16K write request is queued
    2. The nvme command and the data is sent to the controller (in-capsule
       or solicited by r2t)
    3. After the last payload is sent but before the req.iter is advanced,
       the controller sends back a completion.
    4. The completion is processed, the request is completed, and reused
       to transfer a new request (write or read)
    5. The new request is queued, and the driver reset the request parameters
       (nvme_tcp_setup_cmd_pdu).
    6. Now context in (2) resumes execution and advances the req.iter
    
    ==> use-after-completion as this is already a new request.
    
    Fix this by making sure the request is not advanced after the last
    data payload send, knowing that a completion may have arrived already.
    
    An alternative solution would have been to delay the request completion
    or state change waiting for reference counting on the TX path, but besides
    adding atomic operations to the hot-path, it may present challenges in
    multi-stage R2T scenarios where a r2t handler needs to be deferred to
    an async execution.
    Reported-by: default avatarNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
    Tested-by: default avatarAnil Mishra <anil.mishra@wdc.com>
    Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
    Cc: stable@vger.kernel.org # v5.8+
    Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    825619b0
tcp.c 64.9 KB