Commits · edcf9725150e42beeca42d085149f4c88fa97afd · Kirill Smelkov / linux

24 Jan, 2024 1 commit

NeilBrown authored Jan 22, 2024

The test on so_count in nfsd4_release_lockowner() is nonsense and
harmful.  Revert to using check_for_locks(), changing that to not sleep.

First: harmful.
As is documented in the kdoc comment for nfsd4_release_lockowner(), the
test on so_count can transiently return a false positive resulting in a
return of NFS4ERR_LOCKS_HELD when in fact no locks are held.  This is
clearly a protocol violation and with the Linux NFS client it can cause
incorrect behaviour.

If RELEASE_LOCKOWNER is sent while some other thread is still
processing a LOCK request which failed because, at the time that request
was received, the given owner held a conflicting lock, then the nfsd
thread processing that LOCK request can hold a reference (conflock) to
the lock owner that causes nfsd4_release_lockowner() to return an
incorrect error.

The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it
never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so
it knows that the error is impossible.  It assumes the lock owner was in
fact released so it feels free to use the same lock owner identifier in
some later locking request.

When it does reuse a lock owner identifier for which a previous RELEASE
failed, it will naturally use a lock_seqid of zero.  However the server,
which didn't release the lock owner, will expect a larger lock_seqid and
so will respond with NFS4ERR_BAD_SEQID.

So clearly it is harmful to allow a false positive, which testing
so_count allows.

The test is nonsense because ... well... it doesn't mean anything.

so_count is the sum of three different counts.
1/ the set of states listed on so_stateids
2/ the set of active vfs locks owned by any of those states
3/ various transient counts such as for conflicting locks.

When it is tested against '2' it is clear that one of these is the
transient reference obtained by find_lockowner_str_locked().  It is not
clear what the other one is expected to be.

In practice, the count is often 2 because there is precisely one state
on so_stateids.  If there were more, this would fail.

In my testing I see two circumstances when RELEASE_LOCKOWNER is called.
In one case, CLOSE is called before RELEASE_LOCKOWNER.  That results in
all the lock states being removed, and so the lockowner being discarded
(it is removed when there are no more references which usually happens
when the lock state is discarded).  When nfsd4_release_lockowner() finds
that the lock owner doesn't exist, it returns success.

The other case shows an so_count of '2' and precisely one state listed
in so_stateid.  It appears that the Linux client uses a separate lock
owner for each file resulting in one lock state per lock owner, so this
test on '2' is safe.  For another client it might not be safe.

So this patch changes check_for_locks() to use the (newish)
find_any_file_locked() so that it doesn't take a reference on the
nfs4_file and so never calls nfsd_file_put(), and so never sleeps.  With
this check is it safe to restore the use of check_for_locks() rather
than testing so_count against the mysterious '2'.

Fixes: ce3c4ad7 ("NFSD: Fix possible sleep during nfsd4_release_lockowner()")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

edcf9725

17 Jan, 2024 1 commit

SUNRPC: use request size to initialize bio_vec in svc_udp_sendto() · 1d9cabe2

Lucas Stach authored Jan 17, 2024

Use the proper size when setting up the bio_vec, as otherwise only
zero-length UDP packets will be sent.

Fixes: baabf59c ("SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array")
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

1d9cabe2

07 Jan, 2024 38 commits

nfsd: rename nfsd_last_thread() to nfsd_destroy_serv() · 17419aef

NeilBrown authored Dec 15, 2023

As this function now destroys the svc_serv, this is a better name.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

17419aef

SUNRPC: discard sv_refcnt, and svc_get/svc_put · 1e3577a4

NeilBrown authored Dec 15, 2023

sv_refcnt is no longer useful.
lockd and nfs-cb only ever have the svc active when there are a non-zero
number of threads, so sv_refcnt mirrors sv_nrthreads.

nfsd also keeps the svc active between when a socket is added and when
the first thread is started, but we don't really need a refcount for
that.  We can simply not destroy the svc while there are any permanent
sockets attached.

So remove sv_refcnt and the get/put functions.
Instead of a final call to svc_put(), call svc_destroy() instead.
This is changed to also store NULL in the passed-in pointer to make it
easier to avoid use-after-free situations.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

1e3577a4

svc: don't hold reference for poolstats, only mutex. · 7b207ccd

NeilBrown authored Dec 15, 2023

A future patch will remove refcounting on svc_serv as it is of little
use.
It is currently used to keep the svc around while the pool_stats file is
open.
Change this to get the pointer, protected by the mutex, only in
seq_start, and the release the mutex in seq_stop.
This means that if the nfsd server is stopped and restarted while the
pool_stats file it open, then some pool stats info could be from the
first instance and some from the second.  This might appear odd, but is
unlikely to be a problem in practice.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

7b207ccd

SUNRPC: remove printk when back channel request not found · 05a4b583

Dai Ngo authored Dec 15, 2023

If the client interface is down, or there is a network partition between
the client and server that prevents the callback request to reach the
client, TCP on the server will keep re-transmitting the callback for about
~9 minutes before giving up and closing the connection.

If the connection between the client and the server is re-established
before the connection is closed and after the callback timed out (9 secs)
then the re-transmitted callback request will arrive at the client. When
the server receives the reply of the callback, receive_cb_reply prints the
"Got unrecognized reply..." message in the system log since the callback
request was already removed from the server xprt's recv_queue.

Even though this scenario has no effect on the server operation, a
malfunctioning or malicious client can fill up the server's system log.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

05a4b583

svcrdma: Implement multi-stage Read completion again · d3dba534

Chuck Lever authored Dec 18, 2023

Having an nfsd thread waiting for an RDMA Read completion is
problematic if the Read responder (ie, the client) stops responding.
We need to go back to handling RDMA Reads by getting the svc scheduler
to call svc_rdma_recvfrom() a second time to finish building an RPC
message after a Read completion.

This is the final patch, and makes several changes that have to
happen concurrently:

1. svc_rdma_process_read_list no longer waits for a completion, but
   simply builds and posts the Read WRs.

2. svc_rdma_read_done() now queues a completed Read on
   sc_read_complete_q for later processing rather than calling
   complete().

3. The completed RPC message is no longer built in the
   svc_rdma_process_read_list() path. Finishing the message is now
   done in svc_rdma_recvfrom() when it notices work on the
   sc_read_complete_q. The "finish building this RPC message" code
   is removed from the svc_rdma_process_read_list() path.

This arrangement avoids the need for an nfsd thread to wait for an
RDMA Read non-interruptibly without a timeout. It's basically the
same code structure that Tom Tucker used for Read chunks along with
some clean-up and modernization.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d3dba534

svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete() · ecba85e9

Chuck Lever authored Dec 18, 2023

Once a set of RDMA Reads are complete, the Read completion handler
will poke the transport to trigger a second call to
svc_rdma_recvfrom(). recvfrom() will then merge the RDMA Read
payloads with the previously received RPC header to form a completed
RPC Call message.

The new code is copied from the svc_rdma_process_read_list() path.
A subsequent patch will make use of this code and remove the code
that this was copied from (svc_rdma_rw.c).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ecba85e9

svcrdma: Add back svcxprt_rdma::sc_read_complete_q · a937693a

Chuck Lever authored Dec 18, 2023

Having an nfsd thread waiting for an RDMA Read completion is
problematic if the Read responder (ie, the client) stops responding.
We need to go back to handling RDMA Reads by allowing the nfsd
thread to return to the svc scheduler, then waking a second thread
finish the RPC message once the Read completion fires.

As a next step, add a list_head upon which completed Reads are queued.
A subsequent patch will make use of this queue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a937693a

svcrdma: Add back svc_rdma_recv_ctxt::rc_pages · 4d9d69db

Chuck Lever authored Dec 18, 2023

Having an nfsd thread waiting for an RDMA Read completion is
problematic if the Read responder (the client) stops responding. We
need to go back to handling RDMA Reads by allowing the nfsd thread
to return to the svc scheduler, then waking a second thread finish
the RPC message once the Read completion fires.

To start with, restore the rc_pages field so that RDMA Read pages
can be managed across calls to svc_rdma_recvfrom().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

4d9d69db

svcrdma: Clean up comment in svc_rdma_accept() · fc2e69db

Chuck Lever authored Dec 11, 2023

The comment that starts "Qualify ..." applies to only some of the
following code paragraph. Re-arrange the lines so the comment makes
more sense.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

fc2e69db

svcrdma: Remove queue-shortening warnings · b918bfcf

Chuck Lever authored Dec 11, 2023

These won't have much diagnostic value for site administrators.
Since they can't be disabled, they become noise.

What's more, the subsequent rdma_create_qp() call adjusts the Send
Queue size (possibly downward) without warning, making the size
reported by these pr_warns inaccurate.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b918bfcf

svcrdma: Remove pointer addresses shown in dprintk() · 913cd766

Chuck Lever authored Dec 11, 2023

There are a couple of dprintk() call sites in svc_rdma_accept()
that show pointer addresses. These days, displayed pointer addresses
are hashed and thus have little or no diagnostic value, especially
for site administrators.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

913cd766

svcrdma: Optimize svc_rdma_cc_init() · 2a95ce47

Chuck Lever authored Dec 11, 2023

The atomic_inc_return() in svc_rdma_send_cid_init() is expensive.

Some svc_rdma_chunk_ctxt's now reside in long-lived container
structures. They don't need a fresh completion ID for every I/O
operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2a95ce47

svcrdma: De-duplicate completion ID initialization helpers · 28ee0ec8
Chuck Lever authored Dec 11, 2023
```
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
```
28ee0ec8

svcrdma: Move the svc_rdma_cc_init() call · 018f3405

Chuck Lever authored Dec 04, 2023

Now that the chunk_ctxt for Reads is no longer dynamically allocated
it can be initialized once for the life of the object that contains
it (struct svc_rdma_recv_ctxt).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

018f3405

svcrdma: Remove struct svc_rdma_read_info · 57666bbb

Chuck Lever authored Dec 04, 2023

The remaining fields of struct svc_rdma_read_info are no longer
referenced.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

57666bbb

svcrdma: Update the synopsis of svc_rdma_read_special() · efd02cb0

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_read_special() can use that recv_ctxt to derive the
read_info rather than the other way around. This removes another
usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

efd02cb0

svcrdma: Update the synopsis of svc_rdma_read_call_chunk() · 23bab3b2

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_read_call_chunk() can use that recv_ctxt to derive the
read_info rather than the other way around. This removes another
usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

23bab3b2

svcrdma: Update synopsis of svc_rdma_read_multiple_chunks() · 740a3c89

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_read_multiple_chunks() can use that recv_ctxt to derive the
read_info rather than the other way around. This removes another
usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

740a3c89

svcrdma: Update synopsis of svc_rdma_copy_inline_range() · 6518204d

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_copy_inline_range() can use that recv_ctxt to derive the
read_info rather than the other way around. This removes another
usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6518204d

svcrdma: Update the synopsis of svc_rdma_read_data_item() · 6e4b9b86

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_build_read_data_item() can use that recv_ctxt to derive
that information rather than the other way around. This removes
another usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6e4b9b86

svcrdma: Update synopsis of svc_rdma_read_chunk_range() · c7eb4feb

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_build_read_chunk_range() can use that recv_ctxt to derive
that information rather than the other way around. This removes
another usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c7eb4feb

svcrdma: Update synopsis of svc_rdma_build_read_chunk() · 02e8fe1e

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_build_read_chunk() can use that recv_ctxt to derive that
information rather than the other way around. This removes another
usage of the ri_readctxt field, enabling its removal in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

02e8fe1e

svcrdma: Update synopsis of svc_rdma_build_read_segment() · fc20f19b

Chuck Lever authored Dec 04, 2023

Since the RDMA Read I/O state is now contained in the recv_ctxt,
svc_rdma_build_read_segment() can use the recv_ctxt to derive that
information rather than the other way around. This removes one usage
of the ri_readctxt field, enabling its removal in a subsequent
patch.

At the same time, the use of ri_rqst can similarly be replaced with
a passed-in function parameter.

Start with build_read_segment() because it is a common utility
function at the bottom of the Read chunk path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

fc20f19b

svcrdma: Move read_info::ri_pageoff into struct svc_rdma_recv_ctxt · 919f6e79

Chuck Lever authored Dec 04, 2023

Further clean up: move the starting byte offset field into
svc_rdma_recv_ctxt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

919f6e79

svcrdma: Move svc_rdma_read_info::ri_pageno to struct svc_rdma_recv_ctxt · 8e122582

Chuck Lever authored Dec 04, 2023

Further clean up: move the page index field into svc_rdma_recv_ctxt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

8e122582

svcrdma: Start moving fields out of struct svc_rdma_read_info · b1818412

Chuck Lever authored Dec 04, 2023

Since the request's svc_rdma_recv_ctxt will stay around for the
duration of the RDMA Read operation, the contents of struct
svc_rdma_read_info can reside in the request's svc_rdma_recv_ctxt
rather than being allocated separately. This will eventually save a
call to kmalloc() in a hot path.

Start this clean-up by moving the Read chunk's svc_rdma_chunk_ctxt.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b1818412

svcrdma: Move struct svc_rdma_chunk_ctxt to svc_rdma.h · 6a04a434

Chuck Lever authored Dec 04, 2023

Prepare for nestling these into the send and recv ctxts so they
no longer have to be allocated dynamically.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6a04a434

svcrdma: Remove the svc_rdma_chunk_ctxt::cc_rdma field · 2cc0f23b

Chuck Lever authored Dec 04, 2023

In every instance, the pointer address in that field is now
available by other means.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2cc0f23b

svcrdma: Pass a pointer to the transport to svc_rdma_cc_release() · bc8fd4e9

Chuck Lever authored Dec 04, 2023

Enable the eventual removal of the svc_rdma_chunk_ctxt::cc_rdma
field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bc8fd4e9

svcrdma: Explicitly pass the transport to svc_rdma_post_chunk_ctxt() · 83fe6dd6

Chuck Lever authored Dec 04, 2023

Enable the eventual removal of the svc_rdma_chunk_ctxt::cc_rdma
field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

83fe6dd6

svcrdma: Explicitly pass the transport into Read chunk I/O paths · 4a68edd9

Chuck Lever authored Dec 04, 2023

Enable the eventual removal of the svc_rdma_chunk_ctxt::cc_rdma
field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

4a68edd9

svcrdma: Explicitly pass the transport into Write chunk I/O paths · c3899b71

Chuck Lever authored Dec 04, 2023

Enable the eventual removal of the svc_rdma_chunk_ctxt::cc_rdma
field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c3899b71

svcrdma: Acquire the svcxprt_rdma pointer from the CQ context · c4fd9f45

Chuck Lever authored Dec 04, 2023

Enable the removal of the svc_rdma_chunk_ctxt::cc_rdma field in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c4fd9f45

svcrdma: Reduce size of struct svc_rdma_rw_ctxt · 5ef6c666

Chuck Lever authored Dec 04, 2023

SG_CHUNK_SIZE is 128, making struct svc_rdma_rw_ctxt + the first
SGL array more than 4200 bytes in length, pushing the memory
allocation well into order 1.

Even so, the RDMA rw core doesn't seem to use more than max_send_sge
entries in that array (typically 32 or less), so that is all wasted
space.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5ef6c666

svcrdma: Update some svcrdma DMA-related tracepoints · 2dd6e29a

Chuck Lever authored Nov 27, 2023

A send/recv_ctxt already records transport-related information
in the cq.id, thus there is no need to record the IP addresses of
the transport endpoints.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2dd6e29a

svcrdma: DMA error tracepoints should report completion IDs · 848760a9

Chuck Lever authored Nov 27, 2023

Update the DMA error flow tracepoints to report the completion ID of
the failing context. This ties the wait/failure to a particular
operation or request, which is more useful than knowing only the
failing transport.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

848760a9

svcrdma: SQ error tracepoints should report completion IDs · ad3656bd

Chuck Lever authored Nov 27, 2023

Update the Send Queue's error flow tracepoints to report the
completion ID of the waiting or failing context. This ties the
wait/failure to a particular operation or request, which is a little
more useful than knowing only the transport that is about to close.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ad3656bd

rpcrdma: Introduce a simple cid tracepoint class · be2acb10

Chuck Lever authored Nov 27, 2023

De-duplicate some code, making it easier to add new tracepoints that
report only a completion ID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

be2acb10