Commits · 10396f4df8b75ff6ab0aa2cd74296565466f2c8d · Kirill Smelkov / linux

05 Apr, 2024 1 commit

nfsd: hold a lighter-weight client reference over CB_RECALL_ANY · 10396f4d

Jeff Layton authored Apr 05, 2024

Currently the CB_RECALL_ANY job takes a cl_rpc_users reference to the
client. While a callback job is technically an RPC that counter is
really more for client-driven RPCs, and this has the effect of
preventing the client from being unhashed until the callback completes.

If nfsd decides to send a CB_RECALL_ANY just as the client reboots, we
can end up in a situation where the callback can't complete on the (now
dead) callback channel, but the new client can't connect because the old
client can't be unhashed. This usually manifests as a NFS4ERR_DELAY
return on the CREATE_SESSION operation.

The job is only holding a reference to the client so it can clear a flag
after the RPC completes. Fix this by having CB_RECALL_ANY instead hold a
reference to the cl_nfsdfs.cl_ref. Typically we only take that sort of
reference when dealing with the nfsdfs info files, but it should work
appropriately here to ensure that the nfs4_client doesn't disappear.

Fixes: 44df6f43 ("NFSD: add delegation reaper to react to low memory condition")
Reported-by: Vladimir Benes <vbenes@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

10396f4d

04 Apr, 2024 1 commit

SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP · 05258a0a

Chuck Lever authored Apr 03, 2024

Jan Schunk reports that his small NFS servers suffer from memory
exhaustion after just a few days. A bisect shows that commit
e18e157b ("SUNRPC: Send RPC message on TCP with a single
sock_sendmsg() call") is the first bad commit.

That commit assumed that sock_sendmsg() releases all the pages in
the underlying bio_vec array, but the reality is that it doesn't.
svc_xprt_release() releases the rqst's response pages, but the
record marker page fragment isn't one of those, so it is never
released.

This is a narrow fix that can be applied to stable kernels. A
more extensive fix is in the works.
Reported-by: Jan Schunk <scpcom@gmx.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218671
Fixes: e18e157b ("SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call")
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Jakub Kacinski <kuba@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

05258a0a

27 Mar, 2024 1 commit

NFSD: CREATE_SESSION must never cache NFS4ERR_DELAY replies · 99dc2ef0

Chuck Lever authored Mar 26, 2024

There are one or two cases where CREATE_SESSION returns
NFS4ERR_DELAY in order to force the client to wait a bit and try
CREATE_SESSION again. However, after commit e4469c6c ("NFSD: Fix
the NFSv4.1 CREATE_SESSION operation"), NFSD caches that response in
the CREATE_SESSION slot. Thus, when the client resends the
CREATE_SESSION, the server always returns the cached NFS4ERR_DELAY
response rather than actually executing the request and properly
recording its outcome. This blocks the client from making further
progress.

RFC 8881 Section 15.1.1.3 says:
> If NFS4ERR_DELAY is returned on an operation other than SEQUENCE
> that validly appears as the first operation of a request ... [t]he
> request can be retried in full without modification. In this case
> as well, the replier MUST avoid returning a response containing
> NFS4ERR_DELAY as the response to an initial operation of a request
> solely on the basis of its presence in the reply cache.

Neither the original NFSD code nor the discussion in section 18.36.4
refer explicitly to this important requirement, so I missed it.

Note also that not only must the server not cache NFS4ERR_DELAY, but
it has to not advance the CREATE_SESSION slot sequence number so
that it can properly recognize and accept the client's retry.
Reported-by: Dai Ngo <dai.ngo@oracle.com>
Fixes: e4469c6c ("NFSD: Fix the NFSv4.1 CREATE_SESSION operation")
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

99dc2ef0

22 Mar, 2024 2 commits

SUNRPC: Revert · 6978bd6a

Chuck Lever authored Mar 21, 2024

Scott reports an occasional scatterlist BUG that is triggered by the
RFC 8009 Kunit test, then says:

> Looking through the git history of the auth_gss code, there are various
> places where static buffers were replaced by dynamically allocated ones
> because they're being used with scatterlists.
Reported-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 561141dd ("SUNRPC: Use a static buffer for the checksum initialization vector")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6978bd6a

nfsd: Fix error cleanup path in nfsd_rename() · 9fe6e9e7

Jan Kara authored Mar 18, 2024

Commit a8b00268 ("rename(): avoid a deadlock in the case of parents
having no common ancestor") added an error bail out path. However this
path does not drop the remount protection that has been acquired. Fix
the cleanup path to properly drop the remount protection.

Fixes: a8b00268 ("rename(): avoid a deadlock in the case of parents having no common ancestor")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

9fe6e9e7

09 Mar, 2024 1 commit

NFSD: Clean up nfsd4_encode_replay() · 9b350d3e

Chuck Lever authored Mar 04, 2024

Replace open-coded encoding logic with the use of conventional XDR
utility functions. Add a tracepoint to make replays observable in
field troubleshooting situations.

The WARN_ON is removed. A stack trace is of little use, as there is
only one call site for nfsd4_encode_replay(), and a buffer length
shortage here is unlikely.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

9b350d3e

05 Mar, 2024 2 commits

NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit · bad4c585

Dai Ngo authored Mar 03, 2024

The NFS server should ask clients to voluntarily return unused
delegations when the number of granted delegations reaches the
max_delegations. This is so that the server can continue to
grant delegations for new requests.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bad4c585

NFSD: Document nfsd_setattr() fill-attributes behavior · 7d5a352c

Chuck Lever authored Feb 18, 2024

Add an explanation to prevent the future removal of the fill-
attribute call sites in nfsd_setattr(). Some NFSv3 client
implementations don't behave correctly if wcc data is not present in
an NFSv3 SETATTR reply.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

7d5a352c

01 Mar, 2024 32 commits

nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr() · 24d92de9

Trond Myklebust authored Feb 15, 2024

The main point of the guarded SETATTR is to prevent races with other
WRITE and SETATTR calls. That requires that the check of the guard time
against the inode ctime be done after taking the inode lock.

Furthermore, we need to take into account the 32-bit nature of
timestamps in NFSv3, and the possibility that files may change at a
faster rate than once a second.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

24d92de9

nfsd: Fix a regression in nfsd_setattr() · 6412e44c

Trond Myklebust authored Feb 15, 2024

Commit bb4d53d6 ("NFSD: use (un)lock_inode instead of
fh_(un)lock for file operations") broke the NFSv3 pre/post op
attributes behaviour when doing a SETATTR rpc call by stripping out
the calls to fh_fill_pre_attrs() and fh_fill_post_attrs().

Fixes: bb4d53d6 ("NFSD: use (un)lock_inode instead of fh_(un)lock for file operations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Message-ID: <20240216012451.22725-1-trondmy@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6412e44c

NFSD: OP_CB_RECALL_ANY should recall both read and write delegations · 5826e09b

Dai Ngo authored Feb 17, 2024

Add RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5826e09b

NFSD: handle GETATTR conflict with write delegation · c5967721

Dai Ngo authored Feb 15, 2024

If the GETATTR request on a file that has write delegation in effect
and the request attributes include the change info and size attribute
then the request is handled as below:

Server sends CB_GETATTR to client to get the latest change info and file
size. If these values are the same as the server's cached values then
the GETATTR proceeds as normal.

If either the change info or file size is different from the server's
cached values, or the file was already marked as modified, then:

    . update time_modify and time_metadata into file's metadata
      with current time

    . encode GETATTR as normal except the file size is encoded with
      the value returned from CB_GETATTR

    . mark the file as modified

If the CB_GETATTR fails for any reasons, the delegation is recalled
and NFS4ERR_DELAY is returned for the GETATTR.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c5967721

NFSD: add support for CB_GETATTR callback · 6487a13b

Dai Ngo authored Feb 15, 2024

Includes:
   . CB_GETATTR proc for nfs4_cb_procedures[]
   . XDR encoding and decoding function for CB_GETATTR request/reply
   . add nfs4_cb_fattr to nfs4_delegation for sending CB_GETATTR
     and store file attributes from client's reply.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6487a13b

NFSD: Document the phases of CREATE_SESSION · b910544a

Chuck Lever authored Feb 08, 2024

As described in RFC 8881 Section 18.36.4, CREATE_SESSION can be
split into four phases. NFSD's implementation now does it like that
description.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b910544a

NFSD: Fix the NFSv4.1 CREATE_SESSION operation · e4469c6c

Chuck Lever authored Feb 08, 2024

RFC 8881 Section 18.36.4 discusses the implementation of the NFSv4.1
CREATE_SESSION operation. The section defines four phases of
operation.

Phase 2 processes the CREATE_SESSION sequence ID. As a separate
step, Phase 3 evaluates the CREATE_SESSION arguments.

The problem we are concerned with is when phase 2 is successful but
phase 3 fails. The spec language in this case is "No changes are
made to any client records on the server."

RFC 8881 Section 18.35.4 defines a "client record", and it does
/not/ contain any details related to the special CREATE_SESSION
slot. Therefore NFSD is incorrect to skip incrementing the
CREATE_SESSION sequence id when phase 3 (see Section 18.36.4) of
CREATE_SESSION processing fails. In other words, even though NFSD
happens to store the cs_slot in a client record, in terms of the
protocol the slot is logically separate from the client record.

Three complications:

1. The world has moved on since commit 86c3e16c ("nfsd4: confirm
only on succesful create_session") broke this. So we can't simply
revert that commit.

2. NFSD's CREATE_SESSION implementation does not cleanly delineate
the logic of phases 2 and 3. So this won't be a surgical fix.

3. Because of the way it currently handles the CREATE_SESSION slot
sequence number, nfsd4_create_session() isn't caching error
responses in the CREATE_SESSION slot. Instead of replaying the
response cache in those cases, it's executing the transaction
again.

Reorganize the CREATE_SESSION slot sequence number accounting. This
requires that error responses are appropriately cached in the
CREATE_SESSION slot (once it is found).
Reported-by: Connor Smith <connor.smith@hitachivantara.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218382Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

e4469c6c

nfsd: clean up comments over nfs4_client definition · f8104027

Chen Hanxiao authored Feb 08, 2024

nfsd fault injection has been deprecated since
commit 9d60d931 ("Deprecate nfsd fault injection")
and removed by
commit e56dc9e2 ("nfsd: remove fault injection code")

So remove the outdated parts about fault injection.
Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f8104027

svcrdma: Add Write chunk WRs to the RPC's Send WR chain · e084ee67

Chuck Lever authored Feb 04, 2024

Chain RDMA Writes that convey Write chunks onto the local Send
chain. This means all WRs for an RPC Reply are now posted with a
single ib_post_send() call, and there is a single Send completion
when all of these are done. That reduces both the per-transport
doorbell rate and completion rate.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

e084ee67

svcrdma: Post WRs for Write chunks in svc_rdma_sendto() · d2727cef

Chuck Lever authored Feb 04, 2024

Refactor to eventually enable svcrdma to post the Write WRs for each
RPC response using the same ib_post_send() as the Send WR (ie, as a
single WR chain).

svc_rdma_result_payload (originally svc_rdma_read_payload) was added
so that the upper layer XDR encoder could identify a range of bytes
to be possibly conveyed by RDMA (if a Write chunk was provided by
the client).

The purpose of commit f6ad7759 ("svcrdma: Post RDMA Writes while
XDR encoding replies") was to post as much of the result payload
outside of svc_rdma_sendto() as possible because svc_rdma_sendto()
used to be called with the xpt_mutex held.

However, since commit ca4faf54 ("SUNRPC: Move xpt_mutex into
socket xpo_sendto methods"), the xpt_mutex is no longer held when
calling svc_rdma_sendto(). Thus, that benefit is no longer an issue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d2727cef

svcrdma: Post the Reply chunk and Send WR together · 10e6fc10

Chuck Lever authored Feb 04, 2024

Reduce the doorbell and Send completion rates when sending RPC/RDMA
replies that have Reply chunks. NFS READDIR procedures typically
return their result in a Reply chunk, for example.

Instead of calling ib_post_send() to post the Write WRs for the
Reply chunk, and then calling it again to post the Send WR that
conveys the transport header, chain the Write WRs to the Send WR
and call ib_post_send() only once.

Thanks to the Send Queue completion ordering rules, when the Send
WR completes, that guarantees that Write WRs posted before it have
also completed successfully. Thus all Write WRs for the Reply chunk
can remain unsignaled. Instead of handling a Write completion and
then a Send completion, only the Send completion is seen, and it
handles clean up for both the Writes and the Send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

10e6fc10

svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt · a1f5788a

Chuck Lever authored Feb 04, 2024

Since the RPC transaction's svc_rdma_send_ctxt will stay around for
the duration of the RDMA Write operation, the write_info structure
for the Reply chunk can reside in the request's svc_rdma_send_ctxt
instead of being allocated separately.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a1f5788a

svcrdma: Post Send WR chain · 71b43531

Chuck Lever authored Feb 04, 2024

Eventually I'd like the server to post the reply's Send WR along
with any Write WRs using only a single call to ib_post_send(), in
order to reduce the NIC's doorbell rate.

To do this, add an anchor for a WR chain to svc_rdma_send_ctxt, and
refactor svc_rdma_send() to post this WR chain to the Send Queue. For
the moment, the posted chain will continue to contain a single Send
WR.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

71b43531

svcrdma: Fix retry loop in svc_rdma_send() · fc709d82

Chuck Lever authored Feb 04, 2024

Don't call ib_post_send() at all if the transport is already
shutting down.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

fc709d82

svcrdma: Prevent a UAF in svc_rdma_send() · 773f6c5b

Chuck Lever authored Feb 04, 2024

In some error flow cases, svc_rdma_wc_send() releases @ctxt. Copy
the sc_cid field in @ctxt to a stack variable in order to guarantee
that the value is available after the ib_post_send() call.

In case the new comment looks a little strange, this will be done
with at least one more field in a subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

773f6c5b

svcrdma: Fix SQ wake-ups · 5b9a8589

Chuck Lever authored Feb 04, 2024

Ensure there is a wake-up when increasing sc_sq_avail.

Likewise, if a wake-up is done, sc_sq_avail needs to be updated,
otherwise the wait_event() conditional is never going to be met.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5b9a8589

svcrdma: Increase the per-transport rw_ctx count · 2da0f610

Chuck Lever authored Feb 04, 2024

rdma_rw_mr_factor() returns the smallest number of MRs needed to
move a particular number of pages. svcrdma currently asks for the
number of MRs needed to move RPCSVC_MAXPAGES (a little over one
megabyte), as that is the number of pages in the largest r/wsize
the server supports.

This call assumes that the client's NIC can bundle a full one
megabyte payload in a single rdma_segment. In fact, most NICs cannot
handle a full megabyte with a single rkey / rdma_segment. Clients
will typically split even a single Read chunk into many segments.

The server needs one MR to read each rdma_segment in a Read chunk,
and thus each one needs an rw_ctx.

svcrdma has been vastly underestimating the number of rw_ctxs needed
to handle 64 RPC requests with large Read chunks using small
rdma_segments.

Unfortunately there doesn't seem to be a good way to estimate this
number without knowing the client NIC's capabilities. Even then,
the client RPC/RDMA implementation is still free to split a chunk
into smaller segments (for example, it might be using physical
registration, which needs an rdma_segment per page).

The best we can do for now is choose a number that will guarantee
forward progress in the worst case (one page per segment).

At some later point, we could add some mechanisms to make this
much less of a problem:
- Add a core API to add more rw_ctxs to an already-established QP
- svcrdma could treat rw_ctx exhaustion as a temporary error and
  try again
- Limit the number of Reads in flight
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2da0f610

svcrdma: Update max_send_sges after QP is created · 4c8c0fa0

Chuck Lever authored Feb 04, 2024

rdma_create_qp() can modify cap.max_send_sges. Copy the new value
to the svcrdma transport so it is bound by the new limit instead
of the requested one.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

4c8c0fa0

svcrdma: Report CQ depths in debugging output · 5485d6dd

Chuck Lever authored Feb 04, 2024

Check that svc_rdma_accept() is allocating an appropriate number of
CQEs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5485d6dd

svcrdma: Reserve an extra WQE for ib_drain_rq() · e67792cc

Chuck Lever authored Feb 04, 2024

Do as other ULPs already do: ensure there is an extra Receive WQE
reserved for the tear-down drain WR. I haven't heard reports of
problems but it can't hurt.

Note that rq_depth is used to compute the Send Queue depth as well,
so this fix should affect both the SQ and RQ.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

e67792cc

MAINTAINERS: add Alex Aring as Reviewer for file locking code · c8004c1c

Jeff Layton authored Feb 06, 2024

Alex helps co-maintain the DLM code and did some recent work to fix up
how lockd and GFS2 work together. Add him as a Reviewer for file locking
changes.
Acked-by: Alexander Aring <aahringo@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c8004c1c

nfsd: Simplify the allocation of slab caches in nfsd4_init_slabs · 649e58d5

Kunwu Chan authored Feb 01, 2024

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Make the code cleaner and more readable.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

649e58d5

nfsd: Simplify the allocation of slab caches in nfsd_drc_slab_create · 192d80cd

Kunwu Chan authored Feb 04, 2024

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'nfsd_drc' to 'nfsd_cacherep'.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

192d80cd

nfsd: Simplify the allocation of slab caches in nfsd_file_cache_init · 2f74991a

Kunwu Chan authored Jan 31, 2024

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2f74991a

nfsd: Simplify the allocation of slab caches in nfsd4_init_pnfs · 10bcc2f1

Kunwu Chan authored Jan 31, 2024

commit 0a31bd5f ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

10bcc2f1

nfsd: don't call locks_release_private() twice concurrently · 05eda6e7

NeilBrown authored Jan 31, 2024

It is possible for free_blocked_lock() to be called twice concurrently,
once from nfsd4_lock() and once from nfsd4_release_lockowner() calling
remove_blocked_locks(). This is why a kref was added.

It is perfectly safe for locks_delete_block() and kref_put() to be
called in parallel as they use locking or atomicity respectively as
protection. However locks_release_private() has no locking. It is
safe for it to be called twice sequentially, but not concurrently.

This patch moves that call from free_blocked_lock() where it could race
with itself, to free_nbl() where it cannot. This will slightly delay
the freeing of private info or release of the owner - but not by much.
It is arguably more natural for this freeing to happen in free_nbl()
where the structure itself is freed.

This bug was found by code inspection - it has not been seen in practice.

Fixes: 47446d74 ("nfsd4: add refcount for nfsd4_blocked_lock")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

05eda6e7

nfsd: allow layout state to be admin-revoked. · 1e33e141