Commits · 10e6fc1054d900a205e5233f186ebce9c50e1d1d · Kirill Smelkov / linux

01 Mar, 2024 40 commits

svcrdma: Post the Reply chunk and Send WR together · 10e6fc10

Chuck Lever authored Feb 04, 2024

Reduce the doorbell and Send completion rates when sending RPC/RDMA
replies that have Reply chunks. NFS READDIR procedures typically
return their result in a Reply chunk, for example.

Instead of calling ib_post_send() to post the Write WRs for the
Reply chunk, and then calling it again to post the Send WR that
conveys the transport header, chain the Write WRs to the Send WR
and call ib_post_send() only once.

Thanks to the Send Queue completion ordering rules, when the Send
WR completes, that guarantees that Write WRs posted before it have
also completed successfully. Thus all Write WRs for the Reply chunk
can remain unsignaled. Instead of handling a Write completion and
then a Send completion, only the Send completion is seen, and it
handles clean up for both the Writes and the Send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

10e6fc10

svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt · a1f5788a

Chuck Lever authored Feb 04, 2024

Since the RPC transaction's svc_rdma_send_ctxt will stay around for
the duration of the RDMA Write operation, the write_info structure
for the Reply chunk can reside in the request's svc_rdma_send_ctxt
instead of being allocated separately.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a1f5788a

svcrdma: Post Send WR chain · 71b43531

Chuck Lever authored Feb 04, 2024

Eventually I'd like the server to post the reply's Send WR along
with any Write WRs using only a single call to ib_post_send(), in
order to reduce the NIC's doorbell rate.

To do this, add an anchor for a WR chain to svc_rdma_send_ctxt, and
refactor svc_rdma_send() to post this WR chain to the Send Queue. For
the moment, the posted chain will continue to contain a single Send
WR.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

71b43531

svcrdma: Fix retry loop in svc_rdma_send() · fc709d82

Chuck Lever authored Feb 04, 2024

Don't call ib_post_send() at all if the transport is already
shutting down.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

fc709d82

svcrdma: Prevent a UAF in svc_rdma_send() · 773f6c5b

Chuck Lever authored Feb 04, 2024

In some error flow cases, svc_rdma_wc_send() releases @ctxt. Copy
the sc_cid field in @ctxt to a stack variable in order to guarantee
that the value is available after the ib_post_send() call.

In case the new comment looks a little strange, this will be done
with at least one more field in a subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

773f6c5b

svcrdma: Fix SQ wake-ups · 5b9a8589

Chuck Lever authored Feb 04, 2024

Ensure there is a wake-up when increasing sc_sq_avail.

Likewise, if a wake-up is done, sc_sq_avail needs to be updated,
otherwise the wait_event() conditional is never going to be met.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5b9a8589

svcrdma: Increase the per-transport rw_ctx count · 2da0f610

Chuck Lever authored Feb 04, 2024

rdma_rw_mr_factor() returns the smallest number of MRs needed to
move a particular number of pages. svcrdma currently asks for the
number of MRs needed to move RPCSVC_MAXPAGES (a little over one
megabyte), as that is the number of pages in the largest r/wsize
the server supports.

This call assumes that the client's NIC can bundle a full one
megabyte payload in a single rdma_segment. In fact, most NICs cannot
handle a full megabyte with a single rkey / rdma_segment. Clients
will typically split even a single Read chunk into many segments.

The server needs one MR to read each rdma_segment in a Read chunk,
and thus each one needs an rw_ctx.

svcrdma has been vastly underestimating the number of rw_ctxs needed
to handle 64 RPC requests with large Read chunks using small
rdma_segments.

Unfortunately there doesn't seem to be a good way to estimate this
number without knowing the client NIC's capabilities. Even then,
the client RPC/RDMA implementation is still free to split a chunk
into smaller segments (for example, it might be using physical
registration, which needs an rdma_segment per page).

The best we can do for now is choose a number that will guarantee
forward progress in the worst case (one page per segment).

At some later point, we could add some mechanisms to make this
much less of a problem:
- Add a core API to add more rw_ctxs to an already-established QP
- svcrdma could treat rw_ctx exhaustion as a temporary error and
  try again
- Limit the number of Reads in flight
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2da0f610

svcrdma: Update max_send_sges after QP is created · 4c8c0fa0

Chuck Lever authored Feb 04, 2024

rdma_create_qp() can modify cap.max_send_sges. Copy the new value
to the svcrdma transport so it is bound by the new limit instead
of the requested one.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

4c8c0fa0

svcrdma: Report CQ depths in debugging output · 5485d6dd

Chuck Lever authored Feb 04, 2024

Check that svc_rdma_accept() is allocating an appropriate number of
CQEs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5485d6dd

svcrdma: Reserve an extra WQE for ib_drain_rq() · e67792cc

Chuck Lever authored Feb 04, 2024

Do as other ULPs already do: ensure there is an extra Receive WQE
reserved for the tear-down drain WR. I haven't heard reports of
problems but it can't hurt.

Note that rq_depth is used to compute the Send Queue depth as well,
so this fix should affect both the SQ and RQ.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

e67792cc

MAINTAINERS: add Alex Aring as Reviewer for file locking code · c8004c1c

Jeff Layton authored Feb 06, 2024

Alex helps co-maintain the DLM code and did some recent work to fix up
how lockd and GFS2 work together. Add him as a Reviewer for file locking
changes.
Acked-by: Alexander Aring <aahringo@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c8004c1c

nfsd: Simplify the allocation of slab caches in nfsd4_init_slabs · 649e58d5

Kunwu Chan authored Feb 01, 2024

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Make the code cleaner and more readable.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

649e58d5

nfsd: Simplify the allocation of slab caches in nfsd_drc_slab_create · 192d80cd

Kunwu Chan authored Feb 04, 2024

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
And change cache name from 'nfsd_drc' to 'nfsd_cacherep'.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

192d80cd

nfsd: Simplify the allocation of slab caches in nfsd_file_cache_init · 2f74991a

Kunwu Chan authored Jan 31, 2024

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2f74991a

nfsd: Simplify the allocation of slab caches in nfsd4_init_pnfs · 10bcc2f1

Kunwu Chan authored Jan 31, 2024

commit 0a31bd5f ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

10bcc2f1

nfsd: don't call locks_release_private() twice concurrently · 05eda6e7

NeilBrown authored Jan 31, 2024

It is possible for free_blocked_lock() to be called twice concurrently,
once from nfsd4_lock() and once from nfsd4_release_lockowner() calling
remove_blocked_locks(). This is why a kref was added.

It is perfectly safe for locks_delete_block() and kref_put() to be
called in parallel as they use locking or atomicity respectively as
protection. However locks_release_private() has no locking. It is
safe for it to be called twice sequentially, but not concurrently.

This patch moves that call from free_blocked_lock() where it could race
with itself, to free_nbl() where it cannot. This will slightly delay
the freeing of private info or release of the owner - but not by much.
It is arguably more natural for this freeing to happen in free_nbl()
where the structure itself is freed.

This bug was found by code inspection - it has not been seen in practice.

Fixes: 47446d74 ("nfsd4: add refcount for nfsd4_blocked_lock")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>