Commits · c20106944eb679fa3ab7e686fe5f6ba30fbc51e5 · Kirill Smelkov / linux

06 Oct, 2021 1 commit

NFSD: Keep existing listeners on portlist error · c2010694

Benjamin Coddington authored Oct 06, 2021

If nfsd has existing listening sockets without any processes, then an error
returned from svc_create_xprt() for an additional transport will remove
those existing listeners. We're seeing this in practice when userspace
attempts to create rpcrdma transports without having the rpcrdma modules
present before creating nfsd kernel processes. Fix this by checking for
existing sockets before calling nfsd_destroy().
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c2010694

01 Oct, 2021 2 commits

SUNRPC: fix sign error causing rpcsec_gss drops · 2ba5acfb

J. Bruce Fields authored Oct 01, 2021

If sd_max is unsigned, then sd_max - GSS_SEQ_WIN is a very large number
whenever sd_max is less than GSS_SEQ_WIN, and the comparison:

	seq_num <= sd->sd_max - GSS_SEQ_WIN

in gss_check_seq_num is pretty much always true, even when that's
clearly not what was intended.

This was causing pynfs to hang when using krb5, because pynfs uses zero
as the initial gss sequence number.  That's perfectly legal, but this
logic error causes knfsd to drop the rpc in that case.  Out-of-order
sequence IDs in the first GSS_SEQ_WIN (128) calls will also cause this.

Fixes: 10b9d99a ("SUNRPC: Augment server-side rpcgss tracepoints")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2ba5acfb

nfsd: Fix a warning for nfsd_file_close_inode · 19598141

Trond Myklebust authored Sep 30, 2021

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

19598141

30 Sep, 2021 2 commits

nfsd4: Handle the NFSv4 READDIR 'dircount' hint being zero · f2e717d6

Trond Myklebust authored Sep 30, 2021

RFC3530 notes that the 'dircount' field may be zero, in which case the
recommendation is to ignore it, and only enforce the 'maxcount' field.
In RFC5661, this recommendation to ignore a zero valued field becomes a
requirement.

Fixes: aee37764 ("nfsd4: fix rd_dircount enforcement")
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f2e717d6

nfsd: fix error handling of register_pernet_subsys() in init_nfsd() · 1d625050

Patrick Ho authored Aug 21, 2021

init_nfsd() should not unregister pernet subsys if the register fails
but should instead unwind from the last successful operation which is
register_filesystem().

Unregistering a failed register_pernet_subsys() call can result in
a kernel GPF as revealed by programmatically injecting an error in
register_pernet_subsys().

Verified the fix handled failure gracefully with no lingering nfsd
entry in /proc/filesystems.  This change was introduced by the commit
bd5ae928 ("nfsd: register pernet ops last, unregister first"),
the original error handling logic was correct.

Fixes: bd5ae928 ("nfsd: register pernet ops last, unregister first")
Cc: stable@vger.kernel.org
Signed-off-by: Patrick Ho <Patrick.Ho@netapp.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

1d625050

17 Sep, 2021 2 commits

nfsd: back channel stuck in SEQ4_STATUS_CB_PATH_DOWN · 02579b2f

Dai Ngo authored Sep 16, 2021

When the back channel enters SEQ4_STATUS_CB_PATH_DOWN state, the client
recovers by sending BIND_CONN_TO_SESSION but the server fails to recover
the back channel and leaves it as NFSD4_CB_DOWN.

Fix by enhancing nfsd4_bind_conn_to_session to probe the back channel
by calling nfsd4_probe_callback.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

02579b2f

NLM: Fix svcxdr_encode_owner() · 89c485c7

Chuck Lever authored Sep 16, 2021

Dai Ngo reports that, since the XDR overhaul, the NLM server crashes
when the TEST procedure wants to return NLM_DENIED. There is a bug
in svcxdr_encode_owner() that none of our standard test cases found.

Replace the open-coded function with a call to an appropriate
pre-fabricated XDR helper.
Reported-by: Dai Ngo <Dai.Ngo@oracle.com>
Fixes: a6a63ca5 ("lockd: Common NLM XDR helpers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

89c485c7

03 Sep, 2021 1 commit

SUNRPC: improve error response to over-size gss credential · 0c217d50

NeilBrown authored Sep 02, 2021

When the NFS server receives a large gss (kerberos) credential and tries
to pass it up to rpc.svcgssd (which is deprecated), it triggers an
infinite loop in cache_read().

cache_request() always returns -EAGAIN, and this causes a "goto again".

This patch:
 - changes the error to -E2BIG to avoid the infinite loop, and
 - generates a WARN_ONCE when rsi_request first sees an over-sized
   credential.  The warning suggests switching to gssproxy.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196583Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

0c217d50

01 Sep, 2021 1 commit

SUNRPC: don't pause on incomplete allocation · e38b3f20

NeilBrown authored Aug 30, 2021

alloc_pages_bulk_array() attempts to allocate at least one page based on
the provided pages, and then opportunistically allocates more if that
can be done without dropping the spinlock.

So if it returns fewer than requested, that could just mean that it
needed to drop the lock.  In that case, try again immediately.

Only pause for a time if no progress could be made.
Reported-and-tested-by: Mike Javorski <mike.javorski@gmail.com>
Reported-and-tested-by: Lothar Paltins <lopa@mailbox.org>
Fixes: f6e70aab ("SUNRPC: refresh rq_pages using a bulk page allocator")
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Mel Gorman <mgorman@suse.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

e38b3f20

26 Aug, 2021 4 commits

nfsd: fix crash on LOCKT on reexported NFSv3 · 0bcc7ca4

J. Bruce Fields authored Aug 24, 2021

Unlike other filesystems, NFSv3 tries to use fl_file in the GETLK case.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

0bcc7ca4

nfs: don't allow reexport reclaims · bb0a55bb

J. Bruce Fields authored Aug 20, 2021

In the reexport case, nfsd is currently passing along locks with the
reclaim bit set.  The client sends a new lock request, which is granted
if there's currently no conflict--even if it's possible a conflicting
lock could have been briefly held in the interim.

We don't currently have any way to safely grant reclaim, so for now
let's just deny them all.

I'm doing this by passing the reclaim bit to nfs and letting it fail the
call, with the idea that eventually the client might be able to do
something more forgiving here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bb0a55bb

lockd: don't attempt blocking locks on nfs reexports · b840be2f

J. Bruce Fields authored Aug 20, 2021

As in the v4 case, it doesn't work well to block waiting for a lock on
an nfs filesystem.

As in the v4 case, that means we're depending on the client to poll.
It's probably incorrect to depend on that, but I *think* clients do poll
in practice.  In any case, it's an improvement over hanging the lockd
thread indefinitely as we currently are.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b840be2f

nfs: don't atempt blocking locks on nfs reexports · f657f8ee

J. Bruce Fields authored Aug 20, 2021

NFS implements blocking locks by blocking inside its lock method. In
the reexport case, this blocks the nfs server thread, which could lead
to deadlocks since an nfs server thread might be required to unlock the
conflicting lock. It also causes a crash, since the nfs server thread
assumes it can free the lock when its lm_notify lock callback is called.

Ideal would be to make the nfs lock method return without blocking in
this case, but for now it works just not to attempt blocking locks. The
difference is just that the original client will have to poll (as it
does in the v4.0 case) instead of getting a callback when the lock's
available.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f657f8ee

23 Aug, 2021 4 commits

Keep read and write fds with each nlm_file · 7f024fcd

J. Bruce Fields authored Aug 23, 2021

We shouldn't really be using a read-only file descriptor to take a write
lock.

Most filesystems will put up with it.  But NFS, for example, won't.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

7f024fcd

lockd: update nlm_lookup_file reexport comment · b661601a

J. Bruce Fields authored Aug 20, 2021

Update comment to reflect that we *do* allow reexport, whether it's a
good idea or not....
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b661601a

nlm: minor refactoring · a81041b7

J. Bruce Fields authored Aug 23, 2021

Make this lookup slightly more concise, and prepare for changing how we
look this up in a following patch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a81041b7

nlm: minor nlm_lookup_file argument change · 2dc6f19e

J. Bruce Fields authored Aug 23, 2021

It'll come in handy to get the whole nlm_lock.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2dc6f19e

21 Aug, 2021 1 commit

lockd: lockd server-side shouldn't set fl_ops · 7de875b2

J. Bruce Fields authored Aug 20, 2021

Locks have two sets of op arrays, fl_lmops for the lock manager (lockd
or nfsd), fl_ops for the filesystem.  The server-side lockd code has
been setting its own fl_ops, which leads to confusion (and crashes) in
the reexport case, where the filesystem expects to be the only one
setting fl_ops.

And there's no reason for it that I can see-the lm_get/put_owner ops do
the same job.
Reported-by: Daire Byrne <daire@dneg.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

7de875b2

20 Aug, 2021 4 commits

SUNRPC: Add documentation for the fail_sunrpc/ directory · 400edd8c
Chuck Lever authored Aug 09, 2021
```
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
```
400edd8c

SUNRPC: Server-side disconnect injection · 3a126180

Chuck Lever authored Aug 03, 2021

Disconnect injection stress-tests the ability for both client and
server implementations to behave resiliently in the face of network
instability.

A file called /sys/kernel/debug/fail_sunrpc/ignore-server-disconnect
enables administrators to turn off server-side disconnect injection
while allowing other types of sunrpc errors to be injected. The
default setting is that server-side disconnect injection is enabled
(ignore=false).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

3a126180

SUNRPC: Move client-side disconnect injection · a4ae3081

Chuck Lever authored Aug 05, 2021

Disconnect injection stress-tests the ability for both client and
server implementations to behave resiliently in the face of network
instability.

Convert the existing client-side disconnect injection infrastructure
to use the kernel's generic error injection facility. The generic
facility has a richer set of injection criteria.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a4ae3081

SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory · c782af25

Chuck Lever authored Aug 03, 2021

This directory will contain a set of administrative controls for
enabling error injection for kernel RPC consumers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c782af25

19 Aug, 2021 1 commit

svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free() · 729580dd

Chuck Lever authored Aug 18, 2021

svc_xprt_free() already "puts" the bc_xprt before calling the
transport's "free" method. No need to do it twice.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

729580dd

17 Aug, 2021 17 commits

nfsd4: Fix forced-expiry locking · f7104cc1

J. Bruce Fields authored Aug 12, 2021

This should use the network-namespace-wide client_lock, not the
per-client cl_lock.

You shouldn't see any bugs unless you're actually using the
forced-expiry interface introduced by 89c905be.

Fixes: 89c905be "nfsd: allow forced expiration of NFSv4 clients"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f7104cc1

rpc: fix gss_svc_init cleanup on failure · 5a475344

J. Bruce Fields authored Aug 12, 2021

The failure case here should be rare, but it's obviously wrong.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5a475344

SUNRPC: Add RPC_AUTH_TLS protocol numbers · b4ab2fea

Chuck Lever authored Jul 30, 2021

Shared by client and server. See:

https://www.iana.org/assignments/rpc-authentication-numbers/rpc-authentication-numbers.xhtmlSigned-off-by: Chuck Lever <chuck.lever@oracle.com>

b4ab2fea

lockd: change the proc_handler for nsm_use_hostnames · d02a3a2c

Jia He authored Aug 03, 2021

nsm_use_hostnames is a module parameter and it will be exported to sysctl
procfs. This is to let user sometimes change it from userspace. But the
minimal unit for sysctl procfs read/write it sizeof(int).
In big endian system, the converting from/to  bool to/from int will cause
error for proc items.

This patch use a new proc_handler proc_dobool to fix it.
Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
[thuth: Fix typo in commit message]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d02a3a2c

sysctl: introduce new proc handler proc_dobool · a2071573

Jia He authored Aug 03, 2021

This is to let bool variable could be correctly displayed in
big/little endian sysctl procfs. sizeof(bool) is arch dependent,
proc_dobool should work in all arches.
Suggested-by: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Signed-off-by: Jia He <hejianet@gmail.com>
[thuth: rebased the patch to the current kernel version]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

a2071573

SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency() · 5c117207

Chuck Lever authored Aug 05, 2021

Some paths through svc_process() leave rqst->rq_procinfo set to
NULL, which triggers a crash if tracing happens to be enabled.

Fixes: 89ff8749 ("SUNRPC: Display RPC procedure names instead of proc numbers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5c117207

NFSD: remove vanity comments · ea49dc79

NeilBrown authored Jul 28, 2021

Including one's name in copyright claims is appropriate.  Including it
in random comments is just vanity.  After 2 decades, it is time for
these to be gone.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ea49dc79

svcrdma: Convert rdma->sc_rw_ctxts to llist · 07a92d00

Chuck Lever authored Feb 08, 2021

Relieve contention on sc_rw_ctxt_lock by converting rdma->sc_rw_ctxts
to an llist.

The goal is to reduce the average overhead of Send completions,
because a transport's completion handlers are single-threaded on
one CPU core. This change reduces CPU utilization of each Send
completion by 2-3% on my server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>

07a92d00

svcrdma: Relieve contention on sc_send_lock. · b6c2bfea

Chuck Lever authored Feb 09, 2021

/proc/lock_stat indicates the the sc_send_lock is heavily
contended when the server is under load from a single client.

To address this, convert the send_ctxt free list to an llist.
Returning an item to the send_ctxt cache is now waitless, which
reduces the instruction path length in the single-threaded Send
handler (svc_rdma_wc_send).

The goal is to enable the ib_comp_wq worker to handle a higher
RPC/RDMA Send completion rate given the same CPU resources. This
change reduces CPU utilization of Send completion by 2-3% on my
server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>

b6c2bfea

svcrdma: Fewer calls to wake_up() in Send completion handler · 6c8c84f5

Chuck Lever authored Jul 07, 2021

Because wake_up() takes an IRQ-safe lock, it can be expensive,
especially to call inside of a single-threaded completion handler.
What's more, the Send wait queue almost never has waiters, so
most of the time, this is an expensive no-op.

As always, the goal is to reduce the average overhead of each
completion, because a transport's completion handlers are single-
threaded on one CPU core. This change reduces CPU utilization of
the Send completion thread by 2-3% on my server.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Tom Talpey <tom@talpey.com>

6c8c84f5

lockd: Fix invalid lockowner cast after vfs_test_lock · cd2d644d

Benjamin Coddington authored Jul 26, 2021

After calling vfs_test_lock() the pointer to a conflicting lock can be
returned, and that lock is not guarunteed to be owned by nlm.  In that
case, we cannot cast it to struct nlm_lockowner.  Instead return the pid
of that conflicting lock.

Fixes: 646d73e9 ("lockd: Show pid of lockd for remote locks")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

cd2d644d

NFSD: Use new __string_len C macros for nfsd_clid_class · d27b74a8
Chuck Lever authored May 14, 2021
```
Clean up.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
```
d27b74a8
NFSD: Use new __string_len C macros for the nfs_dirent tracepoint · 408c0de7
Chuck Lever authored May 12, 2021
```
Clean up.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
```
408c0de7

tracing: Add trace_event helper macros __string_len() and __assign_str_len() · 883b4aee

Steven Rostedt (VMware) authored Jul 16, 2021

There's a few cases that a string that is to be recorded in a trace event,
does not have a terminating 'nul' character, and instead, the tracepoint
passes in the length of the string to record.

Add two helper macros to the trace event code that lets this work easier,
than tricks with "%.*s" logic.

  __string_len() which is similar to __string() for declaration, but takes a
                 length argument.

  __assign_str_len() which is similar to __assign_str() for assiging the
                 string, but it too takes a length argument.

Note, the TRACE_EVENT() macro will allocate the location on the ring
buffer to 'len + 1', that will be used to store the string into. It is a
requirement that the 'len' used for this is a most the length of the
string being recorded.

This string can still use __get_str() just like strings created with
__string() can use to retrieve the string.

Link: https://lore.kernel.org/linux-nfs/20210513105018.7539996a@gandalf.local.home/Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

883b4aee

NFSD: Batch release pages during splice read · 496d83cf

Chuck Lever authored Jun 28, 2021

Large splice reads call put_page() repeatedly. put_page() is
relatively expensive to call, so replace it with the new
svc_rqst_replace_page() helper to help amortize that cost.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>

496d83cf

SUNRPC: Add svc_rqst_replace_page() API · 2f0f88f4

Chuck Lever authored Jul 01, 2021

Replacing a page in rq_pages[] requires a get_page(), which is a
bus-locked operation, and a put_page(), which can be even more
costly.

To reduce the cost of replacing a page in rq_pages[], batch the
put_page() operations by collecting "freed" pages in a pagevec,
and then release those pages when the pagevec is full. This
pagevec is also emptied when each RPC completes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2f0f88f4

NFSD: Clean up splice actor · c7e0b781

Chuck Lever authored Jun 28, 2021

A few useful observations:

 - The value in @size is never modified.

 - splice_desc.len is an unsigned int, and so is xdr_buf.page_len.
   An implicit cast to size_t is unnecessary.

 - The computation of .page_len is the same in all three arms
   of the "if" statement, so hoist it out to make it clear that
   the operation is an unconditional invariant.

The resulting function is 18 bytes shorter on my system (-Os).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>

c7e0b781