Commits · b1691bc03d4eddb959234409167bef9be9e62d74 · nexedi / linux

09 Dec, 2014 13 commits

sunrpc: convert to lockless lookup of queued server threads · b1691bc0

Jeff Layton authored Nov 21, 2014

Testing has shown that the pool->sp_lock can be a bottleneck on a busy
server. Every time data is received on a socket, the server must take
that lock in order to dequeue a thread from the sp_threads list.

Address this problem by eliminating the sp_threads list (which contains
threads that are currently idle) and replacing it with a RQ_BUSY flag in
svc_rqst. This allows us to walk the sp_all_threads list under the
rcu_read_lock and find a suitable thread for the xprt by doing a
test_and_set_bit.

Note that we do still have a potential atomicity problem however with
this approach. We don't want svc_xprt_do_enqueue to set the
rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
zero (which indicates that the thread was idle). But, by the time we
check that, the bit could be flipped by a waking thread.

To address this, we acquire a new per-rqst spinlock (rq_lock) and take
that before doing the test_and_set_bit. If that returns false, then we
can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
it must set the bit under the same spinlock and can trust that if it was
already set then the rq_xprt is also properly set.

With this scheme, the case where we have an idle thread no longer needs
to take the highly contended pool->sp_lock at all, and that removes the
bottleneck.

That still leaves one issue: What of the case where we walk the whole
sp_all_threads list and don't find an idle thread? Because the search is
lockess, it's possible for the queueing to race with a thread that is
going to sleep. To address that, we queue the xprt and then search again.

If we find an idle thread at that point, we can't attach the xprt to it
directly since that might race with a different thread waking up and
finding it. All we can do is wake the idle thread back up and let it
attempt to find the now-queued xprt.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Chris Worley <chris.worley@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

b1691bc0

sunrpc: fix potential races in pool_stats collection · 403c7b44

Jeff Layton authored Nov 21, 2014

In a later patch, we'll be removing some spinlocking around the socket
and thread queueing code in order to fix some contention problems. At
that point, the stats counters will no longer be protected by the
sp_lock.

Change the counters to atomic_long_t fields, except for the
"sockets_queued" counter which will still be manipulated under a
spinlock.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Chris Worley <chris.worley@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

403c7b44

sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it · 81244386

Jeff Layton authored Nov 21, 2014

...also make the manipulation of sp_all_threads list use RCU-friendly
functions.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Chris Worley <chris.worley@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

81244386

sunrpc: require svc_create callers to pass in meaningful shutdown routine · 0b5707e4

Jeff Layton authored Nov 19, 2014

Currently all svc_create callers pass in NULL for the shutdown parm,
which then gets fixed up to be svc_rpcb_cleanup if the service uses
rpcbind.

Simplify this by instead having the the only caller that requires it
(lockd) pass in svc_rpcb_cleanup and get rid of the special casing.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

0b5707e4

sunrpc: have svc_wake_up only deal with pool 0 · ceff739c

Jeff Layton authored Nov 19, 2014

The way that svc_wake_up works is a bit inefficient. It walks all of the
available pools for a service and either wakes up a task in each one or
sets the SP_TASK_PENDING flag in each one.

When svc_wake_up is called, there is no need to wake up more than one
thread to do this work. In practice, only lockd currently uses this
function and it's single threaded anyway. Thus, this just boils down to
doing a wake up of a thread in pool 0 or setting a single flag.

Eliminate the for loop in this function and change it to just operate on
pool 0. Also update the comments that sit above it and get rid of some
code that has been commented out for years now.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

ceff739c

sunrpc: convert sp_task_pending flag to use atomic bitops · 4d5db3f5

Jeff Layton authored Nov 19, 2014

In a later patch, we'll want to be able to handle this flag without
holding the sp_lock. Change this field to an unsigned long flags
field, and declare a new flag in it that can be managed with atomic
bitops.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

4d5db3f5

sunrpc: move rq_cachetype field to better optimize space · 62978b3c

Jeff Layton authored Nov 19, 2014

There are a couple of holes in the svc_rqst field on x86_64. Move the
rq_cachetype to a different location to eliminate both of them.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

62978b3c

sunrpc: move rq_splice_ok flag into rq_flags · 779fb0f3

Jeff Layton authored Nov 19, 2014

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

779fb0f3

sunrpc: move rq_dropme flag into rq_flags · 78b65eb3

Jeff Layton authored Nov 19, 2014

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

78b65eb3

sunrpc: move rq_usedeferral flag to rq_flags · 30660e04

Jeff Layton authored Nov 19, 2014

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

30660e04

sunrpc: move rq_local field to rq_flags · 7501cc2b

Jeff Layton authored Nov 19, 2014

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

7501cc2b

sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it · 4d152e2c

Jeff Layton authored Nov 19, 2014

In a later patch, we're going to need some atomic bit flags. Since that
field will need to be an unsigned long, we mitigate that space
consumption by migrating some other bitflags to the new field. Start
with the rq_secure flag.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

4d152e2c

Merge tag 'nfs-for-3.19-1' into nfsd for-3.19 branch · 2941b0e9

J. Bruce Fields authored Dec 09, 2014

Mainly what I need is 860a0d9e "sunrpc: add some tracepoints in
svc_rqst handling functions", which subsequent server rpc patches from
jlayton depend on. I'm merging this later tag on the assumption that's
more likely to be a tested and stable point.

2941b0e9

01 Dec, 2014 3 commits

nfsd: minor off by one checks in __write_versions() · 818f2f57

Dan Carpenter authored Nov 27, 2014

My static checker complains that if "len == remaining" then it means we
have truncated the last character off the version string.

The intent of the code is that we print as many versions as we can
without truncating a version.  Then we put a newline at the end.  If the
newline can't fit we return -EINVAL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

818f2f57

sunrpc: release svc_pool_map reference when serv allocation fails · 067f96ef

Jeff Layton authored Nov 19, 2014

Currently, it leaks when the allocation fails.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

067f96ef

sunrpc: eliminate the XPT_DETACHED flag · 8d65ef76

Jeff Layton authored Nov 17, 2014

All it does is indicate whether a xprt has already been deleted from
a list or not, which is unnecessary since we use list_del_init and it's
always set and checked under the sv_lock anyway.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

8d65ef76

27 Nov, 2014 2 commits

sunrpc: add a debugfs rpc_xprt directory with an info file in it · 388f0c77

Jeff Layton authored Nov 26, 2014

Add a new directory heirarchy under the debugfs sunrpc/ directory:

    sunrpc/
        rpc_xprt/
            <xprt id>/

Within that directory, we can put files that give info about the
xprts. We do have the (minor) problem that there is no succinct,
unique identifier for rpc_xprts. So we generate them synthetically
with a static atomic_t counter.

For now, this directory just holds an "info" file, but we may add
other files to it in the future.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

388f0c77

sunrpc: add debugfs file for displaying client rpc_task queue · b4b9d2cc

Jeff Layton authored Nov 26, 2014

It's possible to get a dump of the RPC task queue by writing a value to
/proc/sys/sunrpc/rpc_debug. If you write any value to that file, you get
a dump of the RPC client task list into the log buffer. This is a rather
inconvenient interface however, and makes it hard to get immediate info
about the task queue.

Add a new directory hierarchy under debugfs:

    sunrpc/
        rpc_clnt/
            <clientid>/

Within each clientid directory we create a new "tasks" file that will
dump info similar to what shows up in the log buffer, but with a few
small differences -- we avoid printing raw kernel addresses in favor of
symbolic names and the XID is also displayed.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

b4b9d2cc

26 Nov, 2014 2 commits

Merge tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next · ea526413

Trond Myklebust authored Nov 26, 2014

Pull NFS client RDMA changes for 3.19 from Anna Schumaker:
 "NFS: Client side changes for RDMA

  These patches various bugfixes and cleanups for using NFS over RDMA, including
  better error handling and performance improvements by using pad optimization.

  Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"

* tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Display async errors
  xprtrdma: Enable pad optimization
  xprtrdma: Re-write rpcrdma_flush_cqs()
  xprtrdma: Refactor tasklet scheduling
  xprtrdma: unmap all FMRs during transport disconnect
  xprtrdma: Cap req_cqinit
  xprtrdma: Return an errno from rpcrdma_register_external()

ea526413

Merge tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next · 1702562d

Trond Myklebust authored Nov 26, 2014

Pull pull additional NFS client changes for 3.19 from Anna Schumaker:
  "NFS: Generic client side changes from Chuck

  These patches fixes for iostats and SETCLIENTID in addition to cleaning
  up the nfs4_init_callback() function.

  Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"

* tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  NFS: Clean up nfs4_init_callback()
  NFS: SETCLIENTID XDR buffer sizes are incorrect
  SUNRPC: serialize iostats updates

1702562d

25 Nov, 2014 15 commits

nfs: Add DEALLOCATE support · 624bd5b7

Anna Schumaker authored Nov 25, 2014

This patch adds support for using the NFS v4.2 operation DEALLOCATE to
punch holes in a file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

624bd5b7

nfs: Add ALLOCATE support · f4ac1674

Anna Schumaker authored Nov 25, 2014

This patch adds support for using the NFS v4.2 operation ALLOCATE to
preallocate data in a file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

f4ac1674

NFS: Clean up nfs4_init_callback() · c2ef47b7

Chuck Lever authored Nov 08, 2014

nfs4_init_callback() is never invoked for NFS versions other than 4.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c2ef47b7

NFS: SETCLIENTID XDR buffer sizes are incorrect · 6dd3436b

Chuck Lever authored Nov 08, 2014

Use the correct calculation of the maximum size of a clientaddr4
when encoding and decoding SETCLIENTID operations. clientaddr4 is
defined in section 2.2.10 of RFC3530bis-31.

The usage in encode_setclientid_maxsz is missing the 4-byte length
in both strings, but is otherwise correct. decode_setclientid_maxsz
simply asks for a page of receive buffer space, which is
unnecessarily large (more than 4KB).

Note that a SETCLIENTID reply is either clientid+verifier, or
clientaddr4, depending on the returned NFS status. It doesn't
hurt to allocate enough space for both.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6dd3436b

SUNRPC: serialize iostats updates · edef1297

Chuck Lever authored Nov 08, 2014

Occasionally mountstats reports a negative retransmission rate.
Ensure that two RPCs completing concurrently don't confuse the sums
in the transport's op_metrics array.

Since pNFS filelayout can invoke rpc_count_iostats() on another
transport from xprt_release(), we can't rely on simply holding the
transport_lock in xprt_release(). There's nothing for it but hard
serialization. One spin lock per RPC operation should make this as
painless as it can be.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

edef1297

xprtrdma: Display async errors · 7ff11de1

Chuck Lever authored Nov 08, 2014

An async error upcall is a hard error, and should be reported in
the system log.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

7ff11de1

xprtrdma: Enable pad optimization · d5440e27

Chuck Lever authored Nov 08, 2014

The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
pad optimization was enabled. That bug was fixed by commit
e560e3b5 ("svcrdma: Add zero padding if the client doesn't send
it").

We can now enable pad optimization on the client, which helps
performance and is supported now by both Linux and Solaris servers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

d5440e27

xprtrdma: Re-write rpcrdma_flush_cqs() · 5c166bef

Chuck Lever authored Nov 08, 2014

Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.

1. rpcrdma_flush_cqs() can run concurrently with provider upcalls.
   Both flush_cqs() and the upcalls were invoking ib_poll_cq() in
   different threads using the same wc buffers (ep->rep_recv_wcs
   and ep->rep_send_wcs), added by commit 1c00dd07 ("xprtrmda:
   Reduce calls to ib_poll_cq() in completion handlers").

   During transport disconnect processing, this sometimes resulted
   in the same reply getting added to the rpcrdma_tasklets_g list
   more than once, which corrupted the list.

2. The upcall functions drain only a limited number of CQEs,
   thanks to the poll budget added by commit 8301a2c0
   ("xprtrdma: Limit work done by completion handler").

Fixes: a7bc211a ("xprtrdma: On disconnect, don't ignore ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5c166bef

xprtrdma: Refactor tasklet scheduling · f1a03b76

Chuck Lever authored Nov 08, 2014

Restore the separate function that schedules the reply handling
tasklet. I need to call it from two different paths.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

f1a03b76

xprtrdma: unmap all FMRs during transport disconnect · 467c9674

Chuck Lever authored Nov 08, 2014

When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.

Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

467c9674

xprtrdma: Cap req_cqinit · e7104a2a

Chuck Lever authored Nov 08, 2014

Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.

Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.

The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.

For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).

For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.

Some providers hold back dispatching Work Requests until a CQE is
generated.  If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.

I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.

Cap the rep_cqinit setting so completions are not left turned off
for too long.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e7104a2a

xprtrdma: Return an errno from rpcrdma_register_external() · 92b98361

Chuck Lever authored Nov 08, 2014

The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

92b98361

nfs: define nfs_inc_fscache_stats and using it as possible · e9f456ca

Li RongQing authored Nov 23, 2014

Define and use nfs_inc_fscache_stats when plus one, which can save to
pass one parameter.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

e9f456ca

nfs: replace nfs_add_stats with nfs_inc_stats when add one · 5a254d08

Li RongQing authored Nov 23, 2014

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

5a254d08

NFS: Deletion of unnecessary checks before the function call "nfs_put_client" · fe0bf118

Markus Elfring authored Nov 18, 2014

The nfs_put_client() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

fe0bf118

24 Nov, 2014 5 commits

sunrpc: eliminate RPC_TRACEPOINTS · 1306729b

Jeff Layton authored Nov 17, 2014

It's always set to the same value as CONFIG_TRACEPOINTS, so we can just
use that instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

1306729b

sunrpc: eliminate RPC_DEBUG · f895b252

Jeff Layton authored Nov 17, 2014

It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

f895b252

lockd: eliminate LOCKD_DEBUG · 10b89567

Jeff Layton authored Nov 17, 2014

LOCKD_DEBUG is always the same value as CONFIG_SUNRPC_DEBUG, so we can
just use it instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

10b89567

nfs41: fix nfs4_proc_layoutget error handling · 4bd5a980

Peng Tao authored Nov 17, 2014

nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
early so that it is safe to call .release in case nfs4_alloc_pages
fails.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Fixes: a47970ff ("NFSv4.1: Hold reference to layout hdr in layoutget")
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

4bd5a980

NFS: fix subtle change in COMMIT behavior · cb1410c7

Weston Andros Adamson authored Nov 12, 2014

Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.

This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

cb1410c7