Commits · 8224b2734ab1da4996b851e1e5d3047e7a0df499 · Kirill Smelkov / linux

12 Sep, 2017 2 commits

NFS: Add static NFS I/O tracepoints · 8224b273

Chuck Lever authored Aug 21, 2017

Tools like tcpdump and rpcdebug can be very useful. But there are
plenty of environments where they are difficult or impossible to
use. For example, we've had customers report I/O failures during
workloads so heavy that collecting network traffic or enabling
RPC debugging are themselves onerous.

The kernel's static tracepoints are lightweight (less likely to
introduce timing changes) and efficient (the trace data is compact).
They also work in scenarios where capturing network traffic is not
possible due to lack of hardware support (some InfiniBand HCAs) or
where data or network privacy is a concern.

Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT
is initiated, and when it completes. Record the arguments and
results of each operation, which are not shown by existing sunrpc
module's tracepoints.

For instance, the recorded offset and count can be used to match an
"initiate" event to a "done" event. If an NFS READ result returns
fewer bytes than requested or zero, seeing the EOF flag can be
probative. Seeing an NFS4ERR_BAD_STATEID result is also indication
of a particular class of problems. The timing information attached
to each event record can often be useful as well.

Usage example:

[root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done
/sys/kernel/debug/tracing/events/nfs/*initiate*/filter
/sys/kernel/debug/tracing/events/nfs/*done/filter
Hit Ctrl^C to stop recording
^CKernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 3680
oldest event ts:    78.367422
now ts:   100.124419
dropped events: 0
read events: 74

... and so on.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

8224b273

pNFS: Use the standard I/O stateid when calling LAYOUTGET · 70d2f7b1

Trond Myklebust authored Sep 11, 2017

Instead of having a private method for copying the open/delegation stateid,
use the same call that is used for standard I/O through the MDS.

Note that this means we transmit the stateid with a zero seqid, avoiding
issues with NFS4ERR_OLD_STATEID.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

70d2f7b1

09 Sep, 2017 5 commits

NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests() · 1bd5d6d0

Trond Myklebust authored Sep 09, 2017

If we skip a subrequest due to a zero refcount, we should still count
the byte range that it covered so that we accurately reconstruct the
original request size.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

1bd5d6d0

NFS: Don't hold the group lock when calling nfs_release_request() · 8b77484f

Trond Myklebust authored Sep 09, 2017

That can deadlock if this is the last reference since
nfs_page_group_destroy() calls nfs_page_group_sync_on_bit().
Note that even if the page was removed from the subpage list,
the req->wb_head could still be pointing to the old head.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

8b77484f

NFS: Remove pnfs_generic_transfer_commit_list() · 5d2a9d9d

Trond Myklebust authored Sep 09, 2017

It's pretty much a duplicate of nfs_scan_commit_list() that also
clears the PG_COMMIT_TO_DS flag.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

5d2a9d9d

NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock · 137da553

Trond Myklebust authored Sep 09, 2017

Since the commit list is not ordered, it is possible for nfs_scan_commit_list
to hold a request that nfs_lock_and_join_requests() is waiting for, while
at the same time trying to grab a request that nfs_lock_and_join_requests
already holds.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

137da553

NFS: Fix 2 use after free issues in the I/O code · 196639eb

Trond Myklebust authored Sep 08, 2017

The writeback code wants to send a commit after processing the pages,
which is why we want to delay releasing the struct path until after
that's done.

Also, the layout code expects that we do not free the inode before
we've put the layout segments in pnfs_writehdr_free() and
pnfs_readhdr_free()

Fixes: 919e3bd9 ("NFS: Ensure we commit after writeback is complete")
Fixes: 4714fb51 ("nfs: remove pgio_header refcount, related cleanup")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

196639eb

07 Sep, 2017 1 commit

NFS: Sync the correct byte range during synchronous writes · e973b1a5

tarangg@amazon.com authored Sep 07, 2017

Since commit 18290650 ("NFS: Move buffered I/O locking into
nfs_file_write()") nfs_file_write() has not flushed the correct byte
range during synchronous writes.  generic_write_sync() expects that
iocb->ki_pos points to the right edge of the range rather than the
left edge.

To replicate the problem, open a file with O_DSYNC, have the client
write at increasing offsets, and then print the successful offsets.
Block port 2049 partway through that sequence, and observe that the
client application indicates successful writes in advance of what the
server received.

Fixes: 18290650 ("NFS: Move buffered I/O locking into nfs_file_write()")
Signed-off-by: Jacob Strauss <jsstraus@amazon.com>
Signed-off-by: Tarang Gupta <tarangg@amazon.com>
Tested-by: Tarang Gupta <tarangg@amazon.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

e973b1a5

06 Sep, 2017 5 commits

lockd: Delete an error message for a failed memory allocation in reclaimer() · 58a69893

Markus Elfring authored Aug 16, 2017

Omit an extra message for a memory allocation failure in this function.

This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

58a69893

NFS: remove jiffies field from access cache · 03c6f7d6

NeilBrown authored Aug 16, 2017

This field hasn't been used since commit 57b69181 ("NFS: Cache
access checks more aggressively").
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

03c6f7d6

NFS: flush data when locking a file to ensure cache coherence for mmap. · 779eafab

NeilBrown authored Aug 18, 2017

When a byte range lock (or flock) is taken out on an NFS file, the
validity of the cached data is checked and the inode is marked
NFS_INODE_INVALID_DATA.  However the cached data isn't flushed from
the page cache.

This is sufficient for future read() requests or mmap() requests as
they call nfs_revalidate_mapping() which performs the flush if
necessary.

However an existing mapping is not affected.  Accessing data through
that mapping will continue to return old data even though the inode is
marked NFS_INODE_INVALID_DATA.

This can easily be confirmed using the 'nfs' tool in
  git://github.com/okirch/twopence-nfs.git
and running

   nfs coherence FILENAME
on one client, and
   nfs coherence -r FILENAME
on another client.

It appears that prior to Linux 2.6.0 this worked correctly.

However commit:

http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0

removed the call to inode_invalidate_pages() from nfs_zap_caches().  I
haven't tested this code, but inspection suggests that prior to this
commit, file locking would invalidate all inode pages.

This patch adds a call to nfs_revalidate_mapping() after a
successful SETLK so that invalid data is flushed.  With this patch the
above test passes.  To minimize impact (and possibly avoid a GETATTR
call) this only happens if the mapping might be mapped into
userspace.

Cc: Olaf Kirch <okir@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

779eafab

SUNRPC: remove some dead code. · f1ecbc21

NeilBrown authored Aug 18, 2017

RPC_TASK_NO_RETRANS_TIMEOUT is set when cl_noretranstimeo
is set, which happens when  RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT is set,
which happens when NFS_CS_NO_RETRANS_TIMEOUT is set.

This flag means "don't resend on a timeout, only resend if the
connection gets broken for some reason".

cl_discrtry is set when RPC_CLNT_CREATE_DISCRTRY is set, which
happens when NFS_CS_DISCRTRY is set.

This flag means "always disconnect before resending".

NFS_CS_NO_RETRANS_TIMEOUT and NFS_CS_DISCRTRY are both only set
in nfs4_init_client(), and it always sets both.

So we will never have a situation where only one of the flags is set.
So this code, which tests if timeout retransmits are allowed, and
disconnection is required, will never run.

So it makes sense to remove this code as it cannot be tested and
could confuse people reading the code (like me).

(alternately we could leave it there with a comment saying
 it is never actually used).
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

f1ecbc21

NFS: don't expect errors from mempool_alloc(). · 237f8306

NeilBrown authored Aug 18, 2017

Commit fbe77c30 ("NFS: move rw_mode to nfs_pageio_header")
reintroduced some pointless code that commit 518662e0 ("NFS: fix
usage of mempools.") had recently removed.

Remove it again.

Cc: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

237f8306

05 Sep, 2017 2 commits

xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler · 9590d083

Chuck Lever authored Aug 23, 2017

Adopt the use of xprt_pin_rqst to eliminate contention between
Call-side users of rb_lock and the use of rb_lock in
rpcrdma_reply_handler.

This replaces the mechanism introduced in 431af645 ("xprtrdma:
Fix client lock-up after application signal fires").

Use recv_lock to quickly find the completing rqst, pin it, then
drop the lock. At that point invalidation and pull-up of the Reply
XDR can be done. Both are often expensive operations.

Finally, take recv_lock again to signal completion to the RPC
layer. It also protects adjustment of "cwnd".

This greatly reduces the amount of time a lock is held by the
reply handler. Comparing lock_stat results shows a marked decrease
in contention on rb_lock and recv_lock.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
   the "out_norqst:" path in rpcrdma_reply_handler.]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

9590d083

Merge tag 'nfs-rdma-for-4.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next · f9773b22

Trond Myklebust authored Sep 05, 2017

NFS-over-RDMA client updates for Linux 4.14

Bugfixes and cleanups:
- Constify rpc_xprt_ops
- Harden RPC call encoding and decoding
- Clean up rpc call decoding to use xdr_streams
- Remove unused variables from various structures
- Refactor code to remove imul instructions
- Rearrange rx_stats structure for better cacheline sharing

f9773b22

22 Aug, 2017 1 commit

xprtrdma: Re-arrange struct rx_stats · 67af6f65

Chuck Lever authored Aug 22, 2017

To reduce false cacheline sharing, separate counters that are likely
to be accessed in the Call path from those accessed in the Reply
path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

67af6f65

20 Aug, 2017 6 commits

Merge branch 'bugfixes' · 7af7a596
Trond Myklebust authored Aug 20, 2017

7af7a596

NFS: Fix NFSv2 security settings · 53a75f22

Chuck Lever authored Aug 10, 2017

For a while now any NFSv2 mount where sec= is specified uses
AUTH_NULL. If sec= is not specified, the mount uses AUTH_UNIX.
Commit e68fd7c8 ("mount: use sec= that was specified on the
command line") attempted to address a very similar problem with
NFSv3, and should have fixed this too, but it has a bug.

The MNTv1 MNT procedure does not return a list of security flavors,
so our client makes up a list containing just AUTH_NULL. This should
enable nfs_verify_authflavors() to assign the sec= specified flavor,
but instead, it incorrectly sets it to AUTH_NULL.

I expect this would also be a problem for any NFSv3 server whose
MNTv3 MNT procedure returned a security flavor list containing only
AUTH_NULL.

Fixes: e68fd7c8 ("mount: use sec= that was specified on ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=310Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

53a75f22

NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys' · b79e87e0

NeilBrown authored Aug 18, 2017

An NFSv4.1 client might close a file after the user who opened it has
logged off.  In this case the user's credentials may no longer be
valid, if they are e.g. kerberos credentials that have expired.

NFSv4.1 has a mechanism to allow the client to use machine credentials
to close a file.  However due to a short-coming in the RFC, a CLOSE
with those credentials may not be possible if the file in question
isn't exported to the same security flavor - the required PUTFH must
be rejected when this is the case.

Specifically if a server and client support kerberos in general and
have used it to form a machine credential, but the file is only
exported to "sec=sys", a PUTFH with the machine credentials will fail,
so CLOSE is not possible.

As RPC_AUTH_UNIX (used by sec=sys) credentials can never expire, there
is no value in using the machine credential in place of them.
So in that case, just use the users credentials for CLOSE etc, as you would
in NFSv4.0
Signed-off-by: Neil Brown <neilb@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

b79e87e0

SUNRPC: ECONNREFUSED should cause a rebind. · fd01b259

NeilBrown authored Aug 18, 2017

If you
 - mount and NFSv3 filesystem
 - do some file locking which requires the server
   to make a GRANT call back
 - unmount
 - mount again and do the same locking

then the second attempt at locking suffers a 30 second delay.
Unmounting and remounting causes lockd to stop and restart,
which causes it to bind to a new port.
The server still thinks the old port is valid and gets ECONNREFUSED
when trying to contact it.
ECONNREFUSED should be seen as a hard error that is not worth
retrying.  Rebinding is the only reasonable response.

This patch forces a rebind if that makes sense.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

fd01b259

NFS: Remove unused parameter gfp_flags from nfs_pageio_init() · 3bde7afd

Trond Myklebust authored Aug 20, 2017

Now that the mirror allocation has been moved, the parameter can go.
Also remove the redundant symbol export.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

3bde7afd

NFSv4: Fix up mirror allocation · 14abcb0b

Trond Myklebust authored Aug 19, 2017

There are a number of callers of nfs_pageio_complete() that want to
continue using the nfs_pageio_descriptor without needing to call
nfs_pageio_init() again. Examples include nfs_pageio_resend() and
nfs_pageio_cond_complete().

The problem is that nfs_pageio_complete() also calls
nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors.
This can lead to writeback errors, in the next call to
nfs_pageio_setup_mirroring().

Fix by simply moving the allocation of the mirrors to
nfs_pageio_setup_mirroring().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709Reported-by: JianhongYin <yin-jianhong@163.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

14abcb0b

18 Aug, 2017 2 commits

Merge branch 'writeback' · b7561e51
Trond Myklebust authored Aug 18, 2017

b7561e51

SUNRPC: Add a separate spinlock to protect the RPC request receive list · ce7c252a

Trond Myklebust authored Aug 16, 2017

This further reduces contention with the transport_lock, and allows us
to convert to using a non-bh-safe spinlock, since the list is now never
accessed from a bh context.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

ce7c252a

16 Aug, 2017 4 commits

SUNRPC: Cleanup xs_tcp_read_common() · 040249df

Trond Myklebust authored Aug 13, 2017

Simplify the code to avoid a full copy of the struct xdr_skb_reader.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

040249df

SUNRPC: Don't loop forever in xs_tcp_data_receive() · 8d6f97d6

Trond Myklebust authored Aug 12, 2017

Ensure that we don't hog the workqueue thread by requeuing the job
every 64 loops.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

8d6f97d6

SUNRPC: Don't hold the transport lock when receiving backchannel data · c89091c8

Trond Myklebust authored Aug 16, 2017

The backchannel request has no associated task, so it is going nowhere
until we call xprt_complete_bc_request().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

c89091c8

SUNRPC: Don't hold the transport lock across socket copy operations · 729749bb

Trond Myklebust authored Aug 13, 2017

Instead add a mechanism to ensure that the request doesn't disappear
from underneath us while copying from the socket. We do this by
preventing xprt_release() from freeing the XDR buffers until the
flag RPC_TASK_MSG_RECV has been cleared from the request.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>

729749bb

15 Aug, 2017 12 commits

xprtrdma: Remove imul instructions from chunk list encoders · 6748b0ca

Chuck Lever authored Aug 14, 2017

Re-arrange the pointer arithmetic in the chunk list encoders to
eliminate several more integer multiplication instructions during
Transport Header encoding.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6748b0ca

xprtrdma: Remove imul instructions from rpcrdma_convert_iovs() · 28d9d56f

Chuck Lever authored Aug 14, 2017

Re-arrange the pointer arithmetic in rpcrdma_convert_iovs() to
eliminate several integer multiplication instructions during
Transport Header encoding.

Also, array overflow does not occur outside development
environments, so replace overflow checking with one spot check
at the end. This reduces the number of conditional branches in
the common case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

28d9d56f

NFS: Wait for requests that are locked on the commit list · 2ce209c4

Trond Myklebust authored Aug 01, 2017

If a request is on the commit list, but is locked, we will currently skip
it, which can lead to livelocking when the commit count doesn't reduce
to zero.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

2ce209c4

NFSv4/pnfs: Replace pnfs_put_lseg_locked() with pnfs_put_lseg() · 8205b9ce

Trond Myklebust authored Aug 01, 2017

Now that we no longer hold the inode->i_lock when manipulating the
commit lists, it is safe to call pnfs_put_lseg() again.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

8205b9ce

NFS: Switch to using mapping->private_lock for page writeback lookups. · 4b9bb25b

Trond Myklebust authored Aug 01, 2017

Switch from using the inode->i_lock for this to avoid contention with
other metadata manipulation.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

4b9bb25b

NFS: Use an atomic_long_t to count the number of commits · 5cb953d4
Trond Myklebust authored Aug 01, 2017
```
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
```
5cb953d4

NFS: Use an atomic_long_t to count the number of requests · a6b6d5b8

Trond Myklebust authored Aug 01, 2017

Rather than forcing us to take the inode->i_lock just in order to bump
the number.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

a6b6d5b8

NFSv4: Use a mutex to protect the per-inode commit lists · e824f99a

Trond Myklebust authored Aug 01, 2017

The commit lists can get very large, so using the inode->i_lock can
end up affecting general metadata performance.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

e824f99a

NFS: Refactor nfs_page_find_head_request() · b30d2f04

Trond Myklebust authored Aug 01, 2017

Split out the 2 cases so that we can treat the locking differently.
The issue is that the locking in the pageswapcache cache is highly
linked to the commit list locking.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

b30d2f04

NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request() · bd37d6fc

Trond Myklebust authored Aug 01, 2017

Hide the locking from nfs_lock_and_join_requests() so that we can
separate out the requirements for swapcache pages.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

bd37d6fc

NFS: Fix up nfs_page_group_covers_page() · 7e8a30f8

Trond Myklebust authored Jul 17, 2017

Fix up the test in nfs_page_group_covers_page(). The simplest implementation
is to check that we have a set of intersecting or contiguous subrequests
that connect page offset 0 to nfs_page_length(req->wb_page).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

7e8a30f8

NFS: Remove unused parameter from nfs_page_group_lock() · 1344b7ea

Trond Myklebust authored Jul 17, 2017

nfs_page_group_lock() is now always called with the 'nonblock'
parameter set to 'false'.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

1344b7ea