Commits · 7fe5c398fc2186ed586db11106a6692d871d0d58 · Kirill Smelkov / linux

19 Mar, 2009 8 commits

Trond Myklebust authored Mar 19, 2009

Close-to-open cache consistency rules really only require us to flush out
writes on calls to close(), and require us to revalidate attributes on the
very last close of the file.

Currently we appear to be doing a lot of extra attribute revalidation
and cache flushes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

7fe5c398

NFS: Fix the notifications when renaming onto an existing file · b1e4adf4

Trond Myklebust authored Mar 19, 2009

NFS appears to be returning an unnecessary "delete" notification when
we're doing an atomic rename. See

  http://bugzilla.gnome.org/show_bug.cgi?id=575684

The fix is to get rid of the redundant call to d_delete().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

b1e4adf4

NFS: Fix up a mismerged patch · 47c62564

Trond Myklebust authored Mar 16, 2009

Move the definition of nfs_need_commit() into the #ifdef CONFIG_NFS_V3
section as originally intended in the patch "NFS: cleanup - remove
struct nfs_inode->ncommit"
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

47c62564

SVCRDMA: fix recent printk format warnings. · 2e3c230b

Tom Talpey authored Mar 12, 2009

printk formats in prior commit were reversed/incorrect.
Compiled without warning on x86 and x86_64, but detected on ppc.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

2e3c230b

SUNRPC: Ensure we close the socket on EPIPE errors too... · 55420c24

Trond Myklebust authored Mar 11, 2009

As long as one task is holding the socket lock, then calls to
xprt_force_disconnect(xprt) will not succeed in shutting down the socket.
In particular, this would mean that a server initiated shutdown will not
succeed until the lock is relinquished.
In order to avoid the deadlock, we should ensure that xs_tcp_send_request()
closes the socket on EPIPE errors too.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

55420c24

SUNRPC: xs_tcp_connect_worker{4,6}: merge common code · b61d59ff
Trond Myklebust authored Mar 11, 2009
```
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
```
b61d59ff
SUNRPC: Add a sysctl to control the duration of the socket linger timeout · 25fe6142
Trond Myklebust authored Mar 11, 2009
```
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
```
25fe6142

SUNRPC: Add the equivalent of the linger and linger2 timeouts to RPC sockets · 7d1e8255

Trond Myklebust authored Mar 11, 2009

This fixes a regression against FreeBSD servers as reported by Tomas
Kasparek. Apparently when using RPC over a TCP socket, the FreeBSD servers
don't ever react to the client closing the socket, and so commit
e06799f9 (SUNRPC: Use shutdown() instead of
close() when disconnecting a TCP socket) causes the setup to hang forever
whenever the client attempts to close and then reconnect.

We break the deadlock by adding a 'linger2' style timeout to the socket,
after which, the client will abort the connection using a TCP 'RST'.

The default timeout is set to 15 seconds. A subsequent patch will put it
under user control by means of a systctl.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

7d1e8255

11 Mar, 2009 32 commits

SUNRPC: Ensure that xs_nospace return values are propagated · 5e3771ce

Trond Myklebust authored Mar 11, 2009

If xs_nospace() finds that the socket has disconnected, it attempts to
return ENOTCONN, however that value is then squashed by the callers.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

5e3771ce

SUNRPC: Delay, then retry on connection errors. · 8a2cec29

Trond Myklebust authored Mar 11, 2009

Enforce the comment in xs_tcp_connect_worker4/xs_tcp_connect_worker6 that
we should delay, then retry on certain connection errors.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

8a2cec29

SUNRPC: Return EAGAIN instead of ENOTCONN when waking up xprt->pending · 2a491991

Trond Myklebust authored Mar 11, 2009

While we should definitely return socket errors to the task that is
currently trying to send data, there is no need to propagate the same error
to all the other tasks on xprt->pending. Doing so actually slows down
recovery, since it causes more than one tasks to attempt socket recovery.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

2a491991

SUNRPC: Handle socket errors correctly · 482f32e6

Trond Myklebust authored Mar 11, 2009

Ensure that we pick up and handle socket errors as they occur.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

482f32e6

SUNRPC: Handle ECONNREFUSED correctly in xprt_transmit() · c8485e4d

Trond Myklebust authored Mar 11, 2009

If we get an ECONNREFUSED error, we currently go to sleep on the
'xprt->sending' wait queue. The problem is that no timeout is set there,
and there is nothing else that will wake the task up later.

We should deal with ECONNREFUSED in call_status, given that is where we
also deal with -EHOSTDOWN, and friends.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

c8485e4d

SUNRPC: Don't disconnect if a connection is still in progress. · 40d2549d
Trond Myklebust authored Mar 11, 2009
```
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
```
40d2549d

SUNRPC: Ensure we set XPRT_CLOSING only after we've sent a tcp FIN... · 670f9457

Trond Myklebust authored Mar 11, 2009

...so that we can distinguish between when we need to shutdown and when we
don't. Also remove the call to xs_tcp_shutdown() from xs_tcp_connect(),
since xprt_connect() makes the same test.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

670f9457

SUNRPC: Avoid an unnecessary task reschedule on ENOTCONN · 15f081ca

Trond Myklebust authored Mar 11, 2009

If the socket is unconnected, and xprt_transmit() returns ENOTCONN, we
currently give up the lock on the transport channel. Doing so means that
the lock automatically gets assigned to the next task in the xprt->sending
queue, and so that task needs to be woken up to do the actual connect.

The following patch aims to avoid that unnecessary task switch.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

15f081ca

NFS: load the rpc/rdma transport module automatically · a67d18f8

Tom Talpey authored Mar 11, 2009

When mounting an NFS/RDMA server with the "-o proto=rdma" or
"-o rdma" options, attempt to dynamically load the necessary
"xprtrdma" client transport module. Doing so improves usability,
while avoiding a static module dependency and any unnecesary
resources.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

a67d18f8

SUNRPC: dynamically load RPC transport modules on-demand · 441e3e24

Tom Talpey authored Mar 11, 2009

Provide an api to attempt to load any necessary kernel RPC
client transport module automatically. By convention, the
desired module name is "xprt"+"transport name". For example,
when NFS mounting with "-o proto=rdma", attempt to load the
"xprtrdma" module.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

441e3e24

XPRTRDMA: correct an rpc/rdma inline send marshaling error · b38ab40a

Tom Talpey authored Mar 11, 2009

Certain client rpc's which contain both lengthy page-contained
metadata and a non-empty xdr_tail buffer require careful handling
to avoid overlapped memory copying. Rearranging of existing rpcrdma
marshaling code avoids it; this fixes an NFSv4 symlink creation error
detected with connectathon basic/test8 to multiple servers.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

b38ab40a

SVCRDMA: remove faulty assertions in rpc/rdma chunk validation. · b1e1e158

Tom Talpey authored Mar 11, 2009

Certain client-provided RPCRDMA chunk alignments result in an
additional scatter/gather entry, which triggered nfs/rdma server
assertions incorrectly. OpenSolaris nfs/rdma client connectathon
testing was blocked by these in the special/locking section.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Cc: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

b1e1e158

NFS: Kill the "defined but not used" compile error on nommu machines · e1ebfd33

Trond Myklebust authored Mar 11, 2009

Bryan Wu reports that when compiling NFS on nommu machines he gets a
"defined but not used" error on nfs_file_mmap().

The easiest fix is simply to get rid of the special casing in NFS, and
just always call generic_file_mmap() to set up the file.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

e1ebfd33

NFS: Throttle page dirtying while we're flushing to disk · 72cb77f4

Trond Myklebust authored Mar 11, 2009

The following patch is a combination of a patch by myself and Peter
Staubach.

Trond: If we allow other processes to dirty pages while a process is doing
a consistency sync to disk, we can end up never making progress.

Peter: Attached is a patch which addresses a continuing problem with
the NFS client generating out of order WRITE requests.  While
this is compliant with all of the current protocol
specifications, there are servers in the market which can not
handle out of order WRITE requests very well.  Also, this may
lead to sub-optimal block allocations in the underlying file
system on the server.  This may cause the read throughputs to
be reduced when reading the file from the server.

Peter: There has been a lot of work recently done to address out of
order issues on a systemic level.  However, the NFS client is
still susceptible to the problem.  Out of order WRITE
requests can occur when pdflush is in the middle of writing
out pages while the process dirtying the pages calls
generic_file_buffered_write which calls
generic_perform_write which calls
balance_dirty_pages_rate_limited which ends up calling
writeback_inodes which ends up calling back into the NFS
client to writes out dirty pages for the same file that
pdflush happens to be working with.
Signed-off-by: Peter Staubach <staubach@redhat.com>
[modification by Trond to merge the two similar patches]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

72cb77f4

NFS: cleanup - remove struct nfs_inode->ncommit · fb8a1f11
Trond Myklebust authored Mar 11, 2009
```
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
```
fb8a1f11

NFSv4: Simplify some cache consistency post-op GETATTRs · a65318bf

Trond Myklebust authored Mar 11, 2009

Certain asynchronous operations such as write() do not expect
(or care) that other metadata such as the file owner, mode, acls, ...
change. All they want to do is update and/or check the change attribute,
ctime, and mtime.
By skipping the file owner and group update, we also avoid having to do a
potential idmapper upcall for these asynchronous RPC calls.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

a65318bf

NFSv4: A referral is assumed to always point to a directory. · 69aaaae1

Trond Myklebust authored Mar 11, 2009

Fix a bug whereby we would fail to create a mount point for a referral.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

69aaaae1

NFSv4: Make decode_getfattr() set fattr->valid to reflect what was decoded · 409924e4
Trond Myklebust authored Mar 11, 2009
```
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
```
409924e4
NFSv4: Clean up decode_getfattr() · f26c7a78
Trond Myklebust authored Mar 11, 2009
```
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
```
f26c7a78

NFS: Fix the type of struct nfs_fattr->mode · bca79478

Trond Myklebust authored Mar 11, 2009

There is no point in using anything other than umode_t, since we copy the
content pretty much directly into inode->i_mode.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

bca79478

NFS: Shrink the struct nfs_fattr · 1ca277d8

Trond Myklebust authored Mar 11, 2009

We don't need the bitmap[] field anymore, since the 'valid' field tells us
all we need to know about which attributes were filled in...
Also move the pre-op attributes in order to improve the structure packing.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

1ca277d8

NFSv4: Support NFSv4 optional attributes in the struct nfs_fattr · 9e6e70f8

Trond Myklebust authored Mar 11, 2009

Currently, filling struct nfs_fattr is more or less an all or nothing
operation, since NFSv2 and NFSv3 have only mandatory attributes.
In NFSv4, some attributes are optional, and so we may simply not be able to
fill in those fields. Furthermore, NFSv4 allows you to specify which
attributes you are interested in retrieving, thus permitting you to
optimise away retrieval of attributes that you know will no change...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

9e6e70f8

NFSv4: Ignore errors on the post-op attributes in SETATTR calls · 78f945f8

Trond Myklebust authored Mar 11, 2009

There is no need to fail or retry a SETATTR call just because the post-op
GETATTR failed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

78f945f8

NFS: flush cached directory information slightly more readily. · 37d9d76d

NeilBrown authored Mar 11, 2009

If cached directory contents becomes incorrect, there is no way to
flush the contents.  This contrasts with files where file locking is
the recommended way to ensure cache consistency between multiple
applications (a read-lock always flushes the cache).

Also while changes to files often change the size of the file (thus
triggering a cache flush), changes to directories often do not change
the apparent size (as the size is often rounded to a block size).

So it is particularly important with directories to avoid the
possibility of an incorrect cache wherever possible.

When the link count on a directory changes it implies a change in the
number of child directories, and so a change in the contents of this
directory.  So use that as a trigger to flush cached contents.

When the ctime changes but the mtime does not, there are two possible
reasons.
 1/ The owner/mode information has been changed.
 2/ utimes has been used to set the mtime backwards.

In the first case, a data-cache flush is not required.
In the second case it is.

So on the basis that correctness trumps performance, flush the
directory contents cache in this case also.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

37d9d76d

NFS: Minor __nfs_revalidate_inode cleanup · 2b57dc6c

Suresh Jayaraman authored Mar 11, 2009

Remove redundant NFS_STALE() check, a leftover due to the commit
691beb13Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

2b57dc6c

SUNRPC: Avoid spurious wake-up during UDP connect processing · fe315e76

Chuck Lever authored Mar 11, 2009

To clear out old state, the UDP connect workers unconditionally invoke
xs_close() before proceeding with a new connect. Nowadays this causes
a spurious wake-up of the task waiting for the connect to complete.

This is a little racey, but usually harmless. The waiting task
immediately retries the connect via a call_bind/call_connect sequence,
which usually finds the transport already in the connected state
because the connect worker has finished in the background.

To avoid a spurious wake-up, factor the xs_close() logic that resets
the underlying socket into a helper, and have the UDP connect workers
call that helper instead of xs_close().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

fe315e76

SUNRPC: xprt_connect() don't abort the task if the transport isn't bound · 01d37c42

Trond Myklebust authored Mar 11, 2009

If the transport isn't bound, then we should just return ENOTCONN, letting
call_connect_status() and/or call_status() deal with retrying. Currently,
we appear to abort all pending tasks with an EIO error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

01d37c42

SUNRPC: Fix an Oops due to socket not set up yet... · fba91afb

Trond Myklebust authored Mar 11, 2009

We can Oops in both xs_udp_send_request() and xs_tcp_send_request() if the
call to xs_sendpages() returns an error due to the socket not yet being
set up.
Deal with that situation by returning a new error: ENOTSOCK, so that we
know to avoid dereferencing transport->sock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

fba91afb

Bug 11061, NFS mounts dropped · d7371c41

Ian Dall authored Mar 10, 2009

Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=11061

sockaddr structures can't be reliably compared using memcmp() because
there are padding bytes in the structure which can't be guaranteed to
be the same even when the sockaddr structures refer to the same
socket. Instead compare all the relevant fields. In the case of IPv6
sin6_flowinfo is not compared because it only affects QoS and
sin6_scope_id is only compared if the address is "link local" because
"link local" addresses need only be unique to a specific link.
Signed-off-by: Ian Dall <ian@beware.dropbear.id.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

d7371c41

NFS: Handle -ESTALE error in access() · a71ee337

Suresh Jayaraman authored Mar 10, 2009

Hi Trond,

I have been looking at a bugreport where trying to open applications on KDE
on a NFS mounted home fails temporarily. There have been multiple reports on
different kernel versions pointing to this common issue:
http://bugzilla.kernel.org/show_bug.cgi?id=12557
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html

This issue can be reproducible consistently by doing this on a NFS mounted
home (KDE):
1. Open 2 xterm sessions
2. From one of the xterm session, do "ssh -X <remote host>"
3. "stat ~/.Xauthority" on the remote SSH session
4. Close the two xterm sessions
5. On the server do a "stat ~/.Xauthority"
6. Now on the client, try to open xterm
This will fail.

Even if the filehandle had become stale, the NFS client should invalidate
the cache/inode and should repeat LOOKUP. Looking at the packet capture when
the failure occurs shows that there were two subsequent ACCESS() calls with
the same filehandle and both fails with -ESTALE error.

I have tested the fix below. Now the client issue a LOOKUP after the
ACCESS() call fails with -ESTALE. If all this makes sense to you, can you
consider this for inclusion?

Thanks,


If the server returns an -ESTALE error due to stale filehandle in response to
an ACCESS() call, we need to invalidate the cache and inode so that LOOKUP()
can be retried. Without this change, the nfs client retries ACCESS() with the
same filehandle, fails again and could lead to temporary failure of
applications running on nfs mounted home.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

a71ee337

NLM: Fix GRANT callback address comparison when IPv6 is enabled · 57df675c

Chuck Lever authored Mar 10, 2009

The NFS mount command may pass an AF_INET server address to lockd.  If
lockd happens to be using a PF_INET6 listener, the nlm_cmp_addr() in
nlmclnt_grant() will fail to match requests from that host because they
will all have a mapped IPv4 AF_INET6 address.

Adopt the same solution used in nfs_sockaddr_match_ipaddr() for NFSv4
callbacks: if either address is AF_INET, map it to an AF_INET6 address
before doing the comparison.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

57df675c

NLM: Shrink the IPv4-only version of nlm_cmp_addr() · 78851e1a

Chuck Lever authored Mar 10, 2009

Clean up/micro-optimatization:  Make the AF_INET-only version of
nlm_cmp_addr() smaller.  This matches the style of
nlm_privileged_requester(), and makes the AF_INET-only version of
nlm_cmp_addr() nearly the same size as it was before IPv6 support.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

78851e1a