1. 09 Feb, 2022 4 commits
    • Chuck Lever's avatar
      NFSD: Clamp WRITE offsets · 6260d9a5
      Chuck Lever authored
      Ensure that a client cannot specify a WRITE range that falls in a
      byte range outside what the kernel's internal types (such as loff_t,
      which is signed) can represent. The kiocb iterators, invoked in
      nfsd_vfs_write(), should properly limit write operations to within
      the underlying file system's s_maxbytes.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6260d9a5
    • Chuck Lever's avatar
      NFSD: Fix NFSv3 SETATTR/CREATE's handling of large file sizes · a648fdeb
      Chuck Lever authored
      iattr::ia_size is a loff_t, so these NFSv3 procedures must be
      careful to deal with incoming client size values that are larger
      than s64_max without corrupting the value.
      
      Silently capping the value results in storing a different value
      than the client passed in which is unexpected behavior, so remove
      the min_t() check in decode_sattr3().
      
      Note that RFC 1813 permits only the WRITE procedure to return
      NFS3ERR_FBIG. We believe that NFSv3 reference implementations
      also return NFS3ERR_FBIG when ia_size is too large.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      a648fdeb
    • Chuck Lever's avatar
      NFSD: Fix ia_size underflow · e6faac3f
      Chuck Lever authored
      iattr::ia_size is a loff_t, which is a signed 64-bit type. NFSv3 and
      NFSv4 both define file size as an unsigned 64-bit type. Thus there
      is a range of valid file size values an NFS client can send that is
      already larger than Linux can handle.
      
      Currently decode_fattr4() dumps a full u64 value into ia_size. If
      that value happens to be larger than S64_MAX, then ia_size
      underflows. I'm about to fix up the NFSv3 behavior as well, so let's
      catch the underflow in the common code path: nfsd_setattr().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e6faac3f
    • Chuck Lever's avatar
      NFSD: Fix the behavior of READ near OFFSET_MAX · 0cb4d23a
      Chuck Lever authored
      Dan Aloni reports:
      > Due to commit 8cfb9015 ("NFS: Always provide aligned buffers to
      > the RPC read layers") on the client, a read of 0xfff is aligned up
      > to server rsize of 0x1000.
      >
      > As a result, in a test where the server has a file of size
      > 0x7fffffffffffffff, and the client tries to read from the offset
      > 0x7ffffffffffff000, the read causes loff_t overflow in the server
      > and it returns an NFS code of EINVAL to the client. The client as
      > a result indefinitely retries the request.
      
      The Linux NFS client does not handle NFS?ERR_INVAL, even though all
      NFS specifications permit servers to return that status code for a
      READ.
      
      Instead of NFS?ERR_INVAL, have out-of-range READ requests succeed
      and return a short result. Set the EOF flag in the result to prevent
      the client from retrying the READ request. This behavior appears to
      be consistent with Solaris NFS servers.
      
      Note that NFSv3 and NFSv4 use u64 offset values on the wire. These
      must be converted to loff_t internally before use -- an implicit
      type cast is not adequate for this purpose. Otherwise VFS checks
      against sb->s_maxbytes do not work properly.
      Reported-by: default avatarDan Aloni <dan.aloni@vastdata.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      0cb4d23a
  2. 28 Jan, 2022 1 commit
    • Dai Ngo's avatar
      nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client. · ab451ea9
      Dai Ngo authored
      From RFC 7530 Section 16.34.5:
      
      o  The server has not recorded an unconfirmed { v, x, c, *, * } and
         has recorded a confirmed { v, x, c, *, s }.  If the principals of
         the record and of SETCLIENTID_CONFIRM do not match, the server
         returns NFS4ERR_CLID_INUSE without removing any relevant leased
         client state, and without changing recorded callback and
         callback_ident values for client { x }.
      
      The current code intends to do what the spec describes above but
      it forgot to set 'old' to NULL resulting to the confirmed client
      to be expired.
      
      Fixes: 2b634821 ("nfsd: fix clid_inuse on mount with security change")
      Signed-off-by: default avatarDai Ngo <dai.ngo@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarBruce Fields <bfields@fieldses.org>
      ab451ea9
  3. 27 Jan, 2022 1 commit
    • J. Bruce Fields's avatar
      lockd: fix failure to cleanup client locks · d19a7af7
      J. Bruce Fields authored
      In my testing, we're sometimes hitting the request->fl_flags & FL_EXISTS
      case in posix_lock_inode, presumably just by random luck since we're not
      actually initializing fl_flags here.
      
      This probably didn't matter before commit 7f024fcd ("Keep read and
      write fds with each nlm_file") since we wouldn't previously unlock
      unless we knew there were locks.
      
      But now it causes lockd to give up on removing more locks.
      
      We could just initialize fl_flags, but really it seems dubious to be
      calling vfs_lock_file with random values in some of the fields.
      
      Fixes: 7f024fcd ("Keep read and write fds with each nlm_file")
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      [ cel: fixed checkpatch.pl nit ]
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      d19a7af7
  4. 18 Jan, 2022 1 commit
  5. 10 Jan, 2022 2 commits
    • Chuck Lever's avatar
      SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points · 16720861
      Chuck Lever authored
      Avoid potentially hazardous memory copying and the needless use of
      "%pIS" -- in the kernel, an RPC service listener is always bound to
      ANYADDR. Having the network namespace is helpful when recording
      errors, though.
      
      Fixes: a0469f46 ("SUNRPC: Replace dprintk call sites in TCP state change callouts")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      16720861
    • Chuck Lever's avatar
      SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point · dc6c6fb3
      Chuck Lever authored
      While testing, I got an unexpected KASAN splat:
      
      Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
      Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628
      
      The memcpy() in the TP_fast_assign section of this trace point
      copies the size of the destination buffer in order that the buffer
      won't be overrun.
      
      In other similar trace points, the source buffer for this memcpy is
      a "struct sockaddr_storage" so the actual length of the source
      buffer is always long enough to prevent the memcpy from reading
      uninitialized or unallocated memory.
      
      However, for this trace point, the source buffer can be as small as
      a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
      memory that follows the source buffer, which is not always valid
      memory.
      
      To avoid copying past the end of the passed-in sockaddr, make the
      source address's length available to the memcpy(). It would be a
      little nicer if the tracing infrastructure was more friendly about
      storing socket addresses that are not AF_INET, but I could not find
      a way to make printk("%pIS") work with a dynamic array.
      
      Reported-by: KASAN
      Fixes: 4b8f380e ("SUNRPC: Tracepoint to record errors in svc_xpo_create()")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      dc6c6fb3
  6. 09 Jan, 2022 1 commit
    • Arnd Bergmann's avatar
      fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes · 0ea9fc15
      Arnd Bergmann authored
      My patch to rework oabi fcntl64() introduced a harmless
      sparse warning when file locking is disabled:
      
         arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: sparse: incorrect type in argument 3 (different address spaces) @@     expected struct flock64 [noderef] __user *user @@     got struct flock64 * @@
         arch/arm/kernel/sys_oabi-compat.c:251:51: sparse:     expected struct flock64 [noderef] __user *user
         arch/arm/kernel/sys_oabi-compat.c:251:51: sparse:     got struct flock64 *
         arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: sparse: incorrect type in argument 4 (different address spaces) @@     expected struct flock64 [noderef] __user *user @@     got struct flock64 * @@
         arch/arm/kernel/sys_oabi-compat.c:265:55: sparse:     expected struct flock64 [noderef] __user *user
         arch/arm/kernel/sys_oabi-compat.c:265:55: sparse:     got struct flock64 *
      
      When file locking is enabled, everything works correctly and the
      right data gets passed, but the stub declarations in linux/fs.h
      did not get modified when the calling conventions changed in an
      earlier patch.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Fixes: 7e2d8c29 ("ARM: 9111/1: oabi-compat: rework fcntl64() emulation")
      Fixes: a75d30c7 ("fs/locks: pass kernel struct flock to fcntl_getlk/setlk")
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      0ea9fc15
  7. 08 Jan, 2022 23 commits
    • J. Bruce Fields's avatar
      nfsd: fix crash on COPY_NOTIFY with special stateid · 074b07d9
      J. Bruce Fields authored
      RTM says "If the special ONE stateid is passed to
      nfs4_preprocess_stateid_op(), it returns status=0 but does not set
      *cstid. nfsd4_copy_notify() depends on stid being set if status=0, and
      thus can crash if the client sends the right COPY_NOTIFY RPC."
      
      RFC 7862 says "The cna_src_stateid MUST refer to either open or locking
      states provided earlier by the server.  If it is invalid, then the
      operation MUST fail."
      
      The RFC doesn't specify an error, and the choice doesn't matter much as
      this is clearly illegal client behavior, but bad_stateid seems
      reasonable.
      
      Simplest is just to guarantee that nfs4_preprocess_stateid_op, called
      with non-NULL cstid, errors out if it can't return a stateid.
      
      Reported-by: rtm@csail.mit.edu
      Fixes: 624322f1 ("NFSD add COPY_NOTIFY operation")
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Tested-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      074b07d9
    • J. Bruce Fields's avatar
      MAINTAINERS: remove bfields · 7f4f5d70
      J. Bruce Fields authored
      I'm cutting back on my responsibilities.  The NFS server and file
      locking code are in good hands.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      7f4f5d70
    • Chuck Lever's avatar
      NFSD: Move fill_pre_wcc() and fill_post_wcc() · fcb5e3fa
      Chuck Lever authored
      These functions are related to file handle processing and have
      nothing to do with XDR encoding or decoding. Also they are no longer
      NFSv3-specific. As a clean-up, move their definitions to a more
      appropriate location. WCC is also an NFSv3-specific term, so rename
      them as general-purpose helpers.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      fcb5e3fa
    • Chuck Lever's avatar
      Revert "nfsd: skip some unnecessary stats in the v4 case" · 58f258f6
      Chuck Lever authored
      On the wire, I observed NFSv4 OPEN(CREATE) operations sometimes
      returning a reasonable-looking value in the cinfo.before field and
      zero in the cinfo.after field.
      
      RFC 8881 Section 10.8.1 says:
      > When a client is making changes to a given directory, it needs to
      > determine whether there have been changes made to the directory by
      > other clients.  It does this by using the change attribute as
      > reported before and after the directory operation in the associated
      > change_info4 value returned for the operation.
      
      and
      
      > ... The post-operation change
      > value needs to be saved as the basis for future change_info4
      > comparisons.
      
      A good quality client implementation therefore saves the zero
      cinfo.after value. During a subsequent OPEN operation, it will
      receive a different non-zero value in the cinfo.before field for
      that directory, and it will incorrectly believe the directory has
      changed, triggering an undesirable directory cache invalidation.
      
      There are filesystem types where fs_supports_change_attribute()
      returns false, tmpfs being one. On NFSv4 mounts, this means the
      fh_getattr() call site in fill_pre_wcc() and fill_post_wcc() is
      never invoked. Subsequently, nfsd4_change_attribute() is invoked
      with an uninitialized @stat argument.
      
      In fill_pre_wcc(), @stat contains stale stack garbage, which is
      then placed on the wire. In fill_post_wcc(), ->fh_post_wc is all
      zeroes, so zero is placed on the wire. Both of these values are
      meaningless.
      
      This fix can be applied immediately to stable kernels. Once there
      are more regression tests in this area, this optimization can be
      attempted again.
      
      Fixes: 428a23d2 ("nfsd: skip some unnecessary stats in the v4 case")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      58f258f6
    • Chuck Lever's avatar
      NFSD: Trace boot verifier resets · 75acacb6
      Chuck Lever authored
      According to commit bbf2f098 ("nfsd: Reset the boot verifier on
      all write I/O errors"), the Linux NFS server forces all clients to
      resend pending unstable writes if any server-side write or commit
      operation encounters an error (say, ENOSPC). This is a rare and
      quite exceptional event that could require administrative recovery
      action, so it should be made trace-able. Example trace event:
      
      nfsd-938   [002]  7174.945558: nfsd_writeverf_reset: boot_time=        61cc920d xid=0xdcd62036 error=-28 new verifier=0x08aecc6142515904
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      75acacb6
    • Chuck Lever's avatar
      NFSD: Rename boot verifier functions · 3988a578
      Chuck Lever authored
      Clean up: These functions handle what the specs call a write
      verifier, which in the Linux NFS server implementation is now
      divorced from the server's boot instance
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      3988a578
    • Chuck Lever's avatar
      NFSD: Clean up the nfsd_net::nfssvc_boot field · 91d2e9b5
      Chuck Lever authored
      There are two boot-time fields in struct nfsd_net: one called
      boot_time and one called nfssvc_boot. The latter is used only to
      form write verifiers, but its documenting comment declares:
      
              /* Time of server startup */
      
      Since commit 27c438f5 ("nfsd: Support the server resetting the
      boot verifier"), this field can be reset at any time; it's no
      longer tied to server restart. So that comment is stale.
      
      Also, according to pahole, struct timespec64 is 16 bytes long on
      x86_64. The nfssvc_boot field is used only to form a write verifier,
      which is 8 bytes long.
      
      Let's clarify this situation by manufacturing an 8-byte verifier
      in nfs_reset_boot_verifier() and storing only that in struct
      nfsd_net.
      
      We're grabbing 128 bits of time, so compress all of those into a
      64-bit verifier instead of throwing out the high-order bits.
      In the future, the siphash_key can be re-used for other hashed
      objects per-nfsd_net.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      91d2e9b5
    • Chuck Lever's avatar
      NFSD: Write verifier might go backwards · cdc55660
      Chuck Lever authored
      When vfs_iter_write() starts to fail because a file system is full,
      a bunch of writes can fail at once with ENOSPC. These writes
      repeatedly invoke nfsd_reset_boot_verifier() in quick succession.
      
      Ensure that the time it grabs doesn't go backwards due to an ntp
      adjustment going on at the same time.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      cdc55660
    • Trond Myklebust's avatar
      nfsd: Add a tracepoint for errors in nfsd4_clone_file_range() · a2f4c3fa
      Trond Myklebust authored
      Since a clone error commit can cause the boot verifier to change,
      we should trace those errors.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      [ cel: Addressed a checkpatch.pl splat in fs/nfsd/vfs.h ]
      a2f4c3fa
    • Chuck Lever's avatar
      NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id) · 2c445a0e
      Chuck Lever authored
      Since this pointer is used repeatedly, move it to a stack variable.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      2c445a0e
    • Chuck Lever's avatar
      NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id) · fb7622c2
      Chuck Lever authored
      Since this pointer is used repeatedly, move it to a stack variable.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      fb7622c2
    • Chuck Lever's avatar
      NFSD: Clean up nfsd_vfs_write() · 33388b3a
      Chuck Lever authored
      The RWF_SYNC and !RWF_SYNC arms are now exactly alike except that
      the RWF_SYNC arm resets the boot verifier twice in a row. Fix that
      redundancy and de-duplicate the code.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      33388b3a
    • Trond Myklebust's avatar
      nfsd: Replace use of rwsem with errseq_t · 555dbf1a
      Trond Myklebust authored
      The nfsd_file nf_rwsem is currently being used to separate file write
      and commit instances to ensure that we catch errors and apply them to
      the correct write/commit.
      We can improve scalability at the expense of a little accuracy (some
      extra false positives) by replacing the nf_rwsem with more careful
      use of the errseq_t mechanism to track errors across the different
      operations.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      [ cel: rebased on zero-verifier fix ]
      555dbf1a
    • Chuck Lever's avatar
      NFSD: Fix verifier returned in stable WRITEs · f11ad7aa
      Chuck Lever authored
      RFC 8881 explains the purpose of the write verifier this way:
      
      > The final portion of the result is the field writeverf. This field
      > is the write verifier and is a cookie that the client can use to
      > determine whether a server has changed instance state (e.g., server
      > restart) between a call to WRITE and a subsequent call to either
      > WRITE or COMMIT.
      
      But then it says:
      
      > This cookie MUST be unchanged during a single instance of the
      > NFSv4.1 server and MUST be unique between instances of the NFSv4.1
      > server. If the cookie changes, then the client MUST assume that
      > any data written with an UNSTABLE4 value for committed and an old
      > writeverf in the reply has been lost and will need to be
      > recovered.
      
      RFC 1813 has similar language for NFSv3. NFSv2 does not have a write
      verifier since it doesn't implement the COMMIT procedure.
      
      Since commit 19e0663f ("nfsd: Ensure sampling of the write
      verifier is atomic with the write"), the Linux NFS server has
      returned a boot-time-based verifier for UNSTABLE WRITEs, but a zero
      verifier for FILE_SYNC and DATA_SYNC WRITEs. FILE_SYNC and DATA_SYNC
      WRITEs are not followed up with a COMMIT, so there's no need for
      clients to compare verifiers for stable writes.
      
      However, by returning a different verifier for stable and unstable
      writes, the above commit puts the Linux NFS server a step farther
      out of compliance with the first MUST above. At least one NFS client
      (FreeBSD) noticed the difference, making this a potential
      regression.
      Reported-by: default avatarRick Macklem <rmacklem@uoguelph.ca>
      Link: https://lore.kernel.org/linux-nfs/YQXPR0101MB096857EEACF04A6DF1FC6D9BDD749@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM/T/
      Fixes: 19e0663f ("nfsd: Ensure sampling of the write verifier is atomic with the write")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      f11ad7aa
    • Jeff Layton's avatar
      nfsd: Retry once in nfsd_open on an -EOPENSTALE return · 12bcbd40
      Jeff Layton authored
      If we get back -EOPENSTALE from an NFSv4 open, then we either got some
      unhandled error or the inode we got back was not the same as the one
      associated with the dentry.
      
      We really have no recourse in that situation other than to retry the
      open, and if it fails to just return nfserr_stale back to the client.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarLance Shelton <lance.shelton@hammerspace.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      12bcbd40
    • Jeff Layton's avatar
      nfsd: Add errno mapping for EREMOTEIO · a2694e51
      Jeff Layton authored
      The NFS client can occasionally return EREMOTEIO when signalling issues
      with the server.  ...map to NFSERR_IO.
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: default avatarLance Shelton <lance.shelton@hammerspace.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      a2694e51
    • Peng Tao's avatar
      nfsd: map EBADF · b3d0db70
      Peng Tao authored
      Now that we have open file cache, it is possible that another client
      deletes the file and DP will not know about it. Then IO to MDS would
      fail with BADSTATEID and knfsd would start state recovery, which
      should fail as well and then nfs read/write will fail with EBADF.
      And it triggers a WARN() in nfserrno().
      
      -----------[ cut here ]------------
      WARNING: CPU: 0 PID: 13529 at fs/nfsd/nfsproc.c:758 nfserrno+0x58/0x70 [nfsd]()
      nfsd: non-standard errno: -9
      modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_connt
      pata_acpi floppy
      CPU: 0 PID: 13529 Comm: nfsd Tainted: G        W       4.1.5-00307-g6e6579b #7
      Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
       0000000000000000 00000000464e6c9c ffff88079085fba8 ffffffff81789936
       0000000000000000 ffff88079085fc00 ffff88079085fbe8 ffffffff810a08ea
       ffff88079085fbe8 ffff88080f45c900 ffff88080f627d50 ffff880790c46a48
       all Trace:
       [<ffffffff81789936>] dump_stack+0x45/0x57
       [<ffffffff810a08ea>] warn_slowpath_common+0x8a/0xc0
       [<ffffffff810a0975>] warn_slowpath_fmt+0x55/0x70
       [<ffffffff81252908>] ? splice_direct_to_actor+0x148/0x230
       [<ffffffffa02fb8c0>] ? fsid_source+0x60/0x60 [nfsd]
       [<ffffffffa02f9918>] nfserrno+0x58/0x70 [nfsd]
       [<ffffffffa02fba57>] nfsd_finish_read+0x97/0xb0 [nfsd]
       [<ffffffffa02fc7a6>] nfsd_splice_read+0x76/0xa0 [nfsd]
       [<ffffffffa02fcca1>] nfsd_read+0xc1/0xd0 [nfsd]
       [<ffffffffa0233af2>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
       [<ffffffffa03073da>] nfsd3_proc_read+0xba/0x150 [nfsd]
       [<ffffffffa02f7a03>] nfsd_dispatch+0xc3/0x210 [nfsd]
       [<ffffffffa0233af2>] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc]
       [<ffffffffa0232913>] svc_process_common+0x453/0x6f0 [sunrpc]
       [<ffffffffa0232cc3>] svc_process+0x113/0x1b0 [sunrpc]
       [<ffffffffa02f740f>] nfsd+0xff/0x170 [nfsd]
       [<ffffffffa02f7310>] ? nfsd_destroy+0x80/0x80 [nfsd]
       [<ffffffff810bf3a8>] kthread+0xd8/0xf0
       [<ffffffff810bf2d0>] ? kthread_create_on_node+0x1b0/0x1b0
       [<ffffffff817912a2>] ret_from_fork+0x42/0x70
       [<ffffffff810bf2d0>] ? kthread_create_on_node+0x1b0/0x1b0
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      Signed-off-by: default avatarLance Shelton <lance.shelton@hammerspace.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      b3d0db70
    • Chuck Lever's avatar
      NFSD: Fix zero-length NFSv3 WRITEs · 6a2f7744
      Chuck Lever authored
      The Linux NFS server currently responds to a zero-length NFSv3 WRITE
      request with NFS3ERR_IO. It responds to a zero-length NFSv4 WRITE
      with NFS4_OK and count of zero.
      
      RFC 1813 says of the WRITE procedure's @count argument:
      
      count
               The number of bytes of data to be written. If count is
               0, the WRITE will succeed and return a count of 0,
               barring errors due to permissions checking.
      
      RFC 8881 has similar language for NFSv4, though NFSv4 removed the
      explicit @count argument because that value is already contained in
      the opaque payload array.
      
      The synthetic client pynfs's WRT4 and WRT15 tests do emit zero-
      length WRITEs to exercise this spec requirement. Commit fdec6114
      ("nfsd4: zero-length WRITE should succeed") addressed the same
      problem there with the same fix.
      
      But interestingly the Linux NFS client does not appear to emit zero-
      length WRITEs, instead squelching them. I'm not aware of a test that
      can generate such WRITEs for NFSv3, so I wrote a naive C program to
      generate a zero-length WRITE and test this fix.
      
      Fixes: 8154ef27 ("NFSD: Clean up legacy NFS WRITE argument XDR decoders")
      Reported-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6a2f7744
    • Vasily Averin's avatar
      nfsd4: add refcount for nfsd4_blocked_lock · 47446d74
      Vasily Averin authored
      nbl allocated in nfsd4_lock can be released by a several ways:
      directly in nfsd4_lock(), via nfs4_laundromat(), via another nfs
      command RELEASE_LOCKOWNER or via nfsd4_callback.
      This structure should be refcounted to be used and released correctly
      in all these cases.
      
      Refcount is initialized to 1 during allocation and is incremented
      when nbl is added into nbl_list/nbl_lru lists.
      
      Usually nbl is linked into both lists together, so only one refcount
      is used for both lists.
      
      However nfsd4_lock() should keep in mind that nbl can be present
      in one of lists only. This can happen if nbl was handled already
      by nfs4_laundromat/nfsd4_callback/etc.
      
      Refcount is decremented if vfs_lock_file() returns FILE_LOCK_DEFERRED,
      because nbl can be handled already by nfs4_laundromat/nfsd4_callback/etc.
      
      Refcount is not changed in find_blocked_lock() because of it reuses counter
      released after removing nbl from lists.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      47446d74
    • J. Bruce Fields's avatar
      nfs: block notification on fs with its own ->lock · 40595cdc
      J. Bruce Fields authored
      NFSv4.1 supports an optional lock notification feature which notifies
      the client when a lock comes available.  (Normally NFSv4 clients just
      poll for locks if necessary.)  To make that work, we need to request a
      blocking lock from the filesystem.
      
      We turned that off for NFS in commit f657f8ee ("nfs: don't atempt
      blocking locks on nfs reexports") [sic] because it actually blocks the
      nfsd thread while waiting for the lock.
      
      Thanks to Vasily Averin for pointing out that NFS isn't the only
      filesystem with that problem.
      
      Any filesystem that leaves ->lock NULL will use posix_lock_file(), which
      does the right thing.  Simplest is just to assume that any filesystem
      that defines its own ->lock is not safe to request a blocking lock from.
      
      So, this patch mostly reverts commit f657f8ee ("nfs: don't atempt
      blocking locks on nfs reexports") [sic] and commit b840be2f ("lockd:
      don't attempt blocking locks on nfs reexports"), and instead uses a
      check of ->lock (Vasily's suggestion) to decide whether to support
      blocking lock notifications on a given filesystem.  Also add a little
      documentation.
      
      Perhaps someday we could add back an export flag later to allow
      filesystems with "good" ->lock methods to support blocking lock
      notifications.
      Reported-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      [ cel: Description rewritten to address checkpatch nits ]
      [ cel: Fixed warning when SUNRPC debugging is disabled ]
      [ cel: Fixed NULL check ]
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarVasily Averin <vvs@virtuozzo.com>
      40595cdc
    • Chuck Lever's avatar
      NFSD: De-duplicate nfsd4_decode_bitmap4() · cd2e999c
      Chuck Lever authored
      Clean up. Trond points out that xdr_stream_decode_uint32_array()
      does the same thing as nfsd4_decode_bitmap4().
      Suggested-by: default avatarTrond Myklebust <trondmy@hammerspace.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      cd2e999c
    • J. Bruce Fields's avatar
      nfsd: improve stateid access bitmask documentation · 3dcd1d8a
      J. Bruce Fields authored
      The use of the bitmaps is confusing.  Add a cross-reference to make it
      easier to find the existing comment.  Add an updated reference with URL
      to make it quicker to look up.  And a bit more editorializing about the
      value of this.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      3dcd1d8a
    • Chuck Lever's avatar
      NFSD: Combine XDR error tracepoints · 70e94d75
      Chuck Lever authored
      Clean up: The garbage_args and cant_encode tracepoints report the
      same information as each other, so combine them into a single
      tracepoint class to reduce code duplication and slightly reduce the
      size of trace.o.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      70e94d75
  8. 13 Dec, 2021 7 commits
    • Chuck Lever's avatar
      SUNRPC: Remove low signal-to-noise tracepoints · 5089f3d9
      Chuck Lever authored
      I'm about to add more information to the server-side SUNRPC
      tracepoints, so I'm going to offset the increased trace log
      consumption by getting rid of some tracepoints that fire frequently
      but don't offer much value.
      
      trace_svc_xprt_received() was useful for debugging, perhaps, but
      is not generally informative.
      
      trace_svc_handle_xprt() reports largely the same information as
      trace_svc_xdr_recvfrom().
      
      As a clean-up, rename trace_svc_xprt_do_enqueue() to match
      svc_xprt_dequeue().
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      5089f3d9
    • NeilBrown's avatar
      NFSD: simplify per-net file cache management · 1463b38e
      NeilBrown authored
      We currently have a 'laundrette' for closing cached files - a different
      work-item for each network-namespace.
      
      These 'laundrettes' (aka struct nfsd_fcache_disposal) are currently on a
      list, and are freed using rcu.
      
      The list is not necessary as we have a per-namespace structure (struct
      nfsd_net) which can hold a link to the nfsd_fcache_disposal.
      The use of kfree_rcu is also unnecessary as the cache is cleaned of all
      files associated with a given namespace, and no new files can be added,
      before the nfsd_fcache_disposal is freed.
      
      So add a '->fcache_disposal' link to nfsd_net, and discard the list
      management and rcu usage.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      1463b38e
    • Jiapeng Chong's avatar
      NFSD: Fix inconsistent indenting · 1e37d0e5
      Jiapeng Chong authored
      Eliminate the follow smatch warning:
      
      fs/nfsd/nfs4xdr.c:4766 nfsd4_encode_read_plus_hole() warn: inconsistent
      indenting.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      1e37d0e5
    • Chuck Lever's avatar
      NFSD: Remove be32_to_cpu() from DRC hash function · 7578b2f6
      Chuck Lever authored
      Commit 7142b98d ("nfsd: Clean up drc cache in preparation for
      global spinlock elimination"), billed as a clean-up, added
      be32_to_cpu() to the DRC hash function without explanation. That
      commit removed two comments that state that byte-swapping in the
      hash function is unnecessary without explaining whether there was
      a need for that change.
      
      On some Intel CPUs, the swab32 instruction is known to cause a CPU
      pipeline stall. be32_to_cpu() does not add extra randomness, since
      the hash multiplication is done /before/ shifting to the high-order
      bits of the result.
      
      As a micro-optimization, remove the unnecessary transform from the
      DRC hash function.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      7578b2f6
    • NeilBrown's avatar
      NFS: switch the callback service back to non-pooled. · 23a1a573
      NeilBrown authored
      Now that thread management is consistent there is no need for
      nfs-callback to use svc_create_pooled() as introduced in Commit
      df807fff ("NFSv4.x/callback: Create the callback service through
      svc_create_pooled").  So switch back to svc_create().
      
      If service pools were configured, but the number of threads were left at
      '1', nfs callback may not work reliably when svc_create_pooled() is used.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      23a1a573
    • NeilBrown's avatar
      lockd: use svc_set_num_threads() for thread start and stop · 6b044fba
      NeilBrown authored
      svc_set_num_threads() does everything that lockd_start_svc() does, except
      set sv_maxconn.  It also (when passed 0) finds the threads and
      stops them with kthread_stop().
      
      So move the setting for sv_maxconn, and use svc_set_num_thread()
      
      We now don't need nlmsvc_task.
      
      Now that we use svc_set_num_threads() it makes sense to set svo_module.
      This request that the thread exists with module_put_and_exit().
      Also fix the documentation for svo_module to make this explicit.
      
      svc_prepare_thread is now only used where it is defined, so it can be
      made static.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6b044fba
    • NeilBrown's avatar
      SUNRPC: always treat sv_nrpools==1 as "not pooled" · 93aa619e
      NeilBrown authored
      Currently 'pooled' services hold a reference on the pool_map, and
      'unpooled' services do not.
      svc_destroy() uses the presence of ->svo_function (via
      svc_serv_is_pooled()) to determine if the reference should be dropped.
      There is no direct correlation between being pooled and the use of
      svo_function, though in practice, lockd is the only non-pooled service,
      and the only one not to use svo_function.
      
      This is untidy and would cause problems if we changed lockd to use
      svc_set_num_threads(), which requires the use of ->svo_function.
      
      So change the test for "is the service pooled" to "is sv_nrpools > 1".
      
      This means that when svc_pool_map_get() returns 1, it must NOT take a
      reference to the pool.
      
      We discard svc_serv_is_pooled(), and test sv_nrpools directly.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      93aa619e