1. 27 Apr, 2017 1 commit
    • Kinglong Mee's avatar
      NFSv4.x/callback: Create the callback service through svc_create_pooled · df807fff
      Kinglong Mee authored
      As the comments for svc_set_num_threads() said,
      " Destroying threads relies on the service threads filling in
      rqstp->rq_task, which only the nfs ones do.  Assumes the serv
      has been created using svc_create_pooled()."
      
      If creating service through svc_create(), the svc_pool_map_put()
      will be called in svc_destroy(), but the pool map isn't used.
      So that, the reference of pool map will be drop, the next using
      of pool map will get a zero npools.
      
      [  137.992130] divide error: 0000 [#1] SMP
      [  137.992148] Modules linked in: nfsd(E) nfsv4 nfs fscache fuse tun bridge stp llc ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event vmw_balloon coretemp crct10dif_pclmul crc32_pclmul ppdev ghash_clmulni_intel intel_rapl_perf joydev snd_ens1371 gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore parport_pc parport nfit acpi_cpufreq tpm_tis tpm_tis_core tpm vmw_vmci i2c_piix4 shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm crc32c_intel drm e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
      [  137.992336] CPU: 0 PID: 4514 Comm: rpc.nfsd Tainted: G            E   4.11.0-rc8+ #536
      [  137.992777] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  137.993757] task: ffff955984101d00 task.stack: ffff9873c2604000
      [  137.994231] RIP: 0010:svc_pool_for_cpu+0x2b/0x80 [sunrpc]
      [  137.994768] RSP: 0018:ffff9873c2607c18 EFLAGS: 00010246
      [  137.995227] RAX: 0000000000000000 RBX: ffff95598376f000 RCX: 0000000000000002
      [  137.995673] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9559944aec00
      [  137.996156] RBP: ffff9873c2607c18 R08: ffff9559944aec28 R09: 0000000000000000
      [  137.996609] R10: 0000000001080002 R11: 0000000000000000 R12: ffff95598376f010
      [  137.997063] R13: ffff95598376f018 R14: ffff9559944aec28 R15: ffff9559944aec00
      [  137.997584] FS:  00007f755529eb40(0000) GS:ffff9559bb600000(0000) knlGS:0000000000000000
      [  137.998048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  137.998548] CR2: 000055f3aecd9660 CR3: 0000000084290000 CR4: 00000000001406f0
      [  137.999052] Call Trace:
      [  137.999517]  svc_xprt_do_enqueue+0xef/0x260 [sunrpc]
      [  138.000028]  svc_xprt_received+0x47/0x90 [sunrpc]
      [  138.000487]  svc_add_new_perm_xprt+0x76/0x90 [sunrpc]
      [  138.000981]  svc_addsock+0x14b/0x200 [sunrpc]
      [  138.001424]  ? recalc_sigpending+0x1b/0x50
      [  138.001860]  ? __getnstimeofday64+0x41/0xd0
      [  138.002346]  ? do_gettimeofday+0x29/0x90
      [  138.002779]  write_ports+0x255/0x2c0 [nfsd]
      [  138.003202]  ? _copy_from_user+0x4e/0x80
      [  138.003676]  ? write_recoverydir+0x100/0x100 [nfsd]
      [  138.004098]  nfsctl_transaction_write+0x48/0x80 [nfsd]
      [  138.004544]  __vfs_write+0x37/0x160
      [  138.004982]  ? selinux_file_permission+0xd7/0x110
      [  138.005401]  ? security_file_permission+0x3b/0xc0
      [  138.005865]  vfs_write+0xb5/0x1a0
      [  138.006267]  SyS_write+0x55/0xc0
      [  138.006654]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [  138.007071] RIP: 0033:0x7f7554b9dc30
      [  138.007437] RSP: 002b:00007ffc9f92c788 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  138.007807] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7554b9dc30
      [  138.008168] RDX: 0000000000000002 RSI: 00005640cd536640 RDI: 0000000000000003
      [  138.008573] RBP: 00007ffc9f92c780 R08: 0000000000000001 R09: 0000000000000002
      [  138.008918] R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000004
      [  138.009254] R13: 00005640cdbf77a0 R14: 00005640cdbf7720 R15: 00007ffc9f92c238
      [  138.009610] Code: 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 48 83 78 08 00 74 10 8b 05 07 42 02 00 83 f8 01 74 40 83 f8 02 74 19 31 c0 31 d2 <f7> b7 88 00 00 00 5d 89 d0 48 c1 e0 07 48 03 87 90 00 00 00 c3
      [  138.010664] RIP: svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: ffff9873c2607c18
      [  138.011061] ---[ end trace b3468224cafa7d11 ]---
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      df807fff
  2. 25 Apr, 2017 22 commits
    • Colin Ian King's avatar
      lockd: remove redundant check on block · e56efe93
      Colin Ian King authored
      A null check followed by a return is being performed already, so block
      is always non-null at the second check on block, hence we can remove
      this redundant null-check (Detected by PVS-Studio).  Also re-work
      comment to clean up a check-patch warning.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e56efe93
    • Chuck Lever's avatar
      svcrdma: Clean out old XDR encoders · dadf3e43
      Chuck Lever authored
      Clean up: These have been replaced and are no longer used.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      dadf3e43
    • Chuck Lever's avatar
      svcrdma: Remove the req_map cache · 2cf32924
      Chuck Lever authored
      req_maps are no longer used by the send path and can thus be removed.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      2cf32924
    • Chuck Lever's avatar
      svcrdma: Remove unused RDMA Write completion handler · 68cc4636
      Chuck Lever authored
      Clean up. All RDMA Write completions are now handled by
      svc_rdma_wc_write_ctx.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      68cc4636
    • Chuck Lever's avatar
      svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxt · ded8d196
      Chuck Lever authored
      The sge array in struct svc_rdma_op_ctxt is no longer used for
      sending RDMA Write WRs. It need only accommodate the construction of
      Send and Receive WRs. The maximum inline size is the largest payload
      it needs to handle now.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      ded8d196
    • Chuck Lever's avatar
      svcrdma: Clean up RPC-over-RDMA backchannel reply processing · f5821c76
      Chuck Lever authored
      Replace C structure-based XDR decoding with pointer arithmetic.
      Pointer arithmetic is considered more portable.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      f5821c76
    • Chuck Lever's avatar
      svcrdma: Report Write/Reply chunk overruns · 4757d90b
      Chuck Lever authored
      Observed at Connectathon 2017.
      
      If a client has underestimated the size of a Write or Reply chunk,
      the Linux server writes as much payload data as it can, then it
      recognizes there was a problem and closes the connection without
      sending the transport header.
      
      This creates a couple of problems:
      
      <> The client never receives indication of the server-side failure,
         so it continues to retransmit the bad RPC. Forward progress on
         the transport is blocked.
      
      <> The reply payload pages are not moved out of the svc_rqst, thus
         they can be released by the RPC server before the RDMA Writes
         have completed.
      
      The new rdma_rw-ized helpers return a distinct error code when a
      Write/Reply chunk overrun occurs, so it's now easy for the caller
      (svc_rdma_sendto) to recognize this case.
      
      Instead of dropping the connection, post an RDMA_ERROR message. The
      client now sees an RDMA_ERROR and can properly terminate the RPC
      transaction.
      
      As part of the new logic, set up the same delayed release for these
      payload pages as would have occurred in the normal case.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      4757d90b
    • Chuck Lever's avatar
      svcrdma: Clean up RDMA_ERROR path · 6b19cc5c
      Chuck Lever authored
      Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
      be refactored to reduce code duplication and remove C structure-
      based XDR encoding. It is also relocated to the source file that
      contains its only caller.
      
      This is a refactoring change only.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      6b19cc5c
    • Chuck Lever's avatar
      svcrdma: Use rdma_rw API in RPC reply path · 9a6a180b
      Chuck Lever authored
      The current svcrdma sendto code path posts one RDMA Write WR at a
      time. Each of these Writes typically carries a small number of pages
      (for instance, up to 30 pages for mlx4 devices). That means a 1MB
      NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
      and one for the Send WR carrying the actual RPC Reply message.
      
      Instead, use the new rdma_rw API. The details of Write WR chain
      construction and memory registration are taken care of in the RDMA
      core. svcrdma can focus on the details of the RPC-over-RDMA
      protocol. This gives three main benefits:
      
      1. All Write WRs for one RDMA segment are posted in a single chain.
      As few as one ib_post_send() for each Write chunk.
      
      2. The Write path can now use FRWR to register the Write buffers.
      If the device's maximum page list depth is large, this means a
      single Write WR is needed for each RPC's Write chunk data.
      
      3. The new code introduces support for RPCs that carry both a Write
      list and a Reply chunk. This combination can be used for an NFSv4
      READ where the data payload is large, and thus is removed from the
      Payload Stream, but the Payload Stream is still larger than the
      inline threshold.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      9a6a180b
    • Chuck Lever's avatar
      svcrdma: Introduce local rdma_rw API helpers · f13193f5
      Chuck Lever authored
      The plan is to replace the local bespoke code that constructs and
      posts RDMA Read and Write Work Requests with calls to the rdma_rw
      API. This shares code with other RDMA-enabled ULPs that manages the
      gory details of buffer registration and posting Work Requests.
      
      Some design notes:
      
       o The structure of RPC-over-RDMA transport headers is flexible,
         allowing multiple segments per Reply with arbitrary alignment,
         each with a unique R_key. Write and Send WRs continue to be
         built and posted in separate code paths. However, one whole
         chunk (with one or more RDMA segments apiece) gets exactly
         one ib_post_send and one work completion.
      
       o svc_xprt reference counting is modified, since a chain of
         rdma_rw_ctx structs generates one completion, no matter how
         many Write WRs are posted.
      
       o The current code builds the transport header as it is construct-
         ing Write WRs. I've replaced that with marshaling of transport
         header data items in a separate step. This is because the exact
         structure of client-provided segments may not align with the
         components of the server's reply xdr_buf, or the pages in the
         page list. Thus parts of each client-provided segment may be
         written at different points in the send path.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      f13193f5
    • Chuck Lever's avatar
      svcrdma: Clean up svc_rdma_get_inv_rkey() · c238c4c0
      Chuck Lever authored
      Replace C structure-based XDR decoding with more portable code that
      instead uses pointer arithmetic.
      
      This is a refactoring change only.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      c238c4c0
    • Chuck Lever's avatar
      svcrdma: Add helper to save pages under I/O · c55ab070
      Chuck Lever authored
      Clean up: extract the logic to save pages under I/O into a helper to
      add a big documenting comment without adding clutter in the send
      path.
      
      This is a refactoring change only.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      c55ab070
    • Chuck Lever's avatar
      svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT · b623589d
      Chuck Lever authored
      The Send Queue depth is temporarily reduced to 1 SQE per credit. The
      new rdma_rw API does an internal computation, during QP creation, to
      increase the depth of the Send Queue to handle RDMA Read and Write
      operations.
      
      This change has to come before the NFSD code paths are updated to
      use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
      the size of the SQ too much, resulting in memory allocation failures
      during QP creation.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      b623589d
    • Chuck Lever's avatar
      svcrdma: Add svc_rdma_map_reply_hdr() · 6e6092ca
      Chuck Lever authored
      Introduce a helper to DMA-map a reply's transport header before
      sending it. This will in part replace the map vector cache.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      6e6092ca
    • Chuck Lever's avatar
      svcrdma: Move send_wr to svc_rdma_op_ctxt · 17f5f7f5
      Chuck Lever authored
      Clean up: Move the ib_send_wr off the stack, and move common code
      to post a Send Work Request into a helper.
      
      This is a refactoring change only.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      17f5f7f5
    • NeilBrown's avatar
      NFS: don't try to cross a mountpount when there isn't one there. · 99bbf6ec
      NeilBrown authored
      consider the sequence of commands:
       mkdir -p /import/nfs /import/bind /import/etc
       mount --bind / /import/bind
       mount --make-private /import/bind
       mount --bind /import/etc /import/bind/etc
      
       exportfs -o rw,no_root_squash,crossmnt,async,no_subtree_check localhost:/
       mount -o vers=4 localhost:/ /import/nfs
       ls -l /import/nfs/etc
      
      You would not expect this to report a stale file handle.
      Yet it does.
      
      The manipulations under /import/bind cause the dentry for
      /etc to get the DCACHE_MOUNTED flag set, even though nothing
      is mounted on /etc.  This causes nfsd to call
      nfsd_cross_mnt() even though there is no mountpoint.  So an
      upcall to mountd for "/etc" is performed.
      
      The 'crossmnt' flag on the export of / causes mountd to
      report that /etc is exported as it is a descendant of /.  It
      assumes the kernel wouldn't ask about something that wasn't
      a mountpoint.  The filehandle returned identifies the
      filesystem and the inode number of /etc.
      
      When this filehandle is presented to rpc.mountd, via
      "nfsd.fh", the inode cannot be found associated with any
      name in /etc/exports, or with any mountpoint listed by
      getmntent().  So rpc.mountd says the filehandle doesn't
      exist. Hence ESTALE.
      
      This is fixed by teaching nfsd not to trust DCACHE_MOUNTED
      too much.  It is just a hint, not a guarantee.
      Change nfsd_mountpoint() to return '1' for a certain mountpoint,
      '2' for a possible mountpoint, and 0 otherwise.
      
      Then change nfsd_crossmnt() to check if follow_down()
      actually found a mountpount and, if not, to avoid performing
      a lookup if the location is not known to certainly require
      an export-point.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      99bbf6ec
    • NeilBrown's avatar
      nfsd4: remove pointless strdup_if_nonnull · 2f10fdcb
      NeilBrown authored
      kstrdup() already checks for NULL.
      
      (Brought to our attention by Jason Yann noticing (from sparse output)
      that it should have been declared static.)
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Reported-by: default avatarJason Yan <yanaijie@huawei.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      2f10fdcb
    • Dmitry V. Levin's avatar
      uapi: fix linux/nfsd/cld.h userspace compilation errors · 16719199
      Dmitry V. Levin authored
      Include <linux/types.h> and consistently use types it provides
      to fix the following linux/nfsd/cld.h userspace compilation errors:
      
      /usr/include/linux/nfsd/cld.h:40:2: error: unknown type name 'uint16_t'
        uint16_t cn_len;    /* length of cm_id */
      /usr/include/linux/nfsd/cld.h:46:2: error: unknown type name 'uint8_t'
        uint8_t  cm_vers;  /* upcall version */
      /usr/include/linux/nfsd/cld.h:47:2: error: unknown type name 'uint8_t'
        uint8_t  cm_cmd;   /* upcall command */
      /usr/include/linux/nfsd/cld.h:48:2: error: unknown type name 'int16_t'
        int16_t  cm_status;  /* return code */
      /usr/include/linux/nfsd/cld.h:49:2: error: unknown type name 'uint32_t'
        uint32_t cm_xid;   /* transaction id */
      /usr/include/linux/nfsd/cld.h:51:3: error: unknown type name 'int64_t'
         int64_t  cm_gracetime; /* grace period start time */
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      16719199
    • J. Bruce Fields's avatar
      nfsd: check for oversized NFSv2/v3 arguments · 51f56777
      J. Bruce Fields authored
      A client can append random data to the end of an NFSv2 or NFSv3 RPC call
      without our complaining; we'll just stop parsing at the end of the
      expected data and ignore the rest.
      
      Encoded arguments and replies are stored together in an array of pages,
      and if a call is too large it could leave inadequate space for the
      reply.  This is normally OK because NFS RPC's typically have either
      short arguments and long replies (like READ) or long arguments and short
      replies (like WRITE).  But a client that sends an incorrectly long reply
      can violate those assumptions.  This was observed to cause crashes.
      
      So, insist that the argument not be any longer than we expect.
      
      Also, several operations increment rq_next_page in the decode routine
      before checking the argument size, which can leave rq_next_page pointing
      well past the end of the page array, causing trouble later in
      svc_free_pages.
      
      As followup we may also want to rewrite the encoding routines to check
      more carefully that they aren't running off the end of the page array.
      Reported-by: default avatarTuomas Haanpää <thaan@synopsys.com>
      Reported-by: default avatarAri Kauppi <ari@synopsys.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      51f56777
    • J. Bruce Fields's avatar
      nfsd: stricter decoding of write-like NFSv2/v3 ops · 13bf9fbf
      J. Bruce Fields authored
      The NFSv2/v3 code does not systematically check whether we decode past
      the end of the buffer.  This generally appears to be harmless, but there
      are a few places where we do arithmetic on the pointers involved and
      don't account for the possibility that a length could be negative.  Add
      checks to catch these.
      Reported-by: default avatarTuomas Haanpää <thaan@synopsys.com>
      Reported-by: default avatarAri Kauppi <ari@synopsys.com>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      13bf9fbf
    • J. Bruce Fields's avatar
      nfsd4: minor NFSv2/v3 write decoding cleanup · db44bac4
      J. Bruce Fields authored
      Use a couple shortcuts that will simplify a following bugfix.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      db44bac4
    • J. Bruce Fields's avatar
      nfsd: check for oversized NFSv2/v3 arguments · e6838a29
      J. Bruce Fields authored
      A client can append random data to the end of an NFSv2 or NFSv3 RPC call
      without our complaining; we'll just stop parsing at the end of the
      expected data and ignore the rest.
      
      Encoded arguments and replies are stored together in an array of pages,
      and if a call is too large it could leave inadequate space for the
      reply.  This is normally OK because NFS RPC's typically have either
      short arguments and long replies (like READ) or long arguments and short
      replies (like WRITE).  But a client that sends an incorrectly long reply
      can violate those assumptions.  This was observed to cause crashes.
      
      Also, several operations increment rq_next_page in the decode routine
      before checking the argument size, which can leave rq_next_page pointing
      well past the end of the page array, causing trouble later in
      svc_free_pages.
      
      So, following a suggestion from Neil Brown, add a central check to
      enforce our expectation that no NFSv2/v3 call has both a large call and
      a large reply.
      
      As followup we may also want to rewrite the encoding routines to check
      more carefully that they aren't running off the end of the page array.
      
      We may also consider rejecting calls that have any extra garbage
      appended.  That would be safer, and within our rights by spec, but given
      the age of our server and the NFS protocol, and the fact that we've
      never enforced this before, we may need to balance that against the
      possibility of breaking some oddball client.
      Reported-by: default avatarTuomas Haanpää <thaan@synopsys.com>
      Reported-by: default avatarAri Kauppi <ari@synopsys.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e6838a29
  3. 23 Apr, 2017 4 commits
  4. 21 Apr, 2017 13 commits
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux · 94836ecf
      Linus Torvalds authored
      Pull nfsd bugfix from Bruce Fields:
       "Fix a 4.11 regression that triggers a BUG() on an attempt to use an
        unsupported NFSv4 compound op"
      
      * tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux:
        nfsd: fix oops on unsupported operation
      94836ecf
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 057a650b
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Don't race in IPSEC dumps, from Yuejie Shi.
      
       2) Verify lengths properly in IPSEC reqeusts, from Herbert Xu.
      
       3) Fix out of bounds access in ipv6 segment routing code, from David
          Lebrun.
      
       4) Don't write into the header of cloned SKBs in smsc95xx driver, from
          James Hughes.
      
       5) Several other drivers have this bug too, fix them. From Eric
          Dumazet.
      
       6) Fix access to uninitialized data in TC action cookie code, from
          Wolfgang Bumiller.
      
       7) Fix double free in IPV6 segment routing, again from David Lebrun.
      
       8) Don't let userspace set the RTF_PCPU flag, oops. From David Ahern.
      
       9) Fix use after free in qrtr code, from Dan Carpenter.
      
      10) Don't double-destroy devices in ip6mr code, from Nikolay
          Aleksandrov.
      
      11) Don't pass out-of-range TX queue indices into drivers, from Tushar
          Dave.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
        netpoll: Check for skb->queue_mapping
        ip6mr: fix notification device destruction
        bpf, doc: update bpf maintainers entry
        net: qrtr: potential use after free in qrtr_sendmsg()
        bpf: Fix values type used in test_maps
        net: ipv6: RTF_PCPU should not be settable from userspace
        gso: Validate assumption of frag_list segementation
        kaweth: use skb_cow_head() to deal with cloned skbs
        ch9200: use skb_cow_head() to deal with cloned skbs
        lan78xx: use skb_cow_head() to deal with cloned skbs
        sr9700: use skb_cow_head() to deal with cloned skbs
        cx82310_eth: use skb_cow_head() to deal with cloned skbs
        smsc75xx: use skb_cow_head() to deal with cloned skbs
        ipv6: sr: fix double free of skb after handling invalid SRH
        MAINTAINERS: Add "B:" field for networking.
        net sched actions: allocate act cookie early
        qed: Fix issue in populating the PFC config paramters.
        qed: Fix possible system hang in the dcbnl-getdcbx() path.
        qed: Fix sending an invalid PFC error mask to MFW.
        qed: Fix possible error in populating max_tc field.
        ...
      057a650b
    • Tushar Dave's avatar
      netpoll: Check for skb->queue_mapping · c70b17b7
      Tushar Dave authored
      Reducing real_num_tx_queues needs to be in sync with skb queue_mapping
      otherwise skbs with queue_mapping greater than real_num_tx_queues
      can be sent to the underlying driver and can result in kernel panic.
      
      One such event is running netconsole and enabling VF on the same
      device. Or running netconsole and changing number of tx queues via
      ethtool on same device.
      
      e.g.
      Unable to handle kernel NULL pointer dereference
      tsk->{mm,active_mm}->context = 0000000000001525
      tsk->{mm,active_mm}->pgd = fff800130ff9a000
                    \|/ ____ \|/
                    "@'/ .. \`@"
                    /_| \__/ |_\
                       \__U_/
      kworker/48:1(475): Oops [#1]
      CPU: 48 PID: 475 Comm: kworker/48:1 Tainted: G           OE
      4.11.0-rc3-davem-net+ #7
      Workqueue: events queue_process
      task: fff80013113299c0 task.stack: fff800131132c000
      TSTATE: 0000004480e01600 TPC: 00000000103f9e3c TNPC: 00000000103f9e40 Y:
      00000000    Tainted: G           OE
      TPC: <ixgbe_xmit_frame_ring+0x7c/0x6c0 [ixgbe]>
      g0: 0000000000000000 g1: 0000000000003fff g2: 0000000000000000 g3:
      0000000000000001
      g4: fff80013113299c0 g5: fff8001fa6808000 g6: fff800131132c000 g7:
      00000000000000c0
      o0: fff8001fa760c460 o1: fff8001311329a50 o2: fff8001fa7607504 o3:
      0000000000000003
      o4: fff8001f96e63a40 o5: fff8001311d77ec0 sp: fff800131132f0e1 ret_pc:
      000000000049ed94
      RPC: <set_next_entity+0x34/0xb80>
      l0: 0000000000000000 l1: 0000000000000800 l2: 0000000000000000 l3:
      0000000000000000
      l4: 000b2aa30e34b10d l5: 0000000000000000 l6: 0000000000000000 l7:
      fff8001fa7605028
      i0: fff80013111a8a00 i1: fff80013155a0780 i2: 0000000000000000 i3:
      0000000000000000
      i4: 0000000000000000 i5: 0000000000100000 i6: fff800131132f1a1 i7:
      00000000103fa4b0
      I7: <ixgbe_xmit_frame+0x30/0xa0 [ixgbe]>
      Call Trace:
       [00000000103fa4b0] ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
       [0000000000998c74] netpoll_start_xmit+0xf4/0x200
       [0000000000998e10] queue_process+0x90/0x160
       [0000000000485fa8] process_one_work+0x188/0x480
       [0000000000486410] worker_thread+0x170/0x4c0
       [000000000048c6b8] kthread+0xd8/0x120
       [0000000000406064] ret_from_fork+0x1c/0x2c
       [0000000000000000]           (null)
      Disabling lock debugging due to kernel taint
      Caller[00000000103fa4b0]: ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
      Caller[0000000000998c74]: netpoll_start_xmit+0xf4/0x200
      Caller[0000000000998e10]: queue_process+0x90/0x160
      Caller[0000000000485fa8]: process_one_work+0x188/0x480
      Caller[0000000000486410]: worker_thread+0x170/0x4c0
      Caller[000000000048c6b8]: kthread+0xd8/0x120
      Caller[0000000000406064]: ret_from_fork+0x1c/0x2c
      Caller[0000000000000000]:           (null)
      Signed-off-by: default avatarTushar Dave <tushar.n.dave@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c70b17b7
    • Nikolay Aleksandrov's avatar
      ip6mr: fix notification device destruction · 723b929c
      Nikolay Aleksandrov authored
      Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
      because we call unregister_netdevice_many for a device that is already
      being destroyed. In IPv4's ipmr that has been resolved by two commits
      long time ago by introducing the "notify" parameter to the delete
      function and avoiding the unregister when called from a notifier, so
      let's do the same for ip6mr.
      
      The trace from Andrey:
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:6813!
      invalid opcode: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 1165 Comm: kworker/u4:3 Not tainted 4.11.0-rc7+ #251
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Workqueue: netns cleanup_net
      task: ffff880069208000 task.stack: ffff8800692d8000
      RIP: 0010:rollback_registered_many+0x348/0xeb0 net/core/dev.c:6813
      RSP: 0018:ffff8800692de7f0 EFLAGS: 00010297
      RAX: ffff880069208000 RBX: 0000000000000002 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006af90569
      RBP: ffff8800692de9f0 R08: ffff8800692dec60 R09: 0000000000000000
      R10: 0000000000000006 R11: 0000000000000000 R12: ffff88006af90070
      R13: ffff8800692debf0 R14: dffffc0000000000 R15: ffff88006af90000
      FS:  0000000000000000(0000) GS:ffff88006cb00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe7e897d870 CR3: 00000000657e7000 CR4: 00000000000006e0
      Call Trace:
       unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
       unregister_netdevice_many+0xc8/0x120 net/core/dev.c:7880
       ip6mr_device_event+0x362/0x3f0 net/ipv6/ip6mr.c:1346
       notifier_call_chain+0x145/0x2f0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1647
       call_netdevice_notifiers net/core/dev.c:1663
       rollback_registered_many+0x919/0xeb0 net/core/dev.c:6841
       unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
       unregister_netdevice_many net/core/dev.c:7880
       default_device_exit_batch+0x4fa/0x640 net/core/dev.c:8333
       ops_exit_list.isra.4+0x100/0x150 net/core/net_namespace.c:144
       cleanup_net+0x5a8/0xb40 net/core/net_namespace.c:463
       process_one_work+0xc04/0x1c10 kernel/workqueue.c:2097
       worker_thread+0x223/0x19c0 kernel/workqueue.c:2231
       kthread+0x35e/0x430 kernel/kthread.c:231
       ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
      Code: 3c 32 00 0f 85 70 0b 00 00 48 b8 00 02 00 00 00 00 ad de 49 89
      47 78 e9 93 fe ff ff 49 8d 57 70 49 8d 5f 78 eb 9e e8 88 7a 14 fe <0f>
      0b 48 8b 9d 28 fe ff ff e8 7a 7a 14 fe 48 b8 00 00 00 00 00
      RIP: rollback_registered_many+0x348/0xeb0 RSP: ffff8800692de7f0
      ---[ end trace e0b29c57e9b3292c ]---
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      723b929c
    • Daniel Borkmann's avatar
      bpf, doc: update bpf maintainers entry · cdb90499
      Daniel Borkmann authored
      Add various related files that have been missing under
      BPF entry covering essential parts of its infrastructure
      and also add myself as co-maintainer.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cdb90499
    • Dan Carpenter's avatar
      net: qrtr: potential use after free in qrtr_sendmsg() · 6f60f438
      Dan Carpenter authored
      If skb_pad() fails then it frees the skb so we should check for errors.
      
      Fixes: bdabad3e ("net: Add Qualcomm IPC router")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f60f438
    • David Miller's avatar
      bpf: Fix values type used in test_maps · 89087c45
      David Miller authored
      Maps of per-cpu type have their value element size adjusted to 8 if it
      is specified smaller during various map operations.
      
      This makes test_maps as a 32-bit binary fail, in fact the kernel
      writes past the end of the value's array on the user's stack.
      
      To be quite honest, I think the kernel should reject creation of a
      per-cpu map that doesn't have a value size of at least 8 if that's
      what the kernel is going to silently adjust to later.
      
      If the user passed something smaller, it is a sizeof() calcualtion
      based upon the type they will actually use (just like in this testcase
      code) in later calls to the map operations.
      
      Fixes: df570f57 ("samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_ARRAY")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      89087c45
    • David Ahern's avatar
      net: ipv6: RTF_PCPU should not be settable from userspace · 557c44be
      David Ahern authored
      Andrey reported a fault in the IPv6 route code:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff880069809600 task.stack: ffff880062dc8000
      RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
      RSP: 0018:ffff880062dced30 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: ffff8800670561c0 RCX: 0000000000000006
      RDX: 0000000000000003 RSI: ffff880062dcfb28 RDI: 0000000000000018
      RBP: ffff880062dced68 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff880062dcfb28 R14: dffffc0000000000 R15: 0000000000000000
      FS:  00007feebe37e7c0(0000) GS:ffff88006cb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000205a0fe4 CR3: 000000006b5c9000 CR4: 00000000000006e0
      Call Trace:
       ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
       ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
      ...
      
      Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
      set. Flags passed to the kernel are blindly copied to the allocated
      rt6_info by ip6_route_info_create making a newly inserted route appear
      as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
      and expects rt->dst.from to be set - which it is not since it is not
      really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
      generates the fault.
      
      Fix by checking for the flag and failing with EINVAL.
      
      Fixes: d52d3997 ("ipv6: Create percpu rt6_info")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      557c44be
    • Ilan Tayari's avatar
      gso: Validate assumption of frag_list segementation · 43170c4e
      Ilan Tayari authored
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Change-Id: Ia03983f4a47b6534dd987d7a2aad96d54d46d212
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: default avatarIlan Tayari <ilant@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43170c4e
    • David S. Miller's avatar
      Merge branch 'skb_cow_head' · 918b7024
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: use skb_cow_head() to deal with cloned skbs
      
      James Hughes found an issue with smsc95xx driver. Same problematic code
      is found in other drivers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      918b7024
    • Eric Dumazet's avatar
      kaweth: use skb_cow_head() to deal with cloned skbs · 39fba783
      Eric Dumazet authored
      We can use skb_cow_head() to properly deal with clones,
      especially the ones coming from TCP stack that allow their head being
      modified. This avoids a copy.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39fba783
    • Eric Dumazet's avatar
      ch9200: use skb_cow_head() to deal with cloned skbs · 6bc6895b
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: 4a476bd6 ("usbnet: New driver for QinHeng CH9200 devices")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bc6895b
    • Eric Dumazet's avatar
      lan78xx: use skb_cow_head() to deal with cloned skbs · d4ca7359
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: 55d7de9d ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Cc: Woojung Huh <woojung.huh@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4ca7359