1. 17 May, 2016 11 commits
    • Chuck Lever's avatar
      xprtrdma: Avoid using Write list for small NFS READ requests · cce6deeb
      Chuck Lever authored
      Avoid the latency and interrupt overhead of registering a Write
      chunk when handling NFS READ requests of a few hundred bytes or
      less.
      
      This change does not interoperate with Linux NFS/RDMA servers
      that do not have commit 9d11b51c ('svcrdma: Fix send_reply()
      scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3,
      and is included in 4.2.y, 4.1.y, and 3.18.y.
      
      Oracle bug 22925946 has been filed to request that the above fix
      be included in the Oracle Linux UEK4 NFS/RDMA server.
      
      Red Hat bugzillas 1327280 and 1327554 have been filed to request
      that RHEL NFS/RDMA server backports include the above fix.
      
      Workaround: Replace the "proto=rdma,port=20049" mount options
      with "proto=tcp" until commit 9d11b51c is applied to your
      NFS server.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      cce6deeb
    • Chuck Lever's avatar
      xprtrdma: Prevent inline overflow · 302d3deb
      Chuck Lever authored
      When deciding whether to send a Call inline, rpcrdma_marshal_req
      doesn't take into account header bytes consumed by chunk lists.
      This results in Call messages on the wire that are sometimes larger
      than the inline threshold.
      
      Likewise, when a Write list or Reply chunk is in play, the server's
      reply has to emit an RDMA Send that includes a larger-than-minimal
      RPC-over-RDMA header.
      
      The actual size of a Call message cannot be estimated until after
      the chunk lists have been registered. Thus the size of each
      RPC-over-RDMA header can be estimated only after chunks are
      registered; but the decision to register chunks is based on the size
      of that header. Chicken, meet egg.
      
      The best a client can do is estimate header size based on the
      largest header that might occur, and then ensure that inline content
      is always smaller than that.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      302d3deb
    • Chuck Lever's avatar
      xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers · 94931746
      Chuck Lever authored
      Send buffer space is shared between the RPC-over-RDMA header and
      an RPC message. A large RPC-over-RDMA header means less space is
      available for the associated RPC message, which then has to be
      moved via an RDMA Read or Write.
      
      As more segments are added to the chunk lists, the header increases
      in size.  Typical modern hardware needs only a few segments to
      convey the maximum payload size, but some devices and registration
      modes may need a lot of segments to convey data payload. Sometimes
      so many are needed that the remaining space in the Send buffer is
      not enough for the RPC message. Sending such a message usually
      fails.
      
      To ensure a transport can always make forward progress, cap the
      number of RDMA segments that are allowed in chunk lists. This
      prevents less-capable devices and memory registrations from
      consuming a large portion of the Send buffer by reducing the
      maximum data payload that can be conveyed with such devices.
      
      For now I choose an arbitrary maximum of 8 RDMA segments. This
      allows a maximum size RPC-over-RDMA header to fit nicely in the
      current 1024 byte inline threshold with over 700 bytes remaining
      for an inline RPC message.
      
      The current maximum data payload of NFS READ or WRITE requests is
      one megabyte. To convey that payload on a client with 4KB pages,
      each chunk segment would need to handle 32 or more data pages. This
      is well within the capabilities of FMR. For physical registration,
      the maximum payload size on platforms with 4KB pages is reduced to
      32KB.
      
      For FRWR, a device's maximum page list depth would need to be at
      least 34 to support the maximum 1MB payload. A device with a smaller
      maximum page list depth means the maximum data payload is reduced
      when using that device.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      94931746
    • Chuck Lever's avatar
      xprtrdma: Bound the inline threshold values · 29c55422
      Chuck Lever authored
      Currently the sysctls that allow setting the inline threshold allow
      any value to be set.
      
      Small values only make the transport run slower. The default 1KB
      setting is as low as is reasonable. And the logic that decides how
      to divide a Send buffer between RPC-over-RDMA header and RPC message
      assumes (but does not check) that the lower bound is not crazy (say,
      57 bytes).
      
      Send and receive buffers share a page with some control information.
      Values larger than about 3KB can't be supported, currently.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      29c55422
    • Chuck Lever's avatar
      sunrpc: Advertise maximum backchannel payload size · 6b26cc8c
      Chuck Lever authored
      RPC-over-RDMA transports have a limit on how large a backward
      direction (backchannel) RPC message can be. Ensure that the NFSv4.x
      CREATE_SESSION operation advertises this limit to servers.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6b26cc8c
    • Chuck Lever's avatar
      sunrpc: Update RPCBIND_MAXNETIDLEN · 4b9c7f9d
      Chuck Lever authored
      Commit 176e21ee ("SUNRPC: Support for RPC over AF_LOCAL
      transports") added a 5-character netid, but did not bump
      RPCBIND_MAXNETIDLEN from 4 to 5.
      
      Fixes: 176e21ee ("SUNRPC: Support for RPC over AF_LOCAL ...")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b9c7f9d
    • Shirley Ma's avatar
      xprtrdma: Add rdma6 option to support NFS/RDMA IPv6 · 181342c5
      Shirley Ma authored
      RFC 5666: The "rdma" netid is to be used when IPv4 addressing
      is employed by the underlying transport, and "rdma6" for IPv6
      addressing.
      
      Add mount -o proto=rdma6 option to support NFS/RDMA IPv6 addressing.
      
      Changes from v2:
       - Integrated comments from Chuck Level, Anna Schumaker, Trodt Myklebust
       - Add a little more to the patch description to describe NFS/RDMA
         IPv6 suggested by Chuck Level and Anna Schumaker
       - Removed duplicated rdma6 define
       - Remove Opt_xprt_rdma mountfamily since it doesn't support
      Signed-off-by: default avatarShirley Ma <shirley.ma@oracle.com>
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      181342c5
    • Tigran Mkrtchyan's avatar
      nfs4: client: do not send empty SETATTR after OPEN_CREATE · a1d1c4f1
      Tigran Mkrtchyan authored
      OPEN_CREATE with EXCLUSIVE4_1 sends initial file permission.
      Ignoring  fact, that server have indicated that file mod is set, client
      will send yet another SETATTR request, but, as mode is already set,
      new SETATTR will be empty. This is not a problem, nevertheless
      an extra roundtrip and slow open on high latency networks.
      
      This change is aims to skip extra setattr after open  if there are
      no attributes to be set.
      Signed-off-by: default avatarTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a1d1c4f1
    • Anna Schumaker's avatar
      NFS: Add COPY nfs operation · 2e72448b
      Anna Schumaker authored
      This adds the copy_range file_ops function pointer used by the
      sys_copy_range() function call.  This patch only implements sync copies,
      so if an async copy happens we decode the stateid and ignore it.
      Signed-off-by: default avatarAnna Schumaker <bjschuma@netapp.com>
      2e72448b
    • Anna Schumaker's avatar
      NFS: Add nfs_commit_file() · 67911c8f
      Anna Schumaker authored
      Copy will use this to set up a commit request for a generic range.  I
      don't want to allocate a new pagecache entry for the file, so I needed
      to change parts of the commit path to handle requests with a null
      wb_page.
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      67911c8f
    • Olga Kornievskaia's avatar
      Fixing oops in callback path · c2985d00
      Olga Kornievskaia authored
      Commit 80f96427 ("NFSv4.x: Enforce the ca_maxreponsesize_cached
      on the back channel") causes an oops when it receives a callback with
      cachethis=yes.
      
      [  109.667378] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c8
      [  109.669476] IP: [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
      [  109.671216] PGD 0
      [  109.671736] Oops: 0000 [#1] SMP
      [  109.705427] CPU: 1 PID: 3579 Comm: nfsv4.1-svc Not tainted 4.5.0-rc1+ #1
      [  109.706987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
      [  109.709468] task: ffff8800b4408000 ti: ffff88008448c000 task.ti: ffff88008448c000
      [  109.711207] RIP: 0010:[<ffffffffa08a3e68>]  [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
      [  109.713521] RSP: 0018:ffff88008448fca0  EFLAGS: 00010286
      [  109.714762] RAX: ffff880081ee202c RBX: ffff8800b7b5b600 RCX: 0000000000000001
      [  109.716427] RDX: 0000000000000008 RSI: 0000000000000008 RDI: 0000000000000000
      [  109.718091] RBP: ffff88008448fda8 R08: 0000000000000000 R09: 000000000b000000
      [  109.719757] R10: ffff880137786000 R11: ffff8800b7b5b600 R12: 0000000001000000
      [  109.721415] R13: 0000000000000002 R14: 0000000053270000 R15: 000000000000000b
      [  109.723061] FS:  0000000000000000(0000) GS:ffff880139640000(0000) knlGS:0000000000000000
      [  109.724931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  109.726278] CR2: 00000000000002c8 CR3: 0000000034d50000 CR4: 00000000001406e0
      [  109.727972] Stack:
      [  109.728465]  ffff880081ee202c ffff880081ee201c 000000008448fcc0 ffff8800baccb800
      [  109.730349]  ffff8800baccc800 ffffffffa08d0380 0000000000000000 0000000000000000
      [  109.732211]  ffff8800b7b5b600 0000000000000001 ffffffff81d073c0 ffff880081ee3090
      [  109.734056] Call Trace:
      [  109.734657]  [<ffffffffa03795d4>] svc_process_common+0x5c4/0x6c0 [sunrpc]
      [  109.736267]  [<ffffffffa0379a4c>] bc_svc_process+0x1fc/0x360 [sunrpc]
      [  109.737775]  [<ffffffffa08a2c2c>] nfs41_callback_svc+0x10c/0x1d0 [nfsv4]
      [  109.739335]  [<ffffffff810cb380>] ? prepare_to_wait_event+0xf0/0xf0
      [  109.740799]  [<ffffffffa08a2b20>] ? nfs4_callback_svc+0x50/0x50 [nfsv4]
      [  109.742349]  [<ffffffff810a6998>] kthread+0xd8/0xf0
      [  109.743495]  [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
      [  109.744776]  [<ffffffff816abc4f>] ret_from_fork+0x3f/0x70
      [  109.746037]  [<ffffffff810a68c0>] ? kthread_park+0x60/0x60
      [  109.747324] Code: cc 45 31 f6 48 8b 85 00 ff ff ff 44 89 30 48 8b 85 f8 fe ff ff 44 89 20 48 8b 9d 38 ff ff ff 48 8b bd 30 ff ff ff 48 85 db 74 4c <4c> 8b af c8 02 00 00 4d 8d a5 08 02 00 00 49 81 c5 98 02 00 00
      [  109.754361] RIP  [<ffffffffa08a3e68>] nfs4_callback_compound+0x4f8/0x690 [nfsv4]
      [  109.756123]  RSP <ffff88008448fca0>
      [  109.756951] CR2: 00000000000002c8
      [  109.757738] ---[ end trace 2b8555511ab5dfb4 ]---
      [  109.758819] Kernel panic - not syncing: Fatal exception
      [  109.760126] Kernel Offset: disabled
      [  118.938934] ---[ end Kernel panic - not syncing: Fatal exception
      
      It doesn't unlock the table nor does it set the cps->clp pointer which
      is later needed by nfs4_cb_free_slot().
      
      Fixes: 80f96427 ("NFSv4.x: Enforce the ca_maxresponsesize_cached ...")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      c2985d00
  2. 09 May, 2016 13 commits
  3. 08 May, 2016 1 commit
  4. 07 May, 2016 7 commits
  5. 06 May, 2016 8 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 07837831
      Linus Torvalds authored
      Pull writeback fix from Jens Axboe:
       "Just a single fix for domain aware writeback, fixing a regression that
        can cause balance_dirty_pages() to keep looping while not getting any
        work done"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        writeback: Fix performance regression in wb_over_bg_thresh()
      07837831
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3f86ba5d
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "This contains two fixes: a boot fix for older SGI/UV systems, and an
        APIC calibration fix"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
        x86/platform/UV: Bring back the call to map_low_mmrs in uv_system_init
      3f86ba5d
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 01ec7167
      Linus Torvalds authored
      Pull power management and ACPI fixes from Rafael Wysocki:
       "Fixes for problems introduced or discovered recently (intel_pstate,
        sti-cpufreq, ARM64 cpuidle, Operating Performance Points framework,
        generic device properties framework) and one fix for a hotplug-related
        deadlock in ACPICA that's been there forever, but is nasty enough.
      
        Specifics:
      
         - Fix for a recent regression in the intel_pstate driver causing it
           to fail to restore the HWP (HW-managed P-states) configuration of
           the boot CPU after suspend-to-RAM (Rafael Wysocki).
      
         - Fix for two recent regressions in the intel_pstate driver, one that
           can trigger a divide by zero if the driver is accessed via sysfs
           before it manages to take the first sample and one causing it to
           fail to update a structure field used in a trace point, so the
           information coming from it is less useful (Rafael Wysocki).
      
         - Fix for a problem in the sti-cpufreq driver introduced during the
           4.5 cycle that causes it to break CPU PM in multi-platform kernels
           by registering cpufreq-dt (which subsequently doesn't work)
           unconditionally and preventing the driver that would actually work
           from registering (Sudeep Holla).
      
         - Stable-candidate fix for an ARM64 cpuidle issue causing idle state
           usage counters to be incorrectly updated for idle states that were
           not entered due to errors (James Morse).
      
         - Fix for a recently introduced issue in the OPP (Operating
           Performance Points) framework causing it to print bogus error
           messages for missing optional regulators (Viresh Kumar).
      
         - Fix for a recently introduced issue in the generic device
           properties framework that may cause it to attempt to dereferece and
           invalid pointer in some cases (Heikki Krogerus).
      
         - Fix for a deadlock in the ACPICA core that may be triggered by
           device (eg Thunderbolt) hotplug (Prarit Bhargava)"
      
      * tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM / OPP: Remove useless check
        ACPICA: Dispatcher: Update thread ID for recursive method calls
        intel_pstate: Fix intel_pstate_get()
        cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
        cpufreq: st: enable selective initialization based on the platform
        ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
        device property: Avoid potential dereferences of invalid pointers
      01ec7167
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 17d25a33
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "This contains a single fix that fixes a nohz tick stopping bug when
        mixed-poliocy SCHED_FIFO and SCHED_RR tasks are present on a runqueue"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
      17d25a33
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 18fb92c3
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "This tree contains two fixes: new Intel CPU model numbers and an
        AMD/iommu uncore PMU driver fix"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/amd/iommu: Do not register a task ctx for uncore like PMUs
        perf/x86: Add model numbers for Kabylake CPUs
      18fb92c3
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cade8184
      Linus Torvalds authored
      Pull EFI fixes from Ingo Molnar:
       "This tree contains three fixes: a console spam fix, a file pattern fix
        and a sysfb_efi fix for a bug that triggered on older ThinkPads"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sysfb_efi: Fix valid BAR address range check
        x86/efi-bgrt: Switch all pr_err() to pr_notice() for invalid BGRT
        MAINTAINERS: Remove asterisk from EFI directory names
      cade8184
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 83a395d3
      Linus Torvalds authored
      Pull parisc fix from Helge Deller:
       "Patch from Dmitry V Levin to fix a kernel crash when a straced process
        calls the (invalid) syscall which is equal to value of __NR_Linux_syscalls"
      
      * 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls
      83a395d3
    • Linus Torvalds's avatar
      Merge tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · dd287690
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
       "Late in the cycle, but this has fixes for couple of issues: a PAE40
        boot crash and Arnd spotting lack of barriers in BE io-accessors.
      
        The 3rd patch for enabling highmem in low physical mem ;-) honestly is
        more than a "fix" but its been in works for some time, seems to be
        stable in testing and enables 2 of our customers to go forward with
        4.6 kernel.
      
         - Fix for PTE truncation in PAE40 builds
         - Fix for big endian IO accessors lacking IO barrier
         - Allow HIGHMEM to work with low physical addresses"
      
      * tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: support HIGHMEM even without PAE40
        ARC: Fix PAE40 boot failures due to PTE truncation
        ARC: Add missing io barriers to io{read,write}{16,32}be()
      dd287690