1. 05 Oct, 2017 31 commits
    • Andreas Gruenbacher's avatar
      gfs2: Fix debugfs glocks dump · ddf25aea
      Andreas Gruenbacher authored
      commit 10201655 upstream.
      
      The switch to rhashtables (commit 88ffbf3e) broke the debugfs glock
      dump (/sys/kernel/debug/gfs2/<device>/glocks) for dumps bigger than a
      single buffer: the right function for restarting an rhashtable iteration
      from the beginning of the hash table is rhashtable_walk_enter;
      rhashtable_walk_stop + rhashtable_walk_start will just resume from the
      current position.
      
      The upstream commit doesn't directly apply to 4.4.y because 4.4.y
      doesn't have rhashtable_walk_enter and the following mainline commits:
      
        92ecd73a  
          gfs2: Deduplicate gfs2_{glocks,glstats}_open
        cc37a627  
          gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
      
      Other than rhashtable_walk_enter, rhashtable_walk_init can fail.  To
      handle the failure case in gfs2_glock_seq_stop, we check if
      rhashtable_walk_init has initialized iter->walker; if it has not, we
      must not call rhashtable_walk_stop or rhashtable_walk_exit.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddf25aea
    • Eric Biggers's avatar
      x86/fpu: Don't let userspace set bogus xcomp_bv · d25fea06
      Eric Biggers authored
      commit 814fb7bb upstream.
      
      [Please apply to 4.4-stable.  Note: the backport includes the
      fpstate_init() call in xstateregs_set(), since fix is useless without
      it.  It was added by commit 91c3dba7 ("x86/fpu/xstate: Fix PTRACE
      frames for XSAVES"), but it doesn't make sense to backport that whole
      commit.]
      
      On x86, userspace can use the ptrace() or rt_sigreturn() system calls to
      set a task's extended state (xstate) or "FPU" registers.  ptrace() can
      set them for another task using the PTRACE_SETREGSET request with
      NT_X86_XSTATE, while rt_sigreturn() can set them for the current task.
      In either case, registers can be set to any value, but the kernel
      assumes that the XSAVE area itself remains valid in the sense that the
      CPU can restore it.
      
      However, in the case where the kernel is using the uncompacted xstate
      format (which it does whenever the XSAVES instruction is unavailable),
      it was possible for userspace to set the xcomp_bv field in the
      xstate_header to an arbitrary value.  However, all bits in that field
      are reserved in the uncompacted case, so when switching to a task with
      nonzero xcomp_bv, the XRSTOR instruction failed with a #GP fault.  This
      caused the WARN_ON_FPU(err) in copy_kernel_to_xregs() to be hit.  In
      addition, since the error is otherwise ignored, the FPU registers from
      the task previously executing on the CPU were leaked.
      
      Fix the bug by checking that the user-supplied value of xcomp_bv is 0 in
      the uncompacted case, and returning an error otherwise.
      
      The reason for validating xcomp_bv rather than simply overwriting it
      with 0 is that we want userspace to see an error if it (incorrectly)
      provides an XSAVE area in compacted format rather than in uncompacted
      format.
      
      Note that as before, in case of error we clear the task's FPU state.
      This is perhaps non-ideal, especially for PTRACE_SETREGSET; it might be
      better to return an error before changing anything.  But it seems the
      "clear on error" behavior is fine for now, and it's a little tricky to
      do otherwise because it would mean we couldn't simply copy the full
      userspace state into kernel memory in one __copy_from_user().
      
      This bug was found by syzkaller, which hit the above-mentioned
      WARN_ON_FPU():
      
          WARNING: CPU: 1 PID: 0 at ./arch/x86/include/asm/fpu/internal.h:373 __switch_to+0x5b5/0x5d0
          CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.13.0 #453
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          task: ffff9ba2bc8e42c0 task.stack: ffffa78cc036c000
          RIP: 0010:__switch_to+0x5b5/0x5d0
          RSP: 0000:ffffa78cc08bbb88 EFLAGS: 00010082
          RAX: 00000000fffffffe RBX: ffff9ba2b8bf2180 RCX: 00000000c0000100
          RDX: 00000000ffffffff RSI: 000000005cb10700 RDI: ffff9ba2b8bf36c0
          RBP: ffffa78cc08bbbd0 R08: 00000000929fdf46 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ba2bc8e42c0
          R13: 0000000000000000 R14: ffff9ba2b8bf3680 R15: ffff9ba2bf5d7b40
          FS:  00007f7e5cb10700(0000) GS:ffff9ba2bf400000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00000000004005cc CR3: 0000000079fd5000 CR4: 00000000001406e0
          Call Trace:
          Code: 84 00 00 00 00 00 e9 11 fd ff ff 0f ff 66 0f 1f 84 00 00 00 00 00 e9 e7 fa ff ff 0f ff 66 0f 1f 84 00 00 00 00 00 e9 c2 fa ff ff <0f> ff 66 0f 1f 84 00 00 00 00 00 e9 d4 fc ff ff 66 66 2e 0f 1f
      
      Here is a C reproducer.  The expected behavior is that the program spin
      forever with no output.  However, on a buggy kernel running on a
      processor with the "xsave" feature but without the "xsaves" feature
      (e.g. Sandy Bridge through Broadwell for Intel), within a second or two
      the program reports that the xmm registers were corrupted, i.e. were not
      restored correctly.  With CONFIG_X86_DEBUG_FPU=y it also hits the above
      kernel warning.
      
          #define _GNU_SOURCE
          #include <stdbool.h>
          #include <inttypes.h>
          #include <linux/elf.h>
          #include <stdio.h>
          #include <sys/ptrace.h>
          #include <sys/uio.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          int main(void)
          {
              int pid = fork();
              uint64_t xstate[512];
              struct iovec iov = { .iov_base = xstate, .iov_len = sizeof(xstate) };
      
              if (pid == 0) {
                  bool tracee = true;
                  for (int i = 0; i < sysconf(_SC_NPROCESSORS_ONLN) && tracee; i++)
                      tracee = (fork() != 0);
                  uint32_t xmm0[4] = { [0 ... 3] = tracee ? 0x00000000 : 0xDEADBEEF };
                  asm volatile("   movdqu %0, %%xmm0\n"
                               "   mov %0, %%rbx\n"
                               "1: movdqu %%xmm0, %0\n"
                               "   mov %0, %%rax\n"
                               "   cmp %%rax, %%rbx\n"
                               "   je 1b\n"
                               : "+m" (xmm0) : : "rax", "rbx", "xmm0");
                  printf("BUG: xmm registers corrupted!  tracee=%d, xmm0=%08X%08X%08X%08X\n",
                         tracee, xmm0[0], xmm0[1], xmm0[2], xmm0[3]);
              } else {
                  usleep(100000);
                  ptrace(PTRACE_ATTACH, pid, 0, 0);
                  wait(NULL);
                  ptrace(PTRACE_GETREGSET, pid, NT_X86_XSTATE, &iov);
                  xstate[65] = -1;
                  ptrace(PTRACE_SETREGSET, pid, NT_X86_XSTATE, &iov);
                  ptrace(PTRACE_CONT, pid, 0, 0);
                  wait(NULL);
              }
              return 1;
          }
      
      Note: the program only tests for the bug using the ptrace() system call.
      The bug can also be reproduced using the rt_sigreturn() system call, but
      only when called from a 32-bit program, since for 64-bit programs the
      kernel restores the FPU state from the signal frame by doing XRSTOR
      directly from userspace memory (with proper error checking).
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Kevin Hao <haokexin@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Halcrow <mhalcrow@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Cc: kernel-hardening@lists.openwall.com
      Fixes: 0b29643a ("x86/xsaves: Change compacted format xsave area header")
      Link: http://lkml.kernel.org/r/20170922174156.16780-2-ebiggers3@gmail.com
      Link: http://lkml.kernel.org/r/20170923130016.21448-25-mingo@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d25fea06
    • satoru takeuchi's avatar
      btrfs: prevent to set invalid default subvolid · 4c16afac
      satoru takeuchi authored
      commit 6d6d2829 upstream.
      
      `btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
      fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
      filesystem without specifying `subvol` or `subvolid` mount options.
      
      Fixes: 6ef5ed0d ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
      Signed-off-by: default avatarSatoru Takeuchi <satoru.takeuchi@gmail.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo.btrfs@gmx.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c16afac
    • Naohiro Aota's avatar
      btrfs: propagate error to btrfs_cmp_data_prepare caller · 0efde435
      Naohiro Aota authored
      commit 78ad4ce0 upstream.
      
      btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
      from gather_extent_pages(). While the pages are freed by
      btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
      btrfs_extent_same() try to access the already freed pages causing faults
      (or violates PageLocked assertion).
      
      This patch just return the error as is so that the caller stop the process.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Fixes: f4414602 ("btrfs: fix deadlock with extent-same and readpage")
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0efde435
    • Naohiro Aota's avatar
      btrfs: fix NULL pointer dereference from free_reloc_roots() · 9a7d93dd
      Naohiro Aota authored
      commit bb166d72 upstream.
      
      __del_reloc_root should be called before freeing up reloc_root->node.
      If not, calling __del_reloc_root() dereference reloc_root->node, causing
      the system BUG.
      
      Fixes: 6bdf131f ("Btrfs: don't leak reloc root nodes on error")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a7d93dd
    • Nicolai Stange's avatar
      PCI: Fix race condition with driver_override · b08dc7d4
      Nicolai Stange authored
      commit 9561475d upstream.
      
      The driver_override implementation is susceptible to a race condition when
      different threads are reading vs. storing a different driver override.  Add
      locking to avoid the race condition.
      
      This is in close analogy to commit 62655397 ("driver core: platform:
      fix race condition with driver_override") from Adrian Salido.
      
      Fixes: 782a985d ("PCI: Introduce new device binding path using pci_dev.driver_override")
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b08dc7d4
    • Jim Mattson's avatar
      kvm: nVMX: Don't allow L2 to access the hardware CR8 · 21a638c5
      Jim Mattson authored
      commit 51aa68e7 upstream.
      
      If L1 does not specify the "use TPR shadow" VM-execution control in
      vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
      exiting" VM-execution controls in vmcs02. Failure to do so will give
      the L2 VM unrestricted read/write access to the hardware CR8.
      
      This fixes CVE-2017-12154.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21a638c5
    • Jan H. Schönherr's avatar
      KVM: VMX: Do not BUG() on out-of-bounds guest IRQ · 7520be6a
      Jan H. Schönherr authored
      commit 3a8b0677 upstream.
      
      The value of the guest_irq argument to vmx_update_pi_irte() is
      ultimately coming from a KVM_IRQFD API call. Do not BUG() in
      vmx_update_pi_irte() if the value is out-of bounds. (Especially,
      since KVM as a whole seems to hang after that.)
      
      Instead, print a message only once if we find that we don't have a
      route for a certain IRQ (which can be out-of-bounds or within the
      array).
      
      This fixes CVE-2017-1000252.
      
      Fixes: efc64404 ("KVM: x86: Update IRTE for posted-interrupts")
      Signed-off-by: default avatarJan H. Schönherr <jschoenh@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7520be6a
    • Will Deacon's avatar
      arm64: fault: Route pte translation faults via do_translation_fault · e726c30c
      Will Deacon authored
      commit 760bfb47 upstream.
      
      We currently route pte translation faults via do_page_fault, which elides
      the address check against TASK_SIZE before invoking the mm fault handling
      code. However, this can cause issues with the path walking code in
      conjunction with our word-at-a-time implementation because
      load_unaligned_zeropad can end up faulting in kernel space if it reads
      across a page boundary and runs into a page fault (e.g. by attempting to
      read from a guard region).
      
      In the case of such a fault, load_unaligned_zeropad has registered a
      fixup to shift the valid data and pad with zeroes, however the abort is
      reported as a level 3 translation fault and we dispatch it straight to
      do_page_fault, despite it being a kernel address. This results in calling
      a sleeping function from atomic context:
      
        BUG: sleeping function called from invalid context at arch/arm64/mm/fault.c:313
        in_atomic(): 0, irqs_disabled(): 0, pid: 10290
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        [...]
        [<ffffff8e016cd0cc>] ___might_sleep+0x134/0x144
        [<ffffff8e016cd158>] __might_sleep+0x7c/0x8c
        [<ffffff8e016977f0>] do_page_fault+0x140/0x330
        [<ffffff8e01681328>] do_mem_abort+0x54/0xb0
        Exception stack(0xfffffffb20247a70 to 0xfffffffb20247ba0)
        [...]
        [<ffffff8e016844fc>] el1_da+0x18/0x78
        [<ffffff8e017f399c>] path_parentat+0x44/0x88
        [<ffffff8e017f4c9c>] filename_parentat+0x5c/0xd8
        [<ffffff8e017f5044>] filename_create+0x4c/0x128
        [<ffffff8e017f59e4>] SyS_mkdirat+0x50/0xc8
        [<ffffff8e01684e30>] el0_svc_naked+0x24/0x28
        Code: 36380080 d5384100 f9400800 9402566d (d4210000)
        ---[ end trace 2d01889f2bca9b9f ]---
      
      Fix this by dispatching all translation faults to do_translation_faults,
      which avoids invoking the page fault logic for faults on kernel addresses.
      Reported-by: default avatarAnkit Jain <ankijain@codeaurora.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e726c30c
    • Marc Zyngier's avatar
      arm64: Make sure SPsel is always set · 638e7874
      Marc Zyngier authored
      commit 5371513f upstream.
      
      When the kernel is entered at EL2 on an ARMv8.0 system, we construct
      the EL1 pstate and make sure this uses the the EL1 stack pointer
      (we perform an exception return to EL1h).
      
      But if the kernel is either entered at EL1 or stays at EL2 (because
      we're on a VHE-capable system), we fail to set SPsel, and use whatever
      stack selection the higher exception level has choosen for us.
      
      Let's not take any chance, and make sure that SPsel is set to one
      before we decide the mode we're going to run in.
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      638e7874
    • Oleg Nesterov's avatar
      seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter() · 9237605e
      Oleg Nesterov authored
      commit 66a733ea upstream.
      
      As Chris explains, get_seccomp_filter() and put_seccomp_filter() can end
      up using different filters. Once we drop ->siglock it is possible for
      task->seccomp.filter to have been replaced by SECCOMP_FILTER_FLAG_TSYNC.
      
      Fixes: f8e529ed ("seccomp, ptrace: add support for dumping seccomp filters")
      Reported-by: default avatarChris Salls <chrissalls5@gmail.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      [tycho: add __get_seccomp_filter vs. open coding refcount_inc()]
      Signed-off-by: default avatarTycho Andersen <tycho@docker.com>
      [kees: tweak commit log]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9237605e
    • Christoph Hellwig's avatar
      bsg-lib: don't free job in bsg_prepare_job · 668cee82
      Christoph Hellwig authored
      commit f507b54d upstream.
      
      The job structure is allocated as part of the request, so we should not
      free it in the error path of bsg_prepare_job.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      668cee82
    • Vladis Dronov's avatar
      nl80211: check for the required netlink attributes presence · 9d74367d
      Vladis Dronov authored
      commit e785fa0a upstream.
      
      nl80211_set_rekey_data() does not check if the required attributes
      NL80211_REKEY_DATA_{REPLAY_CTR,KEK,KCK} are present when processing
      NL80211_CMD_SET_REKEY_OFFLOAD request. This request can be issued by
      users with CAP_NET_ADMIN privilege and may result in NULL dereference
      and a system crash. Add a check for the required attributes presence.
      This patch is based on the patch by bo Zhang.
      
      This fixes CVE-2017-12153.
      
      References: https://bugzilla.redhat.com/show_bug.cgi?id=1491046
      Fixes: e5497d76 ("cfg80211/nl80211: support GTK rekey offload")
      Reported-by: default avatarbo Zhang <zhangbo5891001@gmail.com>
      Signed-off-by: default avatarVladis Dronov <vdronov@redhat.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d74367d
    • Andreas Gruenbacher's avatar
      vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets · 3393445e
      Andreas Gruenbacher authored
      commit fc46820b upstream.
      
      In generic_file_llseek_size, return -ENXIO for negative offsets as well
      as offsets beyond EOF.  This affects filesystems which don't implement
      SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
      holes.
      
      Fixes xfstest generic/448.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3393445e
    • Steve French's avatar
    • Steve French's avatar
      SMB: Validate negotiate (to protect against downgrade) even if signing off · 02ef29f9
      Steve French authored
      commit 0603c96f upstream.
      
      As long as signing is supported (ie not a guest user connection) and
      connection is SMB3 or SMB3.02, then validate negotiate (protect
      against man in the middle downgrade attacks).  We had been doing this
      only when signing was required, not when signing was just enabled,
      but this more closely matches recommended SMB3 behavior and is
      better security.  Suggested by Metze.
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Reviewed-by: default avatarJeremy Allison <jra@samba.org>
      Acked-by: default avatarStefan Metzmacher <metze@samba.org>
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      02ef29f9
    • Steve French's avatar
      Fix SMB3.1.1 guest authentication to Samba · c096b31f
      Steve French authored
      commit 23586b66 upstream.
      
      Samba rejects SMB3.1.1 dialect (vers=3.1.1) negotiate requests from
      the kernel client due to the two byte pad at the end of the negotiate
      contexts.
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c096b31f
    • Tyrel Datwyler's avatar
      powerpc/pseries: Fix parent_dn reference leak in add_dt_node() · fe37a445
      Tyrel Datwyler authored
      commit b537ca6f upstream.
      
      A reference to the parent device node is held by add_dt_node() for the
      node to be added. If the call to dlpar_configure_connector() fails
      add_dt_node() returns ENOENT and that reference is not freed.
      
      Add a call to of_node_put(parent_dn) prior to bailing out after a
      failed dlpar_configure_connector() call.
      
      Fixes: 8d5ff320 ("powerpc/pseries: Make dlpar_configure_connector parent node aware")
      Signed-off-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe37a445
    • Eric Biggers's avatar
      KEYS: prevent KEYCTL_READ on negative key · 638b3850
      Eric Biggers authored
      commit 37863c43 upstream.
      
      Because keyctl_read_key() looks up the key with no permissions
      requested, it may find a negatively instantiated key.  If the key is
      also possessed, we went ahead and called ->read() on the key.  But the
      key payload will actually contain the ->reject_error rather than the
      normal payload.  Thus, the kernel oopses trying to read the
      user_key_payload from memory address (int)-ENOKEY = 0x00000000ffffff82.
      
      Fortunately the payload data is stored inline, so it shouldn't be
      possible to abuse this as an arbitrary memory read primitive...
      
      Reproducer:
          keyctl new_session
          keyctl request2 user desc '' @s
          keyctl read $(keyctl show | awk '/user: desc/ {print $1}')
      
      It causes a crash like the following:
           BUG: unable to handle kernel paging request at 00000000ffffff92
           IP: user_read+0x33/0xa0
           PGD 36a54067 P4D 36a54067 PUD 0
           Oops: 0000 [#1] SMP
           CPU: 0 PID: 211 Comm: keyctl Not tainted 4.14.0-rc1 #337
           Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
           task: ffff90aa3b74c3c0 task.stack: ffff9878c0478000
           RIP: 0010:user_read+0x33/0xa0
           RSP: 0018:ffff9878c047bee8 EFLAGS: 00010246
           RAX: 0000000000000001 RBX: ffff90aa3d7da340 RCX: 0000000000000017
           RDX: 0000000000000000 RSI: 00000000ffffff82 RDI: ffff90aa3d7da340
           RBP: ffff9878c047bf00 R08: 00000024f95da94f R09: 0000000000000000
           R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
           R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
           FS:  00007f58ece69740(0000) GS:ffff90aa3e200000(0000) knlGS:0000000000000000
           CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
           CR2: 00000000ffffff92 CR3: 0000000036adc001 CR4: 00000000003606f0
           Call Trace:
            keyctl_read_key+0xac/0xe0
            SyS_keyctl+0x99/0x120
            entry_SYSCALL_64_fastpath+0x1f/0xbe
           RIP: 0033:0x7f58ec787bb9
           RSP: 002b:00007ffc8d401678 EFLAGS: 00000206 ORIG_RAX: 00000000000000fa
           RAX: ffffffffffffffda RBX: 00007ffc8d402800 RCX: 00007f58ec787bb9
           RDX: 0000000000000000 RSI: 00000000174a63ac RDI: 000000000000000b
           RBP: 0000000000000004 R08: 00007ffc8d402809 R09: 0000000000000020
           R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc8d402800
           R13: 00007ffc8d4016e0 R14: 0000000000000000 R15: 0000000000000000
           Code: e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb e8 a4 b4 ad ff 85 c0 74 09 80 3d b9 4c 96 00 00 74 43 48 8b b3 20 01 00 00 4d 85 ed <0f> b7 5e 10 74 29 4d 85 e4 74 24 4c 39 e3 4c 89 e2 4c 89 ef 48
           RIP: user_read+0x33/0xa0 RSP: ffff9878c047bee8
           CR2: 00000000ffffff92
      
      Fixes: 61ea0c0b ("KEYS: Skip key state checks when checking for possession")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      638b3850
    • Eric Biggers's avatar
      KEYS: prevent creating a different user's keyrings · 539255ae
      Eric Biggers authored
      commit 237bbd29 upstream.
      
      It was possible for an unprivileged user to create the user and user
      session keyrings for another user.  For example:
      
          sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u
                                 keyctl add keyring _uid_ses.4000 "" @u
                                 sleep 15' &
          sleep 1
          sudo -u '#4000' keyctl describe @u
          sudo -u '#4000' keyctl describe @us
      
      This is problematic because these "fake" keyrings won't have the right
      permissions.  In particular, the user who created them first will own
      them and will have full access to them via the possessor permissions,
      which can be used to compromise the security of a user's keys:
      
          -4: alswrv-----v------------  3000     0 keyring: _uid.4000
          -5: alswrv-----v------------  3000     0 keyring: _uid_ses.4000
      
      Fix it by marking user and user session keyrings with a flag
      KEY_FLAG_UID_KEYRING.  Then, when searching for a user or user session
      keyring by name, skip all keyrings that don't have the flag set.
      
      Fixes: 69664cf1 ("keys: don't generate user and user session keyrings unless they're accessed")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      539255ae
    • Eric Biggers's avatar
      KEYS: fix writing past end of user-supplied buffer in keyring_read() · af24e9d8
      Eric Biggers authored
      commit e645016a upstream.
      
      Userspace can call keyctl_read() on a keyring to get the list of IDs of
      keys in the keyring.  But if the user-supplied buffer is too small, the
      kernel would write the full list anyway --- which will corrupt whatever
      userspace memory happened to be past the end of the buffer.  Fix it by
      only filling the space that is available.
      
      Fixes: b2a4df20 ("KEYS: Expand the capacity of a keyring")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af24e9d8
    • LEROY Christophe's avatar
      crypto: talitos - fix sha224 · 362711d5
      LEROY Christophe authored
      commit afd62fa2 upstream.
      
      Kernel crypto tests report the following error at startup
      
      [    2.752626] alg: hash: Test 4 failed for sha224-talitos
      [    2.757907] 00000000: 30 e2 86 e2 e7 8a dd 0d d7 eb 9f d5 83 fe f1 b0
      00000010: 2d 5a 6c a5 f9 55 ea fd 0e 72 05 22
      
      This patch fixes it
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      362711d5
    • LEROY Christophe's avatar
      crypto: talitos - Don't provide setkey for non hmac hashing algs. · 231c4f64
      LEROY Christophe authored
      commit 56136631 upstream.
      
      Today, md5sum fails with error -ENOKEY because a setkey
      function is set for non hmac hashing algs, see strace output below:
      
      mmap(NULL, 378880, PROT_READ, MAP_SHARED, 6, 0) = 0x77f50000
      accept(3, 0, NULL)                      = 7
      vmsplice(5, [{"bin/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 378880}], 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 262144
      splice(4, NULL, 7, NULL, 262144, SPLICE_F_MORE) = -1 ENOKEY (Required key not available)
      write(2, "Generation of hash for file kcap"..., 50) = 50
      munmap(0x77f50000, 378880)              = 0
      
      This patch ensures that setkey() function is set only
      for hmac hashing.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      231c4f64
    • Xin Long's avatar
      scsi: scsi_transport_iscsi: fix the issue that iscsi_if_rx doesn't parse nlmsg properly · 9d253491
      Xin Long authored
      commit c88f0e6b upstream.
      
      ChunYu found a kernel crash by syzkaller:
      
      [  651.617875] kasan: CONFIG_KASAN_INLINE enabled
      [  651.618217] kasan: GPF could be caused by NULL-ptr deref or user memory access
      [  651.618731] general protection fault: 0000 [#1] SMP KASAN
      [  651.621543] CPU: 1 PID: 9539 Comm: scsi Not tainted 4.11.0.cov #32
      [  651.621938] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  651.622309] task: ffff880117780000 task.stack: ffff8800a3188000
      [  651.622762] RIP: 0010:skb_release_data+0x26c/0x590
      [...]
      [  651.627260] Call Trace:
      [  651.629156]  skb_release_all+0x4f/0x60
      [  651.629450]  consume_skb+0x1a5/0x600
      [  651.630705]  netlink_unicast+0x505/0x720
      [  651.632345]  netlink_sendmsg+0xab2/0xe70
      [  651.633704]  sock_sendmsg+0xcf/0x110
      [  651.633942]  ___sys_sendmsg+0x833/0x980
      [  651.637117]  __sys_sendmsg+0xf3/0x240
      [  651.638820]  SyS_sendmsg+0x32/0x50
      [  651.639048]  entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      It's caused by skb_shared_info at the end of sk_buff was overwritten by
      ISCSI_KEVENT_IF_ERROR when parsing nlmsg info from skb in iscsi_if_rx.
      
      During the loop if skb->len == nlh->nlmsg_len and both are sizeof(*nlh),
      ev = nlmsg_data(nlh) will acutally get skb_shinfo(SKB) instead and set a
      new value to skb_shinfo(SKB)->nr_frags by ev->type.
      
      This patch is to fix it by checking nlh->nlmsg_len properly there to
      avoid over accessing sk_buff.
      Reported-by: default avatarChunYu Wang <chunwang@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarChris Leech <cleech@redhat.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d253491
    • Dennis Yang's avatar
      md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list · 29854a77
      Dennis Yang authored
      commit 184a09eb upstream.
      
      In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
      set, it indicates that this stripe_head is already in the raid5_plug_cb
      list and release_stripe() would be called instead to drop a reference
      count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
      stripe_head and it will get queued into the raid5_plug_cb list.
      
      Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
      A stripe could be re-added to plug list while it is still on that list
      in the following situation. If stripe_head A is added to another
      stripe_head B's batch list, in this case A will have its
      batch_head != NULL and be added into the plug list. After that,
      stripe_head B gets handled and called break_stripe_batch_list() to
      reset all the batched stripe_head(including A which is still on
      the plug list)'s state and reset their batch_head to NULL.
      Before the plug list gets processed, if there is another write request
      comes in and get stripe_head A, A will have its batch_head == NULL
      (cleared by calling break_stripe_batch_list() on B) and be added to
      plug list once again.
      Signed-off-by: default avatarDennis Yang <dennisyang@qnap.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29854a77
    • Shaohua Li's avatar
      md/raid5: fix a race condition in stripe batch · d03d1567
      Shaohua Li authored
      commit 3664847d upstream.
      
      We have a race condition in below scenario, say have 3 continuous stripes, sh1,
      sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:
      
      CPU1				CPU2				CPU3
      handle_stripe(sh3)
      				stripe_add_to_batch_list(sh3)
      				-> lock(sh2, sh3)
      				-> lock batch_lock(sh1)
      				-> add sh3 to batch_list of sh1
      				-> unlock batch_lock(sh1)
      								clear_batch_ready(sh1)
      								-> lock(sh1) and batch_lock(sh1)
      								-> clear STRIPE_BATCH_READY for all stripes in batch_list
      								-> unlock(sh1) and batch_lock(sh1)
      ->clear_batch_ready(sh3)
      -->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
      --->return 0 as sh->batch == NULL
      				-> sh3->batch_head = sh1
      				-> unlock (sh2, sh3)
      
      In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
      of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
      impossible to clear STRIPE_BATCH_READY before batch_head is set.
      
      Thanks Stephane for helping debug this tricky issue.
      Reported-and-tested-by: default avatarStephane Thiell <sthiell@stanford.edu>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d03d1567
    • Bo Yan's avatar
      tracing: Erase irqsoff trace with empty write · 68a4a528
      Bo Yan authored
      commit 8dd33bcb upstream.
      
      One convenient way to erase trace is "echo > trace". However, this
      is currently broken if the current tracer is irqsoff tracer. This
      is because irqsoff tracer use max_buffer as the default trace
      buffer.
      
      Set the max_buffer as the one to be cleared when it's the trace
      buffer currently in use.
      
      Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com
      
      Cc: <mingo@redhat.com>
      Fixes: 4acd4d00 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: default avatarBo Yan <byan@nvidia.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68a4a528
    • Tahsin Erdogan's avatar
      tracing: Fix trace_pipe behavior for instance traces · 9c5afa72
      Tahsin Erdogan authored
      commit 75df6e68 upstream.
      
      When reading data from trace_pipe, tracing_wait_pipe() performs a
      check to see if tracing has been turned off after some data was read.
      Currently, this check always looks at global trace state, but it
      should be checking the trace instance where trace_pipe is located at.
      
      Because of this bug, cat instances/i1/trace_pipe in the following
      script will immediately exit instead of waiting for data:
      
      cd /sys/kernel/debug/tracing
      echo 0 > tracing_on
      mkdir -p instances/i1
      echo 1 > instances/i1/tracing_on
      echo 1 > instances/i1/events/sched/sched_process_exec/enable
      cat instances/i1/trace_pipe
      
      Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com
      
      Fixes: 10246fa3 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c5afa72
    • Paul Mackerras's avatar
      KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce() · f75c0042
      Paul Mackerras authored
      commit 47c5310a upstream, with part
      of commit edd03602 folded in.
      
      Nixiaoming pointed out that there is a memory leak in
      kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd()
      fails; the memory allocated for the kvmppc_spapr_tce_table struct
      is not freed, and nor are the pages allocated for the iommu
      tables.
      
      David Hildenbrand pointed out that there is a race in that the
      function checks early on that there is not already an entry in the
      stt->iommu_tables list with the same LIOBN, but an entry with the
      same LIOBN could get added between then and when the new entry is
      added to the list.
      
      This fixes both problems.  To simplify things, we now call
      anon_inode_getfd() before placing the new entry in the list.  The
      check for an existing entry is done while holding the kvm->lock
      mutex, immediately before adding the new entry to the list.
      
      [paulus@ozlabs.org - folded in that part of edd03602 ("KVM:
       PPC: Book3S HV: Protect updates to spapr_tce_tables list", 2017-08-28)
       which restructured the code that 47c5310a modified, to avoid
       a build failure caused by the absence of put_unused_fd().
       Also removed the locked memory accounting, since it doesn't exist
       in this version, and adjusted the commit message.]
      
      Fixes: 54738c09 ("KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode")
      Reported-by: default avatarNixiaoming <nixiaoming@huawei.com>
      Reported-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f75c0042
    • Avraham Stern's avatar
      mac80211: flush hw_roc_start work before cancelling the ROC · 7d8fbf3d
      Avraham Stern authored
      commit 6e46d8ce upstream.
      
      When HW ROC is supported it is possible that after the HW notified
      that the ROC has started, the ROC was cancelled and another ROC was
      added while the hw_roc_start worker is waiting on the mutex (since
      cancelling the ROC and adding another one also holds the same mutex).
      As a result, the hw_roc_start worker will continue to run after the
      new ROC is added but before it is actually started by the HW.
      This may result in notifying userspace that the ROC has started before
      it actually does, or in case of management tx ROC, in an attempt to
      tx while not on the right channel.
      
      In addition, when the driver will notify mac80211 that the second ROC
      has started, mac80211 will warn that this ROC has already been
      notified.
      
      Fix this by flushing the hw_roc_start work before cancelling an ROC.
      Signed-off-by: default avatarAvraham Stern <avraham.stern@intel.com>
      Signed-off-by: default avatarLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d8fbf3d
    • Shu Wang's avatar
      cifs: release auth_key.response for reconnect. · fcc949a4
      Shu Wang authored
      commit f5c4ba81 upstream.
      
      There is a race that cause cifs reconnect in cifs_mount,
      - cifs_mount
        - cifs_get_tcp_session
          - [ start thread cifs_demultiplex_thread
            - cifs_read_from_socket: -ECONNABORTED
              - DELAY_WORK smb2_reconnect_server ]
        - cifs_setup_session
        - [ smb2_reconnect_server ]
      
      auth_key.response was allocated in cifs_setup_session, and
      will release when the session destoried. So when session re-
      connect, auth_key.response should be check and released.
      
      Tested with my system:
      CIFS VFS: Free previous auth_key.response = ffff8800320bbf80
      
      A simple auth_key.response allocation call trace:
      - cifs_setup_session
      - SMB2_sess_setup
      - SMB2_sess_auth_rawntlmssp_authenticate
      - build_ntlmssp_auth_blob
      - setup_ntlmv2_rsp
      Signed-off-by: default avatarShu Wang <shuwang@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcc949a4
  2. 27 Sep, 2017 9 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.89 · 10def3a6
      Greg Kroah-Hartman authored
      10def3a6
    • Steven Rostedt (VMware)'s avatar
      ftrace: Fix memleak when unregistering dynamic ops when tracing disabled · ed1bf439
      Steven Rostedt (VMware) authored
      commit edb096e0 upstream.
      
      If function tracing is disabled by the user via the function-trace option or
      the proc sysctl file, and a ftrace_ops that was allocated on the heap is
      unregistered, then the shutdown code exits out without doing the proper
      clean up. This was found via kmemleak and running the ftrace selftests, as
      one of the tests unregisters with function tracing disabled.
      
       # cat kmemleak
      unreferenced object 0xffffffffa0020000 (size 4096):
        comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s)
        hex dump (first 32 bytes):
          55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89  U.t$.UH...t$.UH.
          e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c  .H......H.D$PH.L
        backtrace:
          [<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0
          [<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0
          [<ffffffff8109697f>] module_alloc+0x4f/0x90
          [<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420
          [<ffffffff81249947>] ftrace_startup+0xe7/0x300
          [<ffffffff81249bd2>] register_ftrace_function+0x72/0x90
          [<ffffffff81263786>] trace_selftest_ops+0x204/0x397
          [<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624
          [<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7
          [<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192
          [<ffffffff81002230>] do_one_initcall+0x90/0x1e2
          [<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe
          [<ffffffff81d61ec3>] kernel_init+0x13/0x122
          [<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Fixes: 12cce594 ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed1bf439
    • Michael Lyle's avatar
      bcache: fix bch_hprint crash and improve output · a069d0a4
      Michael Lyle authored
      commit 9276717b upstream.
      
      Most importantly, solve a crash where %llu was used to format signed
      numbers.  This would cause a buffer overflow when reading sysfs
      writeback_rate_debug, as only 20 bytes were allocated for this and
      %llu writes 20 characters plus a null.
      
      Always use the units mechanism rather than having different output
      paths for simplicity.
      
      Also, correct problems with display output where 1.10 was a larger
      number than 1.09, by multiplying by 10 and then dividing by 1024 instead
      of dividing by 100.  (Remainders of >= 1000 would print as .10).
      
      Minor changes: Always display the decimal point instead of trying to
      omit it based on number of digits shown.  Decide what units to use
      based on 1000 as a threshold, not 1024 (in other words, always print
      at most 3 digits before the decimal point).
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reported-by: default avatarDmitry Yu Okunev <dyokunev@ut.mephi.ru>
      Acked-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a069d0a4
    • Tang Junhui's avatar
      bcache: fix for gc and write-back race · f522051a
      Tang Junhui authored
      commit 9baf3097 upstream.
      
      gc and write-back get raced (see the email "bcache get stucked" I sended
      before):
      gc thread                               write-back thread
      |                                       |bch_writeback_thread()
      |bch_gc_thread()                        |
      |                                       |==>read_dirty()
      |==>bch_btree_gc()                      |
      |==>btree_root() //get btree root       |
      |                //node write locker    |
      |==>bch_btree_gc_root()                 |
      |                                       |==>read_dirty_submit()
      |                                       |==>write_dirty()
      |                                       |==>continue_at(cl,
      |                                       |               write_dirty_finish,
      |                                       |               system_wq);
      |                                       |==>write_dirty_finish()//excute
      |                                       |               //in system_wq
      |                                       |==>bch_btree_insert()
      |                                       |==>bch_btree_map_leaf_nodes()
      |                                       |==>__bch_btree_map_nodes()
      |                                       |==>btree_root //try to get btree
      |                                       |              //root node read
      |                                       |              //lock
      |                                       |-----stuck here
      |==>bch_btree_set_root()
      |==>bch_journal_meta()
      |==>bch_journal()
      |==>journal_try_write()
      |==>journal_write_unlocked() //journal_full(&c->journal)
      |                            //condition satisfied
      |==>continue_at(cl, journal_write, system_wq); //try to excute
      |                               //journal_write in system_wq
      |                               //but work queue is excuting
      |                               //write_dirty_finish()
      |==>closure_sync(); //wait journal_write execute
      |                   //over and wake up gc,
      |-------------stuck here
      |==>release root node write locker
      
      This patch alloc a separate work-queue for write-back thread to avoid such
      race.
      
      (Commit log re-organized by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f522051a
    • Tony Asleson's avatar
      bcache: Correct return value for sysfs attach errors · a6c5e7a0
      Tony Asleson authored
      commit 77fa100f upstream.
      
      If you encounter any errors in bch_cached_dev_attach it will return
      a negative error code.  The variable 'v' which stores the result is
      unsigned, thus user space sees a very large value returned for bytes
      written which can cause incorrect user space behavior.  Utilize 1
      signed variable to use throughout the function to preserve error return
      capability.
      Signed-off-by: default avatarTony Asleson <tasleson@redhat.com>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6c5e7a0
    • Tang Junhui's avatar
      bcache: correct cache_dirty_target in __update_writeback_rate() · d9c6a28a
      Tang Junhui authored
      commit a8394090 upstream.
      
      __update_write_rate() uses a Proportion-Differentiation Controller
      algorithm to control writeback rate. A dirty target number is used in
      this PD controller to control writeback rate. A larger target number
      will make the writeback rate smaller, on the versus, a smaller target
      number will make the writeback rate larger.
      
      bcache uses the following steps to calculate the target number,
      1) cache_sectors = all-buckets-of-cache-set * buckets-size
      2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
      3) target = cache_dirty_target *
      (sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)
      
      The calculation at step 1) for cache_sectors is incorrect, which does
      not consider dirty blocks occupied by flash only volume.
      
      A flash only volume can be took as a bcache device without cached
      device. All data sectors allocated for it are persistent on cache device
      and marked dirty, they are not touched by bcache writeback and garbage
      collection code. So data blocks of flash only volume should be ignore
      when calculating cache_sectors of cache set.
      
      Current code does not subtract dirty sectors of flash only volume, which
      results a larger target number from the above 3 steps. And in sequence
      the cache device's writeback rate is smaller then a correct value,
      writeback speed is slower on all cached devices.
      
      This patch fixes the incorrect slower writeback rate by subtracting
      dirty sectors of flash only volumes in __update_writeback_rate().
      
      (Commit log composed by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9c6a28a
    • Tang Junhui's avatar
      bcache: do not subtract sectors_to_gc for bypassed IO · 0471f58e
      Tang Junhui authored
      commit 69daf03a upstream.
      
      Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
      trigger gc thread.
      Signed-off-by: default avatartang.junhui <tang.junhui@zte.com.cn>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0471f58e
    • Jan Kara's avatar
      bcache: Fix leak of bdev reference · 093457f2
      Jan Kara authored
      commit 4b758df2 upstream.
      
      If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
      block device using lookup_bdev() to detect which situation we are in to
      properly report error. However we never drop the reference returned to
      us from lookup_bdev(). Fix that.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      093457f2
    • Tang Junhui's avatar
      bcache: initialize dirty stripes in flash_dev_run() · 5025da3b
      Tang Junhui authored
      commit 175206cf upstream.
      
      bcache uses a Proportion-Differentiation Controller algorithm to control
      writeback rate to cached devices. In the PD controller algorithm, dirty
      stripes of thin flash device should not be counted in, because flash only
      volumes never write back dirty data.
      
      Currently dirty stripe counter for thin flash device is not initialized
      when the thin flash device starts. Which means the following calculation
      in PD controller will reference an undefined dirty stripes number, and
      all cached devices attached to the same cache set where the thin flash
      device lies on may have an inaccurate writeback rate.
      
      This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
      correctly initialize dirty stripe counter when the thin flash device
      starts to run. This patch also does following parameter data type change,
       -void bch_sectors_dirty_init(struct cached_dev *dc);
       +void bch_sectors_dirty_init(struct bcache_device *);
      to call this function conveniently in flash_dev_run().
      
      (Commit log is composed by Coly Li)
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5025da3b