1. 27 Jul, 2016 40 commits
    • Jeff Mahoney's avatar
      btrfs: account for non-CoW'd blocks in btrfs_abort_transaction · 13226e1e
      Jeff Mahoney authored
      commit 64c12921 upstream.
      
      The test for !trans->blocks_used in btrfs_abort_transaction is
      insufficient to determine whether it's safe to drop the transaction
      handle on the floor.  btrfs_cow_block, informed by should_cow_block,
      can return blocks that have already been CoW'd in the current
      transaction.  trans->blocks_used is only incremented for new block
      allocations. If an operation overlaps the blocks in the current
      transaction entirely and must abort the transaction, we'll happily
      let it clean up the trans handle even though it may have modified
      the blocks and will commit an incomplete operation.
      
      In the long-term, I'd like to do closer tracking of when the fs
      is actually modified so we can still recover as gracefully as possible,
      but that approach will need some discussion.  In the short term,
      since this is the only code using trans->blocks_used, let's just
      switch it to a bool indicating whether any blocks were used and set
      it when should_cow_block returns false.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13226e1e
    • Tejun Heo's avatar
      percpu: fix synchronization between synchronous map extension and chunk destruction · 3bb1138e
      Tejun Heo authored
      commit 6710e594 upstream.
      
      For non-atomic allocations, pcpu_alloc() can try to extend the area
      map synchronously after dropping pcpu_lock; however, the extension
      wasn't synchronized against chunk destruction and the chunk might get
      freed while extension is in progress.
      
      This patch fixes the bug by putting most of non-atomic allocations
      under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
      is responsible for async chunk management including destruction.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Fixes: 1a4d7607 ("percpu: implement asynchronous chunk population")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3bb1138e
    • Tejun Heo's avatar
      percpu: fix synchronization between chunk->map_extend_work and chunk destruction · c26ae537
      Tejun Heo authored
      commit 4f996e23 upstream.
      
      Atomic allocations can trigger async map extensions which is serviced
      by chunk->map_extend_work.  pcpu_balance_work which is responsible for
      destroying idle chunks wasn't synchronizing properly against
      chunk->map_extend_work and may end up freeing the chunk while the work
      item is still in flight.
      
      This patch fixes the bug by rolling async map extension operations
      into pcpu_balance_work.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Fixes: 9c824b6a ("percpu: make sure chunk->map array has available space")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c26ae537
    • Miklos Szeredi's avatar
      af_unix: fix hard linked sockets on overlay · 0da3127a
      Miklos Szeredi authored
      commit eb0a4a47 upstream.
      
      Overlayfs uses separate inodes even in the case of hard links on the
      underlying filesystems.  This is a problem for AF_UNIX socket
      implementation which indexes sockets based on the inode.  This resulted in
      hard linked sockets not working.
      
      The fix is to use the real, underlying inode.
      
      Test case follows:
      
      -- ovl-sock-test.c --
      #include <unistd.h>
      #include <err.h>
      #include <sys/socket.h>
      #include <sys/un.h>
      
      #define SOCK "test-sock"
      #define SOCK2 "test-sock2"
      
      int main(void)
      {
      	int fd, fd2;
      	struct sockaddr_un addr = {
      		.sun_family = AF_UNIX,
      		.sun_path = SOCK,
      	};
      	struct sockaddr_un addr2 = {
      		.sun_family = AF_UNIX,
      		.sun_path = SOCK2,
      	};
      
      	unlink(SOCK);
      	unlink(SOCK2);
      	if ((fd = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
      		err(1, "socket");
      	if (bind(fd, (struct sockaddr *) &addr, sizeof(addr)) == -1)
      		err(1, "bind");
      	if (listen(fd, 0) == -1)
      		err(1, "listen");
      	if (link(SOCK, SOCK2) == -1)
      		err(1, "link");
      	if ((fd2 = socket(AF_UNIX, SOCK_STREAM, 0)) == -1)
      		err(1, "socket");
      	if (connect(fd2, (struct sockaddr *) &addr2, sizeof(addr2)) == -1)
      		err (1, "connect");
      	return 0;
      }
      ----
      Reported-by: default avatarAlexander Morozov <alexandr.morozov@docker.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0da3127a
    • Miklos Szeredi's avatar
      vfs: add d_real_inode() helper · c6517079
      Miklos Szeredi authored
      commit a1180844 upstream.
      
      Needed by the following fix.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6517079
    • Mark Rutland's avatar
      arm64: Rework valid_user_regs · cff5b236
      Mark Rutland authored
      commit dbd4d7ca upstream.
      
      We validate pstate using PSR_MODE32_BIT, which is part of the
      user-provided pstate (and cannot be trusted). Also, we conflate
      validation of AArch32 and AArch64 pstate values, making the code
      difficult to reason about.
      
      Instead, validate the pstate value based on the associated task. The
      task may or may not be current (e.g. when using ptrace), so this must be
      passed explicitly by callers. To avoid circular header dependencies via
      sched.h, is_compat_task is pulled out of asm/ptrace.h.
      
      To make the code possible to reason about, the AArch64 and AArch32
      validation is split into separate functions. Software must respect the
      RES0 policy for SPSR bits, and thus the kernel mirrors the hardware
      policy (RAZ/WI) for bits as-yet unallocated. When these acquire an
      architected meaning writes may be permitted (potentially with additional
      validation).
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Peter Maydell <peter.maydell@linaro.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      [ rebased for v4.1+
        This avoids a user-triggerable Oops() if a task is switched to a mode
        not supported by the kernel (e.g. switching a 64-bit task to AArch32).
      ]
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Reviewed-by: Mark Rutland <mark.rutland@arm.com> [backport]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      cff5b236
    • Junichi Nomura's avatar
      ipmi: Remove smi_msg from waiting_rcv_msgs list before handle_one_recv_msg() · de0f9fa7
      Junichi Nomura authored
      commit ae4ea9a2 upstream.
      
      Commit 7ea0ed2b ("ipmi: Make the message handler easier to use for
      SMI interfaces") changed handle_new_recv_msgs() to call handle_one_recv_msg()
      for a smi_msg while the smi_msg is still connected to waiting_rcv_msgs list.
      That could lead to following list corruption problems:
      
      1) low-level function treats smi_msg as not connected to list
      
        handle_one_recv_msg() could end up calling smi_send(), which
        assumes the msg is not connected to list.
      
        For example, the following sequence could corrupt list by
        doing list_add_tail() for the entry still connected to other list.
      
          handle_new_recv_msgs()
            msg = list_entry(waiting_rcv_msgs)
            handle_one_recv_msg(msg)
              handle_ipmb_get_msg_cmd(msg)
                smi_send(msg)
                  spin_lock(xmit_msgs_lock)
                  list_add_tail(msg)
                  spin_unlock(xmit_msgs_lock)
      
      2) race between multiple handle_new_recv_msgs() instances
      
        handle_new_recv_msgs() once releases waiting_rcv_msgs_lock before calling
        handle_one_recv_msg() then retakes the lock and list_del() it.
      
        If others call handle_new_recv_msgs() during the window shown below
        list_del() will be done twice for the same smi_msg.
      
        handle_new_recv_msgs()
          spin_lock(waiting_rcv_msgs_lock)
          msg = list_entry(waiting_rcv_msgs)
          spin_unlock(waiting_rcv_msgs_lock)
        |
        | handle_one_recv_msg(msg)
        |
          spin_lock(waiting_rcv_msgs_lock)
          list_del(msg)
          spin_unlock(waiting_rcv_msgs_lock)
      
      Fixes: 7ea0ed2b ("ipmi: Make the message handler easier to use for SMI interfaces")
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      [Added a comment to describe why this works.]
      Signed-off-by: default avatarCorey Minyard <cminyard@mvista.com>
      Tested-by: default avatarYe Feng <yefeng.yl@alibaba-inc.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de0f9fa7
    • Mathieu Larouche's avatar
      drm/mgag200: Black screen fix for G200e rev 4 · 084ad7f7
      Mathieu Larouche authored
      commit d3922b69 upstream.
      
      - Fixed black screen for some resolutions of G200e rev4
      - Fixed testm & testn which had predetermined value.
      Reported-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarMathieu Larouche <mathieu.larouche@matrox.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      084ad7f7
    • Joerg Roedel's avatar
      iommu/amd: Fix unity mapping initialization race · e205592b
      Joerg Roedel authored
      commit 522e5cb7 upstream.
      
      There is a race condition in the AMD IOMMU init code that
      causes requested unity mappings to be blocked by the IOMMU
      for a short period of time. This results on boot failures
      and IO_PAGE_FAULTs on some machines.
      
      Fix this by making sure the unity mappings are installed
      before all other DMA is blocked.
      
      Fixes: aafd8ba0 ('iommu/amd: Implement add_device and remove_device')
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e205592b
    • Joerg Roedel's avatar
      iommu/vt-d: Enable QI on all IOMMUs before setting root entry · 72803a72
      Joerg Roedel authored
      commit a4c34ff1 upstream.
      
      This seems to be required on some X58 chipsets on systems
      with more than one IOMMU. QI does not work until it is
      enabled on all IOMMUs in the system.
      Reported-by: default avatarDheeraj CVR <cvr.dheeraj@gmail.com>
      Tested-by: default avatarDheeraj CVR <cvr.dheeraj@gmail.com>
      Fixes: 5f0a7f76 ('iommu/vt-d: Make root entry visible for hardware right after allocation')
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72803a72
    • Jean-Philippe Brucker's avatar
      iommu/arm-smmu: Wire up map_sg for arm-smmu-v3 · c9566f67
      Jean-Philippe Brucker authored
      commit 9aeb26cf upstream.
      
      The map_sg callback is missing from arm_smmu_ops, but is required by
      iommu.h. Similarly to most other IOMMU drivers, connect it to
      default_iommu_map_sg.
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9566f67
    • Jiri Slaby's avatar
      base: make module_create_drivers_dir race-free · c705db27
      Jiri Slaby authored
      commit 7e1b1fc4 upstream.
      
      Modules which register drivers via standard path (driver_register) in
      parallel can cause a warning:
      WARNING: CPU: 2 PID: 3492 at ../fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80
      sysfs: cannot create duplicate filename '/module/saa7146/drivers'
      Modules linked in: hexium_gemini(+) mxb(+) ...
      ...
      Call Trace:
      ...
       [<ffffffff812e63a2>] sysfs_warn_dup+0x62/0x80
       [<ffffffff812e6487>] sysfs_create_dir_ns+0x77/0x90
       [<ffffffff8140f2c4>] kobject_add_internal+0xb4/0x340
       [<ffffffff8140f5b8>] kobject_add+0x68/0xb0
       [<ffffffff8140f631>] kobject_create_and_add+0x31/0x70
       [<ffffffff8157a703>] module_add_driver+0xc3/0xd0
       [<ffffffff8155e5d4>] bus_add_driver+0x154/0x280
       [<ffffffff815604c0>] driver_register+0x60/0xe0
       [<ffffffff8145bed0>] __pci_register_driver+0x60/0x70
       [<ffffffffa0273e14>] saa7146_register_extension+0x64/0x90 [saa7146]
       [<ffffffffa0033011>] hexium_init_module+0x11/0x1000 [hexium_gemini]
      ...
      
      As can be (mostly) seen, driver_register causes this call sequence:
        -> bus_add_driver
          -> module_add_driver
            -> module_create_drivers_dir
      The last one creates "drivers" directory in /sys/module/<...>. When
      this is done in parallel, the directory is attempted to be created
      twice at the same time.
      
      This can be easily reproduced by loading mxb and hexium_gemini in
      parallel:
      while :; do
        modprobe mxb &
        modprobe hexium_gemini
        wait
        rmmod mxb hexium_gemini saa7146_vv saa7146
      done
      
      saa7146 calls pci_register_driver for both mxb and hexium_gemini,
      which means /sys/module/saa7146/drivers is to be created for both of
      them.
      
      Fix this by a new mutex in module_create_drivers_dir which makes the
      test-and-create "drivers" dir atomic.
      
      I inverted the condition and removed 'return' to avoid multiple
      unlocks or a goto.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Fixes: fe480a26 (Modules: only add drivers/ direcory if needed)
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c705db27
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Handle NULL formats in hold_module_trace_bprintk_format() · bc64a839
      Steven Rostedt (Red Hat) authored
      commit 70c8217a upstream.
      
      If a task uses a non constant string for the format parameter in
      trace_printk(), then the trace_printk_fmt variable is set to NULL. This
      variable is then saved in the __trace_printk_fmt section.
      
      The function hold_module_trace_bprintk_format() checks to see if duplicate
      formats are used by modules, and reuses them if so (saves them to the list
      if it is new). But this function calls lookup_format() that does a strcmp()
      to the value (which is now NULL) and can cause a kernel oops.
      
      This wasn't an issue till 3debb0a9 ("tracing: Fix trace_printk() to print
      when not using bprintk()") which added "__used" to the trace_printk_fmt
      variable, and before that, the kernel simply optimized it out (no NULL value
      was saved).
      
      The fix is simply to handle the NULL pointer in lookup_format() and have the
      caller ignore the value if it was NULL.
      
      Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.comReported-by: default avatarxingzhen <zhengjun.xing@intel.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Fixes: 3debb0a9 ("tracing: Fix trace_printk() to print when not using bprintk()")
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc64a839
    • Allen Hung's avatar
      HID: multitouch: enable palm rejection for Windows Precision Touchpad · 2f839c95
      Allen Hung authored
      commit 6dd2e27a upstream.
      
      The usage Confidence is mandary to Windows Precision Touchpad devices. If
      it is examined in input_mapping on a WIndows Precision Touchpad, a new add
      quirk MT_QUIRK_CONFIDENCE desgned for such devices will be applied to the
      device. A touch with the confidence bit is not set is determined as
      invalid.
      
      Tested on Dell XPS13 9343
      Reviewed-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Tested-by: Andy Lutomirski <luto@kernel.org> # XPS 13 9350, BIOS 1.4.3
      Signed-off-by: default avatarAllen Hung <allen_hung@dell.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f839c95
    • Scott Bauer's avatar
      HID: hiddev: validate num_values for HIDIOCGUSAGES, HIDIOCSUSAGES commands · 300851ff
      Scott Bauer authored
      commit 93a2001b upstream.
      
      This patch validates the num_values parameter from userland during the
      HIDIOCGUSAGES and HIDIOCSUSAGES commands. Previously, if the report id was set
      to HID_REPORT_ID_UNKNOWN, we would fail to validate the num_values parameter
      leading to a heap overflow.
      Signed-off-by: default avatarScott Bauer <sbauer@plzdonthack.me>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      300851ff
    • Oliver Neukum's avatar
      HID: elo: kill not flush the work · 2d7a2ff1
      Oliver Neukum authored
      commit ed596a4a upstream.
      
      Flushing a work that reschedules itself is not a sensible operation. It needs
      to be killed. Failure to do so leads to a kernel panic in the timer code.
      Signed-off-by: default avatarOliver Neukum <ONeukum@suse.com>
      Reviewed-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d7a2ff1
    • Quentin Casasnovas's avatar
      KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode. · 44dd5cec
      Quentin Casasnovas authored
      commit ff30ef40 upstream.
      
      I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
      getting a GP(0) on a rather unspecial vmread from Xen:
      
           (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
           (XEN) CPU:    1
           (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
           (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
           (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
           (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
           (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
           (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
           (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
           (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
           (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
           (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
           (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
           (XEN) Xen stack trace from rsp=ffff83003ffbfab0:
      
           ...
      
           (XEN) Xen call trace:
           (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
           (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
           (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
           (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
           (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
           (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
           (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
           (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
           (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
           (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
           (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
           (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
           (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
           (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
           (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
           (XEN)
           (XEN)
           (XEN) ****************************************
           (XEN) Panic on CPU 1:
           (XEN) GENERAL PROTECTION FAULT
           (XEN) [error_code=0000]
           (XEN) ****************************************
      
      Tracing my host KVM showed it was the one injecting the GP(0) when
      emulating the VMREAD and checking the destination segment permissions in
      get_vmx_mem_address():
      
           3)               |    vmx_handle_exit() {
           3)               |      handle_vmread() {
           3)               |        nested_vmx_check_permission() {
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.065 us    |            vmx_read_guest_seg_selector();
           3)   0.066 us    |            vmx_read_guest_seg_ar();
           3)   1.636 us    |          }
           3)   0.058 us    |          vmx_get_rflags();
           3)   0.062 us    |          vmx_read_guest_seg_ar();
           3)   3.469 us    |        }
           3)               |        vmx_get_cs_db_l_bits() {
           3)   0.058 us    |          vmx_read_guest_seg_ar();
           3)   0.662 us    |        }
           3)               |        get_vmx_mem_address() {
           3)   0.068 us    |          vmx_cache_reg();
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.068 us    |            vmx_read_guest_seg_selector();
           3)   0.071 us    |            vmx_read_guest_seg_ar();
           3)   1.756 us    |          }
           3)               |          kvm_queue_exception_e() {
           3)   0.066 us    |            kvm_multiple_exception();
           3)   0.684 us    |          }
           3)   4.085 us    |        }
           3)   9.833 us    |      }
           3) + 10.366 us   |    }
      
      Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
      Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
      Control Structure", I found that we're enforcing that the destination
      operand is NOT located in a read-only data segment or any code segment when
      the L1 is in long mode - BUT that check should only happen when it is in
      protected mode.
      
      Shuffling the code a bit to make our emulation follow the specification
      allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
      without problems.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Signed-off-by: default avatarQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44dd5cec
    • Xiubo Li's avatar
      kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES · 54f87e16
      Xiubo Li authored
      commit caf1ff26 upstream.
      
      These days, we experienced one guest crash with 8 cores and 3 disks,
      with qemu error logs as bellow:
      
      qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
      kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
      
      And then we found one patch(bdf026317d) in qemu tree, which said
      could fix this bug.
      
      Execute the following script will reproduce the BUG quickly:
      
      irq_affinity.sh
      ========================================================================
      
      vda_irq_num=25
      vdb_irq_num=27
      while [ 1 ]
      do
          for irq in {1,2,4,8,10,20,40,80}
              do
                  echo $irq > /proc/irq/$vda_irq_num/smp_affinity
                  echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
                  dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
                  dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
              done
      done
      ========================================================================
      
      The following qemu log is added in the qemu code and is displayed when
      this bug reproduced:
      
      kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
      irq_routes->nr: 1024, gsi_count: 1024.
      
      That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
      but in the kernel code when routes->nr >= 1024, will just return -EINVAL;
      
      The nr is the number of the routing entries which is in of
      [1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].
      
      This patch fix the BUG above.
      Signed-off-by: default avatarXiubo Li <lixiubo@cmss.chinamobile.com>
      Signed-off-by: default avatarWei Tang <tangwei@cmss.chinamobile.com>
      Signed-off-by: default avatarZhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54f87e16
    • Dan Carpenter's avatar
      KEYS: potential uninitialized variable · 398051f2
      Dan Carpenter authored
      commit 38327424 upstream.
      
      If __key_link_begin() failed then "edit" would be uninitialized.  I've
      added a check to fix that.
      
      This allows a random user to crash the kernel, though it's quite
      difficult to achieve.  There are three ways it can be done as the user
      would have to cause an error to occur in __key_link():
      
       (1) Cause the kernel to run out of memory.  In practice, this is difficult
           to achieve without ENOMEM cropping up elsewhere and aborting the
           attempt.
      
       (2) Revoke the destination keyring between the keyring ID being looked up
           and it being tested for revocation.  In practice, this is difficult to
           time correctly because the KEYCTL_REJECT function can only be used
           from the request-key upcall process.  Further, users can only make use
           of what's in /sbin/request-key.conf, though this does including a
           rejection debugging test - which means that the destination keyring
           has to be the caller's session keyring in practice.
      
       (3) Have just enough key quota available to create a key, a new session
           keyring for the upcall and a link in the session keyring, but not then
           sufficient quota to create a link in the nominated destination keyring
           so that it fails with EDQUOT.
      
      The bug can be triggered using option (3) above using something like the
      following:
      
      	echo 80 >/proc/sys/kernel/keys/root_maxbytes
      	keyctl request2 user debug:fred negate @t
      
      The above sets the quota to something much lower (80) to make the bug
      easier to trigger, but this is dependent on the system.  Note also that
      the name of the keyring created contains a random number that may be
      between 1 and 10 characters in size, so may throw the test off by
      changing the amount of quota used.
      
      Assuming the failure occurs, something like the following will be seen:
      
      	kfree_debugcheck: out of range ptr 6b6b6b6b6b6b6b68h
      	------------[ cut here ]------------
      	kernel BUG at ../mm/slab.c:2821!
      	...
      	RIP: 0010:[<ffffffff811600f9>] kfree_debugcheck+0x20/0x25
      	RSP: 0018:ffff8804014a7de8  EFLAGS: 00010092
      	RAX: 0000000000000034 RBX: 6b6b6b6b6b6b6b68 RCX: 0000000000000000
      	RDX: 0000000000040001 RSI: 00000000000000f6 RDI: 0000000000000300
      	RBP: ffff8804014a7df0 R08: 0000000000000001 R09: 0000000000000000
      	R10: ffff8804014a7e68 R11: 0000000000000054 R12: 0000000000000202
      	R13: ffffffff81318a66 R14: 0000000000000000 R15: 0000000000000001
      	...
      	Call Trace:
      	  kfree+0xde/0x1bc
      	  assoc_array_cancel_edit+0x1f/0x36
      	  __key_link_end+0x55/0x63
      	  key_reject_and_link+0x124/0x155
      	  keyctl_reject_key+0xb6/0xe0
      	  keyctl_negate_key+0x10/0x12
      	  SyS_keyctl+0x9f/0xe7
      	  do_syscall_64+0x63/0x13a
      	  entry_SYSCALL64_slow_path+0x25/0x25
      
      Fixes: f70e2e06 ('KEYS: Do preallocation for __key_link()')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      398051f2
    • Vineet Gupta's avatar
    • Vineet Gupta's avatar
      f06a5a01
    • Martin KaFai Lau's avatar
      ipv6: Fix mem leak in rt6i_pcpu · 9c458a86
      Martin KaFai Lau authored
      [ Upstream commit 903ce4ab ]
      
      It was first reported and reproduced by Petr (thanks!) in
      https://bugzilla.kernel.org/show_bug.cgi?id=119581
      
      free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy().
      
      However, after fixing a deadlock bug in
      commit 9c7370a1 ("ipv6: Fix a potential deadlock when creating pcpu rt"),
      free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL.
      
      It is worth to note that rt6i_pcpu is protected by table->tb6_lock.
      
      kmemleak somehow did not report it.  We nailed it down by
      observing the pcpu entries in /proc/vmallocinfo (first suggested
      by Hannes, thanks!).
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Fixes: 9c7370a1 ("ipv6: Fix a potential deadlock when creating pcpu rt")
      Reported-by: default avatarPetr Novopashenniy <pety@rusnet.ru>
      Tested-by: default avatarPetr Novopashenniy <pety@rusnet.ru>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Petr Novopashenniy <pety@rusnet.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c458a86
    • Bjørn Mork's avatar
      cdc_ncm: workaround for EM7455 "silent" data interface · 61f602d8
      Bjørn Mork authored
      [ Upstream commit c086e709 ]
      
      Several Lenovo users have reported problems with their Sierra
      Wireless EM7455 modem. The driver has loaded successfully and
      the MBIM management channel has appeared to work, including
      establishing a connection to the mobile network. But no frames
      have been received over the data interface.
      
      The problem affects all EM7455 and MC7455, and is assumed to
      affect other modems based on the same Qualcomm chipset and
      baseband firmware.
      
      Testing narrowed the problem down to what seems to be a
      firmware timing bug during initialization. Adding a short sleep
      while probing is sufficient to make the problem disappear.
      Experiments have shown that 1-2 ms is too little to have any
      effect, while 10-20 ms is enough to reliably succeed.
      Reported-by: default avatarStefan Armbruster <ml001@armbruster-it.de>
      Reported-by: default avatarRalph Plawetzki <ralph@purejava.org>
      Reported-by: default avatarAndreas Fett <andreas.fett@secunet.com>
      Reported-by: default avatarRasmus Lerdorf <rasmus@lerdorf.com>
      Reported-by: default avatarSamo Ratnik <samo.ratnik@gmail.com>
      Reported-and-tested-by: default avatarAleksander Morgado <aleksander@aleksander.es>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61f602d8
    • WANG Cong's avatar
      net_sched: fix mirrored packets checksum · 2832302f
      WANG Cong authored
      [ Upstream commit 82a31b92 ]
      
      Similar to commit 9b368814 ("net: fix bridge multicast packet checksum validation")
      we need to fixup the checksum for CHECKSUM_COMPLETE when
      pushing skb on RX path. Otherwise we get similar splats.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2832302f
    • David S. Miller's avatar
      packet: Use symmetric hash for PACKET_FANOUT_HASH. · 424848bd
      David S. Miller authored
      [ Upstream commit eb70db87 ]
      
      People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
      they want packets going in both directions on a flow to hash to the
      same bucket.
      
      The core kernel SKB hash became non-symmetric when the ipv6 flow label
      and other entities were incorporated into the standard flow hash order
      to increase entropy.
      
      But there are no users of PACKET_FANOUT_HASH who want an assymetric
      hash, they all want a symmetric one.
      
      Therefore, use the flow dissector to compute a flat symmetric hash
      over only the protocol, addresses and ports.  This hash does not get
      installed into and override the normal skb hash, so this change has
      no effect whatsoever on the rest of the stack.
      Reported-by: default avatarEric Leblond <eric@regit.org>
      Tested-by: default avatarEric Leblond <eric@regit.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      424848bd
    • Peter Zijlstra's avatar
      sched/fair: Fix cfs_rq avg tracking underflow · 43b1bfec
      Peter Zijlstra authored
      commit 89741892 upstream.
      
      As per commit:
      
        b7fa30c9 ("sched/fair: Fix post_init_entity_util_avg() serialization")
      
      > the code generated from update_cfs_rq_load_avg():
      >
      > 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      > 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      > 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      > 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      > 		removed_load = 1;
      > 	}
      >
      > turns into:
      >
      > ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
      > ffffffff8108706b:       48 85 c0                test   %rax,%rax
      > ffffffff8108706e:       74 40                   je     ffffffff810870b0 <update_blocked_averages+0xc0>
      > ffffffff81087070:       4c 89 f8                mov    %r15,%rax
      > ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
      > ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
      > ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
      > ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
      > ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
      > ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      >
      > Which you'll note ends up with sa->load_avg -= r in memory at
      > ffffffff8108707a.
      
      So I _should_ have looked at other unserialized users of ->load_avg,
      but alas. Luckily nikbor reported a similar /0 from task_h_load() which
      instantly triggered recollection of this here problem.
      
      Aside from the intermediate value hitting memory and causing problems,
      there's another problem: the underflow detection relies on the signed
      bit. This reduces the effective width of the variables, IOW its
      effectively the same as having these variables be of signed type.
      
      This patch changes to a different means of unsigned underflow
      detection to not rely on the signed bit. This allows the variables to
      use the 'full' unsigned range. And it does so with explicit LOAD -
      STORE to ensure any intermediate value will never be visible in
      memory, allowing these unserialized loads.
      
      Note: GCC generates crap code for this, might warrant a look later.
      
      Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
             maybe we should do clamping on add too.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: kernel@kyup.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      43b1bfec
    • Kirill A. Shutemov's avatar
      UBIFS: Implement ->migratepage() · 1e1f4ff7
      Kirill A. Shutemov authored
      commit 4ac1c17b upstream.
      
      During page migrations UBIFS might get confused
      and the following assert triggers:
      [  213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436)
      [  213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008
      [  213.490000] Hardware name: Allwinner sun4i/sun5i Families
      [  213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14)
      [  213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0)
      [  213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50)
      [  213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8)
      [  213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290)
      [  213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80)
      [  213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0)
      [  213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4)
      [  213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0)
      [  213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8)
      [  213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274)
      [  213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c)
      [  213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0)
      [  213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8)
      [  213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48)
      [  213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444)
      [  213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614)
      [  213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c)
      [  213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34)
      
      UBIFS is using PagePrivate() which can have different meanings across
      filesystems. Therefore the generic page migration code cannot handle this
      case correctly.
      We have to implement our own migration function which basically does a
      plain copy but also duplicates the page private flag.
      UBIFS is not a block device filesystem and cannot use buffer_migrate_page().
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [rw: Massaged changelog, build fixes, etc...]
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e1f4ff7
    • Richard Weinberger's avatar
      mm: Export migrate_page_move_mapping and migrate_page_copy · 4b1cb3c8
      Richard Weinberger authored
      commit 1118dce7 upstream.
      
      Export these symbols such that UBIFS can implement
      ->migratepage.
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b1cb3c8
    • James Hogan's avatar
      MIPS: KVM: Fix modular KVM under QEMU · 7ad26023
      James Hogan authored
      commit 797179bc upstream.
      
      Copy __kvm_mips_vcpu_run() into unmapped memory, so that we can never
      get a TLB refill exception in it when KVM is built as a module.
      
      This was observed to happen with the host MIPS kernel running under
      QEMU, due to a not entirely transparent optimisation in the QEMU TLB
      handling where TLB entries replaced with TLBWR are copied to a separate
      part of the TLB array. Code in those pages continue to be executable,
      but those mappings persist only until the next ASID switch, even if they
      are marked global.
      
      An ASID switch happens in __kvm_mips_vcpu_run() at exception level after
      switching to the guest exception base. Subsequent TLB mapped kernel
      instructions just prior to switching to the guest trigger a TLB refill
      exception, which enters the guest exception handlers without updating
      EPC. This appears as a guest triggered TLB refill on a host kernel
      mapped (host KSeg2) address, which is not handled correctly as user
      (guest) mode accesses to kernel (host) segments always generate address
      error exceptions.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: kvm@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ad26023
    • Steve Capper's avatar
      ARM: 8579/1: mm: Fix definition of pmd_mknotpresent · 490a71c5
      Steve Capper authored
      commit 56530f5d upstream.
      
      Currently pmd_mknotpresent will use a zero entry to respresent an
      invalidated pmd.
      
      Unfortunately this definition clashes with pmd_none, thus it is
      possible for a race condition to occur if zap_pmd_range sees pmd_none
      whilst __split_huge_pmd_locked is running too with pmdp_invalidate
      just called.
      
      This patch fixes the race condition by modifying pmd_mknotpresent to
      create non-zero faulting entries (as is done in other architectures),
      removing the ambiguity with pmd_none.
      
      [catalin.marinas@arm.com: using L_PMD_SECT_VALID instead of PMD_TYPE_SECT]
      
      Fixes: 8d962507 ("ARM: mm: Transparent huge page support for LPAE systems.")
      Reported-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      490a71c5
    • Will Deacon's avatar
      ARM: 8578/1: mm: ensure pmd_present only checks the valid bit · 54cf0dde
      Will Deacon authored
      commit 62453188 upstream.
      
      In a subsequent patch, pmd_mknotpresent will clear the valid bit of the
      pmd entry, resulting in a not-present entry from the hardware's
      perspective. Unfortunately, pmd_present simply checks for a non-zero pmd
      value and will therefore continue to return true even after a
      pmd_mknotpresent operation. Since pmd_mknotpresent is only used for
      managing huge entries, this is only an issue for the 3-level case.
      
      This patch fixes the 3-level pmd_present implementation to take into
      account the valid bit. For bisectability, the change is made before the
      fix to pmd_mknotpresent.
      
      [catalin.marinas@arm.com: comment update regarding pmd_mknotpresent patch]
      
      Fixes: 8d962507 ("ARM: mm: Transparent huge page support for LPAE systems.")
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steve Capper <Steve.Capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54cf0dde
    • Fabio Estevam's avatar
      ARM: imx6ul: Fix Micrel PHY mask · 91ac7387
      Fabio Estevam authored
      commit 20c15226 upstream.
      
      The value used for Micrel PHY mask is not correct. Use the
      MICREL_PHY_ID_MASK definition instead.
      
      Thanks to Jiri Luznicky for proposing the fix at
      https://community.freescale.com/thread/387739
      
      Fixes: 709bc065 ("ARM: imx6ul: add fec MAC refrence clock and phy fixup init")
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91ac7387
    • Trond Myklebust's avatar
      NFS: Fix another OPEN_DOWNGRADE bug · b5d4a793
      Trond Myklebust authored
      commit e547f262 upstream.
      
      Olga Kornievskaia reports that the following test fails to trigger
      an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.
      
      	fd0 = open(foo, RDRW)   -- should be open on the wire for "both"
      	fd1 = open(foo, RDONLY)  -- should be open on the wire for "read"
      	close(fd0) -- should trigger an open_downgrade
      	read(fd1)
      	close(fd1)
      
      The issue is that we're missing a check for whether or not the current
      state transitioned from an O_RDWR state as opposed to having transitioned
      from a combination of O_RDONLY and O_WRONLY.
      Reported-by: default avatarOlga Kornievskaia <aglo@umich.edu>
      Fixes: cd9288ff ("NFSv4: Fix another bug in the close/open_downgrade code")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5d4a793
    • Al Viro's avatar
      make nfs_atomic_open() call d_drop() on all ->open_context() errors. · 44d86dbf
      Al Viro authored
      commit d20cb71d upstream.
      
      In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
      unconditional d_drop() after the ->open_context() had been removed.  It had
      been correct for success cases (there ->open_context() itself had been doing
      dcache manipulations), but not for error ones.  Only one of those (ENOENT)
      got a compensatory d_drop() added in that commit, but in fact it should've
      been done for all errors.  As it is, the case of O_CREAT non-exclusive open
      on a hashed negative dentry racing with e.g. symlink creation from another
      client ended up with ->open_context() getting an error and proceeding to
      call nfs_lookup().  On a hashed dentry, which would've instantly triggered
      BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
      d_splice_alias()).
      Tested-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      44d86dbf
    • Ben Hutchings's avatar
      nfsd: check permissions when setting ACLs · 412cfeec
      Ben Hutchings authored
      commit 99965378 upstream.
      
      Use set_posix_acl, which includes proper permission checks, instead of
      calling ->set_acl directly.  Without this anyone may be able to grant
      themselves permissions to a file by setting the ACL.
      
      Lock the inode to make the new checks atomic with respect to set_acl.
      (Also, nfsd was the only caller of set_acl not locking the inode, so I
      suspect this may fix other races.)
      
      This also simplifies the code, and ensures our ACLs are checked by
      posix_acl_valid.
      
      The permission checks and the inode locking were lost with commit
      4ac7249e, which changed nfsd to use the set_acl inode operation directly
      instead of going through xattr handlers.
      Reported-by: default avatarDavid Sinquin <david@sinquin.eu>
      [agreunba@redhat.com: use set_posix_acl]
      Fixes: 4ac7249e
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      412cfeec
    • Andreas Gruenbacher's avatar
      posix_acl: Add set_posix_acl · c3fa141c
      Andreas Gruenbacher authored
      commit 485e71e8 upstream.
      
      Factor out part of posix_acl_xattr_set into a common function that takes
      a posix_acl, which nfsd can also call.
      
      The prototype already exists in include/linux/posix_acl.h.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3fa141c
    • Oleg Drokin's avatar
      nfsd: Extend the mutex holding region around in nfsd4_process_open2() · f78ffdc2
      Oleg Drokin authored
      commit 5cc1fb2a upstream.
      
      To avoid racing entry into nfs4_get_vfs_file().
      Make init_open_stateid() return with locked stateid to be unlocked
      by the caller.
      Signed-off-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f78ffdc2
    • Oleg Drokin's avatar
      nfsd: Always lock state exclusively. · 087f8fe0
      Oleg Drokin authored
      commit feb9dad5 upstream.
      
      It used to be the case that state had an rwlock that was locked for write
      by downgrades, but for read for upgrades (opens). Well, the problem is
      if there are two competing opens for the same state, they step on
      each other toes potentially leading to leaking file descriptors
      from the state structure, since access mode is a bitmap only set once.
      Signed-off-by: default avatarOleg Drokin <green@linuxhacker.ru>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      087f8fe0
    • J. Bruce Fields's avatar
      nfsd4/rpc: move backchannel create logic into rpc code · 58e9e70a
      J. Bruce Fields authored
      commit d50039ea upstream.
      
      Also simplify the logic a bit.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Acked-by: default avatarTrond Myklebust <trondmy@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58e9e70a
    • Tejun Heo's avatar
      writeback: use higher precision calculation in domain_dirty_limits() · 400850bd
      Tejun Heo authored
      commit 62a584fe upstream.
      
      As vm.dirty_[background_]bytes can't be applied verbatim to multiple
      cgroup writeback domains, they get converted to percentages in
      domain_dirty_limits() and applied the same way as
      vm.dirty_[background]ratio.  However, if the specified bytes is lower
      than 1% of available memory, the calculated ratios become zero and the
      writeback domain gets throttled constantly.
      
      Fix it by using per-PAGE_SIZE instead of percentage for ratio
      calculations.  Also, the updated DIV_ROUND_UP() usages now should
      yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
      bytes are above zero.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarMiao Xie <miaoxie@huawei.com>
      Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
      Fixes: 9fc3a43e ("writeback: separate out domain_dirty_limits()")
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Adjusted comment based on Jan's suggestion.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      400850bd