1. 05 Sep, 2014 40 commits
    • Dmitry Monakhov's avatar
    • Alexander Usyskin's avatar
      mei: nfc: fix memory leak in error path · 3099fab5
      Alexander Usyskin authored
      commit 8e8248b1 upstream.
      
      NFC will leak buffer if send failed.
      Use single exit point that does the freeing
      Signed-off-by: default avatarAlexander Usyskin <alexander.usyskin@intel.com>
      Signed-off-by: default avatarTomas Winkler <tomas.winkler@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3099fab5
    • Alexander Usyskin's avatar
      mei: reset client state on queued connect request · c079d35b
      Alexander Usyskin authored
      commit 73ab4232 upstream.
      
      If connect request is queued (e.g. device in pg) set client state
      to initializing, thus avoid preliminary exit in wait if current
      state is disconnected.
      
      This is regression from:
      
      commit e4d8270e
      Author: Alexander Usyskin <alexander.usyskin@intel.com>
      mei: set connecting state just upon connection request is sent to the fw
      Signed-off-by: default avatarAlexander Usyskin <alexander.usyskin@intel.com>
      Signed-off-by: default avatarTomas Winkler <tomas.winkler@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c079d35b
    • Liu Bo's avatar
      Btrfs: fix task hang under heavy compressed write · 8b7ac9fc
      Liu Bo authored
      commit 9e0af237 upstream.
      
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b7ac9fc
    • Chris Mason's avatar
      Btrfs: fix filemap_flush call in btrfs_file_release · a1bbac07
      Chris Mason authored
      commit f6dc45c7 upstream.
      
      We should only be flushing on close if the file was flagged as needing
      it during truncate.  I broke this with my ordered data vs transaction
      commit deadlock fix.
      
      Thanks to Miao Xie for catching this.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1bbac07
    • Liu Bo's avatar
      Btrfs: fix crash on endio of reading corrupted block · ddef1474
      Liu Bo authored
      commit 38c1c2e4 upstream.
      
      The crash is
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/extent_io.c:2124!
      [...]
      Workqueue: btrfs-endio normal_work_helper [btrfs]
      RIP: 0010:[<ffffffffa02d6055>]  [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
      
      This is in fact a regression.
      
      It is because we forgot to increase @offset properly in reading corrupted block,
      so that the @offset remains, and this leads to checksum errors while reading
      left blocks queued up in the same bio, and then ends up with hiting the above
      BUG_ON.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddef1474
    • Chris Mason's avatar
      btrfs: disable strict file flushes for renames and truncates · 5792fa6b
      Chris Mason authored
      commit 8d875f95 upstream.
      
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5792fa6b
    • Liu Bo's avatar
      Btrfs: fix compressed write corruption on enospc · 8e46c5dc
      Liu Bo authored
      commit ce62003f upstream.
      
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Tested-By: default avatarMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e46c5dc
    • Filipe Manana's avatar
      Btrfs: read lock extent buffer while walking backrefs · 480f1ea5
      Filipe Manana authored
      commit 6f7ff6d7 upstream.
      
      Before processing the extent buffer, acquire a read lock on it, so
      that we're safe against concurrent updates on the extent buffer.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      480f1ea5
    • Filipe Manana's avatar
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 4dc6ec45
      Filipe Manana authored
      commit 27b9a812 upstream.
      
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      
      For example, consider the following scenario, where we have:
      
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
      
        Leaf N:
      
          slot = 0                           slot = btrfs_header_nritems() - 1
        |-------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        |-------------------------------------------------------------------|
      
        Leaf N + 1:
      
            slot = 0                          slot = btrfs_header_nritems() - 1
        |--------------------------------------------------------------------|
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
        |--------------------------------------------------------------------|
      
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
      
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        |----------------------------------------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
        |----------------------------------------------------------------------------------------------------|
      
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
      
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
      
      and
      
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
      
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4dc6ec45
    • Takashi Iwai's avatar
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 3173e852
      Takashi Iwai authored
      commit 4eb1f66d upstream.
      
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
       Stack:
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3173e852
    • Stephen M. Cameron's avatar
      hpsa: fix bad -ENOMEM return value in hpsa_big_passthru_ioctl · 13ca262a
      Stephen M. Cameron authored
      commit 0758f4f7 upstream.
      
      When copy_from_user fails, return -EFAULT, not -ENOMEM
      Signed-off-by: default avatarStephen M. Cameron <scameron@beardog.cce.hp.com>
      Reported-by: default avatarRobert Elliott <elliott@hp.com>
      Reviewed-by: default avatarJoe Handzik <joseph.t.handzik@hp.com>
      Reviewed-by: default avatarScott Teel <scott.teel@hp.com>
      Reviewed by: Mike MIller <michael.miller@canonical.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      13ca262a
    • Hugh Dickins's avatar
      x86,mm: fix pte_special versus pte_numa · d3e12331
      Hugh Dickins authored
      commit b38af472 upstream.
      
      Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at
      mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels:
      where zap_pte_range() checks page->mapping to see if PageAnon(page).
      
      Those addresses fit struct pages for pfns d2001 and d2000, and in each
      dump a register or a stack slot showed d2001730 or d2000730: pte flags
      0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has
      a hole between cfffffff and 100000000, which would need special access.
      
      Commit c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on
      the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL
      pte no longer passes the pte_special() test, so zap_pte_range() goes on
      to try to access a non-existent struct page.
      
      Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to
      complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE).  A
      hint that this was a problem was that c46a7c81 added pte_numa() test
      to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast
      path: This was papering over a pte_special() snag when the zero page was
      encountered during zap.  This patch reverts vm_normal_page() to how it
      was before, relying on pte_special().
      
      It still appears that this patch may be incomplete: aren't there other
      places which need to be handling PROTNONE along with PRESENT?  For
      example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a
      PROT_NONE area, that would make it pte_special().  This is side-stepped
      by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there
      are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be
      interesting.
      
      Fixes: c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3e12331
    • David Vrabel's avatar
      x86/xen: resume timer irqs early · d752dc49
      David Vrabel authored
      commit 8d5999df upstream.
      
      If the timer irqs are resumed during device resume it is possible in
      certain circumstances for the resume to hang early on, before device
      interrupts are resumed.  For an Ubuntu 14.04 PVHVM guest this would
      occur in ~0.5% of resume attempts.
      
      It is not entirely clear what is occuring the point of the hang but I
      think a task necessary for the resume calls schedule_timeout(),
      waiting for a timer interrupt (which never arrives).  This failure may
      require specific tasks to be running on the other VCPUs to trigger
      (processes are not frozen during a suspend/resume if PREEMPT is
      disabled).
      
      Add IRQF_EARLY_RESUME to the timer interrupts so they are resumed in
      syscore_resume().
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d752dc49
    • David Vrabel's avatar
      x86/xen: use vmap() to map grant table pages in PVH guests · 434d59e0
      David Vrabel authored
      commit 7d951f3c upstream.
      
      Commit b7dd0e35 (x86/xen: safely map and unmap grant frames when
      in atomic context) causes PVH guests to crash in
      arch_gnttab_map_shared() when they attempted to map the pages for the
      grant table.
      
      This use of a PV-specific function during the PVH grant table setup is
      non-obvious and not needed.  The standard vmap() function does the
      right thing.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reported-by: default avatarMukesh Rathor <mukesh.rathor@oracle.com>
      Tested-by: default avatarMukesh Rathor <mukesh.rathor@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      434d59e0
    • Matt Fleming's avatar
      x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stub · ae37cf19
      Matt Fleming authored
      commit 7b2a583a upstream.
      
      Without CONFIG_RELOCATABLE the early boot code will decompress the
      kernel to LOAD_PHYSICAL_ADDR. While this may have been fine in the BIOS
      days, that isn't going to fly with UEFI since parts of the firmware
      code/data may be located at LOAD_PHYSICAL_ADDR.
      
      Straying outside of the bounds of the regions we've explicitly requested
      from the firmware will cause all sorts of trouble. Bruno reports that
      his machine resets while trying to decompress the kernel image.
      
      We already go to great pains to ensure the kernel is loaded into a
      suitably aligned buffer, it's just that the address isn't necessarily
      LOAD_PHYSICAL_ADDR, because we can't guarantee that address isn't in-use
      by the firmware.
      
      Explicitly enforce CONFIG_RELOCATABLE for the EFI boot stub, so that we
      can load the kernel at any address with the correct alignment.
      Reported-by: default avatarBruno Prémont <bonbons@linux-vserver.org>
      Tested-by: default avatarBruno Prémont <bonbons@linux-vserver.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae37cf19
    • David Vrabel's avatar
      xen/events/fifo: ensure all bitops are properly aligned even on x86 · 31c05d1c
      David Vrabel authored
      commit dcecb8fd upstream.
      
      When using the FIFO-based ABI on x86_64, if the last port is at the
      end of an event array page then sync_test_bit() on this port's event
      word will read beyond the end of the page and in certain circumstances
      this may fault.
      
      The fault requires the following page in the kernel's direct mapping
      to be not present, which would mean:
      
      a) the array page is the last page of RAM; or
      
      b) the following page is ballooned out /and/ it has been used for a
         foreign mapping by a kernel driver (such as netback or blkback)
         /and/ the grant has been unmapped.
      
      Use the infrastructure added for arm64 to ensure that all bitops
      operating on event words are unsigned long aligned.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      31c05d1c
    • Thomas Gleixner's avatar
      x86: MCE: Add raw_lock conversion again · c5abe803
      Thomas Gleixner authored
      commit ed5c41d3 upstream.
      
      Commit ea431643 ("x86/mce: Fix CMCI preemption bugs") breaks RT by
      the completely unrelated conversion of the cmci_discover_lock to a
      regular (non raw) spinlock.  This lock was annotated in commit
      59d958d2 ("locking, x86: mce: Annotate cmci_discover_lock as raw")
      with a proper explanation why.
      
      The argument for converting the lock back to a regular spinlock was:
      
       - it does percpu ops without disabling preemption. Preemption is not
         disabled due to the mistaken use of a raw spinlock.
      
      Which is complete nonsense.  The raw_spinlock is disabling preemption in
      the same way as a regular spinlock.  In mainline spinlock maps to
      raw_spinlock, in RT spinlock becomes a "sleeping" lock.
      
      raw_spinlock has on RT exactly the same semantics as in mainline.  And
      because this lock is taken in non preemptible context it must be raw on
      RT.
      
      Undo the locking brainfart.
      Reported-by: default avatarClark Williams <williams@redhat.com>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c5abe803
    • Arnd Bergmann's avatar
      hpsa: fix non-x86 builds · e6ff6f13
      Arnd Bergmann authored
      commit 0b9e7b74 upstream.
      
      commit 28e13446 "[SCSI] hpsa: enable unit attention reporting"
      turns on unit attention notifications, but got the change wrong for
      all architectures other than x86, which now store an uninitialized
      value into the device register.
      
      Gcc helpfully warns about this:
      
      ../drivers/scsi/hpsa.c: In function 'hpsa_set_driver_support_bits':
      ../drivers/scsi/hpsa.c:6373:17: warning: 'driver_support' is used uninitialized in this function [-Wuninitialized]
        driver_support |= ENABLE_UNIT_ATTN;
                       ^
      
      This moves the #ifdef so only the prefetch-enable is conditional
      on x86, not also reading the initial register contents.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 28e13446 "[SCSI] hpsa: enable unit attention reporting"
      Acked-by: default avatarStephen M. Cameron <scameron@beardog.cce.hp.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6ff6f13
    • Andy Lutomirski's avatar
      x86_64/vsyscall: Fix warn_bad_vsyscall log output · 5b345374
      Andy Lutomirski authored
      commit 53b884ac upstream.
      
      This commit in Linux 3.6:
      
          commit c767a54b
          Author: Joe Perches <joe@perches.com>
          Date:   Mon May 21 19:50:07 2012 -0700
      
              x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>
      
      caused warn_bad_vsyscall to output garbage in the middle of the
      line.  Revert the bad part of it.
      
      The printk in question isn't actually bare; the level is "%s".
      
      The bug this fixes is purely cosmetic; backports are optional.
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Link: http://lkml.kernel.org/r/03eac1f24110bbe496ecc12a4df467e0d88466d4.1406330947.git.luto@amacapital.netSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b345374
    • Brian W Hart's avatar
      powerpc/powernv: Update dev->dma_mask in pci_set_dma_mask() path · bc3c1662
      Brian W Hart authored
      commit a32305bf upstream.
      
      powerpc defines various machine-specific routines for handling
      pci_set_dma_mask().  The routines for machine "PowerNV" may neglect
      to set dev->dma_mask.  This could confuse anyone (e.g. drivers) that
      consult dev->dma_mask to find the current mask.  Set the dma_mask in
      the PowerNV leaf routine.
      Signed-off-by: default avatarBrian W. Hart <hartb@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc3c1662
    • Tyrel Datwyler's avatar
      powerpc/pci: Reorder pci bus/bridge unregistration during PHB removal · 6f519e90
      Tyrel Datwyler authored
      commit 73400565 upstream.
      
      Commit bcdde7e2 made __sysfs_remove_dir() recursive and introduced a BUG_ON
      during PHB removal while attempting to delete the power managment attribute
      group of the bus. This is a result of tearing the bridge and bus devices down
      out of order in remove_phb_dynamic. Since, the the bus resides below the bridge
      in the sysfs device tree it should be torn down first.
      
      This patch simply moves the device_unregister call for the PHB bridge device
      after the device_unregister call for the PHB bus.
      
      Fixes: bcdde7e2 ("sysfs: make __sysfs_remove_dir() recursive")
      Signed-off-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f519e90
    • Mike Qiu's avatar
      powerpc/eeh: Wrong place to call pci_get_slot() · 3aa12318
      Mike Qiu authored
      commit 9e5c6e5a upstream.
      
      pci_get_slot() is called with hold of PCI bus semaphore and it's not
      safe to be called in interrupt context. However, we possibly checks
      EEH error and calls the function in interrupt context. To avoid using
      pci_get_slot(), we turn into device tree for fetching location code.
      Otherwise, we might run into WARN_ON() as following messages indicate:
      
       WARNING: at drivers/pci/search.c:223
       CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc3+ #72
       task: c000000001367af0 ti: c000000001444000 task.ti: c000000001444000
       NIP: c000000000497b70 LR: c000000000037530 CTR: 000000003003d114
       REGS: c000000001446fa0 TRAP: 0700   Not tainted  (3.16.0-rc3+)
       MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 48002422  XER: 20000000
       CFAR: c00000000003752c SOFTE: 0
         :
       NIP [c000000000497b70] .pci_get_slot+0x40/0x110
       LR [c000000000037530] .eeh_pe_loc_get+0x150/0x190
       Call Trace:
         .of_get_property+0x30/0x60 (unreliable)
         .eeh_pe_loc_get+0x150/0x190
         .eeh_dev_check_failure+0x1b4/0x550
         .eeh_check_failure+0x90/0xf0
         .lpfc_sli_check_eratt+0x504/0x7c0 [lpfc]
         .lpfc_poll_eratt+0x64/0x100 [lpfc]
         .call_timer_fn+0x64/0x190
         .run_timer_softirq+0x2cc/0x3e0
      Signed-off-by: default avatarMike Qiu <qiudayu@linux.vnet.ibm.com>
      Acked-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3aa12318
    • Christoph Schulz's avatar
      x86: don't exclude low BIOS area when allocating address space for non-PCI cards · 42dbd25a
      Christoph Schulz authored
      commit cbace46a upstream.
      
      Commit 30919b0b ("x86: avoid low BIOS area when allocating address
      space") moved the test for resource allocations that fall within the first
      1MB of address space from the PCI-specific path to a generic path, such
      that all resource allocations will avoid this area.  However, this breaks
      ISA cards which need to allocate a memory region within the first 1MB.  An
      example is the i82365 PCMCIA controller and derivatives like the Ricoh
      RF5C296/396 which map part of the PCMCIA socket memory address space into
      the first 1MB of system memory address space.  They do not work anymore as
      no usable memory region exists due to this change:
      
        Intel ISA PCIC probe: Ricoh RF5C296/396 ISA-to-PCMCIA at port 0x3e0 ofs 0x00, 2 sockets
        host opts [0]: none
        host opts [1]: none
        ISA irqs (scanned) = 3,4,5,9,10 status change on irq 10
        pcmcia_socket pcmcia_socket1: pccard: PCMCIA card inserted into slot 1
        pcmcia_socket pcmcia_socket0: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff
        pcmcia_socket pcmcia_socket0: cs: IO port probe 0xa00-0xaff: clean.
        pcmcia_socket pcmcia_socket0: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff
        pcmcia_socket pcmcia_socket0: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff
        pcmcia_socket pcmcia_socket0: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff
        pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff
        pcmcia_socket pcmcia_socket0: cs: memory probe 0x0d0000-0x0dffff: clean.
        pcmcia_socket pcmcia_socket0: cs: memory probe 0x0e0000-0x0effff: clean.
        pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: clean.
        pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: clean.
        pcmcia_socket pcmcia_socket1: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff
        pcmcia_socket pcmcia_socket1: cs: IO port probe 0xa00-0xaff: clean.
        pcmcia_socket pcmcia_socket1: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x0d0000-0x0dffff: clean.
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x0e0000-0x0effff: clean.
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x60000000-0x60ffffff: clean.
        pcmcia_socket pcmcia_socket1: cs: memory probe 0xa0000000-0xa0ffffff: clean.
        pcmcia_socket pcmcia_socket1: cs: memory probe 0x0cc000-0x0effff: excluding 0xe0000-0xeffff
        pcmcia_socket pcmcia_socket1: cs: unable to map card memory!
      
      If filtering out the first 1MB is reverted, everything works as expected.
      Tested-by: default avatarRobert Resch <fli4l@robert.reschpara.de>
      Signed-off-by: default avatarChristoph Schulz <develop@kristov.de>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      42dbd25a
    • Simone Gotti's avatar
      ACPI / PCI: Fix sysfs acpi_index and label errors · 3b7d7354
      Simone Gotti authored
      commit dcfa9be8 upstream.
      
      Fix errors in handling "device label" _DSM return values.
      
      If _DSM returns a Unicode string, the ACPI type is ACPI_TYPE_BUFFER, not
      ACPI_TYPE_STRING.  Fix dsm_label_utf16s_to_utf8s() to convert UTF-16 from
      acpi_object->buffer instead of acpi_object->string.
      
      Prior to v3.14, we accepted Unicode labels (ACPI_TYPE_BUFFER return
      values).  But after 1d0fcef7, we accepted only ASCII (ACPI_TYPE_STRING)
      (and we incorrectly tried to convert those ASCII labels from UTF-16 to
      UTF-8).
      
      Rejecting Unicode labels made us return -EPERM when reading sysfs
      "acpi_index" or "label" files, which in turn caused on-board network
      interfaces on a Dell PowerEdge E420 to be renamed (by udev net_id internal)
      from eno1/eno2 to enp2s0f0/enp2s0f1.
      
      Fix this by accepting either ACPI_TYPE_STRING (and treating it as ASCII) or
      ACPI_TYPE_BUFFER (and converting from UTF-16 to UTF-8).
      
      [bhelgaas: changelog]
      Fixes: 1d0fcef7 ("ACPI / PCI: replace open-coded _DSM code with helper functions")
      Signed-off-by: default avatarSimone Gotti <simone.gotti@gmail.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b7d7354
    • Myron Stowe's avatar
      PCI: pciehp: Clear Data Link Layer State Changed during init · 5de9628c
      Myron Stowe authored
      commit 0d25d35c upstream.
      
      During PCIe hot-plug initialization - pciehp_probe() - data structures
      related to slot capabilities are set up.  As part of this set up, ISRs are
      put in place to handle slot events and all event bits are cleared out.
      
      This patch adds the Data Link Layer State Changed (PCI_EXP_SLTSTA_DLLSC)
      Slot Status bit to the event bits that are cleared out during
      initialization.
      
      If the BIOS doesn't clear DLLSC before handoff to the OS, pciehp notices
      that it's set and interprets it as a new Link Up event, which results in
      spurious messages:
      
        pciehp 0000:82:04.0:pcie24: slot(4): Link Up event
        pciehp 0000:82:04.0:pcie24: Device 0000:83:00.0 already exists at 0000:83:00, cannot hot-add
        pciehp 0000:82:04.0:pcie24: Cannot add device at 0000:83:00
      
      Prior to e48f1b67 ("PCI: pciehp: Use link change notifications for
      hot-plug and removal"), pciehp ignored DLLSC.
      
      Reference:
        PCI-SIG.  PCI Express Base Specification Revision 4.0 Version 0.3
        (PCI-SIG, 2014): 7.8.11. Slot Status Register (Offset 1Ah).
      
      [bhelgaas: add e48f1b67 ref and stable tag]
      Fixes: e48f1b67 ("PCI: pciehp: Use link change notifications for hot-plug and removal")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=79611Signed-off-by: default avatarMyron Stowe <myron.stowe@redhat.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5de9628c
    • Guo Chao's avatar
      PCI: Keep original resource if we fail to expand it · 36004973
      Guo Chao authored
      commit c3337708 upstream.
      
      If we have space assigned to a resource, we try to expand the resource
      (e.g., to accommodate SR-IOV resources), and the expansion attempt fails,
      we should keep the original assignment.
      
      After bd064f0a ("PCI: Mark resources as IORESOURCE_UNSET if we can't
      assign them"), we left the resource marked IORESOURCE_UNSET when the
      expansion failed, even if it had originally been set.  That caused errors
      like this:
      
        pci 0003:00:00.0: can't enable device: BAR 15 [mem size 0x0c000000 64bit pref] not assigned
        pci 0003:00:00.0: Error enabling bridge (-22), continuing
      
      Fix this by restoring the original flags when reassignment fails.
      
      [bhelgaas: reworked to simplify, changelog]
      Fixes: bd064f0a ("PCI: Mark resources as IORESOURCE_UNSET if we can't assign them")
      Signed-off-by: default avatarGuo Chao <yan@linux.vnet.ibm.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      36004973
    • Vidya Sagar's avatar
      PCI: Configure ASPM when enabling device · 3ade630f
      Vidya Sagar authored
      commit 1f6ae47e upstream.
      
      We can't do ASPM configuration at enumeration-time because enabling it
      makes some defective hardware unresponsive, even if ASPM is disabled later
      (see 41cd766b ("PCI: Don't enable aspm before drivers have had a chance
      to veto it").  Therefore, we have to do it after a driver claims the
      device.
      
      We previously configured ASPM in pci_set_power_state(), but that's not a
      very good place because it's not really related to setting the PCI device
      power state, and doing it there means:
      
        - We incorrectly skipped ASPM config when setting a device that's
          already in D0 to D0.
      
        - We unnecessarily configured ASPM when setting a device to a low-power
          state (the ASPM feature only applies when the device is in D0).
      
        - We unnecessarily configured ASPM when called from a .resume() method
          (ASPM configuration needs to be restored during resume, but
          pci_restore_pcie_state() should already do this).
      
      Move ASPM configuration from pci_set_power_state() to
      do_pci_enable_device() so we do it when a driver enables a device.
      
      [bhelgaas: changelog]
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=79621
      Fixes: db288c9c ("PCI / PM: restore the original behavior of pci_set_power_state()")
      Suggested-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarVidya Sagar <sagar.tv@gmail.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ade630f
    • Alex Deucher's avatar
      drm/radeon: add additional SI pci ids · 8912590e
      Alex Deucher authored
      commit 37dbeab7 upstream.
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8912590e
    • Alex Deucher's avatar
      drm/radeon: add new bonaire pci ids · b7561c9e
      Alex Deucher authored
      commit 5fc540ed upstream.
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b7561c9e
    • Alex Deucher's avatar
    • Theodore Ts'o's avatar
      ext4: fix BUG_ON in mb_free_blocks() · 08b9d6b9
      Theodore Ts'o authored
      commit c99d1e6e upstream.
      
      If we suffer a block allocation failure (for example due to a memory
      allocation failure), it's possible that we will call
      ext4_discard_allocated_blocks() before we've actually allocated any
      blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
      be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
      triggering the BUG_ON on mb_free_blocks():
      
      	BUG_ON(last >= (sb->s_blocksize << 3));
      
      Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
      is zero.
      
      Also fix a missing ext4_mb_unload_buddy() call in
      ext4_discard_allocated_blocks().
      
      Google-Bug-Id: 16844242
      
      Fixes: 86f0afd4Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08b9d6b9
    • Michael S. Tsirkin's avatar
      kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601) · 35df08d6
      Michael S. Tsirkin authored
      commit 350b8bdd upstream.
      
      The third parameter of kvm_iommu_put_pages is wrong,
      It should be 'gfn - slot->base_gfn'.
      
      By making gfn very large, malicious guest or userspace can cause kvm to
      go to this error path, and subsequently to pass a huge value as size.
      Alternatively if gfn is small, then pages would be pinned but never
      unpinned, causing host memory leak and local DOS.
      
      Passing a reasonable but large value could be the most dangerous case,
      because it would unpin a page that should have stayed pinned, and thus
      allow the device to DMA into arbitrary memory.  However, this cannot
      happen because of the condition that can trigger the error:
      
      - out of memory (where you can't allocate even a single page)
        should not be possible for the attacker to trigger
      
      - when exceeding the iommu's address space, guest pages after gfn
        will also exceed the iommu's address space, and inside
        kvm_iommu_put_pages() the iommu_iova_to_phys() will fail.  The
        page thus would not be unpinned at all.
      Reported-by: default avatarJack Morgenstein <jackm@mellanox.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35df08d6
    • Paolo Bonzini's avatar
      Revert "KVM: x86: Increase the number of fixed MTRR regs to 10" · 836251bf
      Paolo Bonzini authored
      commit 0d234daf upstream.
      
      This reverts commit 682367c4,
      which causes 32-bit SMP Windows 7 guests to panic.
      
      SeaBIOS has a limit on the number of MTRRs that it can handle,
      and this patch exceeded the limit.  Better revert it.
      Thanks to Nadav Amit for debugging the cause.
      Reported-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      836251bf
    • Wanpeng Li's avatar
      KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use · f5c694cf
      Wanpeng Li authored
      commit 56cc2406 upstream.
      
      After commit 77b0f5d6 (KVM: nVMX: Ack and write vector info to intr_info
      if L1 asks us to), "Acknowledge interrupt on exit" behavior can be
      emulated. To do so, KVM will ask the APIC for the interrupt vector if
      during a nested vmexit if VM_EXIT_ACK_INTR_ON_EXIT is set.  With APICv,
      kvm_get_apic_interrupt would return -1 and give the following WARNING:
      
      Call Trace:
       [<ffffffff81493563>] dump_stack+0x49/0x5e
       [<ffffffff8103f0eb>] warn_slowpath_common+0x7c/0x96
       [<ffffffffa059709a>] ? nested_vmx_vmexit+0xa4/0x233 [kvm_intel]
       [<ffffffff8103f11a>] warn_slowpath_null+0x15/0x17
       [<ffffffffa059709a>] nested_vmx_vmexit+0xa4/0x233 [kvm_intel]
       [<ffffffffa0594295>] ? nested_vmx_exit_handled+0x6a/0x39e [kvm_intel]
       [<ffffffffa0537931>] ? kvm_apic_has_interrupt+0x80/0xd5 [kvm]
       [<ffffffffa05972ec>] vmx_check_nested_events+0xc3/0xd3 [kvm_intel]
       [<ffffffffa051ebe9>] inject_pending_event+0xd0/0x16e [kvm]
       [<ffffffffa051efa0>] vcpu_enter_guest+0x319/0x704 [kvm]
      
      To fix this, we cannot rely on the processor's virtual interrupt delivery,
      because "acknowledge interrupt on exit" must only update the virtual
      ISR/PPR/IRR registers (and SVI, which is just a cache of the virtual ISR)
      but it should not deliver the interrupt through the IDT.  Thus, KVM has
      to deliver the interrupt "by hand", similar to the treatment of EOI in
      commit fc57ac2c (KVM: lapic: sync highest ISR to hardware apic on
      EOI, 2014-05-14).
      
      The patch modifies kvm_cpu_get_interrupt to always acknowledge an
      interrupt; there are only two callers, and the other is not affected
      because it is never reached with kvm_apic_vid_enabled() == true.  Then it
      modifies apic_set_isr and apic_clear_irr to update SVI and RVI in addition
      to the registers.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: default avatar"Zhang, Yang Z" <yang.z.zhang@intel.com>
      Tested-by: default avatarLiu, RongrongX <rongrongx.liu@intel.com>
      Tested-by: default avatarFelipe Reyes <freyes@suse.com>
      Fixes: 77b0f5d6Signed-off-by: default avatarWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5c694cf
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S: Fix LPCR one_reg interface · d376ef46
      Alexey Kardashevskiy authored
      commit a0840240 upstream.
      
      Unfortunately, the LPCR got defined as a 32-bit register in the
      one_reg interface.  This is unfortunate because KVM allows userspace
      to control the DPFD (default prefetch depth) field, which is in the
      upper 32 bits.  The result is that DPFD always get set to 0, which
      reduces performance in the guest.
      
      We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID,
      since that would break existing userspace binaries.  Instead we define
      a new KVM_REG_PPC_LPCR_64 id which is 64-bit.  Userspace can still use
      the old KVM_REG_PPC_LPCR id, but it now only modifies those fields in
      the bottom 32 bits that userspace can modify (ILE, TC and AIL).
      If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD
      as well.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d376ef46
    • Christian Borntraeger's avatar
      KVM: s390/mm: Fix page table locking vs. split pmd lock · dd713bdf
      Christian Borntraeger authored
      commit 55e4283c upstream.
      
      commit ec66ad66 (s390/mm: enable
      split page table lock for PMD level) activated the split pmd lock
      for s390. Turns out that we missed one place: We also have to take
      the pmd lock instead of the page table lock when we reallocate the
      page tables (==> changing entries in the PMD) during sie enablement.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd713bdf
    • Paolo Bonzini's avatar
      KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table · e54e2690
      Paolo Bonzini authored
      commit 0f6c0a74 upstream.
      
      Currently, the EOI exit bitmap (used for APICv) does not include
      interrupts that are masked.  However, this can cause a bug that manifests
      as an interrupt storm inside the guest.  Alex Williamson reported the
      bug and is the one who really debugged this; I only wrote the patch. :)
      
      The scenario involves a multi-function PCI device with OHCI and EHCI
      USB functions and an audio function, all assigned to the guest, where
      both USB functions use legacy INTx interrupts.
      
      As soon as the guest boots, interrupts for these devices turn into an
      interrupt storm in the guest; the host does not see the interrupt storm.
      Basically the EOI path does not work, and the guest continues to see the
      interrupt over and over, even after it attempts to mask it at the APIC.
      The bug is only visible with older kernels (RHEL6.5, based on 2.6.32
      with not many changes in the area of APIC/IOAPIC handling).
      
      Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ)
      on in the eoi_exit_bitmap and TMR, and things then work.  What happens
      is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap.
      It does not have set bit 59 because the RTE was masked, so the IOAPIC
      never sees the EOI and the interrupt continues to fire in the guest.
      
      My guess was that the guest is masking the interrupt in the redirection
      table in the interrupt routine, i.e. while the interrupt is set in a
      LAPIC's ISR, The simplest fix is to ignore the masking state, we would
      rather have an unnecessary exit rather than a missed IRQ ACK and anyway
      IOAPIC interrupts are not as performance-sensitive as for example MSIs.
      Alex tested this patch and it fixed his bug.
      
      [Thanks to Alex for his precise description of the problem
       and initial debugging effort.  A lot of the text above is
       based on emails exchanged with him.]
      Reported-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e54e2690
    • Nadav Amit's avatar
      KVM: x86: Inter-privilege level ret emulation is not implemeneted · 540b55f9
      Nadav Amit authored
      commit 9e8919ae upstream.
      
      Return unhandlable error on inter-privilege level ret instruction.  This is
      since the current emulation does not check the privilege level correctly when
      loading the CS, and does not pop RSP/SS as needed.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      540b55f9
    • Steven Rostedt's avatar
      debugfs: Fix corrupted loop in debugfs_remove_recursive · 306ccdfd
      Steven Rostedt authored
      commit 485d4402 upstream.
      
      [ I'm currently running my tests on it now, and so far, after a few
       hours it has yet to blow up. I'll run it for 24 hours which it never
       succeeded in the past. ]
      
      The tracing code has a way to make directories within the debugfs file
      system as well as deleting them using mkdir/rmdir in the instance
      directory. This is very limited in functionality, such as there is
      no renames, and the parent directory "instance" can not be modified.
      The tracing code creates the instance directory from the debugfs code
      and then replaces the dentry->d_inode->i_op with its own to allow
      for mkdir/rmdir to work.
      
      When these are called, the d_entry and inode locks need to be released
      to call the instance creation and deletion code. That code has its own
      accounting and locking to serialize everything to prevent multiple
      users from causing harm. As the parent "instance" directory can not
      be modified this simplifies things.
      
      I created a stress test that creates several threads that randomly
      creates and deletes directories thousands of times a second. The code
      stood up to this test and I submitted it a while ago.
      
      Recently I added a new test that adds readers to the mix. While the
      instance directories were being added and deleted, readers would read
      from these directories and even enable tracing within them. This test
      was able to trigger a bug:
      
       general protection fault: 0000 [#1] PREEMPT SMP
       Modules linked in: ...
       CPU: 3 PID: 17789 Comm: rmdir Tainted: G        W     3.15.0-rc2-test+ #41
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
       task: ffff88003786ca60 ti: ffff880077018000 task.ti: ffff880077018000
       RIP: 0010:[<ffffffff811ed5eb>]  [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
       RSP: 0018:ffff880077019df8  EFLAGS: 00010246
       RAX: 0000000000000002 RBX: ffff88006f0fe490 RCX: 0000000000000000
       RDX: dead000000100058 RSI: 0000000000000246 RDI: ffff88003786d454
       RBP: ffff88006f0fe640 R08: 0000000000000628 R09: 0000000000000000
       R10: 0000000000000628 R11: ffff8800795110a0 R12: ffff88006f0fe640
       R13: ffff88006f0fe640 R14: ffffffff81817d0b R15: ffffffff818188b7
       FS:  00007ff13ae24700(0000) GS:ffff88007d580000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000003054ec7be0 CR3: 0000000076d51000 CR4: 00000000000007e0
       Stack:
        ffff88007a41ebe0 dead000000100058 00000000fffffffe ffff88006f0fe640
        0000000000000000 ffff88006f0fe678 ffff88007a41ebe0 ffff88003793a000
        00000000fffffffe ffffffff810bde82 ffff88006f0fe640 ffff88007a41eb28
       Call Trace:
        [<ffffffff810bde82>] ? instance_rmdir+0x15b/0x1de
        [<ffffffff81132e2d>] ? vfs_rmdir+0x80/0xd3
        [<ffffffff81132f51>] ? do_rmdir+0xd1/0x139
        [<ffffffff8124ad9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
        [<ffffffff814fea62>] ? system_call_fastpath+0x16/0x1b
       Code: fe ff ff 48 8d 75 30 48 89 df e8 c9 fd ff ff 85 c0 75 13 48 c7 c6 b8 cc d2 81 48 c7 c7 b0 cc d2 81 e8 8c 7a f5 ff 48 8b 54 24 08 <48> 8b 82 a8 00 00 00 48 89 d3 48 2d a8 00 00 00 48 89 44 24 08
       RIP  [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
        RSP <ffff880077019df8>
      
      It took a while, but every time it triggered, it was always in the
      same place:
      
      	list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) {
      
      Where the child->d_u.d_child seemed to be corrupted.  I added lots of
      trace_printk()s to see what was wrong, and sure enough, it was always
      the child's d_u.d_child field. I looked around to see what touches
      it and noticed that in __dentry_kill() which calls dentry_free():
      
      static void dentry_free(struct dentry *dentry)
      {
      	/* if dentry was never visible to RCU, immediate free is OK */
      	if (!(dentry->d_flags & DCACHE_RCUACCESS))
      		__d_free(&dentry->d_u.d_rcu);
      	else
      		call_rcu(&dentry->d_u.d_rcu, __d_free);
      }
      
      I also noticed that __dentry_kill() unlinks the child->d_u.child
      under the parent->d_lock spin_lock.
      
      Looking back at the loop in debugfs_remove_recursive() it never takes the
      parent->d_lock to do the list walk. Adding more tracing, I was able to
      prove this was the issue:
      
       ftrace-t-15385   1.... 246662024us : dentry_kill <ffffffff81138b91>: free ffff88006d573600
          rmdir-15409   2.... 246662024us : debugfs_remove_recursive <ffffffff811ec7e5>: child=ffff88006d573600 next=dead000000100058
      
      The dentry_kill freed ffff88006d573600 just as the remove recursive was walking
      it.
      
      In order to fix this, the list walk needs to be modified a bit to take
      the parent->d_lock. The safe version is no longer necessary, as every
      time we remove a child, the parent->d_lock must be released and the
      list walk must start over. Each time a child is removed, even though it
      may still be on the list, it should be skipped by the first check
      in the loop:
      
      		if (!debugfs_positive(child))
      			continue;
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      306ccdfd