1. 11 Apr, 2016 38 commits
    • Dave Jones's avatar
      x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() · f4790180
      Dave Jones authored
      commit 7834c103 upstream.
      
      Since 4.4, I've been able to trigger this occasionally:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.5.0-rc7-think+ #3 Not tainted
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20160315012054.GA17765@codemonkey.org.ukSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      
      -------------------------------
      ./arch/x86/include/asm/msr-trace.h:47 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from idle CPU!
      rcu_scheduler_active = 1, debug_locks = 1
      RCU used illegally from extended quiescent state!
      no locks held by swapper/3/0.
      
      stack backtrace:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.5.0-rc7-think+ #3
       ffffffff92f821e0 1f3e5c340597d7fc ffff880468e07f10 ffffffff92560c2a
       ffff880462145280 0000000000000001 ffff880468e07f40 ffffffff921376a6
       ffffffff93665ea0 0000cc7c876d28da 0000000000000005 ffffffff9383dd60
      Call Trace:
       <IRQ>  [<ffffffff92560c2a>] dump_stack+0x67/0x9d
       [<ffffffff921376a6>] lockdep_rcu_suspicious+0xe6/0x100
       [<ffffffff925ae7a7>] do_trace_write_msr+0x127/0x1a0
       [<ffffffff92061c83>] native_apic_msr_eoi_write+0x23/0x30
       [<ffffffff92054408>] smp_trace_call_function_interrupt+0x38/0x360
       [<ffffffff92d1ca60>] trace_call_function_interrupt+0x90/0xa0
       <EOI>  [<ffffffff92ac5124>] ? cpuidle_enter_state+0x1b4/0x520
      
      Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
      tells the RCU susbstems to end the extended quiescent state, so that the
      following trace call in ack_APIC_irq() works correctly.
      Suggested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Fixes: 4787c368 "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()"
      Signed-off-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      f4790180
    • Bjorn Helgaas's avatar
      PCI: Disable IO/MEM decoding for devices with non-compliant BARs · b82df119
      Bjorn Helgaas authored
      commit b84106b4 upstream.
      
      The PCI config header (first 64 bytes of each device's config space) is
      defined by the PCI spec so generic software can identify the device and
      manage its usage of I/O, memory, and IRQ resources.
      
      Some non-spec-compliant devices put registers other than BARs where the
      BARs should be.  When the PCI core sizes these "BARs", the reads and writes
      it does may have unwanted side effects, and the "BAR" may appear to
      describe non-sensical address space.
      
      Add a flag bit to mark non-compliant devices so we don't touch their BARs.
      Turn off IO/MEM decoding to prevent the devices from consuming address
      space, since we can't read the BARs to find out what that address space
      would be.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Tested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b82df119
    • Dan Carpenter's avatar
      EDAC, amd64_edac: Shift wrapping issue in f1x_get_norm_dct_addr() · 4f9a76cd
      Dan Carpenter authored
      commit 6f3508f6 upstream.
      
      dct_sel_base_off is declared as a u64 but we're only using the lower 32
      bits because of a shift wrapping bug. This can possibly truncate the
      upper 16 bits of DctSelBaseOffset[47:26], causing us to misdecode the CS
      row.
      
      Fixes: c8e518d5 ('amd64_edac: Sanitize f10_get_base_addr_offset')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20160120095451.GB19898@mwandaSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4f9a76cd
    • Paolo Bonzini's avatar
      KVM: VMX: avoid guest hang on invalid invept instruction · 8ef8a749
      Paolo Bonzini authored
      commit 2849eb4f upstream.
      
      A guest executing an invalid invept instruction would hang
      because the instruction pointer was not updated.
      
      Fixes: bfd0a56bReviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8ef8a749
    • Paolo Bonzini's avatar
      KVM: fix spin_lock_init order on x86 · 9a9bec2c
      Paolo Bonzini authored
      commit e9ad4ec8 upstream.
      
      Moving the initialization earlier is needed in 4.6 because
      kvm_arch_init_vm is now using mmu_lock, causing lockdep to
      complain:
      
      [  284.440294] INFO: trying to register non-static key.
      [  284.445259] the code is fine but needs lockdep annotation.
      [  284.450736] turning off the locking correctness validator.
      ...
      [  284.528318]  [<ffffffff810aecc3>] lock_acquire+0xd3/0x240
      [  284.533733]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.541467]  [<ffffffff81715581>] _raw_spin_lock+0x41/0x80
      [  284.546960]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.554707]  [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.562281]  [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm]
      [  284.568381]  [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm]
      [  284.574740]  [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm]
      
      However, it also helps fixing a preexisting problem, which is why this
      patch is also good for stable kernels: kvm_create_vm was incrementing
      current->mm->mm_count but not decrementing it at the out_err label (in
      case kvm_init_mmu_notifier failed).  The new initialization order makes
      it possible to add the required mmdrop without adding a new error label.
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9a9bec2c
    • Radim Krčmář's avatar
      KVM: i8254: change PIT discard tick policy · 0466b503
      Radim Krčmář authored
      commit 7dd0fdff upstream.
      
      Discard policy uses ack_notifiers to prevent injection of PIT interrupts
      before EOI from the last one.
      
      This patch changes the policy to always try to deliver the interrupt,
      which makes a difference when its vector is in ISR.
      Old implementation would drop the interrupt, but proposed one injects to
      IRR, like real hardware would.
      
      The old policy breaks legacy NMI watchdogs, where PIT is used through
      virtual wire (LVT0): PIT never sends an interrupt before receiving EOI,
      thus a guest deadlock with disabled interrupts will stop NMIs.
      
      Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt
      through IOAPIC.  (KVM's PIT is deeply rotten and luckily not used much
      in modern systems.)
      
      Even though there is a chance of regressions, I think we can fix the
      LVT0 NMI bug without introducing a new tick policy.
      Reported-by: default avatarYuki Shibuya <shibuya.yk@ncos.nec.co.jp>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0466b503
    • Andy Lutomirski's avatar
      x86/iopl/64: Properly context-switch IOPL on Xen PV · a2a4370a
      Andy Lutomirski authored
      commit b7a58459 upstream.
      
      On Xen PV, regs->flags doesn't reliably reflect IOPL and the
      exit-to-userspace code doesn't change IOPL.  We need to context
      switch it manually.
      
      I'm doing this without going through paravirt because this is
      specific to Xen PV.  After the dust settles, we can merge this with
      the 32-bit code, tidy up the iopl syscall implementation, and remove
      the set_iopl pvop entirely.
      
      Fixes XSA-171.
      Reviewewd-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/693c3bd7aeb4d3c27c92c622b7d0f554a458173c.1458162709.git.luto@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      [ kamal: backport to 3.19-stable: no X86_FEATURE_XENPV so just call
        xen_pv_domain() directly ]
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a2a4370a
    • Charles_Rose@Dell.com's avatar
      ahci: Add Device ID for Intel Sunrise Point PCH · ebb69171
      Charles_Rose@Dell.com authored
      commit c5967b79 upstream.
      
      This patch adds missing AHCI RAID SATA Device IDs for the Intel Sunrise
      Point PCH.
      Signed-off-by: default avatarNanda Kishore Chinna <nanda_kishore_chinna@dell.com>
      Signed-off-by: default avatarCharles Rose <charles_rose@dell.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      ebb69171
    • Benjamin Poirier's avatar
      mld, igmp: Fix reserved tailroom calculation · 6af792bb
      Benjamin Poirier authored
      commit 1837b2e2 upstream.
      
      The current reserved_tailroom calculation fails to take hlen and tlen into
      account.
      
      skb:
      [__hlen__|__data____________|__tlen___|__extra__]
      ^                                               ^
      head                                            skb_end_offset
      
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      
      [__hlen__|__data____________|__extra__|__tlen___]
      ^                                               ^
      head                                            skb_end_offset
      
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      reserved_tailroom
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      else:
      	reserved_tailroom = tlen
      
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      
      Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6af792bb
    • Daniel Borkmann's avatar
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · e98ac086
      Daniel Borkmann authored
      commit 4c672e4b upstream.
      
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: default avatarWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e98ac086
    • Jiri Slaby's avatar
      net/ipv6: fix DEVCONF_ constants · 76f4d42f
      Jiri Slaby authored
      In 3.12 commit e16f5378 (net/ipv6: add
      sysctl option accept_ra_min_hop_limit), upstream commit
      8013d1d7, we added
      DEVCONF_USE_OIF_ADDRS_ONLY and DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT
      constants into <linux/ipv6.h>. But they have different values to
      upstream because some values were added in upstream and we did not
      backport them.
      
      So we have:
              DEVCONF_SUPPRESS_FRAG_NDISC,
      +       DEVCONF_USE_OIF_ADDRS_ONLY,
      +       DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT,
              DEVCONF_MAX
      And upstream has:
              DEVCONF_SUPPRESS_FRAG_NDISC,
      +       DEVCONF_ACCEPT_RA_FROM_LOCAL,
      +       DEVCONF_USE_OPTIMISTIC,
      +       DEVCONF_ACCEPT_RA_MTU,
      +       DEVCONF_STABLE_SECRET,
      +       DEVCONF_USE_OIF_ADDRS_ONLY,
      +       DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT,
              DEVCONF_MAX
      
      Now, our DEVCONF_USE_OIF_ADDRS_ONLY corresponds to
      DEVCONF_USE_OIF_ADDRS_ONLY-4 == DEVCONF_ACCEPT_RA_FROM_LOCAL from
      upstream. Similarly the other constant.
      
      Fix that by simply defining the missing constants to make the values
      equal.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Reported-by: default avatarYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Cc: Luis Henriques <luis.henriques@canonical.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      76f4d42f
    • Jeff Layton's avatar
      nfs: fix high load average due to callback thread sleeping · 2e9b2422
      Jeff Layton authored
      commit 5d05e54a upstream.
      
      Chuck pointed out a problem that crept in with commit 6ffa30d3 (nfs:
      don't call blocking operations while !TASK_RUNNING). Linux counts tasks
      in uninterruptible sleep against the load average, so this caused the
      system's load average to be pinned at at least 1 when there was a
      NFSv4.1+ mount active.
      
      Not a huge problem, but it's probably worth fixing before we get too
      many complaints about it. This patch converts the code back to use
      TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop
      iteration. In practice no one should really be signalling this thread at
      all, so I think this is reasonably safe.
      
      With this change, there's also no need to game the hung task watchdog so
      we can also convert the schedule_timeout call back to a normal schedule.
      Reported-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Tested-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Fixes: commit 6ffa30d3 (“nfs: don't call blocking . . .”)
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2e9b2422
    • Ian Mitchell's avatar
      Fix kmalloc overflow in LPFC driver at large core count · d93cbb44
      Ian Mitchell authored
      commit c0365c06 upstream.
      
      This patch allows the LPFC to start up without a fatal kernel bug based
      on an exceeded KMALLOC_MAX_SIZE and a too large NR_CPU-based maskbits
      field. The bug was based on the number of CPU cores in a system.
      Using the get_cpu_mask() function declared in kernel/cpu.c allows the
      driver to load on the community kernel 4.2 RC1.
      
      Below is the kernel bug reproduced:
      
      8<--------------------------------------------------------------------
      2199382.828437 (    0.005216)| lpfc 0003:02:00.0: enabling device (0140 -> 0142)
      2199382.999272 (    0.170835)| ------------[ cut here ]------------
      2199382.999337 (    0.000065)| WARNING: CPU: 84 PID: 404 at mm/slab_common.c:653 kmalloc_slab+0x2f/0x89()
      2199383.004534 (    0.005197)| Modules linked in: lpfc(+) usbcore(+) mptctl scsi_transport_fc sg lpc_ich i2c_i801 usb_common tpm_tis mfd_core tpm acpi_cpufreq button scsi_dh_alua scsi_dh_rdacusbcore: registered new device driver usb
      2199383.020568 (    0.016034)|
      2199383.020581 (    0.000013)|  scsi_dh_hp_sw scsi_dh_emc scsi_dh gru thermal sata_nv processor piix fan thermal_sysehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
      2199383.035288 (    0.014707)|
      2199383.035306 (    0.000018)|  hwmon ata_piix
      2199383.035336 (    0.000030)| CPU: 84 PID: 404 Comm: kworker/84:0 Not tainted 3.18.0-rc2-gat-00106-ga7ca10f2-dirty #178
      2199383.047077 (    0.011741)| ehci-pci: EHCI PCI platform driver
      2199383.047134 (    0.000057)| Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013
      2199383.056245 (    0.009111)| Workqueue: events work_for_cpu_fn
      2199383.066174 (    0.009929)|  000000000000028d ffff88eef827bbe8 ffffffff815a542f 000000000000028d
      2199383.069545 (    0.003371)|  ffffffff810ea142 ffff88eef827bc28 ffffffff8104365c ffff88eefe4006c8
      2199383.076214 (    0.006669)|  0000000000000000 00000000000080d0 0000000000000000 0000000000000004
      2199383.079213 (    0.002999)| Call Trace:
      2199383.084084 (    0.004871)|  [<ffffffff815a542f>] dump_stack+0x49/0x62
      2199383.087283 (    0.003199)|  [<ffffffff810ea142>] ? kmalloc_slab+0x2f/0x89
      2199383.091415 (    0.004132)|  [<ffffffff8104365c>] warn_slowpath_common+0x77/0x92
      2199383.095197 (    0.003782)|  [<ffffffff8104368c>] warn_slowpath_null+0x15/0x17
      2199383.103336 (    0.008139)|  [<ffffffff810ea142>] kmalloc_slab+0x2f/0x89
      2199383.107082 (    0.003746)|  [<ffffffff8110fd9e>] __kmalloc+0x13/0x16a
      2199383.112531 (    0.005449)|  [<ffffffffa01a8ed9>] lpfc_pci_probe_one_s4+0x105b/0x1644 [lpfc]
      2199383.115316 (    0.002785)|  [<ffffffff81302b92>] ? pci_bus_read_config_dword+0x75/0x87
      2199383.123431 (    0.008115)|  [<ffffffffa01a951f>] lpfc_pci_probe_one+0x5d/0xcb5 [lpfc]
      2199383.127364 (    0.003933)|  [<ffffffff81497119>] ? dbs_check_cpu+0x168/0x177
      2199383.136438 (    0.009074)|  [<ffffffff81496fa5>] ? gov_queue_work+0xb4/0xc0
      2199383.140407 (    0.003969)|  [<ffffffff8130b2a1>] local_pci_probe+0x1e/0x52
      2199383.143105 (    0.002698)|  [<ffffffff81052c47>] work_for_cpu_fn+0x13/0x1b
      2199383.147315 (    0.004210)|  [<ffffffff81054965>] process_one_work+0x222/0x35e
      2199383.151379 (    0.004064)|  [<ffffffff81054e76>] worker_thread+0x3d5/0x46e
      2199383.159402 (    0.008023)|  [<ffffffff81054aa1>] ? process_one_work+0x35e/0x35e
      2199383.163097 (    0.003695)|  [<ffffffff810599c6>] kthread+0xc8/0xd2
      2199383.167476 (    0.004379)|  [<ffffffff810598fe>] ? kthread_freezable_should_stop+0x5b/0x5b
      2199383.176434 (    0.008958)|  [<ffffffff815a8cac>] ret_from_fork+0x7c/0xb0
      2199383.180086 (    0.003652)|  [<ffffffff810598fe>] ? kthread_freezable_should_stop+0x5b/0x5b
      2199383.192333 (    0.012247)| ehci-pci 0000:00:1a.0: EHCI Host Controller
      -------------------------------------------------------------------->8
      
      The proposed solution was approved by James Smart at Emulex and tested
      on a UV2 machine with 6144 cores. With the fix, the LPFC module loads
      with no unwanted effects on the system.
      Signed-off-by: default avatarIan Mitchell <imitchell@sgi.com>
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarRobert Elliot <elliott@hp.com>
      [james.smart: resolve unused variable warning]
      Signed-off-by: default avatarJames Smart <james.smart@avagotech.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Odin.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d93cbb44
    • Markus Metzger's avatar
      perf, nmi: Fix unknown NMI warning · 71f4162a
      Markus Metzger authored
      commit a3ef2229 upstream.
      
      When using BTS on Core i7-4*, I get the below kernel warning.
      
      $ perf record -c 1 -e branches:u ls
      Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
       kernel:[  438.317893] Uhhuh. NMI received for unknown reason 31 on CPU 2.
      
      Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
       kernel:[  438.317920] Do you have a strange power saving mode enabled?
      
      Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
       kernel:[  438.317945] Dazed and confused, but trying to continue
      
      Make intel_pmu_handle_irq() take the full exit path when returning early.
      
      Cc: eranian@google.com
      Cc: peterz@infradead.org
      Cc: mingo@kernel.org
      Signed-off-by: default avatarMarkus Metzger <markus.t.metzger@intel.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1392425048-5309-1-git-send-email-andi@firstfloor.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      71f4162a
    • Lukasz Odzioba's avatar
      hwmon: (coretemp) Increase limit of maximum core ID from 32 to 128. · 2b4faae2
      Lukasz Odzioba authored
      commit cc904f9c upstream.
      
      A new limit selected arbitrarily as power of two greater than
      required minimum for Xeon Phi processor (72 for Knights Landing).
      
      Currently driver is not able to handle cores with core ID greater than 32.
      Such attempt ends up with the following error in dmesg:
      coretemp coretemp.0: Adding Core XXX failed
      Signed-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2b4faae2
    • Martin Schwidefsky's avatar
      s390/mm: four page table levels vs. fork · bf06b31b
      Martin Schwidefsky authored
      commit 3446c13b upstream.
      
      The fork of a process with four page table levels is broken since
      git commit 6252d702 "[S390] dynamic page tables."
      
      All new mm contexts are created with three page table levels and
      an asce limit of 4TB. If the parent has four levels dup_mmap will
      add vmas to the new context which are outside of the asce limit.
      The subsequent call to copy_page_range will walk the three level
      page table structure of the new process with non-zero pgd and pud
      indexes. This leads to memory clobbers as the pgd_index *and* the
      pud_index is added to the mm->pgd pointer without a pgd_deref
      in between.
      
      The init_new_context() function is selecting the number of page
      table levels for a new context. The function is used by mm_init()
      which in turn is called by dup_mm() and mm_alloc(). These two are
      used by fork() and exec(). The init_new_context() function can
      distinguish the two cases by looking at mm->context.asce_limit,
      for fork() the mm struct has been copied and the number of page
      table levels may not change. For exec() the mm_alloc() function
      set the new mm structure to zero, in this case a three-level page
      table is created as the temporary stack space is located at
      STACK_TOP_MAX = 4TB.
      
      This fixes CVE-2016-2143.
      Reported-by: default avatarMarcin Kościelnicki <koriakin@0x04.net>
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bf06b31b
    • Johan Hovold's avatar
      USB: visor: fix null-deref at probe · d53a0262
      Johan Hovold authored
      commit cac9b50b upstream.
      
      Fix null-pointer dereference at probe should a (malicious) Treo device
      lack the expected endpoints.
      
      Specifically, the Treo port-setup hack was dereferencing the bulk-in and
      interrupt-in urbs without first making sure they had been allocated by
      core.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d53a0262
    • Wei Huang's avatar
      KVM: SVM: add rdmsr support for AMD event registers · 9112aa76
      Wei Huang authored
      commit dc9b2d93 upstream.
      
      Current KVM only supports RDMSR for K7_EVNTSEL0 and K7_PERFCTR0
      MSRs. Reading the rest MSRs will trigger KVM to inject #GP into
      guest VM. This causes a warning message "Failed to access perfctr
      msr (MSR c0010001 is ffffffffffffffff)" on AMD host. This patch
      adds RDMSR support for all K7_EVNTSELn and K7_PERFCTRn registers
      and thus supresses the warning message.
      Signed-off-by: default avatarWei Huang <wehuang@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9112aa76
    • Dirk Brandewie's avatar
      intel_pstate: Use del_timer_sync in intel_pstate_cpu_stop · b8c77bd5
      Dirk Brandewie authored
      commit c2294a2f upstream.
      
      Ensure that no timer callback is running since we are about to free
      the timer structure.  We cannot guarantee that the call back is called
      on the CPU where the timer is running.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDirk Brandewie <dirk.j.brandewie@intel.com>
      Reviewed-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b8c77bd5
    • Alan Stern's avatar
      USB: fix invalid memory access in hub_activate() · a706ac40
      Alan Stern authored
      commit e50293ef upstream.
      
      Commit 8520f380 ("USB: change hub initialization sleeps to
      delayed_work") changed the hub_activate() routine to make part of it
      run in a workqueue.  However, the commit failed to take a reference to
      the usb_hub structure or to lock the hub interface while doing so.  As
      a result, if a hub is plugged in and quickly unplugged before the work
      routine can run, the routine will try to access memory that has been
      deallocated.  Or, if the hub is unplugged while the routine is
      running, the memory may be deallocated while it is in active use.
      
      This patch fixes the problem by taking a reference to the usb_hub at
      the start of hub_activate() and releasing it at the end (when the work
      is finished), and by locking the hub interface while the work routine
      is running.  It also adds a check at the start of the routine to see
      if the hub has already been disconnected, in which nothing should be
      done.
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Reported-by: default avatarAlexandru Cornea <alexandru.cornea@intel.com>
      Tested-by: default avatarAlexandru Cornea <alexandru.cornea@intel.com>
      Fixes: 8520f380 ("USB: change hub initialization sleeps to delayed_work")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a706ac40
    • Michal Hocko's avatar
      memcg: do not hang on OOM when killed by userspace OOM access to memory reserves · b6f358ab
      Michal Hocko authored
      commit d8dc595c upstream.
      
      Eric has reported that he can see task(s) stuck in memcg OOM handler
      regularly.  The only way out is to
      
      	echo 0 > $GROUP/memory.oom_control
      
      His usecase is:
      
      - Setup a hierarchy with memory and the freezer (disable kernel oom and
        have a process watch for oom).
      
      - In that memory cgroup add a process with one thread per cpu.
      
      - In one thread slowly allocate once per second I think it is 16M of ram
        and mlock and dirty it (just to force the pages into ram and stay
        there).
      
      - When oom is achieved loop:
        * attempt to freeze all of the tasks.
        * if frozen send every task SIGKILL, unfreeze, remove the directory in
          cgroupfs.
      
      Eric has then pinpointed the issue to be memcg specific.
      
      All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
      Those that have received fatal signal will bypass the charge and should
      continue on their way out.  The tricky part is that the exit path might
      trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
      while its memcg is still under OOM because nobody has released any charges
      yet.
      
      Unlike with the in-kernel OOM handler the exiting task doesn't get
      TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
      and falls to the memcg OOM again without any way out of it as there are no
      fatal signals pending anymore.
      
      This patch fixes the issue by checking PF_EXITING early in
      mem_cgroup_try_charge and bypass the charge same as if it had fatal
      signal pending or TIF_MEMDIE set.
      
      Normally exiting tasks (aka not killed) will bypass the charge now but
      this should be OK as the task is leaving and will release memory and
      increasing the memory pressure just to release it in a moment seems
      dubious wasting of cycles.  Besides that charges after exit_signals should
      be rare.
      
      I am bringing this patch again (rebased on the current mmotm tree). I
      hope we can move forward finally. If there is still an opposition then
      I would really appreciate a concurrent approach so that we can discuss
      alternatives.
      
      http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
      to the followup discussion when the patch has been dropped from the mmotm
      last time.
      Reported-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b6f358ab
    • Takashi Iwai's avatar
      ALSA: seq: Fix leak of pool buffer at concurrent writes · 448da408
      Takashi Iwai authored
      commit d99a36f4 upstream.
      
      When multiple concurrent writes happen on the ALSA sequencer device
      right after the open, it may try to allocate vmalloc buffer for each
      write and leak some of them.  It's because the presence check and the
      assignment of the buffer is done outside the spinlock for the pool.
      
      The fix is to move the check and the assignment into the spinlock.
      
      (The current implementation is suboptimal, as there can be multiple
       unnecessary vmallocs because the allocation is done before the check
       in the spinlock.  But the pool size is already checked beforehand, so
       this isn't a big problem; that is, the only possible path is the
       multiple writes before any pool assignment, and practically seen, the
       current coverage should be "good enough".)
      
      The issue was triggered by syzkaller fuzzer.
      
      Buglink: http://lkml.kernel.org/r/CACT4Y+bSzazpXNvtAr=WXaL8hptqjHwqEyFA+VN2AWEx=aurkg@mail.gmail.comReported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      448da408
    • Takashi Iwai's avatar
      ALSA: rawmidi: Make snd_rawmidi_transmit() race-free · d077a244
      Takashi Iwai authored
      commit 06ab3003 upstream.
      
      A kernel WARNING in snd_rawmidi_transmit_ack() is triggered by
      syzkaller fuzzer:
        WARNING: CPU: 1 PID: 20739 at sound/core/rawmidi.c:1136
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
       [<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
       [<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
       [<ffffffff84f80bd5>] snd_rawmidi_transmit_ack+0x275/0x400 sound/core/rawmidi.c:1136
       [<ffffffff84fdb3c1>] snd_virmidi_output_trigger+0x4b1/0x5a0 sound/core/seq/seq_virmidi.c:163
       [<     inline     >] snd_rawmidi_output_trigger sound/core/rawmidi.c:150
       [<ffffffff84f87ed9>] snd_rawmidi_kernel_write1+0x549/0x780 sound/core/rawmidi.c:1223
       [<ffffffff84f89fd3>] snd_rawmidi_write+0x543/0xb30 sound/core/rawmidi.c:1273
       [<ffffffff817b0323>] __vfs_write+0x113/0x480 fs/read_write.c:528
       [<ffffffff817b1db7>] vfs_write+0x167/0x4a0 fs/read_write.c:577
       [<     inline     >] SYSC_write fs/read_write.c:624
       [<ffffffff817b50a1>] SyS_write+0x111/0x220 fs/read_write.c:616
       [<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185
      
      Also a similar warning is found but in another path:
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82be2c0d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
       [<ffffffff81355139>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
       [<ffffffff81355369>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
       [<ffffffff8527e69a>] rawmidi_transmit_ack+0x24a/0x3b0 sound/core/rawmidi.c:1133
       [<ffffffff8527e851>] snd_rawmidi_transmit_ack+0x51/0x80 sound/core/rawmidi.c:1163
       [<ffffffff852d9046>] snd_virmidi_output_trigger+0x2b6/0x570 sound/core/seq/seq_virmidi.c:185
       [<     inline     >] snd_rawmidi_output_trigger sound/core/rawmidi.c:150
       [<ffffffff85285a0b>] snd_rawmidi_kernel_write1+0x4bb/0x760 sound/core/rawmidi.c:1252
       [<ffffffff85287b73>] snd_rawmidi_write+0x543/0xb30 sound/core/rawmidi.c:1302
       [<ffffffff817ba5f3>] __vfs_write+0x113/0x480 fs/read_write.c:528
       [<ffffffff817bc087>] vfs_write+0x167/0x4a0 fs/read_write.c:577
       [<     inline     >] SYSC_write fs/read_write.c:624
       [<ffffffff817bf371>] SyS_write+0x111/0x220 fs/read_write.c:616
       [<ffffffff86660276>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185
      
      In the former case, the reason is that virmidi has an open code
      calling snd_rawmidi_transmit_ack() with the value calculated outside
      the spinlock.   We may use snd_rawmidi_transmit() in a loop just for
      consuming the input data, but even there, there is a race between
      snd_rawmidi_transmit_peek() and snd_rawmidi_tranmit_ack().
      
      Similarly in the latter case, it calls snd_rawmidi_transmit_peek() and
      snd_rawmidi_tranmit_ack() separately without protection, so they are
      racy as well.
      
      The patch tries to address these issues by the following ways:
      - Introduce the unlocked versions of snd_rawmidi_transmit_peek() and
        snd_rawmidi_transmit_ack() to be called inside the explicit lock.
      - Rewrite snd_rawmidi_transmit() to be race-free (the former case).
      - Make the split calls (the latter case) protected in the rawmidi spin
        lock.
      
      Buglink: http://lkml.kernel.org/r/CACT4Y+YPq1+cYLkadwjWa5XjzF1_Vki1eHnVn-Lm0hzhSpu5PA@mail.gmail.com
      Buglink: http://lkml.kernel.org/r/CACT4Y+acG4iyphdOZx47Nyq_VHGbpJQK-6xNpiqUjaZYqsXOGw@mail.gmail.comReported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d077a244
    • John Allen's avatar
      drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64 · a95562da
      John Allen authored
      commit cb5490a5 upstream.
      
      Fix a bug where a kernel warning is triggered when performing a memory
      hotplug on ppc64.  This warning may also occur on any architecture that
      uses the memory_probe_store interface.
      
        WARNING: at drivers/base/memory.c:200
        CPU: 9 PID: 13042 Comm: systemd-udevd Not tainted 4.4.0-rc4-00113-g0bd0f1e6-dirty #7
        NIP [c00000000055e034] pages_correctly_reserved+0x134/0x1b0
        LR [c00000000055e7f8] memory_subsys_online+0x68/0x140
        Call Trace:
          memory_subsys_online+0x68/0x140
          device_online+0xb4/0x120
          store_mem_state+0xb0/0x180
          dev_attr_store+0x34/0x60
          sysfs_kf_write+0x64/0xa0
          kernfs_fop_write+0x17c/0x1e0
          __vfs_write+0x40/0x160
          vfs_write+0xb8/0x200
          SyS_write+0x60/0x110
          system_call+0x38/0xd0
      
      The warning is triggered because there is a udev rule that automatically
      tries to online memory after it has been added.  The udev rule varies
      from distro to distro, but will generally look something like:
      
        SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
      
      On any architecture that uses memory_probe_store to reserve memory, the
      udev rule will be triggered after the first section of the block is
      reserved and will subsequently attempt to online the entire block,
      interrupting the memory reservation process and causing the warning.
      This patch modifies memory_probe_store to add a block of memory with a
      single call to add_memory as opposed to looping through and adding each
      section individually.  A single call to add_memory is protected by the
      mem_hotplug mutex which will prevent the udev rule from onlining memory
      until the reservation of the entire block is complete.
      Signed-off-by: default avatarJohn Allen <jallen@linux.vnet.ibm.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a95562da
    • Yuval Mintz's avatar
      bnx2x: Add new device ids under the Qlogic vendor · 3edd2164
      Yuval Mintz authored
      commit 9c9a6524 upstream.
      
      This adds support for 3 new PCI device combinations -
      1077:16a1, 1077:16a4 and 1077:16ad.
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3edd2164
    • Wang Shilong's avatar
      Btrfs: skip locking when searching commit root · a8a83b61
      Wang Shilong authored
      commit e84752d4 upstream.
      
      We won't change commit root, skip locking dance with commit root
      when walking backrefs, this can speed up btrfs send operations.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a8a83b61
    • Kirill Tkhai's avatar
      sched: Fix race between task_group and sched_task_group · 5f328493
      Kirill Tkhai authored
      commit eeb61e53 upstream.
      
      The race may happen when somebody is changing task_group of a forking task.
      Child's cgroup is the same as parent's after dup_task_struct() (there just
      memory copying). Also, cfs_rq and rt_rq are the same as parent's.
      
      But if parent changes its task_group before it's called cgroup_post_fork(),
      we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
      the same, while child's task_group changes in cgroup_post_fork().
      
      To fix this we introduce fork() method, which calls sched_move_task() directly.
      This function changes sched_task_group on appropriate (also its logic has
      no problem with freshly created tasks, so we shouldn't introduce something
      special; we are able just to use it).
      
      Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456Signed-off-by: default avatarKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhaiSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5f328493
    • Eric Sandeen's avatar
      xfs: allow inode allocations in post-growfs disk space · d1f8a33f
      Eric Sandeen authored
      commit 9de67c3b upstream.
      
      Today, if we perform an xfs_growfs which adds allocation groups,
      mp->m_maxagi is not properly updated when the growfs is complete.
      
      Therefore inodes will continue to be allocated only in the
      AGs which existed prior to the growfs, and the new space
      won't be utilized.
      
      This is because of this path in xfs_growfs_data_private():
      
      xfs_growfs_data_private
      	xfs_initialize_perag(mp, nagcount, &nagimax);
      		if (mp->m_flags & XFS_MOUNT_32BITINODES)
      			index = xfs_set_inode32(mp);
      		else
      			index = xfs_set_inode64(mp);
      
      		if (maxagi)
      			*maxagi = index;
      
      where xfs_set_inode* iterates over the (old) agcount in
      mp->m_sb.sb_agblocks, which has not yet been updated
      in the growfs path.  So "index" will be returned based on
      the old agcount, not the new one, and new AGs are not available
      for inode allocation.
      
      Fix this by explicitly passing the proper AG count (which
      xfs_initialize_perag() already has) down another level,
      so that xfs_set_inode* can make the proper decision about
      acceptable AGs for inode allocation in the potentially
      newly-added AGs.
      
      This has been broken since 3.7, when these two
      xfs_set_inode* functions were added in commit 2d2194f6.
      Prior to that, we looped over "agcount" not sb_agblocks
      in these calculations.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d1f8a33f
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Save the number of MSI-X entries to be copied later. · 8d676cb5
      Konrad Rzeszutek Wilk authored
      commit d159457b upstream.
      
      Commit 8135cf8b (xen/pciback: Save
      xen_pci_op commands before processing it) broke enabling MSI-X because
      it would never copy the resulting vectors into the response.  The
      number of vectors requested was being overwritten by the return value
      (typically zero for success).
      
      Save the number of vectors before processing the op, so the correct
      number of vectors are copied afterwards.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8d676cb5
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Save xen_pci_op commands before processing it · 16d5d05d
      Konrad Rzeszutek Wilk authored
      commit 8135cf8b upstream.
      
      Double fetch vulnerabilities that happen when a variable is
      fetched twice from shared memory but a security check is only
      performed the first time.
      
      The xen_pcibk_do_op function performs a switch statements on the op->cmd
      value which is stored in shared memory. Interestingly this can result
      in a double fetch vulnerability depending on the performed compiler
      optimization.
      
      This patch fixes it by saving the xen_pci_op command before
      processing it. We also use 'barrier' to make sure that the
      compiler does not perform any optimization.
      
      This is part of XSA155.
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      16d5d05d
    • Roger Pau Monné's avatar
      xen-blkback: read from indirect descriptors only once · d529da06
      Roger Pau Monné authored
      commit 18779149 upstream.
      
      Since indirect descriptors are in memory shared with the frontend, the
      frontend could alter the first_sect and last_sect values after they have
      been validated but before they are recorded in the request.  This may
      result in I/O requests that overflow the foreign page, possibly
      overwriting local pages when the I/O request is executed.
      
      When parsing indirect descriptors, only read first_sect and last_sect
      once.
      
      This is part of XSA155.
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d529da06
    • Roger Pau Monné's avatar
      xen-blkback: only read request operation from shared ring once · a0a2c1c9
      Roger Pau Monné authored
      commit 1f13d75c upstream.
      
      A compiler may load a switch statement value multiple times, which could
      be bad when the value is in memory shared with the frontend.
      
      When converting a non-native request to a native one, ensure that
      src->operation is only loaded once by using READ_ONCE().
      
      This is part of XSA155.
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a0a2c1c9
    • David Vrabel's avatar
      xen-netback: use RING_COPY_REQUEST() throughout · 1b7f294a
      David Vrabel authored
      commit 68a33bfd upstream.
      
      Instead of open-coding memcpy()s and directly accessing Tx and Rx
      requests, use the new RING_COPY_REQUEST() that ensures the local copy
      is correct.
      
      This is more than is strictly necessary for guest Rx requests since
      only the id and gref fields are used and it is harmless if the
      frontend modifies these.
      
      This is part of XSA155.
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1b7f294a
    • David Vrabel's avatar
      xen-netback: don't use last request to determine minimum Tx credit · 0327f4b7
      David Vrabel authored
      commit 0f589967 upstream.
      
      The last from guest transmitted request gives no indication about the
      minimum amount of credit that the guest might need to send a packet
      since the last packet might have been a small one.
      
      Instead allow for the worst case 128 KiB packet.
      
      This is part of XSA155.
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0327f4b7
    • David Vrabel's avatar
      xen: Add RING_COPY_REQUEST() · 120b649b
      David Vrabel authored
      commit 454d5d88 upstream.
      
      Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
      (i.e., by not considering that the other end may alter the data in the
      shared ring while it is being inspected).  Safe usage of a request
      generally requires taking a local copy.
      
      Provide a RING_COPY_REQUEST() macro to use instead of
      RING_GET_REQUEST() and an open-coded memcpy().  This takes care of
      ensuring that the copy is done correctly regardless of any possible
      compiler optimizations.
      
      Use a volatile source to prevent the compiler from reordering or
      omitting the copy.
      
      This is part of XSA155.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      120b649b
    • Peter Zijlstra's avatar
      locking: Remove atomicy checks from {READ,WRITE}_ONCE · fa712550
      Peter Zijlstra authored
      commit 7bd3e239 upstream.
      
      The fact that volatile allows for atomic load/stores is a special case
      not a requirement for {READ,WRITE}_ONCE(). Their primary purpose is to
      force the compiler to emit load/stores _once_.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fa712550
    • Linus Torvalds's avatar
      kernel: make READ_ONCE() valid on const arguments · d9c401f4
      Linus Torvalds authored
      commit dd369297 upstream.
      
      The use of READ_ONCE() causes lots of warnings witht he pending paravirt
      spinlock fixes, because those ends up having passing a member to a
      'const' structure to READ_ONCE().
      
      There should certainly be nothing wrong with using READ_ONCE() with a
      const source, but the helper function __read_once_size() would cause
      warnings because it would drop the 'const' qualifier, but also because
      the destination would be marked 'const' too due to the use of 'typeof'.
      
      Use a union of types in READ_ONCE() to avoid this issue.
      
      Also make sure to use parenthesis around the macro arguments to avoid
      possible operator precedence issues.
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d9c401f4
    • Christian Borntraeger's avatar
      kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val) · dda458f0
      Christian Borntraeger authored
      commit 43239cbe upstream.
      
      Feedback has shown that WRITE_ONCE(x, val) is easier to use than
      ASSIGN_ONCE(val,x).
      There are no in-tree users yet, so lets change it for 3.19.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dda458f0
  2. 30 Mar, 2016 2 commits
    • Christian Borntraeger's avatar
      kernel: Provide READ_ONCE and ASSIGN_ONCE · b5be8baf
      Christian Borntraeger authored
      commit 230fa253 upstream.
      
      ACCESS_ONCE does not work reliably on non-scalar types. For
      example gcc 4.6 and 4.7 might remove the volatile tag for such
      accesses during the SRA (scalar replacement of aggregates) step
      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
      
      Let's provide READ_ONCE/ASSIGN_ONCE that will do all accesses via
      scalar types as suggested by Linus Torvalds. Accesses larger than
      the machines word size cannot be guaranteed to be atomic. These
      macros will use memcpy and emit a build warning.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b5be8baf
    • Eric W. Biederman's avatar
      umount: Do not allow unmounting rootfs. · e1ce59d2
      Eric W. Biederman authored
      commit da362b09 upstream.
      
      Andrew Vagin <avagin@parallels.com> writes:
      
      > #define _GNU_SOURCE
      > #include <sys/types.h>
      > #include <sys/stat.h>
      > #include <fcntl.h>
      > #include <sched.h>
      > #include <unistd.h>
      > #include <sys/mount.h>
      >
      > int main(int argc, char **argv)
      > {
      > 	int fd;
      >
      > 	fd = open("/proc/self/ns/mnt", O_RDONLY);
      > 	if (fd < 0)
      > 	   return 1;
      > 	   while (1) {
      > 	   	 if (umount2("/", MNT_DETACH) ||
      > 		        setns(fd, CLONE_NEWNS))
      > 					break;
      > 					}
      >
      > 					return 0;
      > }
      >
      > root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
      > root@ubuntu:/home/avagin# strace ./nsenter
      > execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
      > ...
      > open("/proc/self/ns/mnt", O_RDONLY)     = 3
      > umount("/", MNT_DETACH)                 = 0
      > setns(3, 131072)                        = 0
      > umount("/", MNT_DETACH
      >
      causes:
      
      > [  260.548301] ------------[ cut here ]------------
      > [  260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
      > [  260.552068] invalid opcode: 0000 [#1] SMP
      > [  260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
      > [  260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
      > [  260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      > [  260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
      > [  260.552068] RIP: 0010:[<ffffffff811e9483>]  [<ffffffff811e9483>] propagate_umount+0x123/0x130
      > [  260.552068] RSP: 0018:ffff880074825e98  EFLAGS: 00010246
      > [  260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
      > [  260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
      > [  260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
      > [  260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
      > [  260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
      > [  260.552068] FS:  00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      > [  260.552068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > [  260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
      > [  260.552068] Stack:
      > [  260.552068]  0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
      > [  260.552068]  ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
      > [  260.552068]  ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
      > [  260.552068] Call Trace:
      > [  260.552068]  [<ffffffff811dcfac>] umount_tree+0x20c/0x260
      > [  260.552068]  [<ffffffff811dd12b>] do_umount+0x12b/0x300
      > [  260.552068]  [<ffffffff811cc642>] ? final_putname+0x22/0x50
      > [  260.552068]  [<ffffffff811cc849>] ? putname+0x29/0x40
      > [  260.552068]  [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
      > [  260.552068]  [<ffffffff8172aeff>] tracesys+0xe1/0xe6
      > [  260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
      > [  260.552068] RIP  [<ffffffff811e9483>] propagate_umount+0x123/0x130
      > [  260.552068]  RSP <ffff880074825e98>
      > [  260.611451] ---[ end trace 11c33d85f1d4c652 ]--
      
      Which in practice is totally uninteresting.  Only the global root user can
      do it, and it is just a stupid thing to do.
      
      However that is no excuse to allow a silly way to oops the kernel.
      
      We can avoid this silly problem by setting MNT_LOCKED on the rootfs
      mount point and thus avoid needing any special cases in the unmount
      code.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e1ce59d2