1. 11 Apr, 2016 30 commits
    • Benjamin Poirier's avatar
      mld, igmp: Fix reserved tailroom calculation · 6af792bb
      Benjamin Poirier authored
      commit 1837b2e2 upstream.
      
      The current reserved_tailroom calculation fails to take hlen and tlen into
      account.
      
      skb:
      [__hlen__|__data____________|__tlen___|__extra__]
      ^                                               ^
      head                                            skb_end_offset
      
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      
      [__hlen__|__data____________|__extra__|__tlen___]
      ^                                               ^
      head                                            skb_end_offset
      
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      reserved_tailroom
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      else:
      	reserved_tailroom = tlen
      
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      
      Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      6af792bb
    • Daniel Borkmann's avatar
      ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs · e98ac086
      Daniel Borkmann authored
      commit 4c672e4b upstream.
      
      It has been reported that generating an MLD listener report on
      devices with large MTUs (e.g. 9000) and a high number of IPv6
      addresses can trigger a skb_over_panic():
      
      skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20
      head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0
      dev:port1
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:100!
      invalid opcode: 0000 [#1] SMP
      Modules linked in: ixgbe(O)
      CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff80578226>] ? skb_put+0x3a/0x3b
       [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e
       [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4
       [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d
       [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45
       [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68
       [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182
       [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d
       [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3
       [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46
       [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70
      
      mld_newpack() skb allocations are usually requested with dev->mtu
      in size, since commit 72e09ad1 ("ipv6: avoid high order allocations")
      we have changed the limit in order to be less likely to fail.
      
      However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb)
      macros, which determine if we may end up doing an skb_put() for
      adding another record. To avoid possible fragmentation, we check
      the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong
      assumption as the actual max allocation size can be much smaller.
      
      The IGMP case doesn't have this issue as commit 57e1ab6e
      ("igmp: refine skb allocations") stores the allocation size in
      the cb[].
      
      Set a reserved_tailroom to make it fit into the MTU and use
      skb_availroom() helper instead. This also allows to get rid of
      igmp_skb_size().
      Reported-by: default avatarWei Liu <lw1a2.jing@gmail.com>
      Fixes: 72e09ad1 ("ipv6: avoid high order allocations")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: David L Stevens <david.stevens@oracle.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e98ac086
    • Jiri Slaby's avatar
      net/ipv6: fix DEVCONF_ constants · 76f4d42f
      Jiri Slaby authored
      In 3.12 commit e16f5378 (net/ipv6: add
      sysctl option accept_ra_min_hop_limit), upstream commit
      8013d1d7, we added
      DEVCONF_USE_OIF_ADDRS_ONLY and DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT
      constants into <linux/ipv6.h>. But they have different values to
      upstream because some values were added in upstream and we did not
      backport them.
      
      So we have:
              DEVCONF_SUPPRESS_FRAG_NDISC,
      +       DEVCONF_USE_OIF_ADDRS_ONLY,
      +       DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT,
              DEVCONF_MAX
      And upstream has:
              DEVCONF_SUPPRESS_FRAG_NDISC,
      +       DEVCONF_ACCEPT_RA_FROM_LOCAL,
      +       DEVCONF_USE_OPTIMISTIC,
      +       DEVCONF_ACCEPT_RA_MTU,
      +       DEVCONF_STABLE_SECRET,
      +       DEVCONF_USE_OIF_ADDRS_ONLY,
      +       DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT,
              DEVCONF_MAX
      
      Now, our DEVCONF_USE_OIF_ADDRS_ONLY corresponds to
      DEVCONF_USE_OIF_ADDRS_ONLY-4 == DEVCONF_ACCEPT_RA_FROM_LOCAL from
      upstream. Similarly the other constant.
      
      Fix that by simply defining the missing constants to make the values
      equal.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Reported-by: default avatarYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Cc: Luis Henriques <luis.henriques@canonical.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      76f4d42f
    • Jeff Layton's avatar
      nfs: fix high load average due to callback thread sleeping · 2e9b2422
      Jeff Layton authored
      commit 5d05e54a upstream.
      
      Chuck pointed out a problem that crept in with commit 6ffa30d3 (nfs:
      don't call blocking operations while !TASK_RUNNING). Linux counts tasks
      in uninterruptible sleep against the load average, so this caused the
      system's load average to be pinned at at least 1 when there was a
      NFSv4.1+ mount active.
      
      Not a huge problem, but it's probably worth fixing before we get too
      many complaints about it. This patch converts the code back to use
      TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop
      iteration. In practice no one should really be signalling this thread at
      all, so I think this is reasonably safe.
      
      With this change, there's also no need to game the hung task watchdog so
      we can also convert the schedule_timeout call back to a normal schedule.
      Reported-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJeff Layton <jeff.layton@primarydata.com>
      Tested-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Fixes: commit 6ffa30d3 (“nfs: don't call blocking . . .”)
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2e9b2422
    • Ian Mitchell's avatar
      Fix kmalloc overflow in LPFC driver at large core count · d93cbb44
      Ian Mitchell authored
      commit c0365c06 upstream.
      
      This patch allows the LPFC to start up without a fatal kernel bug based
      on an exceeded KMALLOC_MAX_SIZE and a too large NR_CPU-based maskbits
      field. The bug was based on the number of CPU cores in a system.
      Using the get_cpu_mask() function declared in kernel/cpu.c allows the
      driver to load on the community kernel 4.2 RC1.
      
      Below is the kernel bug reproduced:
      
      8<--------------------------------------------------------------------
      2199382.828437 (    0.005216)| lpfc 0003:02:00.0: enabling device (0140 -> 0142)
      2199382.999272 (    0.170835)| ------------[ cut here ]------------
      2199382.999337 (    0.000065)| WARNING: CPU: 84 PID: 404 at mm/slab_common.c:653 kmalloc_slab+0x2f/0x89()
      2199383.004534 (    0.005197)| Modules linked in: lpfc(+) usbcore(+) mptctl scsi_transport_fc sg lpc_ich i2c_i801 usb_common tpm_tis mfd_core tpm acpi_cpufreq button scsi_dh_alua scsi_dh_rdacusbcore: registered new device driver usb
      2199383.020568 (    0.016034)|
      2199383.020581 (    0.000013)|  scsi_dh_hp_sw scsi_dh_emc scsi_dh gru thermal sata_nv processor piix fan thermal_sysehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
      2199383.035288 (    0.014707)|
      2199383.035306 (    0.000018)|  hwmon ata_piix
      2199383.035336 (    0.000030)| CPU: 84 PID: 404 Comm: kworker/84:0 Not tainted 3.18.0-rc2-gat-00106-ga7ca10f2-dirty #178
      2199383.047077 (    0.011741)| ehci-pci: EHCI PCI platform driver
      2199383.047134 (    0.000057)| Hardware name: SGI UV2000/ROMLEY, BIOS SGI UV 2000/3000 series BIOS 01/15/2013
      2199383.056245 (    0.009111)| Workqueue: events work_for_cpu_fn
      2199383.066174 (    0.009929)|  000000000000028d ffff88eef827bbe8 ffffffff815a542f 000000000000028d
      2199383.069545 (    0.003371)|  ffffffff810ea142 ffff88eef827bc28 ffffffff8104365c ffff88eefe4006c8
      2199383.076214 (    0.006669)|  0000000000000000 00000000000080d0 0000000000000000 0000000000000004
      2199383.079213 (    0.002999)| Call Trace:
      2199383.084084 (    0.004871)|  [<ffffffff815a542f>] dump_stack+0x49/0x62
      2199383.087283 (    0.003199)|  [<ffffffff810ea142>] ? kmalloc_slab+0x2f/0x89
      2199383.091415 (    0.004132)|  [<ffffffff8104365c>] warn_slowpath_common+0x77/0x92
      2199383.095197 (    0.003782)|  [<ffffffff8104368c>] warn_slowpath_null+0x15/0x17
      2199383.103336 (    0.008139)|  [<ffffffff810ea142>] kmalloc_slab+0x2f/0x89
      2199383.107082 (    0.003746)|  [<ffffffff8110fd9e>] __kmalloc+0x13/0x16a
      2199383.112531 (    0.005449)|  [<ffffffffa01a8ed9>] lpfc_pci_probe_one_s4+0x105b/0x1644 [lpfc]
      2199383.115316 (    0.002785)|  [<ffffffff81302b92>] ? pci_bus_read_config_dword+0x75/0x87
      2199383.123431 (    0.008115)|  [<ffffffffa01a951f>] lpfc_pci_probe_one+0x5d/0xcb5 [lpfc]
      2199383.127364 (    0.003933)|  [<ffffffff81497119>] ? dbs_check_cpu+0x168/0x177
      2199383.136438 (    0.009074)|  [<ffffffff81496fa5>] ? gov_queue_work+0xb4/0xc0
      2199383.140407 (    0.003969)|  [<ffffffff8130b2a1>] local_pci_probe+0x1e/0x52
      2199383.143105 (    0.002698)|  [<ffffffff81052c47>] work_for_cpu_fn+0x13/0x1b
      2199383.147315 (    0.004210)|  [<ffffffff81054965>] process_one_work+0x222/0x35e
      2199383.151379 (    0.004064)|  [<ffffffff81054e76>] worker_thread+0x3d5/0x46e
      2199383.159402 (    0.008023)|  [<ffffffff81054aa1>] ? process_one_work+0x35e/0x35e
      2199383.163097 (    0.003695)|  [<ffffffff810599c6>] kthread+0xc8/0xd2
      2199383.167476 (    0.004379)|  [<ffffffff810598fe>] ? kthread_freezable_should_stop+0x5b/0x5b
      2199383.176434 (    0.008958)|  [<ffffffff815a8cac>] ret_from_fork+0x7c/0xb0
      2199383.180086 (    0.003652)|  [<ffffffff810598fe>] ? kthread_freezable_should_stop+0x5b/0x5b
      2199383.192333 (    0.012247)| ehci-pci 0000:00:1a.0: EHCI Host Controller
      -------------------------------------------------------------------->8
      
      The proposed solution was approved by James Smart at Emulex and tested
      on a UV2 machine with 6144 cores. With the fix, the LPFC module loads
      with no unwanted effects on the system.
      Signed-off-by: default avatarIan Mitchell <imitchell@sgi.com>
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarRobert Elliot <elliott@hp.com>
      [james.smart: resolve unused variable warning]
      Signed-off-by: default avatarJames Smart <james.smart@avagotech.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Odin.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d93cbb44
    • Markus Metzger's avatar
      perf, nmi: Fix unknown NMI warning · 71f4162a
      Markus Metzger authored
      commit a3ef2229 upstream.
      
      When using BTS on Core i7-4*, I get the below kernel warning.
      
      $ perf record -c 1 -e branches:u ls
      Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
       kernel:[  438.317893] Uhhuh. NMI received for unknown reason 31 on CPU 2.
      
      Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
       kernel:[  438.317920] Do you have a strange power saving mode enabled?
      
      Message from syslogd@labpc1501 at Nov 11 15:49:25 ...
       kernel:[  438.317945] Dazed and confused, but trying to continue
      
      Make intel_pmu_handle_irq() take the full exit path when returning early.
      
      Cc: eranian@google.com
      Cc: peterz@infradead.org
      Cc: mingo@kernel.org
      Signed-off-by: default avatarMarkus Metzger <markus.t.metzger@intel.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1392425048-5309-1-git-send-email-andi@firstfloor.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      71f4162a
    • Lukasz Odzioba's avatar
      hwmon: (coretemp) Increase limit of maximum core ID from 32 to 128. · 2b4faae2
      Lukasz Odzioba authored
      commit cc904f9c upstream.
      
      A new limit selected arbitrarily as power of two greater than
      required minimum for Xeon Phi processor (72 for Knights Landing).
      
      Currently driver is not able to handle cores with core ID greater than 32.
      Such attempt ends up with the following error in dmesg:
      coretemp coretemp.0: Adding Core XXX failed
      Signed-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2b4faae2
    • Martin Schwidefsky's avatar
      s390/mm: four page table levels vs. fork · bf06b31b
      Martin Schwidefsky authored
      commit 3446c13b upstream.
      
      The fork of a process with four page table levels is broken since
      git commit 6252d702 "[S390] dynamic page tables."
      
      All new mm contexts are created with three page table levels and
      an asce limit of 4TB. If the parent has four levels dup_mmap will
      add vmas to the new context which are outside of the asce limit.
      The subsequent call to copy_page_range will walk the three level
      page table structure of the new process with non-zero pgd and pud
      indexes. This leads to memory clobbers as the pgd_index *and* the
      pud_index is added to the mm->pgd pointer without a pgd_deref
      in between.
      
      The init_new_context() function is selecting the number of page
      table levels for a new context. The function is used by mm_init()
      which in turn is called by dup_mm() and mm_alloc(). These two are
      used by fork() and exec(). The init_new_context() function can
      distinguish the two cases by looking at mm->context.asce_limit,
      for fork() the mm struct has been copied and the number of page
      table levels may not change. For exec() the mm_alloc() function
      set the new mm structure to zero, in this case a three-level page
      table is created as the temporary stack space is located at
      STACK_TOP_MAX = 4TB.
      
      This fixes CVE-2016-2143.
      Reported-by: default avatarMarcin Kościelnicki <koriakin@0x04.net>
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bf06b31b
    • Johan Hovold's avatar
      USB: visor: fix null-deref at probe · d53a0262
      Johan Hovold authored
      commit cac9b50b upstream.
      
      Fix null-pointer dereference at probe should a (malicious) Treo device
      lack the expected endpoints.
      
      Specifically, the Treo port-setup hack was dereferencing the bulk-in and
      interrupt-in urbs without first making sure they had been allocated by
      core.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d53a0262
    • Wei Huang's avatar
      KVM: SVM: add rdmsr support for AMD event registers · 9112aa76
      Wei Huang authored
      commit dc9b2d93 upstream.
      
      Current KVM only supports RDMSR for K7_EVNTSEL0 and K7_PERFCTR0
      MSRs. Reading the rest MSRs will trigger KVM to inject #GP into
      guest VM. This causes a warning message "Failed to access perfctr
      msr (MSR c0010001 is ffffffffffffffff)" on AMD host. This patch
      adds RDMSR support for all K7_EVNTSELn and K7_PERFCTRn registers
      and thus supresses the warning message.
      Signed-off-by: default avatarWei Huang <wehuang@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9112aa76
    • Dirk Brandewie's avatar
      intel_pstate: Use del_timer_sync in intel_pstate_cpu_stop · b8c77bd5
      Dirk Brandewie authored
      commit c2294a2f upstream.
      
      Ensure that no timer callback is running since we are about to free
      the timer structure.  We cannot guarantee that the call back is called
      on the CPU where the timer is running.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDirk Brandewie <dirk.j.brandewie@intel.com>
      Reviewed-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b8c77bd5
    • Alan Stern's avatar
      USB: fix invalid memory access in hub_activate() · a706ac40
      Alan Stern authored
      commit e50293ef upstream.
      
      Commit 8520f380 ("USB: change hub initialization sleeps to
      delayed_work") changed the hub_activate() routine to make part of it
      run in a workqueue.  However, the commit failed to take a reference to
      the usb_hub structure or to lock the hub interface while doing so.  As
      a result, if a hub is plugged in and quickly unplugged before the work
      routine can run, the routine will try to access memory that has been
      deallocated.  Or, if the hub is unplugged while the routine is
      running, the memory may be deallocated while it is in active use.
      
      This patch fixes the problem by taking a reference to the usb_hub at
      the start of hub_activate() and releasing it at the end (when the work
      is finished), and by locking the hub interface while the work routine
      is running.  It also adds a check at the start of the routine to see
      if the hub has already been disconnected, in which nothing should be
      done.
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Reported-by: default avatarAlexandru Cornea <alexandru.cornea@intel.com>
      Tested-by: default avatarAlexandru Cornea <alexandru.cornea@intel.com>
      Fixes: 8520f380 ("USB: change hub initialization sleeps to delayed_work")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a706ac40
    • Michal Hocko's avatar
      memcg: do not hang on OOM when killed by userspace OOM access to memory reserves · b6f358ab
      Michal Hocko authored
      commit d8dc595c upstream.
      
      Eric has reported that he can see task(s) stuck in memcg OOM handler
      regularly.  The only way out is to
      
      	echo 0 > $GROUP/memory.oom_control
      
      His usecase is:
      
      - Setup a hierarchy with memory and the freezer (disable kernel oom and
        have a process watch for oom).
      
      - In that memory cgroup add a process with one thread per cpu.
      
      - In one thread slowly allocate once per second I think it is 16M of ram
        and mlock and dirty it (just to force the pages into ram and stay
        there).
      
      - When oom is achieved loop:
        * attempt to freeze all of the tasks.
        * if frozen send every task SIGKILL, unfreeze, remove the directory in
          cgroupfs.
      
      Eric has then pinpointed the issue to be memcg specific.
      
      All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
      Those that have received fatal signal will bypass the charge and should
      continue on their way out.  The tricky part is that the exit path might
      trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
      while its memcg is still under OOM because nobody has released any charges
      yet.
      
      Unlike with the in-kernel OOM handler the exiting task doesn't get
      TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
      and falls to the memcg OOM again without any way out of it as there are no
      fatal signals pending anymore.
      
      This patch fixes the issue by checking PF_EXITING early in
      mem_cgroup_try_charge and bypass the charge same as if it had fatal
      signal pending or TIF_MEMDIE set.
      
      Normally exiting tasks (aka not killed) will bypass the charge now but
      this should be OK as the task is leaving and will release memory and
      increasing the memory pressure just to release it in a moment seems
      dubious wasting of cycles.  Besides that charges after exit_signals should
      be rare.
      
      I am bringing this patch again (rebased on the current mmotm tree). I
      hope we can move forward finally. If there is still an opposition then
      I would really appreciate a concurrent approach so that we can discuss
      alternatives.
      
      http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
      to the followup discussion when the patch has been dropped from the mmotm
      last time.
      Reported-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b6f358ab
    • Takashi Iwai's avatar
      ALSA: seq: Fix leak of pool buffer at concurrent writes · 448da408
      Takashi Iwai authored
      commit d99a36f4 upstream.
      
      When multiple concurrent writes happen on the ALSA sequencer device
      right after the open, it may try to allocate vmalloc buffer for each
      write and leak some of them.  It's because the presence check and the
      assignment of the buffer is done outside the spinlock for the pool.
      
      The fix is to move the check and the assignment into the spinlock.
      
      (The current implementation is suboptimal, as there can be multiple
       unnecessary vmallocs because the allocation is done before the check
       in the spinlock.  But the pool size is already checked beforehand, so
       this isn't a big problem; that is, the only possible path is the
       multiple writes before any pool assignment, and practically seen, the
       current coverage should be "good enough".)
      
      The issue was triggered by syzkaller fuzzer.
      
      Buglink: http://lkml.kernel.org/r/CACT4Y+bSzazpXNvtAr=WXaL8hptqjHwqEyFA+VN2AWEx=aurkg@mail.gmail.comReported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      448da408
    • Takashi Iwai's avatar
      ALSA: rawmidi: Make snd_rawmidi_transmit() race-free · d077a244
      Takashi Iwai authored
      commit 06ab3003 upstream.
      
      A kernel WARNING in snd_rawmidi_transmit_ack() is triggered by
      syzkaller fuzzer:
        WARNING: CPU: 1 PID: 20739 at sound/core/rawmidi.c:1136
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
       [<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
       [<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
       [<ffffffff84f80bd5>] snd_rawmidi_transmit_ack+0x275/0x400 sound/core/rawmidi.c:1136
       [<ffffffff84fdb3c1>] snd_virmidi_output_trigger+0x4b1/0x5a0 sound/core/seq/seq_virmidi.c:163
       [<     inline     >] snd_rawmidi_output_trigger sound/core/rawmidi.c:150
       [<ffffffff84f87ed9>] snd_rawmidi_kernel_write1+0x549/0x780 sound/core/rawmidi.c:1223
       [<ffffffff84f89fd3>] snd_rawmidi_write+0x543/0xb30 sound/core/rawmidi.c:1273
       [<ffffffff817b0323>] __vfs_write+0x113/0x480 fs/read_write.c:528
       [<ffffffff817b1db7>] vfs_write+0x167/0x4a0 fs/read_write.c:577
       [<     inline     >] SYSC_write fs/read_write.c:624
       [<ffffffff817b50a1>] SyS_write+0x111/0x220 fs/read_write.c:616
       [<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185
      
      Also a similar warning is found but in another path:
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff82be2c0d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
       [<ffffffff81355139>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
       [<ffffffff81355369>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
       [<ffffffff8527e69a>] rawmidi_transmit_ack+0x24a/0x3b0 sound/core/rawmidi.c:1133
       [<ffffffff8527e851>] snd_rawmidi_transmit_ack+0x51/0x80 sound/core/rawmidi.c:1163
       [<ffffffff852d9046>] snd_virmidi_output_trigger+0x2b6/0x570 sound/core/seq/seq_virmidi.c:185
       [<     inline     >] snd_rawmidi_output_trigger sound/core/rawmidi.c:150
       [<ffffffff85285a0b>] snd_rawmidi_kernel_write1+0x4bb/0x760 sound/core/rawmidi.c:1252
       [<ffffffff85287b73>] snd_rawmidi_write+0x543/0xb30 sound/core/rawmidi.c:1302
       [<ffffffff817ba5f3>] __vfs_write+0x113/0x480 fs/read_write.c:528
       [<ffffffff817bc087>] vfs_write+0x167/0x4a0 fs/read_write.c:577
       [<     inline     >] SYSC_write fs/read_write.c:624
       [<ffffffff817bf371>] SyS_write+0x111/0x220 fs/read_write.c:616
       [<ffffffff86660276>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185
      
      In the former case, the reason is that virmidi has an open code
      calling snd_rawmidi_transmit_ack() with the value calculated outside
      the spinlock.   We may use snd_rawmidi_transmit() in a loop just for
      consuming the input data, but even there, there is a race between
      snd_rawmidi_transmit_peek() and snd_rawmidi_tranmit_ack().
      
      Similarly in the latter case, it calls snd_rawmidi_transmit_peek() and
      snd_rawmidi_tranmit_ack() separately without protection, so they are
      racy as well.
      
      The patch tries to address these issues by the following ways:
      - Introduce the unlocked versions of snd_rawmidi_transmit_peek() and
        snd_rawmidi_transmit_ack() to be called inside the explicit lock.
      - Rewrite snd_rawmidi_transmit() to be race-free (the former case).
      - Make the split calls (the latter case) protected in the rawmidi spin
        lock.
      
      Buglink: http://lkml.kernel.org/r/CACT4Y+YPq1+cYLkadwjWa5XjzF1_Vki1eHnVn-Lm0hzhSpu5PA@mail.gmail.com
      Buglink: http://lkml.kernel.org/r/CACT4Y+acG4iyphdOZx47Nyq_VHGbpJQK-6xNpiqUjaZYqsXOGw@mail.gmail.comReported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d077a244
    • John Allen's avatar
      drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64 · a95562da
      John Allen authored
      commit cb5490a5 upstream.
      
      Fix a bug where a kernel warning is triggered when performing a memory
      hotplug on ppc64.  This warning may also occur on any architecture that
      uses the memory_probe_store interface.
      
        WARNING: at drivers/base/memory.c:200
        CPU: 9 PID: 13042 Comm: systemd-udevd Not tainted 4.4.0-rc4-00113-g0bd0f1e6-dirty #7
        NIP [c00000000055e034] pages_correctly_reserved+0x134/0x1b0
        LR [c00000000055e7f8] memory_subsys_online+0x68/0x140
        Call Trace:
          memory_subsys_online+0x68/0x140
          device_online+0xb4/0x120
          store_mem_state+0xb0/0x180
          dev_attr_store+0x34/0x60
          sysfs_kf_write+0x64/0xa0
          kernfs_fop_write+0x17c/0x1e0
          __vfs_write+0x40/0x160
          vfs_write+0xb8/0x200
          SyS_write+0x60/0x110
          system_call+0x38/0xd0
      
      The warning is triggered because there is a udev rule that automatically
      tries to online memory after it has been added.  The udev rule varies
      from distro to distro, but will generally look something like:
      
        SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
      
      On any architecture that uses memory_probe_store to reserve memory, the
      udev rule will be triggered after the first section of the block is
      reserved and will subsequently attempt to online the entire block,
      interrupting the memory reservation process and causing the warning.
      This patch modifies memory_probe_store to add a block of memory with a
      single call to add_memory as opposed to looping through and adding each
      section individually.  A single call to add_memory is protected by the
      mem_hotplug mutex which will prevent the udev rule from onlining memory
      until the reservation of the entire block is complete.
      Signed-off-by: default avatarJohn Allen <jallen@linux.vnet.ibm.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a95562da
    • Yuval Mintz's avatar
      bnx2x: Add new device ids under the Qlogic vendor · 3edd2164
      Yuval Mintz authored
      commit 9c9a6524 upstream.
      
      This adds support for 3 new PCI device combinations -
      1077:16a1, 1077:16a4 and 1077:16ad.
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3edd2164
    • Wang Shilong's avatar
      Btrfs: skip locking when searching commit root · a8a83b61
      Wang Shilong authored
      commit e84752d4 upstream.
      
      We won't change commit root, skip locking dance with commit root
      when walking backrefs, this can speed up btrfs send operations.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a8a83b61
    • Kirill Tkhai's avatar
      sched: Fix race between task_group and sched_task_group · 5f328493
      Kirill Tkhai authored
      commit eeb61e53 upstream.
      
      The race may happen when somebody is changing task_group of a forking task.
      Child's cgroup is the same as parent's after dup_task_struct() (there just
      memory copying). Also, cfs_rq and rt_rq are the same as parent's.
      
      But if parent changes its task_group before it's called cgroup_post_fork(),
      we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
      the same, while child's task_group changes in cgroup_post_fork().
      
      To fix this we introduce fork() method, which calls sched_move_task() directly.
      This function changes sched_task_group on appropriate (also its logic has
      no problem with freshly created tasks, so we shouldn't introduce something
      special; we are able just to use it).
      
      Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456Signed-off-by: default avatarKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhaiSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5f328493
    • Eric Sandeen's avatar
      xfs: allow inode allocations in post-growfs disk space · d1f8a33f
      Eric Sandeen authored
      commit 9de67c3b upstream.
      
      Today, if we perform an xfs_growfs which adds allocation groups,
      mp->m_maxagi is not properly updated when the growfs is complete.
      
      Therefore inodes will continue to be allocated only in the
      AGs which existed prior to the growfs, and the new space
      won't be utilized.
      
      This is because of this path in xfs_growfs_data_private():
      
      xfs_growfs_data_private
      	xfs_initialize_perag(mp, nagcount, &nagimax);
      		if (mp->m_flags & XFS_MOUNT_32BITINODES)
      			index = xfs_set_inode32(mp);
      		else
      			index = xfs_set_inode64(mp);
      
      		if (maxagi)
      			*maxagi = index;
      
      where xfs_set_inode* iterates over the (old) agcount in
      mp->m_sb.sb_agblocks, which has not yet been updated
      in the growfs path.  So "index" will be returned based on
      the old agcount, not the new one, and new AGs are not available
      for inode allocation.
      
      Fix this by explicitly passing the proper AG count (which
      xfs_initialize_perag() already has) down another level,
      so that xfs_set_inode* can make the proper decision about
      acceptable AGs for inode allocation in the potentially
      newly-added AGs.
      
      This has been broken since 3.7, when these two
      xfs_set_inode* functions were added in commit 2d2194f6.
      Prior to that, we looped over "agcount" not sb_agblocks
      in these calculations.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d1f8a33f
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Save the number of MSI-X entries to be copied later. · 8d676cb5
      Konrad Rzeszutek Wilk authored
      commit d159457b upstream.
      
      Commit 8135cf8b (xen/pciback: Save
      xen_pci_op commands before processing it) broke enabling MSI-X because
      it would never copy the resulting vectors into the response.  The
      number of vectors requested was being overwritten by the return value
      (typically zero for success).
      
      Save the number of vectors before processing the op, so the correct
      number of vectors are copied afterwards.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8d676cb5
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Save xen_pci_op commands before processing it · 16d5d05d
      Konrad Rzeszutek Wilk authored
      commit 8135cf8b upstream.
      
      Double fetch vulnerabilities that happen when a variable is
      fetched twice from shared memory but a security check is only
      performed the first time.
      
      The xen_pcibk_do_op function performs a switch statements on the op->cmd
      value which is stored in shared memory. Interestingly this can result
      in a double fetch vulnerability depending on the performed compiler
      optimization.
      
      This patch fixes it by saving the xen_pci_op command before
      processing it. We also use 'barrier' to make sure that the
      compiler does not perform any optimization.
      
      This is part of XSA155.
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      16d5d05d
    • Roger Pau Monné's avatar
      xen-blkback: read from indirect descriptors only once · d529da06
      Roger Pau Monné authored
      commit 18779149 upstream.
      
      Since indirect descriptors are in memory shared with the frontend, the
      frontend could alter the first_sect and last_sect values after they have
      been validated but before they are recorded in the request.  This may
      result in I/O requests that overflow the foreign page, possibly
      overwriting local pages when the I/O request is executed.
      
      When parsing indirect descriptors, only read first_sect and last_sect
      once.
      
      This is part of XSA155.
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d529da06
    • Roger Pau Monné's avatar
      xen-blkback: only read request operation from shared ring once · a0a2c1c9
      Roger Pau Monné authored
      commit 1f13d75c upstream.
      
      A compiler may load a switch statement value multiple times, which could
      be bad when the value is in memory shared with the frontend.
      
      When converting a non-native request to a native one, ensure that
      src->operation is only loaded once by using READ_ONCE().
      
      This is part of XSA155.
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a0a2c1c9
    • David Vrabel's avatar
      xen-netback: use RING_COPY_REQUEST() throughout · 1b7f294a
      David Vrabel authored
      commit 68a33bfd upstream.
      
      Instead of open-coding memcpy()s and directly accessing Tx and Rx
      requests, use the new RING_COPY_REQUEST() that ensures the local copy
      is correct.
      
      This is more than is strictly necessary for guest Rx requests since
      only the id and gref fields are used and it is harmless if the
      frontend modifies these.
      
      This is part of XSA155.
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1b7f294a
    • David Vrabel's avatar
      xen-netback: don't use last request to determine minimum Tx credit · 0327f4b7
      David Vrabel authored
      commit 0f589967 upstream.
      
      The last from guest transmitted request gives no indication about the
      minimum amount of credit that the guest might need to send a packet
      since the last packet might have been a small one.
      
      Instead allow for the worst case 128 KiB packet.
      
      This is part of XSA155.
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0327f4b7
    • David Vrabel's avatar
      xen: Add RING_COPY_REQUEST() · 120b649b
      David Vrabel authored
      commit 454d5d88 upstream.
      
      Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
      (i.e., by not considering that the other end may alter the data in the
      shared ring while it is being inspected).  Safe usage of a request
      generally requires taking a local copy.
      
      Provide a RING_COPY_REQUEST() macro to use instead of
      RING_GET_REQUEST() and an open-coded memcpy().  This takes care of
      ensuring that the copy is done correctly regardless of any possible
      compiler optimizations.
      
      Use a volatile source to prevent the compiler from reordering or
      omitting the copy.
      
      This is part of XSA155.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      120b649b
    • Peter Zijlstra's avatar
      locking: Remove atomicy checks from {READ,WRITE}_ONCE · fa712550
      Peter Zijlstra authored
      commit 7bd3e239 upstream.
      
      The fact that volatile allows for atomic load/stores is a special case
      not a requirement for {READ,WRITE}_ONCE(). Their primary purpose is to
      force the compiler to emit load/stores _once_.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fa712550
    • Linus Torvalds's avatar
      kernel: make READ_ONCE() valid on const arguments · d9c401f4
      Linus Torvalds authored
      commit dd369297 upstream.
      
      The use of READ_ONCE() causes lots of warnings witht he pending paravirt
      spinlock fixes, because those ends up having passing a member to a
      'const' structure to READ_ONCE().
      
      There should certainly be nothing wrong with using READ_ONCE() with a
      const source, but the helper function __read_once_size() would cause
      warnings because it would drop the 'const' qualifier, but also because
      the destination would be marked 'const' too due to the use of 'typeof'.
      
      Use a union of types in READ_ONCE() to avoid this issue.
      
      Also make sure to use parenthesis around the macro arguments to avoid
      possible operator precedence issues.
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      d9c401f4
    • Christian Borntraeger's avatar
      kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val) · dda458f0
      Christian Borntraeger authored
      commit 43239cbe upstream.
      
      Feedback has shown that WRITE_ONCE(x, val) is easier to use than
      ASSIGN_ONCE(val,x).
      There are no in-tree users yet, so lets change it for 3.19.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      dda458f0
  2. 30 Mar, 2016 5 commits
    • Christian Borntraeger's avatar
      kernel: Provide READ_ONCE and ASSIGN_ONCE · b5be8baf
      Christian Borntraeger authored
      commit 230fa253 upstream.
      
      ACCESS_ONCE does not work reliably on non-scalar types. For
      example gcc 4.6 and 4.7 might remove the volatile tag for such
      accesses during the SRA (scalar replacement of aggregates) step
      https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)
      
      Let's provide READ_ONCE/ASSIGN_ONCE that will do all accesses via
      scalar types as suggested by Linus Torvalds. Accesses larger than
      the machines word size cannot be guaranteed to be atomic. These
      macros will use memcpy and emit a build warning.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b5be8baf
    • Eric W. Biederman's avatar
      umount: Do not allow unmounting rootfs. · e1ce59d2
      Eric W. Biederman authored
      commit da362b09 upstream.
      
      Andrew Vagin <avagin@parallels.com> writes:
      
      > #define _GNU_SOURCE
      > #include <sys/types.h>
      > #include <sys/stat.h>
      > #include <fcntl.h>
      > #include <sched.h>
      > #include <unistd.h>
      > #include <sys/mount.h>
      >
      > int main(int argc, char **argv)
      > {
      > 	int fd;
      >
      > 	fd = open("/proc/self/ns/mnt", O_RDONLY);
      > 	if (fd < 0)
      > 	   return 1;
      > 	   while (1) {
      > 	   	 if (umount2("/", MNT_DETACH) ||
      > 		        setns(fd, CLONE_NEWNS))
      > 					break;
      > 					}
      >
      > 					return 0;
      > }
      >
      > root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
      > root@ubuntu:/home/avagin# strace ./nsenter
      > execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
      > ...
      > open("/proc/self/ns/mnt", O_RDONLY)     = 3
      > umount("/", MNT_DETACH)                 = 0
      > setns(3, 131072)                        = 0
      > umount("/", MNT_DETACH
      >
      causes:
      
      > [  260.548301] ------------[ cut here ]------------
      > [  260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
      > [  260.552068] invalid opcode: 0000 [#1] SMP
      > [  260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
      > [  260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
      > [  260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      > [  260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
      > [  260.552068] RIP: 0010:[<ffffffff811e9483>]  [<ffffffff811e9483>] propagate_umount+0x123/0x130
      > [  260.552068] RSP: 0018:ffff880074825e98  EFLAGS: 00010246
      > [  260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
      > [  260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
      > [  260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
      > [  260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
      > [  260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
      > [  260.552068] FS:  00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      > [  260.552068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > [  260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
      > [  260.552068] Stack:
      > [  260.552068]  0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
      > [  260.552068]  ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
      > [  260.552068]  ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
      > [  260.552068] Call Trace:
      > [  260.552068]  [<ffffffff811dcfac>] umount_tree+0x20c/0x260
      > [  260.552068]  [<ffffffff811dd12b>] do_umount+0x12b/0x300
      > [  260.552068]  [<ffffffff811cc642>] ? final_putname+0x22/0x50
      > [  260.552068]  [<ffffffff811cc849>] ? putname+0x29/0x40
      > [  260.552068]  [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
      > [  260.552068]  [<ffffffff8172aeff>] tracesys+0xe1/0xe6
      > [  260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
      > [  260.552068] RIP  [<ffffffff811e9483>] propagate_umount+0x123/0x130
      > [  260.552068]  RSP <ffff880074825e98>
      > [  260.611451] ---[ end trace 11c33d85f1d4c652 ]--
      
      Which in practice is totally uninteresting.  Only the global root user can
      do it, and it is just a stupid thing to do.
      
      However that is no excuse to allow a silly way to oops the kernel.
      
      We can avoid this silly problem by setting MNT_LOCKED on the rootfs
      mount point and thus avoid needing any special cases in the unmount
      code.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e1ce59d2
    • David S. Miller's avatar
      ipv4: Don't do expensive useless work during inetdev destroy. · 5cc4ff31
      David S. Miller authored
      commit fbd40ea0 upstream.
      
      When an inetdev is destroyed, every address assigned to the interface
      is removed.  And in this scenerio we do two pointless things which can
      be very expensive if the number of assigned interfaces is large:
      
      1) Address promotion.  We are deleting all addresses, so there is no
         point in doing this.
      
      2) A full nf conntrack table purge for every address.  We only need to
         do this once, as is already caught by the existing
         masq_dev_notifier so masq_inet_event() can skip this.
      
      [mk] 3.12.*: The change in masq_inet_event() needs to be duplicated in
      both IPv4 and IPv6 version of the function, these two were merged in
      3.18.
      Reported-by: default avatarSolar Designer <solar@openwall.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Tested-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5cc4ff31
    • Gabriel Krisman Bertazi's avatar
      ipr: Fix regression when loading firmware · 7514ee10
      Gabriel Krisman Bertazi authored
      commit 21b81716 upstream.
      
      Commit d63c7dd5 ("ipr: Fix out-of-bounds null overwrite") removed
      the end of line handling when storing the update_fw sysfs attribute.
      This changed the userpace API because it started refusing writes
      terminated by a line feed, which broke the update tools we already have.
      
      This patch re-adds that handling, so both a write terminated by a line
      feed or not can make it through with the update.
      
      Fixes: d63c7dd5 ("ipr: Fix out-of-bounds null overwrite")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Insu Yun <wuninsu@gmail.com>
      Acked-by: default avatarBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7514ee10
    • Insu Yun's avatar
      ipr: Fix out-of-bounds null overwrite · def91dea
      Insu Yun authored
      commit d63c7dd5 upstream.
      
      Return value of snprintf is not bound by size value, 2nd argument.
      (https://www.kernel.org/doc/htmldocs/kernel-api/API-snprintf.html).
      Return value is number of printed chars, can be larger than 2nd
      argument.  Therefore, it can write null byte out of bounds ofbuffer.
      Since snprintf puts null, it does not need to put additional null byte.
      Signed-off-by: default avatarInsu Yun <wuninsu@gmail.com>
      Reviewed-by: default avatarShane Seymour <shane.seymour@hpe.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      def91dea
  3. 16 Mar, 2016 5 commits
    • Jiri Slaby's avatar
      Linux 3.12.57 · d9d35182
      Jiri Slaby authored
      d9d35182
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Check PF instead of VF for PCI_COMMAND_MEMORY · 372e0615
      Konrad Rzeszutek Wilk authored
      commit 8d47065f upstream.
      
      Commit 408fb0e5 (xen/pciback: Don't
      allow MSI-X ops if PCI_COMMAND_MEMORY is not set) prevented enabling
      MSI-X on passed-through virtual functions, because it checked the VF
      for PCI_COMMAND_MEMORY but this is not a valid bit for VFs.
      
      Instead, check the physical function for PCI_COMMAND_MEMORY.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      372e0615
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set. · bb7aa305
      Konrad Rzeszutek Wilk authored
      commit 408fb0e5 upstream.
      
      commit f598282f ("PCI: Fix the NIU MSI-X problem in a better way")
      teaches us that dealing with MSI-X can be troublesome.
      
      Further checks in the MSI-X architecture shows that if the
      PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
      may not be able to access the BAR (since they are memory regions).
      
      Since the MSI-X tables are located in there.. that can lead
      to us causing PCIe errors. Inhibit us performing any
      operation on the MSI-X unless the MEMORY bit is set.
      
      Note that Xen hypervisor with:
      "x86/MSI-X: access MSI-X table only after having enabled MSI-X"
      will return:
      xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!
      
      When the generic MSI code tries to setup the PIRQ without
      MEMORY bit set. Which means with later versions of Xen
      (4.6) this patch is not neccessary.
      
      This is part of XSA-157
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bb7aa305
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled. · 388a8005
      Konrad Rzeszutek Wilk authored
      commit 7cfb905b upstream.
      
      Otherwise just continue on, returning the same values as
      previously (return of 0, and op->result has the PIRQ value).
      
      This does not change the behavior of XEN_PCI_OP_disable_msi[|x].
      
      The pci_disable_msi or pci_disable_msix have the checks for
      msi_enabled or msix_enabled so they will error out immediately.
      
      However the guest can still call these operations and cause
      us to disable the 'ack_intr'. That means the backend IRQ handler
      for the legacy interrupt will not respond to interrupts anymore.
      
      This will lead to (if the device is causing an interrupt storm)
      for the Linux generic code to disable the interrupt line.
      
      Naturally this will only happen if the device in question
      is plugged in on the motherboard on shared level interrupt GSI.
      
      This is part of XSA-157
      Reviewed-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      388a8005
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Do not install an IRQ handler for MSI interrupts. · 4113b288
      Konrad Rzeszutek Wilk authored
      commit a396f3a2 upstream.
      
      Otherwise an guest can subvert the generic MSI code to trigger
      an BUG_ON condition during MSI interrupt freeing:
      
       for (i = 0; i < entry->nvec_used; i++)
              BUG_ON(irq_has_action(entry->irq + i));
      
      Xen PCI backed installs an IRQ handler (request_irq) for
      the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
      (or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
      done in case the device has legacy interrupts the GSI line
      is shared by the backend devices.
      
      To subvert the backend the guest needs to make the backend
      to change the dev->irq from the GSI to the MSI interrupt line,
      make the backend allocate an interrupt handler, and then command
      the backend to free the MSI interrupt and hit the BUG_ON.
      
      Since the backend only calls 'request_irq' when the guest
      writes to the PCI_COMMAND register the guest needs to call
      XEN_PCI_OP_enable_msi before any other operation. This will
      cause the generic MSI code to setup an MSI entry and
      populate dev->irq with the new PIRQ value.
      
      Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
      and cause the backend to setup an IRQ handler for dev->irq
      (which instead of the GSI value has the MSI pirq). See
      'xen_pcibk_control_isr'.
      
      Then the guest disables the MSI: XEN_PCI_OP_disable_msi
      which ends up triggering the BUG_ON condition in 'free_msi_irqs'
      as there is an IRQ handler for the entry->irq (dev->irq).
      
      Note that this cannot be done using MSI-X as the generic
      code does not over-write dev->irq with the MSI-X PIRQ values.
      
      The patch inhibits setting up the IRQ handler if MSI or
      MSI-X (for symmetry reasons) code had been called successfully.
      
      P.S.
      Xen PCIBack when it sets up the device for the guest consumption
      ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
      XSA-120 addendum patch removed that - however when upstreaming said
      addendum we found that it caused issues with qemu upstream. That
      has now been fixed in qemu upstream.
      
      This is part of XSA-157
      Reviewed-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4113b288