- 11 Apr, 2016 21 commits
-
-
Wei Huang authored
commit dc9b2d93 upstream. Current KVM only supports RDMSR for K7_EVNTSEL0 and K7_PERFCTR0 MSRs. Reading the rest MSRs will trigger KVM to inject #GP into guest VM. This causes a warning message "Failed to access perfctr msr (MSR c0010001 is ffffffffffffffff)" on AMD host. This patch adds RDMSR support for all K7_EVNTSELn and K7_PERFCTRn registers and thus supresses the warning message. Signed-off-by: Wei Huang <wehuang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Dirk Brandewie authored
commit c2294a2f upstream. Ensure that no timer callback is running since we are about to free the timer structure. We cannot guarantee that the call back is called on the CPU where the timer is running. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Alan Stern authored
commit e50293ef upstream. Commit 8520f380 ("USB: change hub initialization sleeps to delayed_work") changed the hub_activate() routine to make part of it run in a workqueue. However, the commit failed to take a reference to the usb_hub structure or to lock the hub interface while doing so. As a result, if a hub is plugged in and quickly unplugged before the work routine can run, the routine will try to access memory that has been deallocated. Or, if the hub is unplugged while the routine is running, the memory may be deallocated while it is in active use. This patch fixes the problem by taking a reference to the usb_hub at the start of hub_activate() and releasing it at the end (when the work is finished), and by locking the hub interface while the work routine is running. It also adds a check at the start of the routine to see if the hub has already been disconnected, in which nothing should be done. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Alexandru Cornea <alexandru.cornea@intel.com> Tested-by: Alexandru Cornea <alexandru.cornea@intel.com> Fixes: 8520f380 ("USB: change hub initialization sleeps to delayed_work") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Michal Hocko authored
commit d8dc595c upstream. Eric has reported that he can see task(s) stuck in memcg OOM handler regularly. The only way out is to echo 0 > $GROUP/memory.oom_control His usecase is: - Setup a hierarchy with memory and the freezer (disable kernel oom and have a process watch for oom). - In that memory cgroup add a process with one thread per cpu. - In one thread slowly allocate once per second I think it is 16M of ram and mlock and dirty it (just to force the pages into ram and stay there). - When oom is achieved loop: * attempt to freeze all of the tasks. * if frozen send every task SIGKILL, unfreeze, remove the directory in cgroupfs. Eric has then pinpointed the issue to be memcg specific. All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled. Those that have received fatal signal will bypass the charge and should continue on their way out. The tricky part is that the exit path might trigger a page fault (e.g. exit_robust_list), thus the memcg charge, while its memcg is still under OOM because nobody has released any charges yet. Unlike with the in-kernel OOM handler the exiting task doesn't get TIF_MEMDIE set so it doesn't shortcut further charges of the killed task and falls to the memcg OOM again without any way out of it as there are no fatal signals pending anymore. This patch fixes the issue by checking PF_EXITING early in mem_cgroup_try_charge and bypass the charge same as if it had fatal signal pending or TIF_MEMDIE set. Normally exiting tasks (aka not killed) will bypass the charge now but this should be OK as the task is leaving and will release memory and increasing the memory pressure just to release it in a moment seems dubious wasting of cycles. Besides that charges after exit_signals should be rare. I am bringing this patch again (rebased on the current mmotm tree). I hope we can move forward finally. If there is still an opposition then I would really appreciate a concurrent approach so that we can discuss alternatives. http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference to the followup discussion when the patch has been dropped from the mmotm last time. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Takashi Iwai authored
commit d99a36f4 upstream. When multiple concurrent writes happen on the ALSA sequencer device right after the open, it may try to allocate vmalloc buffer for each write and leak some of them. It's because the presence check and the assignment of the buffer is done outside the spinlock for the pool. The fix is to move the check and the assignment into the spinlock. (The current implementation is suboptimal, as there can be multiple unnecessary vmallocs because the allocation is done before the check in the spinlock. But the pool size is already checked beforehand, so this isn't a big problem; that is, the only possible path is the multiple writes before any pool assignment, and practically seen, the current coverage should be "good enough".) The issue was triggered by syzkaller fuzzer. Buglink: http://lkml.kernel.org/r/CACT4Y+bSzazpXNvtAr=WXaL8hptqjHwqEyFA+VN2AWEx=aurkg@mail.gmail.comReported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Takashi Iwai authored
commit 06ab3003 upstream. A kernel WARNING in snd_rawmidi_transmit_ack() is triggered by syzkaller fuzzer: WARNING: CPU: 1 PID: 20739 at sound/core/rawmidi.c:1136 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50 [<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482 [<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515 [<ffffffff84f80bd5>] snd_rawmidi_transmit_ack+0x275/0x400 sound/core/rawmidi.c:1136 [<ffffffff84fdb3c1>] snd_virmidi_output_trigger+0x4b1/0x5a0 sound/core/seq/seq_virmidi.c:163 [< inline >] snd_rawmidi_output_trigger sound/core/rawmidi.c:150 [<ffffffff84f87ed9>] snd_rawmidi_kernel_write1+0x549/0x780 sound/core/rawmidi.c:1223 [<ffffffff84f89fd3>] snd_rawmidi_write+0x543/0xb30 sound/core/rawmidi.c:1273 [<ffffffff817b0323>] __vfs_write+0x113/0x480 fs/read_write.c:528 [<ffffffff817b1db7>] vfs_write+0x167/0x4a0 fs/read_write.c:577 [< inline >] SYSC_write fs/read_write.c:624 [<ffffffff817b50a1>] SyS_write+0x111/0x220 fs/read_write.c:616 [<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185 Also a similar warning is found but in another path: Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff82be2c0d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50 [<ffffffff81355139>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482 [<ffffffff81355369>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515 [<ffffffff8527e69a>] rawmidi_transmit_ack+0x24a/0x3b0 sound/core/rawmidi.c:1133 [<ffffffff8527e851>] snd_rawmidi_transmit_ack+0x51/0x80 sound/core/rawmidi.c:1163 [<ffffffff852d9046>] snd_virmidi_output_trigger+0x2b6/0x570 sound/core/seq/seq_virmidi.c:185 [< inline >] snd_rawmidi_output_trigger sound/core/rawmidi.c:150 [<ffffffff85285a0b>] snd_rawmidi_kernel_write1+0x4bb/0x760 sound/core/rawmidi.c:1252 [<ffffffff85287b73>] snd_rawmidi_write+0x543/0xb30 sound/core/rawmidi.c:1302 [<ffffffff817ba5f3>] __vfs_write+0x113/0x480 fs/read_write.c:528 [<ffffffff817bc087>] vfs_write+0x167/0x4a0 fs/read_write.c:577 [< inline >] SYSC_write fs/read_write.c:624 [<ffffffff817bf371>] SyS_write+0x111/0x220 fs/read_write.c:616 [<ffffffff86660276>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185 In the former case, the reason is that virmidi has an open code calling snd_rawmidi_transmit_ack() with the value calculated outside the spinlock. We may use snd_rawmidi_transmit() in a loop just for consuming the input data, but even there, there is a race between snd_rawmidi_transmit_peek() and snd_rawmidi_tranmit_ack(). Similarly in the latter case, it calls snd_rawmidi_transmit_peek() and snd_rawmidi_tranmit_ack() separately without protection, so they are racy as well. The patch tries to address these issues by the following ways: - Introduce the unlocked versions of snd_rawmidi_transmit_peek() and snd_rawmidi_transmit_ack() to be called inside the explicit lock. - Rewrite snd_rawmidi_transmit() to be race-free (the former case). - Make the split calls (the latter case) protected in the rawmidi spin lock. Buglink: http://lkml.kernel.org/r/CACT4Y+YPq1+cYLkadwjWa5XjzF1_Vki1eHnVn-Lm0hzhSpu5PA@mail.gmail.com Buglink: http://lkml.kernel.org/r/CACT4Y+acG4iyphdOZx47Nyq_VHGbpJQK-6xNpiqUjaZYqsXOGw@mail.gmail.comReported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
John Allen authored
commit cb5490a5 upstream. Fix a bug where a kernel warning is triggered when performing a memory hotplug on ppc64. This warning may also occur on any architecture that uses the memory_probe_store interface. WARNING: at drivers/base/memory.c:200 CPU: 9 PID: 13042 Comm: systemd-udevd Not tainted 4.4.0-rc4-00113-g0bd0f1e6-dirty #7 NIP [c00000000055e034] pages_correctly_reserved+0x134/0x1b0 LR [c00000000055e7f8] memory_subsys_online+0x68/0x140 Call Trace: memory_subsys_online+0x68/0x140 device_online+0xb4/0x120 store_mem_state+0xb0/0x180 dev_attr_store+0x34/0x60 sysfs_kf_write+0x64/0xa0 kernfs_fop_write+0x17c/0x1e0 __vfs_write+0x40/0x160 vfs_write+0xb8/0x200 SyS_write+0x60/0x110 system_call+0x38/0xd0 The warning is triggered because there is a udev rule that automatically tries to online memory after it has been added. The udev rule varies from distro to distro, but will generally look something like: SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online" On any architecture that uses memory_probe_store to reserve memory, the udev rule will be triggered after the first section of the block is reserved and will subsequently attempt to online the entire block, interrupting the memory reservation process and causing the warning. This patch modifies memory_probe_store to add a block of memory with a single call to add_memory as opposed to looping through and adding each section individually. A single call to add_memory is protected by the mem_hotplug mutex which will prevent the udev rule from onlining memory until the reservation of the entire block is complete. Signed-off-by: John Allen <jallen@linux.vnet.ibm.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Yuval Mintz authored
commit 9c9a6524 upstream. This adds support for 3 new PCI device combinations - 1077:16a1, 1077:16a4 and 1077:16ad. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Wang Shilong authored
commit e84752d4 upstream. We won't change commit root, skip locking dance with commit root when walking backrefs, this can speed up btrfs send operations. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Kirill Tkhai authored
commit eeb61e53 upstream. The race may happen when somebody is changing task_group of a forking task. Child's cgroup is the same as parent's after dup_task_struct() (there just memory copying). Also, cfs_rq and rt_rq are the same as parent's. But if parent changes its task_group before it's called cgroup_post_fork(), we do not reflect this situation on child. Child's cfs_rq and rt_rq remain the same, while child's task_group changes in cgroup_post_fork(). To fix this we introduce fork() method, which calls sched_move_task() directly. This function changes sched_task_group on appropriate (also its logic has no problem with freshly created tasks, so we shouldn't introduce something special; we are able just to use it). Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhaiSigned-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Eric Sandeen authored
commit 9de67c3b upstream. Today, if we perform an xfs_growfs which adds allocation groups, mp->m_maxagi is not properly updated when the growfs is complete. Therefore inodes will continue to be allocated only in the AGs which existed prior to the growfs, and the new space won't be utilized. This is because of this path in xfs_growfs_data_private(): xfs_growfs_data_private xfs_initialize_perag(mp, nagcount, &nagimax); if (mp->m_flags & XFS_MOUNT_32BITINODES) index = xfs_set_inode32(mp); else index = xfs_set_inode64(mp); if (maxagi) *maxagi = index; where xfs_set_inode* iterates over the (old) agcount in mp->m_sb.sb_agblocks, which has not yet been updated in the growfs path. So "index" will be returned based on the old agcount, not the new one, and new AGs are not available for inode allocation. Fix this by explicitly passing the proper AG count (which xfs_initialize_perag() already has) down another level, so that xfs_set_inode* can make the proper decision about acceptable AGs for inode allocation in the potentially newly-added AGs. This has been broken since 3.7, when these two xfs_set_inode* functions were added in commit 2d2194f6. Prior to that, we looped over "agcount" not sb_agblocks in these calculations. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit d159457b upstream. Commit 8135cf8b (xen/pciback: Save xen_pci_op commands before processing it) broke enabling MSI-X because it would never copy the resulting vectors into the response. The number of vectors requested was being overwritten by the return value (typically zero for success). Save the number of vectors before processing the op, so the correct number of vectors are copied afterwards. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: "Jan Beulich" <JBeulich@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit 8135cf8b upstream. Double fetch vulnerabilities that happen when a variable is fetched twice from shared memory but a security check is only performed the first time. The xen_pcibk_do_op function performs a switch statements on the op->cmd value which is stored in shared memory. Interestingly this can result in a double fetch vulnerability depending on the performed compiler optimization. This patch fixes it by saving the xen_pci_op command before processing it. We also use 'barrier' to make sure that the compiler does not perform any optimization. This is part of XSA155. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: "Jan Beulich" <JBeulich@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Roger Pau Monné authored
commit 18779149 upstream. Since indirect descriptors are in memory shared with the frontend, the frontend could alter the first_sect and last_sect values after they have been validated but before they are recorded in the request. This may result in I/O requests that overflow the foreign page, possibly overwriting local pages when the I/O request is executed. When parsing indirect descriptors, only read first_sect and last_sect once. This is part of XSA155. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jan Beulich <JBeulich@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Roger Pau Monné authored
commit 1f13d75c upstream. A compiler may load a switch statement value multiple times, which could be bad when the value is in memory shared with the frontend. When converting a non-native request to a native one, ensure that src->operation is only loaded once by using READ_ONCE(). This is part of XSA155. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: "Jan Beulich" <JBeulich@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
David Vrabel authored
commit 68a33bfd upstream. Instead of open-coding memcpy()s and directly accessing Tx and Rx requests, use the new RING_COPY_REQUEST() that ensures the local copy is correct. This is more than is strictly necessary for guest Rx requests since only the id and gref fields are used and it is harmless if the frontend modifies these. This is part of XSA155. Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
David Vrabel authored
commit 0f589967 upstream. The last from guest transmitted request gives no indication about the minimum amount of credit that the guest might need to send a packet since the last packet might have been a small one. Instead allow for the worst case 128 KiB packet. This is part of XSA155. Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
David Vrabel authored
commit 454d5d88 upstream. Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly (i.e., by not considering that the other end may alter the data in the shared ring while it is being inspected). Safe usage of a request generally requires taking a local copy. Provide a RING_COPY_REQUEST() macro to use instead of RING_GET_REQUEST() and an open-coded memcpy(). This takes care of ensuring that the copy is done correctly regardless of any possible compiler optimizations. Use a volatile source to prevent the compiler from reordering or omitting the copy. This is part of XSA155. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Peter Zijlstra authored
commit 7bd3e239 upstream. The fact that volatile allows for atomic load/stores is a special case not a requirement for {READ,WRITE}_ONCE(). Their primary purpose is to force the compiler to emit load/stores _once_. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Linus Torvalds authored
commit dd369297 upstream. The use of READ_ONCE() causes lots of warnings witht he pending paravirt spinlock fixes, because those ends up having passing a member to a 'const' structure to READ_ONCE(). There should certainly be nothing wrong with using READ_ONCE() with a const source, but the helper function __read_once_size() would cause warnings because it would drop the 'const' qualifier, but also because the destination would be marked 'const' too due to the use of 'typeof'. Use a union of types in READ_ONCE() to avoid this issue. Also make sure to use parenthesis around the macro arguments to avoid possible operator precedence issues. Tested-by: Ingo Molnar <mingo@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Christian Borntraeger authored
commit 43239cbe upstream. Feedback has shown that WRITE_ONCE(x, val) is easier to use than ASSIGN_ONCE(val,x). There are no in-tree users yet, so lets change it for 3.19. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
- 30 Mar, 2016 5 commits
-
-
Christian Borntraeger authored
commit 230fa253 upstream. ACCESS_ONCE does not work reliably on non-scalar types. For example gcc 4.6 and 4.7 might remove the volatile tag for such accesses during the SRA (scalar replacement of aggregates) step https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145) Let's provide READ_ONCE/ASSIGN_ONCE that will do all accesses via scalar types as suggested by Linus Torvalds. Accesses larger than the machines word size cannot be guaranteed to be atomic. These macros will use memcpy and emit a build warning. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Eric W. Biederman authored
commit da362b09 upstream. Andrew Vagin <avagin@parallels.com> writes: > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <sched.h> > #include <unistd.h> > #include <sys/mount.h> > > int main(int argc, char **argv) > { > int fd; > > fd = open("/proc/self/ns/mnt", O_RDONLY); > if (fd < 0) > return 1; > while (1) { > if (umount2("/", MNT_DETACH) || > setns(fd, CLONE_NEWNS)) > break; > } > > return 0; > } > > root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter > root@ubuntu:/home/avagin# strace ./nsenter > execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0 > ... > open("/proc/self/ns/mnt", O_RDONLY) = 3 > umount("/", MNT_DETACH) = 0 > setns(3, 131072) = 0 > umount("/", MNT_DETACH > causes: > [ 260.548301] ------------[ cut here ]------------ > [ 260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372! > [ 260.552068] invalid opcode: 0000 [#1] SMP > [ 260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy > [ 260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu > [ 260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000 > [ 260.552068] RIP: 0010:[<ffffffff811e9483>] [<ffffffff811e9483>] propagate_umount+0x123/0x130 > [ 260.552068] RSP: 0018:ffff880074825e98 EFLAGS: 00010246 > [ 260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190 > [ 260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0 > [ 260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0 > [ 260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140 > [ 260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000 > [ 260.552068] FS: 00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 > [ 260.552068] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0 > [ 260.552068] Stack: > [ 260.552068] 0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8 > [ 260.552068] ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160 > [ 260.552068] ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000 > [ 260.552068] Call Trace: > [ 260.552068] [<ffffffff811dcfac>] umount_tree+0x20c/0x260 > [ 260.552068] [<ffffffff811dd12b>] do_umount+0x12b/0x300 > [ 260.552068] [<ffffffff811cc642>] ? final_putname+0x22/0x50 > [ 260.552068] [<ffffffff811cc849>] ? putname+0x29/0x40 > [ 260.552068] [<ffffffff811dd88c>] SyS_umount+0xdc/0x100 > [ 260.552068] [<ffffffff8172aeff>] tracesys+0xe1/0xe6 > [ 260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01 > [ 260.552068] RIP [<ffffffff811e9483>] propagate_umount+0x123/0x130 > [ 260.552068] RSP <ffff880074825e98> > [ 260.611451] ---[ end trace 11c33d85f1d4c652 ]-- Which in practice is totally uninteresting. Only the global root user can do it, and it is just a stupid thing to do. However that is no excuse to allow a silly way to oops the kernel. We can avoid this silly problem by setting MNT_LOCKED on the rootfs mount point and thus avoid needing any special cases in the unmount code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
David S. Miller authored
commit fbd40ea0 upstream. When an inetdev is destroyed, every address assigned to the interface is removed. And in this scenerio we do two pointless things which can be very expensive if the number of assigned interfaces is large: 1) Address promotion. We are deleting all addresses, so there is no point in doing this. 2) A full nf conntrack table purge for every address. We only need to do this once, as is already caught by the existing masq_dev_notifier so masq_inet_event() can skip this. [mk] 3.12.*: The change in masq_inet_event() needs to be duplicated in both IPv4 and IPv6 version of the function, these two were merged in 3.18. Reported-by: Solar Designer <solar@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Gabriel Krisman Bertazi authored
commit 21b81716 upstream. Commit d63c7dd5 ("ipr: Fix out-of-bounds null overwrite") removed the end of line handling when storing the update_fw sysfs attribute. This changed the userpace API because it started refusing writes terminated by a line feed, which broke the update tools we already have. This patch re-adds that handling, so both a write terminated by a line feed or not can make it through with the update. Fixes: d63c7dd5 ("ipr: Fix out-of-bounds null overwrite") Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Cc: Insu Yun <wuninsu@gmail.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Insu Yun authored
commit d63c7dd5 upstream. Return value of snprintf is not bound by size value, 2nd argument. (https://www.kernel.org/doc/htmldocs/kernel-api/API-snprintf.html). Return value is number of printed chars, can be larger than 2nd argument. Therefore, it can write null byte out of bounds ofbuffer. Since snprintf puts null, it does not need to put additional null byte. Signed-off-by: Insu Yun <wuninsu@gmail.com> Reviewed-by: Shane Seymour <shane.seymour@hpe.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
- 16 Mar, 2016 7 commits
-
-
Jiri Slaby authored
-
Konrad Rzeszutek Wilk authored
commit 8d47065f upstream. Commit 408fb0e5 (xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set) prevented enabling MSI-X on passed-through virtual functions, because it checked the VF for PCI_COMMAND_MEMORY but this is not a valid bit for VFs. Instead, check the physical function for PCI_COMMAND_MEMORY. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit 408fb0e5 upstream. commit f598282f ("PCI: Fix the NIU MSI-X problem in a better way") teaches us that dealing with MSI-X can be troublesome. Further checks in the MSI-X architecture shows that if the PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we may not be able to access the BAR (since they are memory regions). Since the MSI-X tables are located in there.. that can lead to us causing PCIe errors. Inhibit us performing any operation on the MSI-X unless the MEMORY bit is set. Note that Xen hypervisor with: "x86/MSI-X: access MSI-X table only after having enabled MSI-X" will return: xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3! When the generic MSI code tries to setup the PIRQ without MEMORY bit set. Which means with later versions of Xen (4.6) this patch is not neccessary. This is part of XSA-157 Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit 7cfb905b upstream. Otherwise just continue on, returning the same values as previously (return of 0, and op->result has the PIRQ value). This does not change the behavior of XEN_PCI_OP_disable_msi[|x]. The pci_disable_msi or pci_disable_msix have the checks for msi_enabled or msix_enabled so they will error out immediately. However the guest can still call these operations and cause us to disable the 'ack_intr'. That means the backend IRQ handler for the legacy interrupt will not respond to interrupts anymore. This will lead to (if the device is causing an interrupt storm) for the Linux generic code to disable the interrupt line. Naturally this will only happen if the device in question is plugged in on the motherboard on shared level interrupt GSI. This is part of XSA-157 Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit a396f3a2 upstream. Otherwise an guest can subvert the generic MSI code to trigger an BUG_ON condition during MSI interrupt freeing: for (i = 0; i < entry->nvec_used; i++) BUG_ON(irq_has_action(entry->irq + i)); Xen PCI backed installs an IRQ handler (request_irq) for the dev->irq whenever the guest writes PCI_COMMAND_MEMORY (or PCI_COMMAND_IO) to the PCI_COMMAND register. This is done in case the device has legacy interrupts the GSI line is shared by the backend devices. To subvert the backend the guest needs to make the backend to change the dev->irq from the GSI to the MSI interrupt line, make the backend allocate an interrupt handler, and then command the backend to free the MSI interrupt and hit the BUG_ON. Since the backend only calls 'request_irq' when the guest writes to the PCI_COMMAND register the guest needs to call XEN_PCI_OP_enable_msi before any other operation. This will cause the generic MSI code to setup an MSI entry and populate dev->irq with the new PIRQ value. Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY and cause the backend to setup an IRQ handler for dev->irq (which instead of the GSI value has the MSI pirq). See 'xen_pcibk_control_isr'. Then the guest disables the MSI: XEN_PCI_OP_disable_msi which ends up triggering the BUG_ON condition in 'free_msi_irqs' as there is an IRQ handler for the entry->irq (dev->irq). Note that this cannot be done using MSI-X as the generic code does not over-write dev->irq with the MSI-X PIRQ values. The patch inhibits setting up the IRQ handler if MSI or MSI-X (for symmetry reasons) code had been called successfully. P.S. Xen PCIBack when it sets up the device for the guest consumption ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device). XSA-120 addendum patch removed that - however when upstreaming said addendum we found that it caused issues with qemu upstream. That has now been fixed in qemu upstream. This is part of XSA-157 Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit 5e0ce145 upstream. The guest sequence of: a) XEN_PCI_OP_enable_msix b) XEN_PCI_OP_enable_msix results in hitting an NULL pointer due to using freed pointers. The device passed in the guest MUST have MSI-X capability. The a) constructs and SysFS representation of MSI and MSI groups. The b) adds a second set of them but adding in to SysFS fails (duplicate entry). 'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that in a) pdev->msi_irq_groups is still set) and also free's ALL of the MSI-X entries of the device (the ones allocated in step a) and b)). The unwind code: 'free_msi_irqs' deletes all the entries and tries to delete the pdev->msi_irq_groups (which hasn't been set to NULL). However the pointers in the SysFS are already freed and we hit an NULL pointer further on when 'strlen' is attempted on a freed pointer. The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard against that. The check for msi_enabled is not stricly neccessary. This is part of XSA-157 Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Konrad Rzeszutek Wilk authored
commit 56441f3c upstream. The guest sequence of: a) XEN_PCI_OP_enable_msi b) XEN_PCI_OP_enable_msi c) XEN_PCI_OP_disable_msi results in hitting an BUG_ON condition in the msi.c code. The MSI code uses an dev->msi_list to which it adds MSI entries. Under the above conditions an BUG_ON() can be hit. The device passed in the guest MUST have MSI capability. The a) adds the entry to the dev->msi_list and sets msi_enabled. The b) adds a second entry but adding in to SysFS fails (duplicate entry) and deletes all of the entries from msi_list and returns (with msi_enabled is still set). c) pci_disable_msi passes the msi_enabled checks and hits: BUG_ON(list_empty(dev_to_msi_list(&dev->dev))); and blows up. The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard against that. The check for msix_enabled is not stricly neccessary. This is part of XSA-157. Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
- 14 Mar, 2016 7 commits
-
-
Rusty Russell authored
commit 8244062e upstream. For CONFIG_KALLSYMS, we keep two symbol tables and two string tables. There's one full copy, marked SHF_ALLOC and laid out at the end of the module's init section. There's also a cut-down version that only contains core symbols and strings, and lives in the module's core section. After module init (and before we free the module memory), we switch the mod->symtab, mod->num_symtab and mod->strtab to point to the core versions. We do this under the module_mutex. However, kallsyms doesn't take the module_mutex: it uses preempt_disable() and rcu tricks to walk through the modules, because it's used in the oops path. It's also used in /proc/kallsyms. There's nothing atomic about the change of these variables, so we can get the old (larger!) num_symtab and the new symtab pointer; in fact this is what I saw when trying to reproduce. By grouping these variables together, we can use a carefully-dereferenced pointer to ensure we always get one or the other (the free of the module init section is already done in an RCU callback, so that's safe). We allocate the init one at the end of the module init section, and keep the core one inside the struct module itself (it could also have been allocated at the end of the module core, but that's probably overkill). [ Rebased for 4.4-stable and older, because the following changes aren't in the older trees: - e0224418: adds arg to is_core_symbol - 7523e4dc: module_init/module_core/init_size/core_size become init_layout.base/core_layout.base/init_layout.size/core_layout.size. Original commit: 8244062e ] Reported-by: Weilong Chen <chenweilong@huawei.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Jason Andryuk authored
commit a6807590 upstream. The comparisons should be >= since 0x800 and 0x80 require an additional bit to store. For the 3 byte case, the existing shift would drop off 2 more bits than intended. For the 2 byte case, there should be 5 bits bits in byte 1, and 6 bits in byte 2. Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Cc: Peter Jones <pjones@redhat.com> Cc: Matthew Garrett <mjg59@coreos.com> Cc: "Lee, Chun-Yi" <jlee@suse.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Matt Fleming authored
commit e246eb56 upstream. Laszlo explains why this is a good idea, 'This is because the pstore filesystem can be backed by UEFI variables, and (for example) a crash might dump the last kilobytes of the dmesg into a number of pstore entries, each entry backed by a separate UEFI variable in the above GUID namespace, and with a variable name according to the above pattern. Please see "drivers/firmware/efi/efi-pstore.c". While this patch series will not prevent the user from deleting those UEFI variables via the pstore filesystem (i.e., deleting a pstore fs entry will continue to delete the backing UEFI variable), I think it would be nice to preserve the possibility for the sysadmin to delete Linux-created UEFI variables that carry portions of the crash log, *without* having to mount the pstore filesystem.' There's also no chance of causing machines to become bricked by deleting these variables, which is the whole purpose of excluding things from the whitelist. Use the LINUX_EFI_CRASH_GUID guid and a wildcard '*' for the match so that we don't have to update the string in the future if new variable name formats are created for crash dump variables. Reported-by: Laszlo Ersek <lersek@redhat.com> Acked-by: Peter Jones <pjones@redhat.com> Tested-by: Peter Jones <pjones@redhat.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: "Lee, Chun-Yi" <jlee@suse.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Peter Jones authored
commit ed8b0de5 upstream. "rm -rf" is bricking some peoples' laptops because of variables being used to store non-reinitializable firmware driver data that's required to POST the hardware. These are 100% bugs, and they need to be fixed, but in the mean time it shouldn't be easy to *accidentally* brick machines. We have to have delete working, and picking which variables do and don't work for deletion is quite intractable, so instead make everything immutable by default (except for a whitelist), and make tools that aren't quite so broad-spectrum unset the immutable flag. Signed-off-by: Peter Jones <pjones@redhat.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Peter Jones authored
commit 8282f5d9 upstream. All the variables in this list so far are defined to be in the global namespace in the UEFI spec, so this just further ensures we're validating the variables we think we are. Including the guid for entries will become more important in future patches when we decide whether or not to allow deletion of variables based on presence in this list. Signed-off-by: Peter Jones <pjones@redhat.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Peter Jones authored
commit 3dcb1f55 upstream. Actually translate from ucs2 to utf8 before doing the test, and then test against our other utf8 data, instead of fudging it. Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-
Peter Jones authored
commit e0d64e6a upstream. Translate EFI's UCS-2 variable names to UTF-8 instead of just assuming all variable names fit in ASCII. Signed-off-by: Peter Jones <pjones@redhat.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Tested-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
-