- 26 Sep, 2014 40 commits
-
-
Thomas Hellstrom authored
Commit "drm/vmwgfx: correct fb_fix_screeninfo.line_length", while fixing a vmwgfx fbdev bug, also writes the pitch to a supposedly read-only register: SVGA_REG_BYTES_PER_LINE, while it should be (and also in fact is) written to SVGA_REG_PITCHLOCK. This patch is Cc'd stable because of the unknown effects writing to this register might have, particularly on older device versions. v2: Updated log message. Cc: stable@vger.kernel.org Cc: Christopher Friedt <chrisfriedt@gmail.com> Tested-by:
Christopher Friedt <chrisfriedt@gmail.com> Signed-off-by:
Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by:
Jakob Bornecrantz <jakob@vmware.com> (cherry picked from commit 4e578080) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Jaganath Kanakkassery authored
The length check is invalid since the length varies with type of info response. This was introduced by the commit cb3b3152 Because of this, l2cap info rsp is not handled and command reject is sent. > ACL data: handle 11 flags 0x02 dlen 16 L2CAP(s): Info rsp: type 2 result 0 Extended feature mask 0x00b8 Enhanced Retransmission mode Streaming mode FCS Option Fixed Channels < ACL data: handle 11 flags 0x00 dlen 10 L2CAP(s): Command rej: reason 0 Command not understood Cc: stable@vger.kernel.org Signed-off-by:
Jaganath Kanakkassery <jaganath.k@samsung.com> Signed-off-by:
Chan-Yeol Park <chanyeol.park@samsung.com> Acked-by:
Johan Hedberg <johan.hedberg@intel.com> Signed-off-by:
Gustavo Padovan <gustavo.padovan@collabora.co.uk> ath9k_htc: Handle IDLE state transition properly Make sure that a chip reset is done when IDLE is turned off - this fixes authentication timeouts. Cc: stable@vger.kernel.org Reported-by:
Ignacy Gawedzki <i@lri.fr> Signed-off-by:
Sujith Manoharan <c_manoha@qca.qualcomm.com> Signed-off-by:
John W. Linville <linville@tuxdriver.com> ath9k: fix an RCU issue in calling ieee80211_get_tx_rates ath_txq_schedule is called outside of the drv_tx call, so it needs RCU protection. Signed-off-by:
Felix Fietkau <nbd@openwrt.org> Signed-off-by:
John W. Linville <linville@tuxdriver.com> Bluetooth: Fix invalid length check in l2cap_information_rsp() The length check is invalid since the length varies with type of info response. This was introduced by the commit cb3b3152 Because of this, l2cap info rsp is not handled and command reject is sent. > ACL data: handle 11 flags 0x02 dlen 16 L2CAP(s): Info rsp: type 2 result 0 Extended feature mask 0x00b8 Enhanced Retransmission mode Streaming mode FCS Option Fixed Channels < ACL data: handle 11 flags 0x00 dlen 10 L2CAP(s): Command rej: reason 0 Command not understood Cc: stable@vger.kernel.org Signed-off-by:
Jaganath Kanakkassery <jaganath.k@samsung.com> Signed-off-by:
Chan-Yeol Park <chanyeol.park@samsung.com> Acked-by:
Johan Hedberg <johan.hedberg@intel.com> Signed-off-by:
Gustavo Padovan <gustavo.padovan@collabora.co.uk> Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 nl80211: fix attrbuf access race by allocating a separate one Since my commit 3713b4e3 ("nl80211: allow splitting wiphy information in dumps"), nl80211_dump_wiphy() uses the global nl80211_fam.attrbuf for parsing the incoming data. This wouldn't be a problem if it only did so on the first dump iteration which is locked against other commands in generic netlink, but due to space constraints in cb->args (the needed state doesn't fit) I decided to always parse the original message. That's racy though since nl80211_fam.attrbuf could be used by some other parsing in generic netlink concurrently. For now, fix this by allocating a separate parse buffer (it's a bit too big for the stack, currently 1448 bytes on 64-bit). For -next, I'll change the code to parse into the global buffer in the first round only and then allocate a smaller buffer to keep the data in cb->args. Reported-by:
Linus Torvalds <torvalds@linux-foundation.org> Acked-by:
David S. Miller <davem@davemloft.net> Acked-by:
John W. Linville <linville@tuxdriver.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com> (cherry picked from commit da9910ac 3f6fa3d4) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Konrad Rzeszutek Wilk authored
The 'intel_idle_probe' probes the CPU and sets the CPU notifier. But if later on during the module initialization we fail (say in cpuidle_register_driver), we stop loading, but we neglect to unregister the CPU notifier. This means that during CPU hotplug events the system will fail: calling intel_idle_init+0x0/0x326 @ 1 intel_idle: MWAIT substates: 0x1120 intel_idle: v0.4 model 0x2A intel_idle: lapic_timer_reliable_states 0xffffffff intel_idle: intel_idle yielding to none initcall intel_idle_init+0x0/0x326 returned -19 after 14 usecs ... some time later, offlining and onlining a CPU: cpu 3 spinlock event irq 62 BUG: unable to ] __cpuidle_register_device+0x1c/0x120 PGD 99b8b067 PUD 99b95067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: xen_evtchn nouveau mxm_wmi wmi radeon ttm i915 fbcon tileblit font atl1c bitblit softcursor drm_kms_helper video xen_blkfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea xenfs xen_privcmd mperf CPU 0 Pid: 2302, comm: udevd Not tainted 3.8.0-rc3upstream-00249-g09ad159 #1 MSI MS-7680/H61M-P23 (MS-7680) RIP: e030:[<ffffffff814d956c>] [<ffffffff814d956c>] __cpuidle_register_device+0x1c/0x120 RSP: e02b:ffff88009dacfcb8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff880105380000 RCX: 000000000000001c RDX: 0000000000000000 RSI: 0000000000000055 RDI: ffff880105380000 RBP: ffff88009dacfce8 R08: ffffffff81a4f048 R09: 0000000000000008 R10: 0000000000000008 R11: 0000000000000000 R12: ffff880105380000 R13: 00000000ffffffdd R14: 0000000000000000 R15: ffffffff81a523d0 FS: 00007f37bd83b7a0(0000) GS:ffff880105200000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 00000000a09ea000 CR4: 0000000000042660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process udevd (pid: 2302, threadinfo ffff88009dace000, task ffff88009afb47f0) Stack: ffffffff8107f2d0 ffffffff810c2fb7 ffff88009dacfce8 00000000ffffffea ffff880105380000 00000000ffffffdd ffff88009dacfd08 ffffffff814d9882 0000000000000003 ffff880105380000 ffff88009dacfd28 ffffffff81340afd Call Trace: [<ffffffff8107f2d0>] ? collect_cpu_info_local+0x30/0x30 [<ffffffff810c2fb7>] ? __might_sleep+0xe7/0x100 [<ffffffff814d9882>] cpuidle_register_device+0x32/0x70 [<ffffffff81340afd>] intel_idle_cpu_init+0xad/0x110 [<ffffffff81340bc8>] cpu_hotplug_notify+0x68/0x80 [<ffffffff8166023d>] notifier_call_chain+0x4d/0x70 [<ffffffff810bc369>] __raw_notifier_call_chain+0x9/0x10 [<ffffffff81094a4b>] __cpu_notify+0x1b/0x30 [<ffffffff81652cf7>] _cpu_up+0x103/0x14b [<ffffffff81652e18>] cpu_up+0xd9/0xec [<ffffffff8164a254>] store_online+0x94/0xd0 [<ffffffff814122fb>] dev_attr_store+0x1b/0x20 [<ffffffff81216404>] sysfs_write_file+0xf4/0x170 [<ffffffff811a1024>] vfs_write+0xb4/0x130 [<ffffffff811a17ea>] sys_write+0x5a/0xa0 [<ffffffff816643a9>] system_call_fastpath+0x16/0x1b Code: 03 18 00 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 30 48 89 5d e8 4c 89 65 f0 48 89 fb 4c 89 6d f8 e8 84 08 00 00 <48> 8b 78 08 49 89 c4 e8 f8 7f c1 ff 89 c2 b8 ea ff ff ff 84 d2 RIP [<ffffffff814d956c>] __cpuidle_register_device+0x1c/0x120 RSP <ffff88009dacfcb8> This patch fixes that by moving the CPU notifier registration as the last item to be done by the module. Signed-off-by:
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by:
Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: 3.6+ <stable@vger.kernel.org> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 6f8c2e79) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Florian Westphal authored
local_df means 'ignore DF bit if set', so if its set we're allowed to perform ip fragmentation. This wasn't noticed earlier because the output path also drops such skbs (and emits needed icmp error) and because netfilter ip defrag did not set local_df until couple of days ago. Only difference is that DF-packets-larger-than MTU now discarded earlier (f.e. we avoid pointless netfilter postrouting trip). While at it, drop the repeated test ip_exceeds_mtu, checking it once is enough... Fixes: fe6cc55f ("net: ip, ipv6: handle gso skbs in forwarding path") Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net> (cherry picked from commit ca6c5d4a) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Christopher Friedt authored
commit aa6de142 upstream. Previously, the vmwgfx_fb driver would allow users to call FBIOSET_VINFO, but it would not adjust the FINFO properly, resulting in distorted screen rendering. The patch corrects that behaviour. See https://bugs.gentoo.org/show_bug.cgi?id=494794 for examples. Signed-off-by:
Christopher Friedt <chrisfriedt@gmail.com> Reviewed-by:
Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit b8a0ddef)
-
Roger Quadros authored
OMAP3 doesn't contain "l3_init_clkdm" clock domain. Use the proper clock domains for USB Host and USB TLL modules. Gets rid of the following warnings during boot omap_hwmod: usb_host_hs: could not associate to clkdm l3_init_clkdm omap_hwmod: usb_tll_hs: could not associate to clkdm l3_init_clkdm Reported-by:
Nishanth Menon <nm@ti.com> Cc: Paul Walmsley <paul@pwsan.com> Signed-off-by:
Roger Quadros <rogerq@ti.com> Fixes: de231388 ("ARM: OMAP: USB: EHCI and OHCI hwmod structures for OMAP3") Cc: Keshava Munegowda <keshava_mgowda@ti.com> Cc: Partha Basak <parthab@india.ti.com> Signed-off-by:
Paul Walmsley <paul@pwsan.com> (cherry picked from commit c6c56697) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Nicholas Santos authored
Patch to add the Formosa Industrial Computing, Inc. Infrared Receiver [IR605A/Q] to hid-ids.h and hid-quirks.c. This IR receiver causes about a 10 second timeout when the usbhid driver attempts to initialze the device. Adding this device to the quirks list with HID_QUIRK_NO_INIT_REPORTS removes the delay. Signed-off-by:
Nicholas Santos <nicholas.santos@gmail.com> [jkosina@suse.cz: fix ordering] Signed-off-by:
Jiri Kosina <jkosina@suse.cz> (cherry picked from commit 320cde19) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Lai Jiangshan authored
When a kworker should die, the kworkre is notified through WORKER_DIE flag instead of kthread_should_stop(). This, IIRC, is primarily to keep the test synchronized inside worker_pool lock. WORKER_DIE is first set while holding pool->lock, the lock is dropped and kthread_stop() is called. Unfortunately, this means that there's a slight chance that the target kworker may see WORKER_DIE before kthread_stop() finishes and exits and frees the target task before or during kthread_stop(). Fix it by pinning the target task before setting WORKER_DIE and putting it after kthread_stop() is done. tj: Improved patch description and comment. Moved pinning above WORKER_DIE for better signify what it's protecting. CC: stable@vger.kernel.org Signed-off-by:
Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by:
Tejun Heo <tj@kernel.org> (cherry picked from commit 5bdfff96) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Jason Gunthorpe authored
Some Atmel TPMs provide completely wrong timeouts from their TPM_CAP_PROP_TIS_TIMEOUT query. This patch detects that and returns new correct values via a DID/VID table in the TIS driver. Tested on ARM using an AT97SC3204T FW version 37.16 Cc: <stable@vger.kernel.org> [PHuewe: without this fix these 'broken' Atmel TPMs won't function on older kernels] Signed-off-by:
"Berg, Christopher" <Christopher.Berg@atmel.com> Signed-off-by:
Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by:
Peter Huewe <peterhuewe@gmx.de> (cherry picked from commit 8e54caf4) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Linus Torvalds authored
The performance regression that Josef Bacik reported in the pathname lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made me look at performance stability of the dcache code, just to verify that the problem was actually fixed. That turned up a few other problems in this area. There are a few cases where we exit RCU lookup mode and go to the slow serializing case when we shouldn't, Al has fixed those and they'll come in with the next VFS pull. But my performance verification also shows that link_path_walk() turns out to have a very unfortunate 32-bit store of the length and hash of the name we look up, followed by a 64-bit read of the combined hash_len field. That screws up the processor store to load forwarding, causing an unnecessary hickup in this critical routine. It's caused by the ugly calling convention for the "hash_name()" function, and easily fixed by just making hash_name() fill in the whole 'struct qstr' rather than passing it a pointer to just the hash value. With that, the profile for this function looks much smoother. Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Merge branch 'parisc-3.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc updates from Helge Deller: "The most important patch is a new Light Weigth Syscall (LWS) for 8, 16, 32 and 64 bit atomic CAS operations which is required in order to be able to implement the atomic gcc builtins on our platform. Other than that, we wire up the seccomp, getrandom and memfd_create syscalls, fixes a minor off-by-one bug and a wrong printk string" * 'parisc-3.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Implement new LWS CAS supporting 64 bit operations. parisc: Wire up seccomp, getrandom and memfd_create syscalls parisc: dino: fix %d confusingly prefixed with 0x in format string parisc: sys_hpux: NUL terminator is one past the end Merge tag 'ntb-3.17' of git://github.com/jonmason/ntb Pull ntb driver bugfixes from Jon Mason: "NTB driver fixes for queue spread and buffer alignment. Also, update to MAINTAINERS to reflect new e-mail address" * tag 'ntb-3.17' of git://github.com/jonmason/ntb: ntb: Add alignment check to meet hardware requirement MAINTAINERS: update NTB info NTB: correct the spread of queues over mw's Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull ARM irq chip fixes from Thomas Gleixner: "Another pile of ARM specific irq chip fixlets: - off by one bugs in the crossbar driver - missing annotations - a bunch of "make it compile" updates I pulled the lot today from Jason, but it has been in -next for at least a week" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip: gic-v3: Declare rdist as __percpu pointer to __iomem pointer irqchip: gic: Make gic_default_routable_irq_domain_ops static irqchip: exynos-combiner: Fix compilation error on ARM64 irqchip: crossbar: Off by one bugs in init irqchip: gic-v3: Tag all low level accessors __maybe_unused irqchip: gic-v3: Only define gic_peek_irq() when building SMP Merge tag 'irqchip-urgent-3.17' of git://git.infradead.org/users/jcooper/linux into irq/urgent irqchip fixes for v3.17 from Jason Cooper - GIC/GICV3: Various fixlets - crossbar: Fix off-by-one bug - exynos-combiner: Fix arm64 build error ntb: Add alignment check to meet hardware requirement The NTB translate register must have the value to be BAR size aligned. This alignment check make sure that the DMA memory allocated has the proper alignment. Another requirement for NTB to function properly with memory window BAR size greater or equal to 4M is to use the CMA feature in 3.16 kernel with the appropriate CONFIG_CMA_ALIGNMENT and CONFIG_CMA_SIZE_MBYTES set. Signed-off-by:
Dave Jiang <dave.jiang@intel.com> Signed-off-by:
Jon Mason <jdmason@kudzu.us> MAINTAINERS: update NTB info Update my contact info to my personal email address and add Dave Jiang. Signed-off-by:
Jon Mason <jon.mason@intel.com> Signed-off-by:
Dave Jiang <dave.jiang@intel.com> NTB: correct the spread of queues over mw's The detection of an uneven number of queues on the given memory windows was not correct. The mw_num is zero based and the mod should be division to spread them evenly over the mw's. Signed-off-by:
Jon Mason <jon.mason@intel.com> Merge branches 'locking-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex and timer fixes from Thomas Gleixner: "A oneliner bugfix for the jinxed futex code: - Drop hash bucket lock in the error exit path. I really could slap myself for intruducing that bug while fixing all the other horror in that code three month ago ... and the timer department is not too proud about the following fixes: - Deal with a long standing rounding bug in the timeval to jiffies conversion. It's a real issue and this fix fell through the cracks for quite some time. - Another round of alarmtimer fixes. Finally this code gets used more widely and the subtle issues hidden for quite some time are noticed and fixed. Nothing really exciting, just the itty bitty details which bite the serious users here and there" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unlock hb->lock in futex_wait_requeue_pi() error path * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Lock k_itimer during timer callback alarmtimer: Do not signal SIGEV_NONE timers alarmtimer: Return relative times in timer_gettime jiffies: Fix timeval conversion to jiffies parisc: Implement new LWS CAS supporting 64 bit operations. The current LWS cas only works correctly for 32bit. The new LWS allows for CAS operations of variable size. Signed-off-by:
Guy Martin <gmsoft@tuxicoman.be> Cc: <stable@vger.kernel.org> # 3.13+ Signed-off-by:
Helge Deller <deller@gmx.de> vfs: fix bad hashing of dentries Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32|64()" functions to do that. Reported-by:
Josef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> alarmtimer: Lock k_itimer during timer callback Locks the k_itimer's it_lock member when handling the alarm timer's expiry callback. The regular posix timers defined in posix-timers.c have this lock held during timout processing because their callbacks are routed through posix_timer_fn(). The alarm timers follow a different path, so they ought to grab the lock somewhere else. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by:
Richard Larocque <rlarocque@google.com> Signed-off-by:
John Stultz <john.stultz@linaro.org> alarmtimer: Do not signal SIGEV_NONE timers Avoids sending a signal to alarm timers created with sigev_notify set to SIGEV_NONE by checking for that special case in the timeout callback. The regular posix timers avoid sending signals to SIGEV_NONE timers by not scheduling any callbacks for them in the first place. Although it would be possible to do something similar for alarm timers, it's simpler to handle this as a special case in the timeout. Prior to this patch, the alarm timer would ignore the sigev_notify value and try to deliver signals to the process anyway. Even worse, the sanity check for the value of sigev_signo is skipped when SIGEV_NONE was specified, so the signal number could be bogus. If sigev_signo was an unitialized value (as it often would be if SIGEV_NONE is used), then it's hard to predict which signal will be sent. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by:
Richard Larocque <rlarocque@google.com> Signed-off-by:
John Stultz <john.stultz@linaro.org> alarmtimer: Return relative times in timer_gettime Returns the time remaining for an alarm timer, rather than the time at which it is scheduled to expire. If the timer has already expired or it is not currently scheduled, the it_value's members are set to zero. This new behavior matches that of the other posix-timers and the POSIX specifications. This is a change in user-visible behavior, and may break existing applications. Hopefully, few users rely on the old incorrect behavior. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by:
Richard Larocque <rlarocque@google.com> [jstultz: minor style tweak] Signed-off-by:
John Stultz <john.stultz@linaro.org> jiffies: Fix timeval conversion to jiffies timeval_to_jiffies tried to round a timeval up to an integral number of jiffies, but the logic for doing so was incorrect: intervals corresponding to exactly N jiffies would become N+1. This manifested itself particularly repeatedly stopping/starting an itimer: setitimer(ITIMER_PROF, &val, NULL); setitimer(ITIMER_PROF, NULL, &val); would add a full tick to val, _even if it was exactly representable in terms of jiffies_ (say, the result of a previous rounding.) Doing this repeatedly would cause unbounded growth in val. So fix the math. Here's what was wrong with the conversion: we essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) >> USEC_JIFFIE_SC and rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value; that is, instead of dividing by (1 usec/ 1 jiffie), we effectively divided by (1 usec/1 jiffie)-epsilon (rounding down). This would normally be fine, but we want to round timeouts up, and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this would be fine if our division was exact, but dividing this by the slightly smaller factor was equivalent to adding just _over_ 1 to the final result (instead of just _under_ 1, as desired.) In particular, with HZ=1000, we consistently computed that 10000 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec->nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 10000; setitimer(ITIMER_PROF, &initial, NULL); /* Save and restore several times. */ for (size_t i = 0; i < 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, &zero, &prev); /* on old kernels, this goes up by TICK_USEC every iteration */ printf("previous value: %ld %ld %ld %ld\n", prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, &prev, NULL); } return 0; } Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Reviewed-by:
Paul Turner <pjt@google.com> Reported-by:
Aaron Jacobs <jacobsa@google.com> Signed-off-by:
Andrew Hunter <ahh@google.com> [jstultz: Tweaked to apply to 3.17-rc] Signed-off-by:
John Stultz <john.stultz@linaro.org> futex: Unlock hb->lock in futex_wait_requeue_pi() error path futex_wait_requeue_pi() calls futex_wait_setup(). If futex_wait_setup() succeeds it returns with hb->lock held and preemption disabled. Now the sanity check after this does: if (match_futex(&q.key, &key2)) { ret = -EINVAL; goto out_put_keys; } which releases the keys but does not release hb->lock. So we happily return to user space with hb->lock held and therefor preemption disabled. Unlock hb->lock before taking the exit route. Reported-by:
Dave "Trinity" Jones <davej@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Reviewed-by:
Darren Hart <dvhart@linux.intel.com> Reviewed-by:
Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanosSigned-off-by:
Thomas Gleixner <tglx@linutronix.de> irqchip: gic-v3: Declare rdist as __percpu pointer to __iomem pointer The __percpu __iomem annotations on the rdist base are contradictory and confuse static checkers such as sparse. This patch fixes the anotations so that rdist is described as a __percpu pointer to an __iomem pointer. Cc: Jason Cooper <jason@lakedaemon.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com> Acked-by:
Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1409062410-25891-9-git-send-email-will.deacon@arm.comSigned-off-by:
Jason Cooper <jason@lakedaemon.net> irqchip: gic: Make gic_default_routable_irq_domain_ops static The internal irq domain ops for the GIC are not used directly anywhere else, so make them static. This gets rid of a sparse warning on the file. Cc: Jason Cooper <jason@lakedaemon.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by:
Will Deacon <will.deacon@arm.com> Acked-by:
Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1409062410-25891-8-git-send-email-will.deacon@arm.comSigned-off-by:
Jason Cooper <jason@lakedaemon.net> irqchip: exynos-combiner: Fix compilation error on ARM64 The following compilation error occurs on 64-bit Exynos7 SoC: drivers/irqchip/exynos-combiner.c: In function ‘combiner_irq_domain_map’: drivers/irqchip/exynos-combiner.c:162:2: error: implicit declaration of function ‘set_irq_flags’ [-Werror=implicit-function-declaration] set_irq_flags(irq, IRQF_VALID | IRQF_PROBE); ^ drivers/irqchip/exynos-combiner.c:162:21: error: ‘IRQF_VALID’ undeclared (first use in this function) set_irq_flags(irq, IRQF_VALID | IRQF_PROBE); ^ drivers/irqchip/exynos-combiner.c:162:21: note: each undeclared identifier is reported only once for each function it appears in drivers/irqchip/exynos-combiner.c:162:34: error: ‘IRQF_PROBE’ undeclared (first use in this function) set_irq_flags(irq, IRQF_VALID | IRQF_PROBE); Fix the build error by including linux/interrupt.h. Signed-off-by:
Naveen Krishna Chatradhi <ch.naveen@samsung.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Sudeep Holla <sudeep.holla@arm.com> Link: https://lkml.kernel.org/r/1409722329-18309-1-git-send-email-ch.naveen@samsung.comSigned-off-by:
Jason Cooper <jason@lakedaemon.net> parisc: Wire up seccomp, getrandom and memfd_create syscalls With secure computing we only support the SECCOMP_MODE_STRICT mode for now. Signed-off-by:
Helge Deller <deller@gmx.de> parisc: dino: fix %d confusingly prefixed with 0x in format string Signed-off-by:
Hans Wennborg <hans@hanshq.net> Signed-off-by:
Helge Deller <deller@gmx.de> parisc: sys_hpux: NUL terminator is one past the end We allocate "len" number of chars so we should put the NUL at "len - 1" to avoid corrupting memory. Btw, strlen_user() is different from the normal strlen() function because it includes NUL terminator in the count. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Helge Deller <deller@gmx.de> irqchip: crossbar: Off by one bugs in init My static checker complains that the ">" should be ">=" or else we go beyond the end of the cb->irq_map[] array on the next line. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Jason Cooper <jason@lakedaemon.net> irqchip: gic-v3: Tag all low level accessors __maybe_unused This is only really needed for gic_write_sgi1r in the !SMP case since it is only referenced in the SMP initialisation code but it seems better to have these functions all next to each other and declared consistently. Signed-off-by:
Mark Brown <broonie@linaro.org> Link: https://lkml.kernel.org/r/1406748194-21094-1-git-send-email-broonie@kernel.orgSigned-off-by:
Jason Cooper <jason@lakedaemon.net> irqchip: gic-v3: Only define gic_peek_irq() when building SMP If building with CONFIG_SMP disbled (for example, with allnoconfig) then GCC complains that the static function gic_peek_irq() is defined but not used since the only reference is in the SMP initialisation code. Fix this by moving the function definition inside the ifdef. Signed-off-by:
Mark Brown <broonie@linaro.org> Acked-by:
Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1406480224-24628-1-git-send-email-broonie@kernel.orgSigned-off-by:
Jason Cooper <jason@lakedaemon.net> (cherry picked from commit 9226b5b4 99d263d4) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Al Viro authored
D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()). At this point they are actively misguiding - they imply that values are compiler constants, which is no longer true. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk> (cherry picked from commit 482db906) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Tejun Heo authored
While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD e4a1067 PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b 776687bc ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Shirish Pargaonkar <spargaonkar@suse.com> Reported-by:
Sasha Levin <sasha.levin@oracle.com> Reported-by:
Jet Chen <jet.chen@intel.com> Cc: stable@vger.kernel.org Tested-by:
Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com> Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment" This reverts commit 254c4407. It causes crashes with cryptsetup, even after a few iterations and updates. Drop it for now. blkcg: don't call into policy draining if root_blkg is already gone While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD e4a1067 PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b 776687bc ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by:
Tejun Heo <tj@kernel.org> Reported-by:
Shirish Pargaonkar <spargaonkar@suse.com> Reported-by:
Sasha Levin <sasha.levin@oracle.com> Reported-by:
Jet Chen <jet.chen@intel.com> Cc: stable@vger.kernel.org Tested-by:
Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com> bio: modify __bio_add_page() to accept pages that don't start a new segment The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments] Signed-off-by:
Maurizio Lombardi <mlombard@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Tested-by:
Dongsu Park <dongsu.park@profitbricks.com> Tested-by:
Jet Chen <jet.chen@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Jens Axboe <axboe@fb.com> block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved buffer in bytes as int type. The value needs to be capped at the request queue's max_sectors. But integer overflow is not correctly handled in the calculation when converting max_sectors from sectors to bytes. Signed-off-by:
Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com> block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX BLKSECTGET ioctl loads the request queue's max_sectors as unsigned short value to the argument pointer. So if the max_sector is greater than USHRT_MAX, the upper 16 bits of that is just discarded. In such case, USHRT_MAX is more preferable than the lower 16 bits of max_sectors. Signed-off-by:
Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com> block/partitions/efi.c: kerneldoc fixing Adding function documentation and fixing kerneldoc warnings ('field: description' uniformization). Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Jens Axboe <axboe@fb.com> block/partitions/msdos.c: code clean-up checkpatch fixing: WARNING: Missing a blank line after declarations WARNING: space prohibited between function name and open parenthesis '(' ERROR: spaces required around that '<' (ctx:VxV) Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Jens Axboe <axboe@fb.com> block/partitions/amiga.c: replace nolevel printk by pr_err Also add no prefix pr_fmt to avoid any future default format update Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Jens Axboe <axboe@fb.com> block/partitions/aix.c: replace count*size kzalloc by kcalloc kcalloc manages count*sizeof overflow. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
Jens Axboe <axboe@fb.com> bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload Commit 08778795 ("block: Fix nr_vecs for inline integrity vectors") from Martin introduces the function bip_integrity_vecs(get the useful vectors) to fix the issue about nr_vecs for inline integrity vectors that reported by David Milburn. But it seems that bip_integrity_vecs() will return the wrong number if the bio is not based on any bio_set for some reason(bio->bi_pool == NULL), because in that case, the bip_inline_vecs[0] is malloced directly. So here we add the bip_max_vcnt to record the count of vector slots, and cleanup the function bip_integrity_vecs(). Signed-off-by:
Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by:
Jens Axboe <axboe@fb.com> blk-mq: use percpu_ref for mq usage count Currently, blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism. This type of code has relatively high chance of subtle bugs which are extremely difficult to trigger and it's way too hairy to be open coded in blk-mq. percpu_ref can serve the same purpose after the recent changes. This patch replaces the open-coded per-cpu usage counting and draining mechanism with percpu_ref. blk_mq_queue_enter() performs tryget_live on the ref and exit() performs put. blk_mq_freeze_queue() kills the ref and waits until the reference count reaches zero. blk_mq_unfreeze_queue() revives the ref and wakes up the waiters. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by:
Jens Axboe <axboe@fb.com> blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() Keeping __blk_mq_drain_queue() as a separate function doesn't buy us anything and it's gonna be further simplified. Let's flatten it into its caller. This patch doesn't make any functional change. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by:
Jens Axboe <axboe@fb.com> blk-mq: decouble blk-mq freezing from generic bypassing blk_mq freezing is entangled with generic bypassing which bypasses blkcg and io scheduler and lets IO requests fall through the block layer to the drivers in FIFO order. This allows forward progress on IOs with the advanced features disabled so that those features can be configured or altered without worrying about stalling IO which may lead to deadlock through memory allocation. However, generic bypassing doesn't quite fit blk-mq. blk-mq currently doesn't make use of blkcg or ioscheds and it maps bypssing to freezing, which blocks request processing and drains all the in-flight ones. This causes problems as bypassing assumes that request processing is online. blk-mq works around this by conditionally allowing request processing for the problem case - during queue initialization. Another weirdity is that except for during queue cleanup, bypassing started on the generic side prevents blk-mq from processing new requests but doesn't drain the in-flight ones. This shouldn't break anything but again highlights that something isn't quite right here. The root cause is conflating blk-mq freezing and generic bypassing which are two different mechanisms. The only intersecting purpose that they serve is during queue cleanup. Let's properly separate blk-mq freezing from generic bypassing and simply use it where necessary. * request_queue->mq_freeze_depth is added and blk_mq_[un]freeze_queue() now operate on this counter instead of ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added but the counter is tested directly. This will be further updated by later changes. * blk_mq_drain_queue() is dropped and "__" prefix is dropped from blk_mq_freeze_queue(). Queue cleanup path now calls blk_mq_freeze_queue() directly. * blk_queue_enter()'s fast path condition is simplified to simply check @q->mq_freeze_depth. Previously, the condition was !blk_queue_dying(q) && (!blk_queue_bypass(q) || !blk_queue_init_done(q)) mq_freeze_depth is incremented right after dying is set and blk_queue_init_done() exception isn't necessary as blk-mq doesn't start frozen, which only leaves the blk_queue_bypass() test which can be replaced by @q->mq_freeze_depth test. This change simplifies the code and reduces confusion in the area. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by:
Jens Axboe <axboe@fb.com> block, blk-mq: draining can't be skipped even if bypass_depth was non-zero Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue() skip queue draining if bypass_depth was already above zero. The assumption is that the one which bumped the bypass_depth should have performed draining already; however, there's nothing which prevents a new instance of bypassing/freezing from starting before the previous one finishes draining. The current code may allow the later bypassing/freezing instances to complete while there still are in-flight requests which haven't finished draining. Fix it by draining regardless of bypass_depth. We still skip draining from blk_queue_bypass_start() while the queue is initializing to avoid introducing excessive delays during boot. INIT_DONE setting is moved above the initial blk_queue_bypass_end() so that bypassing attempts can't slip inbetween. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by:
Jens Axboe <axboe@fb.com> blk-mq: fix a memory ordering bug in blk_mq_queue_enter() blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism; unfortunately, it contains a subtle bug - smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from fetching @q->bypass_depth before incrementing @q->mq_usage_counter and if freezing happens inbetween the caller can slip through and freezing can be complete while there are active users. Use smp_mb() instead so that bypass_depth and mq_usage_counter modifications and tests are properly interlocked. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by:
Jens Axboe <axboe@fb.com> Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.17/core Merge the percpu_ref changes from Tejun, he says they are stable now. percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero() Now that explicit invocation of percpu_ref_exit() is necessary to free the percpu counter, we can implement percpu_ref_reinit() which reinitializes a released percpu_ref. This can be used implement scalable gating switch which can be drained and then re-opened without worrying about memory allocation failures. percpu_ref_is_zero() is added to be used in a sanity check in percpu_ref_exit(). As this function will be useful for other purposes too, make it a public interface. v2: Use smp_read_barrier_depends() instead of smp_load_acquire(). We only need data dep barrier and smp_load_acquire() is stronger and heavier on some archs. Spotted by Lai Jiangshan. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> percpu-refcount: require percpu_ref to be exited explicitly Currently, a percpu_ref undoes percpu_ref_init() automatically by freeing the allocated percpu area when the percpu_ref is killed. While seemingly convenient, this has the following niggles. * It's impossible to re-init a released reference counter without going through re-allocation. * In the similar vein, it's impossible to initialize a percpu_ref count with static percpu variables. * We need and have an explicit destructor anyway for failure paths - percpu_ref_cancel_init(). This patch removes the automatic percpu counter freeing in percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a generic destructor now named percpu_ref_exit(). percpu_ref_destroy() is considered but it gets confusing with percpu_ref_kill() while "exit" clearly indicates that it's the counterpart of percpu_ref_init(). All percpu_ref_cancel_init() users are updated to invoke percpu_ref_exit() instead and explicit percpu_ref_exit() calls are added to the destruction path of all percpu_ref users. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Li Zefan <lizefan@huawei.com> percpu-refcount: use unsigned long for pcpu_count pointer percpu_ref->pcpu_count is a percpu pointer with a status flag in its lowest bit. As such, it always goes through arithmetic operations which is very cumbersome to do on a pointer. It has to be first casted to unsigned long and then back. Let's just make the field unsigned long so that we can skip the first casts. While at it, rename it to pcpu_counter_ptr to clarify that it's a pointer value. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> percpu-refcount: add helpers for ->percpu_count accesses * All four percpu_ref_*() operations implemented in the header file perform the same operation to determine whether the percpu_ref is alive and extract the percpu pointer. Factor out the common logic into __pcpu_ref_alive(). This doesn't change the generated code. * There are a couple places in percpu-refcount.c which masks out PCPU_REF_DEAD to obtain the percpu pointer. Factor it out into pcpu_count_ptr(). * The above changes make the WARN_ON_ONCE() conditional at the top of percpu_ref_kill_and_confirm() the only user of REF_STATUS(). Test PCPU_REF_DEAD directly and remove REF_STATUS(). This patch doesn't introduce any functional change. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> percpu-refcount: one bit is enough for REF_STATUS percpu-refcount currently reserves two lowest bits of its percpu pointer to indicate its state; however, only one bit is used for PCPU_REF_DEAD. Simplify it by removing PCPU_STATUS_BITS/MASK and testing PCPU_REF_DEAD directly. This also allows the compiler to choose a more efficient instruction depending on the architecture. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() ioctx_alloc() reaches inside percpu_ref and directly frees ->pcpu_count in its failure path, which is quite gross. percpu_ref has been providing a proper interface to do this, percpu_ref_cancel_init(), for quite some time now. Let's use that instead. This patch doesn't introduce any behavior changes. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> workqueue: stronger test in process_one_work() After the recent changes, when POOL_DISASSOCIATED is cleared, the running worker's local CPU should be the same as pool->cpu without any exception even during cpu-hotplug. Update the sanity check in process_one_work() accordingly. This patch changes "(proposition_A && proposition_B && proposition_C)" to "(proposition_B && proposition_C)", so if the old compound proposition is true, the new one must be true too. so this will not hide any possible bug which can be caught by the old test. tj: Minor updates to the description. CC: Jason J. Herne <jjherne@linux.vnet.ibm.com> CC: Sasha Levin <sasha.levin@oracle.com> Signed-off-by:
Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by:
Tejun Heo <tj@kernel.org> workqueue: clear POOL_DISASSOCIATED in rebind_workers() The commit a9ab775b ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved the pool->lock into rebind_workers() without also moving "pool->flags &= ~POOL_DISASSOCIATED". There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not being moved together, but there isn't any benefit either. We move it into rebind_workers() and achieve these benefits: 1) Better readability. POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) When POOL_DISASSOCIATED is cleared, we can ensure that all the running workers of the pool are on the local CPU (pool->cpu). tj: Cosmetic updates to the code and description. Signed-off-by:
Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by:
Tejun Heo <tj@kernel.org> percpu: Use ALIGN macro instead of hand coding alignment calculation Signed-off-by:
Christoph Lameter <cl@linux.com> Signed-off-by:
Tejun Heo <tj@kernel.org> percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations __verify_pcpu_ptr() is used to verify that a specified parameter is actually an percpu pointer by percpu accessor and operation implementations. Currently, where it's called isn't clearly defined and we just ensure that it's invoked at least once for all accessors and operations. The lack of clarity on when it should be called isn't nice and given that this is a completely generic issue, there's no reason to make archs worry about it. This patch updates __verify_pcpu_ptr() invocations such that it's always invoked from the final generic wrapper once per access or operation. As this is already the case for {raw|this}_cpu_*() definitions through __pcpu_size_*(), only the {raw|per|this}_cpu_ptr() accessors need to be updated. This change makes it unnecessary for archs to worry about __verify_pcpu_ptr(). x86's arch_raw_cpu_ptr() is updated accordingly. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> percpu: preffity percpu header files percpu macros are difficult to read. It's partly because they're fairly complex but also because they simply lack visual and conventional consistency to an unusual degree. The preceding patches tried to organize macro definitions consistently by their roles. This patch makes the following cosmetic changes to improve overall readability. * Use consistent convention for multi-line macro definitions - "do {" or "({" are now put on their own lines and the line continuing '\' are all put on the same column. * Temp variables used inside macro are consistently given "__" prefix. * When a macro argument is passed to another macro or a function, putting extra parenthses around it doesn't help anything. Don't put them. * _this_cpu_generic_*() are renamed to this_cpu_generic_*() so that they're consistent with raw_cpu_generic_*(). * Reorganize raw_cpu_*() and this_cpu_*() definitions so that trivial wrappers are collected in one place after actual operation definitions. * Other misc cleanups including reorganizing comments. All changes in this patch are cosmetic and cause no functional difference. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: use raw_cpu_*() to define __this_cpu_*() __this_cpu_*() operations are the same as raw_cpu_*() operations except for the added __this_cpu_preempt_check(). Curiously, these were defined using __pcu_size_call_*() instead of being layered on top of raw_cpu_*(). Let's layer them so that __this_cpu_*() are defined in terms of raw_cpu_*(). It's simpler and less error-prone this way. This patch doesn't introduce any functional difference. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: reorder macros in percpu header files * In include/asm-generic/percpu.h, collect {raw|_this}_cpu_generic*() macros into one place. They were dispersed through {raw|this}_cpu_*_N() definitions and the visiual inconsistency was making following the code unnecessarily difficult. * In include/linux/percpu-defs.h, move __verify_pcpu_ptr() later in the file so that it's right above accessor definitions where it's actually used. This is pure reorganization. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h We're in the process of moving all percpu accessors and operations to include/linux/percpu-defs.h so that they're available to arch headers without having to include full include/linux/percpu.h which may cause cyclic inclusion dependency. This patch moves {raw|this}_cpu_*() definitions from include/linux/percpu.h to include/linux/percpu-defs.h. The code is moved mostly verbatim; however, raw_cpu_*() are placed above this_cpu_*() which is more conventional as the raw operations may be used to defined other variants. This is pure reorganization. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h {raw|this}_cpu_*_N() operations are expected to be provided by archs and the generic definitions are provided as fallbacks. As such, these firmly belong to include/asm-generic/percpu.h. Move the generic definitions to include/asm-generic/percpu.h. The code is moved mostly verbatim; however, raw_cpu_*_N() are placed above this_cpu_*_N() which is more conventional as the raw operations may be used to defined other variants. This is pure reorganization. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops Currently, percpu allows two separate methods for overriding {raw|this}_cpu_*() ops - for a given operation, an arch can provide whole replacement or sized sub operations to override specific parts of it. e.g. arch either can provide this_cpu_add() or this_cpu_add_4() to override only the 4 byte operation. While quite flexible on a glance, the dual-overriding scheme complicates the code path for no actual gain. It compilcates the already complex operation definitions and if an arch wants to override all sizes, it can easily provide all variants anyway. In fact, no arch is actually making use of whole operation override. Another oddity is that __this_cpu_*() operations are defined in the same way as raw_cpu_*() but ignores full overrides of the raw_cpu_*() and doesn't allow full operation override, so if an arch provides whole overrides for raw_cpu_*() operations __this_cpu_*() ends up using the generic implementations. More importantly, it takes away the layering between arch-specific and generic parts making it impossible for the generic part to implement arch-independent features on top of arch-specific overrides. This patch removes the support for whole operation overrides. As no arch is using it, this doesn't cause any actual difference. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: reorganize include/linux/percpu-defs.h Reorganize for better readability. * Accessor definitions are collected into one place and SMP and UP now define them in the same order. * Definitions are layered when possible - e.g. per_cpu() is now defined in terms of this_cpu_ptr(). * Rather pointless comment dropped. * per_cpu(), __raw_get_cpu_var() and __get_cpu_var() are defined in a way which can be shared between SMP and UP and moved out of CONFIG_SMP blocks. This patch doesn't introduce any functional difference. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> percpu: move accessors from include/linux/percpu.h to percpu-defs.h include/linux/percpu-defs.h is gonna host all accessors and operations so that arch headers can make use of them too without worrying about circular dependency through include/linux/percpu.h. This patch moves the following accessors from include/linux/percpu.h to include/linux/percpu-defs.h. * get/put_cpu_var() * get/put_cpu_ptr() * per_cpu_ptr() This is pure reorgniazation. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: include/asm-generic/percpu.h should contain only arch-overridable parts The roles of the various percpu header files has become unclear. There are four header files involved. include/linux/percpu-defs.h include/linux/percpu.h include/asm-generic/percpu.h arch/*/include/asm/percpu.h The original intention for include/asm-generic/percpu.h is providing generic definitions for arch-overridable parts; however, it now hosts various stuff which can't be overridden by archs. Also, include/linux/percpu-defs.h was initially added to contain section and percpu variable definition macros so that arch header files can make use of them without worrying about introducing cyclic inclusion dependency by including include/linux/percpu.h; however, arch headers sometimes need to access percpu variables too and this is one of the reasons why some accessors were implemented in include/linux/asm-generic/percpu.h. Let's clear up the situation by making include/asm-generic/percpu.h contain only arch-overridable parts and moving accessors and operations into include/linux/percpu-defs. Note that this patch only moves things from include/asm-generic/percpu.h. include/linux/percpu.h will be taken care of by later patches. This patch moves the followings. * SHIFT_PERCPU_PTR() / VERIFY_PERCPU_PTR() * per_cpu() * raw_cpu_ptr() * this_cpu_ptr() * __get_cpu_var() * __raw_get_cpu_var() * __this_cpu_ptr() * PER_CPU_[SHARED_]ALIGNED_SECTION * PER_CPU_[SHARED_]ALIGNED_SECTION * PER_CPU_FIRST_SECTION This patch is pure reorganization. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> percpu: introduce arch_raw_cpu_ptr() Currently, archs can override raw_cpu_ptr() directly; however, we wanna build a layer of indirection in the generic part of percpu so that we can implement generic features there without affecting archs. Introduce arch_raw_cpu_ptr() which is used to define raw_cpu_ptr() by generic percpu code. The two are identical for now. x86 is currently the only arch which overrides raw_cpu_ptr() and is converted to define arch_raw_cpu_ptr() instead. This doesn't introduce any functional difference. Signed-off-by:
Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> percpu: disallow archs from overriding SHIFT_PERCPU_PTR() It has been about half a decade since all archs started using the dynamic percpu allocator and thus the same SHIFT_PERCPU_PTR() implementation. There's no benefit in overriding SHIFT_PERCPU_PTR() anymore. Remove #ifndef around it to clarify that this is identical regardless of the arch. This patch doesn't cause any functional difference. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Christoph Lameter <cl@linux.com> (cherry picked from commit 2a1b4cf2 0b462c89) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
NeilBrown authored
Currently we don't abort recovery on a write error if the write error to the recovering device was triggerd by normal IO (as opposed to recovery IO). This means that for one bitmap region, the recovery might write to the recovering device for a few sectors, then not bother for subsequent sectors (as it never writes to failed devices). In this case the bitmap bit will be cleared, but it really shouldn't. The result is that if the recovering device fails and is then re-added (after fixing whatever hardware problem triggerred the failure), the second recovery won't redo the region it was in the middle of, so some of the device will not be recovered properly. If we abort the recovery, the region being processes will be cancelled (bit not cleared) and the whole region will be retried. As the bug can result in data corruption the patch is suitable for -stable. For kernels prior to 3.11 there is a conflict in raid10.c which will require care. Original-from: jiao hui <jiaohui@bwstor.com.cn> Reported-and-tested-by:
jiao hui <jiaohui@bwstor.com.cn> Signed-off-by:
NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org (cherry picked from commit 2446dba0) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Eric W. Biederman authored
While invesgiating the issue where in "mount --bind -oremount,ro ..." would result in later "mount --bind -oremount,rw" succeeding even if the mount started off locked I realized that there are several additional mount flags that should be locked and are not. In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime flags in addition to MNT_READONLY should all be locked. These flags are all per superblock, can all be changed with MS_BIND, and should not be changable if set by a more privileged user. The following additions to the current logic are added in this patch. - nosuid may not be clearable by a less privileged user. - nodev may not be clearable by a less privielged user. - noexec may not be clearable by a less privileged user. - atime flags may not be changeable by a less privileged user. The logic with atime is that always setting atime on access is a global policy and backup software and auditing software could break if atime bits are not updated (when they are configured to be updated), and serious performance degradation could result (DOS attack) if atime updates happen when they have been explicitly disabled. Therefore an unprivileged user should not be able to mess with the atime bits set by a more privileged user. The additional restrictions are implemented with the addition of MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME mnt flags. Taken together these changes and the fixes for MNT_LOCK_READONLY should make it safe for an unprivileged user to create a user namespace and to call "mount --bind -o remount,... ..." without the danger of mount flags being changed maliciously. Cc: stable@vger.kernel.org Acked-by:
Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com> (cherry picked from commit 9566d674) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Eric W. Biederman authored
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a read-only bind mount read-only in a user namespace the MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user to the remount a read-only mount read-write. Correct this by replacing the mask of mount flags to preserve with a mask of mount flags that may be changed, and preserve all others. This ensures that any future bugs with this mask and remount will fail in an easy to detect way where new mount flags simply won't change. Cc: stable@vger.kernel.org Acked-by:
Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by:
"Eric W. Biederman" <ebiederm@xmission.com> (cherry picked from commit a6138db8) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Ralf Baechle authored
This fixes the following issue BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/1761 caller is blast_dcache32+0x30/0x254 Call Trace: [<8047f02c>] dump_stack+0x8/0x34 [<802e7e40>] debug_smp_processor_id+0xe0/0xf0 [<80114d94>] blast_dcache32+0x30/0x254 [<80118484>] r4k_dma_cache_wback_inv+0x200/0x288 [<80110ff0>] mips_dma_map_sg+0x108/0x180 [<80355098>] ide_dma_prepare+0xf0/0x1b8 [<8034eaa4>] do_rw_taskfile+0x1e8/0x33c [<8035951c>] ide_do_rw_disk+0x298/0x3e4 [<8034a3c4>] do_ide_request+0x2e0/0x704 [<802bb0dc>] __blk_run_queue+0x44/0x64 [<802be000>] queue_unplugged.isra.36+0x1c/0x54 [<802beb94>] blk_flush_plug_list+0x18c/0x24c [<802bec6c>] blk_finish_plug+0x18/0x48 [<8026554c>] journal_commit_transaction+0x3b8/0x151c [<80269648>] kjournald+0xec/0x238 [<8014ac00>] kthread+0xb8/0xc0 [<8010268c>] ret_from_kernel_thread+0x14/0x1c Caches in most systems are identical - but not always, so we can't avoid the use of smp_call_function() by just looking at the boot CPU's data, have to fiddle with preemption instead. Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/5835 (cherry picked from commit ff522058) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Aaro Koskinen authored
get_system_type() is not thread-safe on OCTEON. It uses static data, also more dangerous issue is that it's calling cvmx_fuse_read_byte() every time without any synchronization. Currently it's possible to get processes stuck looping forever in kernel simply by launching multiple readers of /proc/cpuinfo: (while true; do cat /proc/cpuinfo > /dev/null; done) & (while true; do cat /proc/cpuinfo > /dev/null; done) & ... Fix by initializing the system type string only once during the early boot. Signed-off-by:
Aaro Koskinen <aaro.koskinen@nsn.com> Cc: stable@vger.kernel.org Reviewed-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7437/Signed-off-by:
James Hogan <james.hogan@imgtec.com> MIPS: CPS: Initialize EVA before bringing up VPEs from secondary cores The CPS code is doing several memory loads when configuring the VPEs from secondary cores, so the segmentation control registers must be initialized in time otherwise the kernel will crash with strange TLB exceptions. Reviewed-by:
Paul Burton <paul.burton@imgtec.com> Signed-off-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7424/Signed-off-by:
James Hogan <james.hogan@imgtec.com> MIPS: Malta: EVA: Rename 'eva_entry' to 'platform_eva_init' Rename 'eva_entry' to 'platform_eva_init' as required by the new 'eva_init' macro in the eva.h header. Since this macro is now used in a platform dependent way, it must not depend on its caller so move the t1 register initialization inside this macro. Also set the .reorder assembler option in case the caller may have previously set .noreorder. This may allow a few assembler optimizations. Finally include missing headers and document the register usage for this macro. Reviewed-by:
Paul Burton <paul.burton@imgtec.com> Signed-off-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7423/Signed-off-by:
James Hogan <james.hogan@imgtec.com> MIPS: EVA: Add new EVA header Generic code may need to perform certain operations when EVA is enabled, for example, configure the segmentation registers during boot. In order to avoid using more CONFIG_EVA ifdefs in the arch code, such functions will be added in this header instead. Initially this header contains a macro which will be used by generic code later on during VPEs configuration on secondary cores. All it does is to call the platform specific EVA init code in case EVA is enabled. Reviewed-by:
Paul Burton <paul.burton@imgtec.com> Signed-off-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7422/Signed-off-by:
James Hogan <james.hogan@imgtec.com> MIPS: scall64-o32: Fix indirect syscall detection Commit 4c21b8fd (MIPS: seccomp: Handle indirect system calls (o32)) added indirect syscall detection for O32 processes running on MIPS64 but it did not work as expected. The reason is the the scall64-o32 implementation differs compared to scall32-o32. In the former, the v0 (syscall number) register contains the absolute syscall number (4000 + X) whereas in the latter it contains the relative syscall number (X). Fix the code to avoid doing an extra addition, and load the v0 register directly to the first argument for syscall_trace_enter. Moreover, set the .reorder assembler option in order to have better control on this part of the assembly code. Signed-off-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7481/ Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by:
James Hogan <james.hogan@imgtec.com> MIPS: syscall: Fix AUDIT value for O32 processes on MIPS64 On MIPS64, O32 processes set both TIF_32BIT_ADDR and TIF_32BIT_REGS so the previous condition treated O32 applications as N32 when evaluating seccomp filters. Fix the condition to check both TIF_32BIT_{REGS, ADDR} for the N32 AUDIT flag. Signed-off-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7480/ Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by:
James Hogan <james.hogan@imgtec.com> MIPS: Loongson: Fix COP2 usage for preemptible kernel In preemptible kernel, only TIF_USEDFPU flag is reliable to distinguish whether _init_fpu()/_restore_fp() is needed. Because the value of the CP0_Status.CU1 isn't changed during preemption. V2: Fix coding style. Signed-off-by:
Huacai Chen <chenhc@lemote.com> Cc: John Crispin <john@phrozen.org> Cc: Steven J. Hill <Steven.Hill@imgtec.com> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: linux-mips@linux-mips.org Cc: Fuxin Zhang <zhangfx@lemote.com> Cc: Zhangjin Wu <wuzhangjin@gmail.com> Patchwork: https://patchwork.linux-mips.org/patch/7515/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: NL: Fix nlm_xlp_defconfig build error The nlm_xlp_defconfig build fails with ./arch/mips/include/asm/mach-netlogic/topology.h:15:0: error: "topology_core_id" redefined [-Werror] In file included from include/linux/smp.h:59:0, [ ...] from arch/mips/mm/dma-default.c:12: ./arch/mips/include/asm/smp.h:41:0: note: this is the location of the previous definition and similar errors. This is caused by commit bda4584c ("MIPS: Support CPU topology files in sysfs") which adds the defines to arch/mips/include/asm/smp.h. Remove the defines from arch/mips/include/asm/mach-netlogic/topology.h as no longer necessary. Signed-off-by:
Guenter Roeck <linux@roeck-us.net> Cc: Huacai Chen <chenhc@lemote.com> Cc: Andreas Herrmann <andreas.herrmann@caviumnetworks.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/7513/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: Remove race window in page fault handling Multicore MIPSes without I/D hardware coherency suffered from a race condition in the page fault handler. The page table entry was published before any pending lazy D-cache flush was committed, hence it allowed execution of stale page cache data by other VPEs in the system. To make the cache handling safe we need to perform flushing already in the set_pte_at function. MIPSes without coherent I-caches can get a small increase in flushes due to the unavailability of the execute flag in set_pte_at. [ralf@linux-mips.org: outlining set_pte_at() saves a good k in a test build, so I moved its definition from pgtable.h to cache.c.] Signed-off-by:
Lars Persson <larper@axis.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/7511/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: Malta: Improve system memory detection for '{e, }memsize' >= 2G Using kstrtol to parse the "{e,}memsize" variables was wrong because this parses signed long numbers. In case of '{e,}memsize' >= 2G, the top bit is set, resulting to -ERANGE errors and possibly random system memory boundaries. We fix this by replacing "kstrtol" with "kstrtoul". We also improve the code to check the kstrtoul return value and print a warning if an error was returned. Signed-off-by:
Markos Chandras <markos.chandras@imgtec.com> Cc: <stable@vger.kernel.org> # v3.15+ Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/7543/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: Alchemy: Fix db1200 PSC clock enablement Enable PSC0 (I2C/SPI) clock and leave PSC1 (Audio) alone. This patch restores functionality to both Audio and I2C/SPI. Signed-off-by:
Manuel Lauss <manuel.lauss@gmail.com> Cc: Linux-MIPS <linux-mips@linux-mips.org> Patchwork: https://patchwork.linux-mips.org/patch/7544/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: BCM47XX: Fix reboot problem on BCM4705/BCM4785 This adds some code based on code from the Broadcom GPL tar to fix the reboot problems on BCM4705/BCM4785. I tried rebooting my device for ~10 times and have never seen a problem. This reverts the changes in the previous commit and adds the real fix as suggested by Rafał. Setting bit 22 in Reg 22, sel 4 puts the BIU (Bus Interface Unit) into async mode. The previous commit was 316cad5c [MIPS: BCM47XX: make reboot more relaiable] Signed-off-by:
Hauke Mehrtens <hauke@hauke-m.de> Cc: jogo@openwrt.org Cc: zajec5@gmail.com Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/7545/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: Remove duplicated include from numa.c Signed-off-by:
Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: Huacai Chen <chenhc@lemote.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/7537/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: Add common plat_irq_dispatch declaration Add common declaration to get rid of following sparse warning: "symbol 'plat_irq_dispatch' was not declared. Should it be static?" Signed-off-by:
Sergey Ryazanov <ryazanov.s.a@gmail.com> Cc: Linux MIPS <linux-mips@linux-mips.org> Patchwork: https://patchwork.linux-mips.org/patch/7539/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: MSP71xx: remove unused plat_irq_dispatch() argument Remove unused argument to make the plat_irq_dispatch() function declaration similar to the realization of other platforms. Signed-off-by:
Sergey Ryazanov <ryazanov.s.a@gmail.com> Cc: Linux MIPS <linux-mips@linux-mips.org> Patchwork: https://patchwork.linux-mips.org/patch/7538/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: GIC: Remove useless parens from GICBIS(). Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: perf: Mark pmu interupt IRQF_NO_THREAD In RT kernel, I ran into the following calltrace, so PMU interrupts cannot be threaded in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/0 INFO: lockdep is turned off. Call Trace: [<ffffffff8088595c>] dump_stack+0x1c/0x50 [<ffffffff801a958c>] __might_sleep+0x13c/0x148 [<ffffffff80891c54>] rt_spin_lock+0x3c/0xb0 [<ffffffff801ad29c>] __wake_up+0x3c/0x80 [<ffffffff80243ba4>] perf_event_wakeup+0x8c/0xf8 [<ffffffff80243c50>] perf_pending_event+0x40/0x78 [<ffffffff8023d88c>] irq_work_run+0x74/0xc0 [<ffffffff80152640>] mipsxx_pmu_handle_shared_irq+0x110/0x228 [<ffffffff8015276c>] mipsxx_pmu_handle_irq+0x14/0x30 [<ffffffff801ffda4>] handle_irq_event_percpu+0xbc/0x470 [<ffffffff80204478>] handle_percpu_irq+0x98/0xc8 [<ffffffff801ff284>] generic_handle_irq+0x4c/0x68 [<ffffffff8089748c>] do_IRQ+0x2c/0x48 [<ffffffff80105864>] plat_irq_dispatch+0x64/0xd0 [ralf@linux-mips.org: I don't see why based on this register dump the handler should be marked IRQF_NO_THREAD - but the handler is manipulating per-CPU resources so we don't want it to be rescheduled to another CPU.] Signed-off-by:
Yang Wei <Wei.Yang@windriver.com> Cc: a.p.zijlstra@chello.nl Cc: paulus@samba.org Cc: mingo@redhat.com Cc: acme@kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/7506/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: kdump: Set correct value to kexec_indirection_page variable Since there is not indirection page in crash type, so the vaule of the head field of kimage structure is not equal to the address of indirection page but IND_DONE. so we have to set kexec_indirection_page variable to the address of the head field of image structure. [ralf@linux-mips.org: Don't add pointless empty line, fix trailing whitespace damage.] Signed-off-by:
Yang Wei <Wei.Yang@windriver.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/7499/Signed-off-by:
Ralf Baechle <ralf@linux-mips.org> MIPS: OCTEON: make get_system_type() thread-safe get_system_type() is not thread-safe on OCTEON. It uses static data, also more dangerous issue is that it's calling cvmx_fuse_read_byte() every time without any synchronization. Currently it's possible to get processes stuck looping forever in kernel simply by launching multiple readers of /proc/cpuinfo: (while true; do cat /proc/cpuinfo > /dev/null; done) & (while true; do cat /proc/cpuinfo > /dev/null; done) & ... Fix by initializing the system type string only once during the early boot. Signed-off-by:
Aaro Koskinen <aaro.koskinen@nsn.com> Cc: stable@vger.kernel.org Reviewed-by:
Markos Chandras <markos.chandras@imgtec.com> Patchwork: http://patchwork.linux-mips.org/patch/7437/Signed-off-by:
James Hogan <james.hogan@imgtec.com> (cherry picked from commit 33d9a530 60830868) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Jarkko Sakkinen authored
Regression in 41ab999c. Call to tpm_chip_put is missing. This will cause TPM device driver not to unload if tmp_get_random() is called. Cc: <stable@vger.kernel.org> # 3.7+ Signed-off-by:
Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by:
Peter Huewe <peterhuewe@gmx.de> (cherry picked from commit 3e14d83e) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Paolo Bonzini authored
This reverts commit 682367c4, which causes 32-bit SMP Windows 7 guests to panic. SeaBIOS has a limit on the number of MTRRs that it can handle, and this patch exceeded the limit. Better revert it. Thanks to Nadav Amit for debugging the cause. Cc: stable@nongnu.org Reported-by:
Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit 0d234daf) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
H. Peter Anvin authored
Make espfix64 a hidden Kconfig option. This fixes the x86-64 UML build which had broken due to the non-existence of init_espfix_bsp() in UML: since UML uses its own Kconfig, this option does not appear in the UML build. This also makes it possible to make support for 16-bit segments a configuration option, for the people who want to minimize the size of the kernel. Reported-by:
Ingo Molnar <mingo@kernel.org> Signed-off-by:
H. Peter Anvin <hpa@zytor.com> Cc: Richard Weinberger <richard@nod.at> Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com (cherry picked from commit 197725de) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Bart Van Assche authored
If scsi_remove_host() is invoked after a SCSI device has been blocked, if the fast_io_fail_tmo or dev_loss_tmo work gets scheduled on the workqueue executing srp_remove_work() and if an I/O request is scheduled after the SCSI device had been blocked by e.g. multipathd then the following deadlock can occur: kworker/6:1 D ffff880831f3c460 0 195 2 0x00000000 Call Trace: [<ffffffff814aafd9>] schedule+0x29/0x70 [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0 [<ffffffff8105af6f>] msleep+0x2f/0x40 [<ffffffff8123b0ae>] __blk_drain_queue+0x4e/0x180 [<ffffffff8123d2d5>] blk_cleanup_queue+0x225/0x230 [<ffffffffa0010732>] __scsi_remove_device+0x62/0xe0 [scsi_mod] [<ffffffffa000ed2f>] scsi_forget_host+0x6f/0x80 [scsi_mod] [<ffffffffa0002eba>] scsi_remove_host+0x7a/0x130 [scsi_mod] [<ffffffffa07cf5c5>] srp_remove_work+0x95/0x180 [ib_srp] [<ffffffff8106d7aa>] process_one_work+0x1ea/0x6c0 [<ffffffff8106dd9b>] worker_thread+0x11b/0x3a0 [<ffffffff810758bd>] kthread+0xed/0x110 [<ffffffff814b972c>] ret_from_fork+0x7c/0xb0 multipathd D ffff880096acc460 0 5340 1 0x00000000 Call Trace: [<ffffffff814aafd9>] schedule+0x29/0x70 [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0 [<ffffffff814ab79b>] io_schedule_timeout+0x9b/0xf0 [<ffffffff814abe1c>] wait_for_completion_io_timeout+0xdc/0x110 [<ffffffff81244b9b>] blk_execute_rq+0x9b/0x100 [<ffffffff8124f665>] sg_io+0x1a5/0x450 [<ffffffff8124fd21>] scsi_cmd_ioctl+0x2a1/0x430 [<ffffffff8124fef2>] scsi_cmd_blk_ioctl+0x42/0x50 [<ffffffffa00ec97e>] sd_ioctl+0xbe/0x140 [sd_mod] [<ffffffff8124bd04>] blkdev_ioctl+0x234/0x840 [<ffffffff811cb491>] block_ioctl+0x41/0x50 [<ffffffff811a0df0>] do_vfs_ioctl+0x300/0x520 [<ffffffff811a1051>] SyS_ioctl+0x41/0x80 [<ffffffff814b9962>] tracesys+0xd0/0xd5 Fix this by scheduling removal work on another workqueue than the transport layer timers. Signed-off-by:
Bart Van Assche <bvanassche@acm.org> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Reviewed-by:
David Dillow <dave@thedillows.org> Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Roland Dreier <roland@purestorage.com> (cherry picked from commit bcc05910) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Jeff Moyer authored
Commit 05f1dd53 ("block: add queue flag for disabling SG merging") introduced a new queue flag: QUEUE_FLAG_NO_SG_MERGE. This gets set by default in blk_mq_init_queue for mq-enabled devices. The effect of the flag is to bypass the SG segment merging. Instead, the bio->bi_vcnt is used as the number of hardware segments. With a device mapper target on top of a device with QUEUE_FLAG_NO_SG_MERGE set, we can end up sending down more segments than a driver is prepared to handle. I ran into this when backporting the virtio_blk mq support. It triggerred this BUG_ON, in virtio_queue_rq: BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems); The queue's max is set here: blk_queue_max_segments(q, vblk->sg_elems-2); Basically, what happens is that a bio is built up for the dm device (which does not have the QUEUE_FLAG_NO_SG_MERGE flag set) using bio_add_page. That path will call into __blk_recalc_rq_segments, so what you end up with is bi_phys_segments being much smaller than bi_vcnt (and bi_vcnt grows beyond the maximum sg elements). Then, when the bio is submitted, it gets cloned. When the cloned bio is submitted, it will end up in blk_recount_segments, here: if (test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags)) bio->bi_phys_segments = bio->bi_vcnt; and now we've set bio->bi_phys_segments to a number that is beyond what was registered as queue_max_segments by the driver. The right way to fix this is to propagate the queue flag up the stack. The rules for propagating the flag are simple: - if the flag is set for any underlying device, it must be set for the upper device - consequently, if the flag is not set for any underlying device, it should not be set for the upper device. Signed-off-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.16+ (cherry picked from commit 200612ec) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Roger Quadros authored
commit 65b97cf6 introduced in v3.7 caused a regression by using a reversed CS_MASK thus causing omap_calculate_ecc to always fail. As the NAND base driver never checks for .calculate()'s return value, the zeroed ECC values are used as is without showing any error to the user. However, this won't work and the NAND device won't be guarded by any error code. Fix the issue by using the correct mask. Code was tested on omap3beagle using the following procedure - flash the primary bootloader (MLO) from the kernel to the first NAND partition using nandwrite. - boot the board from NAND. This utilizes OMAP ROM loader that relies on 1-bit Hamming code ECC. Fixes: 65b97cf6 (mtd: nand: omap2: handle nand on gpmc) Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by:
Roger Quadros <rogerq@ti.com> Signed-off-by:
Tony Lindgren <tony@atomide.com> (cherry picked from commit 40ddbf50) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Kevin Hao authored
I got the following panic on my fsl p5020ds board. Unable to handle kernel paging request for data at address 0x7375627379737465 Faulting instruction address: 0xc000000000100778 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=24 CoreNet Generic Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-next-20140613 #145 task: c0000000fe080000 ti: c0000000fe088000 task.ti: c0000000fe088000 NIP: c000000000100778 LR: c00000000010073c CTR: 0000000000000000 REGS: c0000000fe08aa00 TRAP: 0300 Not tainted (3.15.0-next-20140613) MSR: 0000000080029000 <CE,EE,ME> CR: 24ad2e24 XER: 00000000 DEAR: 7375627379737465 ESR: 0000000000000000 SOFTE: 1 GPR00: c0000000000c99b0 c0000000fe08ac80 c0000000009598e0 c0000000fe001d80 GPR04: 00000000000000d0 0000000000000913 c000000007902b20 0000000000000000 GPR08: c0000000feaae888 0000000000000000 0000000007091000 0000000000200200 GPR12: 0000000028ad2e28 c00000000fff4000 c0000000007abe08 0000000000000000 GPR16: c0000000007ab160 c0000000007aaf98 c00000000060ba68 c0000000007abda8 GPR20: c0000000007abde8 c0000000feaea6f8 c0000000feaea708 c0000000007abd10 GPR24: c000000000989370 c0000000008c6228 00000000000041ed c0000000fe00a400 GPR28: c00000000017c1cc 00000000000000d0 7375627379737465 c0000000fe001d80 NIP [c000000000100778] .__kmalloc_track_caller+0x70/0x168 LR [c00000000010073c] .__kmalloc_track_caller+0x34/0x168 Call Trace: [c0000000fe08ac80] [c00000000087e6b8] uevent_sock_list+0x0/0x10 (unreliable) [c0000000fe08ad20] [c0000000000c99b0] .kstrdup+0x44/0x90 [c0000000fe08adc0] [c00000000017c1cc] .__kernfs_new_node+0x4c/0x130 [c0000000fe08ae70] [c00000000017d7e4] .kernfs_new_node+0x2c/0x64 [c0000000fe08aef0] [c00000000017db00] .kernfs_create_dir_ns+0x34/0xc8 [c0000000fe08af80] [c00000000018067c] .sysfs_create_dir_ns+0x58/0xcc [c0000000fe08b010] [c0000000002c711c] .kobject_add_internal+0xc8/0x384 [c0000000fe08b0b0] [c0000000002c7644] .kobject_add+0x64/0xc8 [c0000000fe08b140] [c000000000355ebc] .device_add+0x11c/0x654 [c0000000fe08b200] [c0000000002b5988] .add_disk+0x20c/0x4b4 [c0000000fe08b2c0] [c0000000003a21d4] .add_mtd_blktrans_dev+0x340/0x514 [c0000000fe08b350] [c0000000003a3410] .mtdblock_add_mtd+0x74/0xb4 [c0000000fe08b3e0] [c0000000003a32cc] .blktrans_notify_add+0x64/0x94 [c0000000fe08b470] [c00000000039b5b4] .add_mtd_device+0x1d4/0x368 [c0000000fe08b520] [c00000000039b830] .mtd_device_parse_register+0xe8/0x104 [c0000000fe08b5c0] [c0000000003b8408] .of_flash_probe+0x72c/0x734 [c0000000fe08b750] [c00000000035ba40] .platform_drv_probe+0x38/0x84 [c0000000fe08b7d0] [c0000000003599a4] .really_probe+0xa4/0x29c [c0000000fe08b870] [c000000000359d3c] .__driver_attach+0x100/0x104 [c0000000fe08b900] [c00000000035746c] .bus_for_each_dev+0x84/0xe4 [c0000000fe08b9a0] [c0000000003593c0] .driver_attach+0x24/0x38 [c0000000fe08ba10] [c000000000358f24] .bus_add_driver+0x1c8/0x2ac [c0000000fe08bab0] [c00000000035a3a4] .driver_register+0x8c/0x158 [c0000000fe08bb30] [c00000000035b9f4] .__platform_driver_register+0x6c/0x80 [c0000000fe08bba0] [c00000000084e080] .of_flash_driver_init+0x1c/0x30 [c0000000fe08bc10] [c000000000001864] .do_one_initcall+0xbc/0x238 [c0000000fe08bd00] [c00000000082cdc0] .kernel_init_freeable+0x188/0x268 [c0000000fe08bdb0] [c0000000000020a0] .kernel_init+0x1c/0xf7c [c0000000fe08be30] [c000000000000884] .ret_from_kernel_thread+0x58/0xd4 Instruction dump: 41bd0010 480000c8 4bf04eb5 60000000 e94d0028 e93f0000 7cc95214 e8a60008 7fc9502a 2fbe0000 419e00c8 e93f0022 <7f7e482a> 39200000 88ed06b2 992d06b2 ---[ end trace b4c9a94804a42d40 ]--- It seems that the corrupted partition header on my mtd device triggers a bug in the ftl. In function build_maps() it will allocate the buffers needed by the mtd partition, but if something goes wrong such as kmalloc failure, mtd read error or invalid partition header parameter, it will free all allocated buffers and then return non-zero. In my case, it seems that partition header parameter 'NumTransferUnits' is invalid. And the ftl_freepart() is a function which free all the partition buffers allocated by build_maps(). Given the build_maps() is a self cleaning function, so there is no need to invoke this function even if build_maps() return with error. Otherwise it will causes the buffers to be freed twice and then weird things would happen. Cc: stable@vger.kernel.org Signed-off-by:
Kevin Hao <haokexin@gmail.com> Signed-off-by:
Brian Norris <computersforpeace@gmail.com> (cherry picked from commit a152056c) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Pavel Shilovsky authored
The existing code calls server->ops->close() that is not right. This causes XFS test generic/310 to fail. Fix this by using server->ops->closedir() function. Cc: <stable@vger.kernel.org> # v3.7+ Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by:
Steve French <smfrench@gmail.com> (cherry picked from commit f736906a) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Pavel Shilovsky authored
The existing code uses the old MAX_NAME constant. This causes XFS test generic/013 to fail. Fix it by replacing MAX_NAME with PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant definition. Cc: <stable@vger.kernel.org> # v3.7+ Signed-off-by:
Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by:
Steve French <smfrench@gmail.com> (cherry picked from commit 1bbe4997) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Pavel Shilovsky authored
CIFS servers process nlink counts differently for files and directories. In cifs_rename() if we the request fails on the existing target, we try to remove it through cifs_unlink() but this is not what we want to do for directories. As the result the following sequence of commands mkdir {1,2}; mv -T 1 2; rmdir {1,2}; mkdir {1,2}; echo foo > 2/bar and XFS test generic/023 fail with -ENOENT error. That's why the second mkdir reuses the existing inode (target inode of the mv -T command) with S_DEAD flag. Fix this by checking whether the target is directory or not and calling cifs_rmdir() rather than cifs_unlink() for directories. Cc: <stable@vger.kernel.org> Signed-off-by:
Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by:
Steve French <smfrench@gmail.com> (cherry picked from commit a07d3220) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Pavel Shilovsky authored
When we requests rename we also need to update attributes of both source and target parent directories. Not doing it causes generic/309 xfstest to fail on SMB2 mounts. Fix this by marking these directories for force revalidating. Cc: <stable@vger.kernel.org> Signed-off-by:
Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by:
Steve French <smfrench@gmail.com> (cherry picked from commit b46799a8) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Pavel Shilovsky authored
If we get into read_into_pages() from cifs_readv_receive() and then loose a network, we issue cifs_reconnect that moves all mids to a private list and issue their callbacks. The callback of the async read request sets a mid to retry, frees it and wakes up a process that waits on the rdata completion. After the connection is established we return from read_into_pages() with a short read, use the mid that was freed before and try to read the remaining data from the a newly created socket. Both actions are not what we want to do. In reconnect cases (-EAGAIN) we should not mask off the error with a short read but should return the error code instead. Acked-by:
Jeff Layton <jlayton@samba.org> Cc: stable@vger.kernel.org Signed-off-by:
Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by:
Steve French <smfrench@gmail.com> (cherry picked from commit 038bc961) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Pavel Shilovsky authored
The existing mapping causes unlink() call to return error after delete operation. Changing the mapping to -EACCES makes the client process the call like CIFS protocol does - reset dos attributes with ATTR_READONLY flag masked off and retry the operation. Cc: stable@vger.kernel.org Signed-off-by:
Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by:
Steve French <smfrench@gmail.com> (cherry picked from commit 21496687) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Ilya Dryomov authored
We hard code cephx auth ticket buffer size to 256 bytes. This isn't enough for any moderate setups and, in case tickets themselves are not encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the buffer is allocated dynamically anyway, allocated it a bit later, at the point where we know how much is going to be needed. Fixes: http://tracker.ceph.com/issues/8979 Cc: stable@vger.kernel.org Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@redhat.com> (cherry picked from commit c27a3e4d) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Ilya Dryomov authored
Add a helper for processing individual cephx auth tickets. Needed for the next commit, which deals with allocating ticket buffers. (Most of the diff here is whitespace - view with git diff -b). Cc: stable@vger.kernel.org Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@redhat.com> (cherry picked from commit 597cda35) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Sage Weil authored
We preallocate a few of the message types we get back from the mon. If we get a larger message than we are expecting, fall back to trying to allocate a new one instead of blindly using the one we have. CC: stable@vger.kernel.org Signed-off-by:
Sage Weil <sage@redhat.com> Reviewed-by:
Ilya Dryomov <ilya.dryomov@inktank.com> (cherry picked from commit 73c3d481) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Chris Mason authored
Similar to direct IO reads, direct IO writes are using truncate_pagecache_range to invalidate the page cache. This is incorrect due to the sub-block zeroing in the page cache that truncate_pagecache_range() triggers. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. cc: stable@vger.kernel.org Signed-off-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Brian Foster <bfoster@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com> xfs: don't zero partial page cache pages during O_DIRECT writes xfs is using truncate_pagecache_range to invalidate the page cache during DIO reads. This is different from the other filesystems who only invalidate pages during DIO writes. truncate_pagecache_range is meant to be used when we are freeing the underlying data structs from disk, so it will zero any partial ranges in the page. This means a DIO read can zero out part of the page cache page, and it is possible the page will stay in cache. buffered reads will find an up to date page with zeros instead of the data actually on disk. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. [dchinner: catch error and warn if it fails. Comment.] cc: stable@vger.kernel.org Signed-off-by:
Chris Mason <clm@fb.com> Reviewed-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Brian Foster <bfoster@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com> (cherry picked from commit 834ffca6 85e584da) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Dave Chinner authored
Similar to direct IO reads, direct IO writes are using truncate_pagecache_range to invalidate the page cache. This is incorrect due to the sub-block zeroing in the page cache that truncate_pagecache_range() triggers. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. cc: stable@vger.kernel.org Signed-off-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Brian Foster <bfoster@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com> xfs: don't zero partial page cache pages during O_DIRECT writes xfs is using truncate_pagecache_range to invalidate the page cache during DIO reads. This is different from the other filesystems who only invalidate pages during DIO writes. truncate_pagecache_range is meant to be used when we are freeing the underlying data structs from disk, so it will zero any partial ranges in the page. This means a DIO read can zero out part of the page cache page, and it is possible the page will stay in cache. buffered reads will find an up to date page with zeros instead of the data actually on disk. This patch fixes things by using invalidate_inode_pages2_range instead. It preserves the page cache invalidation, but won't zero any pages. [dchinner: catch error and warn if it fails. Comment.] cc: stable@vger.kernel.org Signed-off-by:
Chris Mason <clm@fb.com> Reviewed-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Brian Foster <bfoster@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com> (cherry picked from commit 834ffca6 85e584da) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Dave Chinner authored
generic/263 is failing fsx at this point with a page spanning EOF that cannot be invalidated. The operations are: 1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes) 1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes) 1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes) where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO write attempts to invalidate the cached page over this range, it fails with -EBUSY and so any attempt to do page invalidation fails. The real question is this: Why can't that page be invalidated after it has been written to disk and cleaned? Well, there's data on the first two buffers in the page (1k block size, 4k page), but the third buffer on the page (i.e. beyond EOF) is failing drop_buffers because it's bh->b_state == 0x3, which is BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say what? OK, set_buffer_dirty() is called on all buffers from __set_page_buffers_dirty(), regardless of whether the buffer is beyond EOF or not, which means that when we get to ->writepage, we have buffers marked dirty beyond EOF that we need to clean. So, we need to implement our own .set_page_dirty method that doesn't dirty buffers beyond EOF. This is messy because the buffer code is not meant to be shared and it has interesting locking issues on the buffer dirty bits. So just copy and paste it and then modify it to suit what we need. Note: the solutions the other filesystems and generic block code use of marking the buffers clean in ->writepage does not work for XFS. It still leaves dirty buffers beyond EOF and invalidations still fail. Hence rather than play whack-a-mole, this patch simply prevents those buffers from being dirtied in the first place. cc: <stable@kernel.org> Signed-off-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Brian Foster <bfoster@redhat.com> Signed-off-by:
Dave Chinner <david@fromorbit.com> (cherry picked from commit 22e757a4) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Dave Chinner authored
When running xfs/305, I noticed that quotacheck was flushing dquot buffers that did not have the xfs_dquot_buf_ops verifiers attached: XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8 ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00 DQ....e......... ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001 ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000 ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80 Call Trace: [<ffffffff81cf1cca>] dump_stack+0x45/0x56 [<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0 [<ffffffff810db520>] ? wake_up_state+0x20/0x20 [<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0 [<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0 [<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0 [<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220 [<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90 [<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90 [<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0 [<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0 [<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0 [<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340 [<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0 [<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170 [<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20 [<ffffffff811ca8c9>] mount_fs+0x39/0x1b0 [<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120 [<ffffffff811e757e>] do_mount+0x23e/0xad0 [<ffffffff8117abde>] ? __get_free_pages+0xe/0x50 [<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150 [<ffffffff811e8103>] SyS_mount+0x83/0xc0 [<ffffffff81cfd40b>] tracesys+0xdd/0xe2 This was caused by dquot buffer readahead not attaching a verifier structure to the buffer when readahead was issued, resulting in the followup read of the buffer finding a valid buffer and so not attaching new verifiers to the buffer as part of the read. Also, when a verifier failure occurs, we then read the buffer without verifiers. Attach the verifiers manually after this read so that if the buffer is then written it will be verified that the corruption has been repaired. Further, when flushing a dquot we don't ask for a verifier when reading in the dquot buffer the dquot belongs to. Most of the time this isn't an issue because the buffer is still cached, but when it is not cached it will result in writing the dquot buffer without having the verfier attached. Signed-off-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Brian Foster <bfoster@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Dave Chinner <david@fromorbit.com> (cherry picked from commit 5fd364fe) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Doug Ledford authored
added struct sockaddr_storage to rdma_user_cm.h without also adding an include for linux/socket.h to make sure it is defined. Systemtap needs the header files to build standalone and cannot rely on other files to pre-include other headers, so add linux/socket.h to the list of includes in this file. Fixes: ee7aed45 ("RDMA/ucma: Support querying for AF_IB addresses") Signed-off-by:
Doug Ledford <dledford@redhat.com> Signed-off-by:
Roland Dreier <roland@purestorage.com> (cherry picked from commit db1044d4) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
Steve Wise authored
If the user creates a listening cm_id with backlog of 0 the IWCM ends up not allowing any connection requests at all. The correct behavior is for the IWCM to pick a default value if the user backlog parameter is zero. Lustre from version 1.8.8 onward uses a backlog of 0, which breaks iwarp support without this fix. Signed-off-by:
Steve Wise <swise@opengridcomputing.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Roland Dreier <roland@purestorage.com> (cherry picked from commit 2f0304d2) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-
NeilBrown authored
When a raid10 commences a resync/recovery/reshape it allocates some buffer space. When a resync/recovery completes the buffer space is freed. But not when the reshape completes. This can result in a small memory leak. There is a subtle side-effect of this bug. When a RAID10 is reshaped to a larger array (more devices), the reshape is immediately followed by a "resync" of the new space. This "resync" will use the buffer space which was allocated for "reshape". This can cause problems including a "BUG" in the SCSI layer. So this is suitable for -stable. Cc: stable@vger.kernel.org (v3.5+) Fixes: 3ea7daa5Signed-off-by:
NeilBrown <neilb@suse.de> (cherry picked from commit b3968552) Signed-off-by:
Sasha Levin <sasha.levin@oracle.com>
-