Commits · b6c52222f4947ff41cf896b92b20931409c3c689 · nexedi / linux

12 Dec, 2016 13 commits

rcu: Fix soft lockup for rcu_nocb_kthread · b6c52222

Ding Tianhong authored Jun 15, 2016

commit bedc1969 upstream.

Carrying out the following steps results in a softlockup in the
RCU callback-offload (rcuo) kthreads:

1. Connect to ixgbevf, and set the speed to 10Gb/s.
2. Use ifconfig to bring the nic up and down repeatedly.

[  317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[  368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15]
[  368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: ffff88057dd9c000
[  368.106005] RIP: 0010:[<ffffffff81579e04>]  [<ffffffff81579e04>] fib_table_lookup+0x14/0x390
[  368.106005] RSP: 0018:ffff88061fc83ce8  EFLAGS: 00000286
[  368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: 0000000000000001
[  368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: ffff880036d11a00
[  368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: 0000000000000000
[  368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: ffff88061fc83c58
[  368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: 00000000020155c0
[  368.106005] FS:  0000000000000000(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000
[  368.106005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: 00000000000407e0
[  368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  368.106005] Stack:
[  368.106005]  00000000010000c0 ffff88057b766000 ffff8802e380b000 ffff88057af03e00
[  368.106005]  ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 ffffffff814ee146
[  368.106005]  ffff8802e380af00 00000000e380af00 ffffffff819e0900 020155c0010000c0
[  368.106005] Call Trace:
[  368.106005]  <IRQ>
[  368.106005]
[  368.106005]  [<ffffffff815349a6>] ip_route_input_noref+0x516/0xbd0
[  368.106005]  [<ffffffff814ee146>] ? skb_release_data+0xd6/0x110
[  368.106005]  [<ffffffff814ee20a>] ? kfree_skb+0x3a/0xa0
[  368.106005]  [<ffffffff8153698f>] ip_rcv_finish+0x29f/0x350
[  368.106005]  [<ffffffff81537034>] ip_rcv+0x234/0x380
[  368.106005]  [<ffffffff814fd656>] __netif_receive_skb_core+0x676/0x870
[  368.106005]  [<ffffffff814fd868>] __netif_receive_skb+0x18/0x60
[  368.106005]  [<ffffffff814fe4de>] process_backlog+0xae/0x180
[  368.106005]  [<ffffffff814fdcb2>] net_rx_action+0x152/0x240
[  368.106005]  [<ffffffff81077b3f>] __do_softirq+0xef/0x280
[  368.106005]  [<ffffffff8161619c>] call_softirq+0x1c/0x30
[  368.106005]  <EOI>
[  368.106005]
[  368.106005]  [<ffffffff81015d95>] do_softirq+0x65/0xa0
[  368.106005]  [<ffffffff81077174>] local_bh_enable+0x94/0xa0
[  368.106005]  [<ffffffff81114922>] rcu_nocb_kthread+0x232/0x370
[  368.106005]  [<ffffffff81098250>] ? wake_up_bit+0x30/0x30
[  368.106005]  [<ffffffff811146f0>] ? rcu_start_gp+0x40/0x40
[  368.106005]  [<ffffffff8109728f>] kthread+0xcf/0xe0
[  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140
[  368.106005]  [<ffffffff816147d8>] ret_from_fork+0x58/0x90
[  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140

==================================cut here==============================

It turns out that the rcuos callback-offload kthread is busy processing
a very large quantity of RCU callbacks, and it is not reliquishing the
CPU while doing so.  This commit therefore adds an cond_resched_rcu_qs()
within the loop to allow other tasks to run.

[js] use onlu cond_resched() in 3.12
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
[ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

b6c52222

x86/traps: Ignore high word of regs->cs in early_fixup_exception() · e59f8eb4

Andy Lutomirski authored Nov 19, 2016

commit fc0e81b2 upstream.

On the 80486 DX, it seems that some exceptions may leave garbage in
the high bits of CS.  This causes sporadic failures in which
early_fixup_exception() refuses to fix up an exception.

As far as I can tell, this has been buggy for a long time, but the
problem seems to have been exacerbated by commits:

  1e02ce4c ("x86: Store a per-cpu shadow copy of CR4")
  e1bfc11c ("x86/init: Fix cr4_init_shadow() on CR4-less machines")

This appears to have broken for as long as we've had early
exception handling.

[ This backport should apply to kernels from 3.4 - 4.5. ]

Fixes: 4c5023a3 ("x86-32: Handle exception table entries during early boot")
Cc: H. Peter Anvin <hpa@zytor.com>
Reported-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

e59f8eb4

drm/radeon: Ensure vblank interrupt is enabled on DPMS transition to on · 78a9c69d

Michel Dänzer authored Nov 29, 2016

NOTE: This patch only applies to 4.5.y or older kernels. With newer
kernels, this problem cannot happen because the driver now uses
drm_crtc_vblank_on/off instead of drm_vblank_pre/post_modeset[0]. I
consider this patch safer for older kernels than backporting the API
change, because drm_crtc_vblank_on/off had various issues in older
kernels, and I'm not sure all fixes for those have been backported to
all stable branches where this patch could be applied.

    ---------------------

Fixes the vblank interrupt being disabled when it should be on, which
can cause at least the following symptoms:

* Hangs when running 'xset dpms force off' in a GNOME session with
  gnome-shell using DRI2.
* RandR 1.4 slave outputs freezing with garbage displayed using
  xf86-video-ati 7.8.0 or newer.

[0] See upstream commit:

commit 777e3cbc
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 21 11:08:57 2016 +0100

    drm/radeon: Switch to drm_vblank_on/off
Reported-and-Tested-by: Max Staudt <mstaudt@suse.de>
Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>

78a9c69d

mpi: Fix NULL ptr dereference in mpi_powm() [ver #3] · 18fb7a8f

Andrey Ryabinin authored Nov 24, 2016

commit f5527fff upstream.

This fixes CVE-2016-8650.

If mpi_powm() is given a zero exponent, it wants to immediately return
either 1 or 0, depending on the modulus.  However, if the result was
initalised with zero limb space, no limbs space is allocated and a
NULL-pointer exception ensues.

Fix this by allocating a minimal amount of limb space for the result when
the 0-exponent case when the result is 1 and not touching the limb space
when the result is 0.

This affects the use of RSA keys and X.509 certificates that carry them.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
PGD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 3 PID: 3014 Comm: keyctl Not tainted 4.9.0-rc6-fscache+ #278
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
task: ffff8804011944c0 task.stack: ffff880401294000
RIP: 0010:[<ffffffff8138ce5d>]  [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
RSP: 0018:ffff880401297ad8  EFLAGS: 00010212
RAX: 0000000000000000 RBX: ffff88040868bec0 RCX: ffff88040868bba0
RDX: ffff88040868b260 RSI: ffff88040868bec0 RDI: ffff88040868bee0
RBP: ffff880401297ba8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000047 R11: ffffffff8183b210 R12: 0000000000000000
R13: ffff8804087c7600 R14: 000000000000001f R15: ffff880401297c50
FS:  00007f7a7918c700(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000401250000 CR4: 00000000001406e0
Stack:
 ffff88040868bec0 0000000000000020 ffff880401297b00 ffffffff81376cd4
 0000000000000100 ffff880401297b10 ffffffff81376d12 ffff880401297b30
 ffffffff81376f37 0000000000000100 0000000000000000 ffff880401297ba8
Call Trace:
 [<ffffffff81376cd4>] ? __sg_page_iter_next+0x43/0x66
 [<ffffffff81376d12>] ? sg_miter_get_next_page+0x1b/0x5d
 [<ffffffff81376f37>] ? sg_miter_next+0x17/0xbd
 [<ffffffff8138ba3a>] ? mpi_read_raw_from_sgl+0xf2/0x146
 [<ffffffff8132a95c>] rsa_verify+0x9d/0xee
 [<ffffffff8132acca>] ? pkcs1pad_sg_set_buf+0x2e/0xbb
 [<ffffffff8132af40>] pkcs1pad_verify+0xc0/0xe1
 [<ffffffff8133cb5e>] public_key_verify_signature+0x1b0/0x228
 [<ffffffff8133d974>] x509_check_for_self_signed+0xa1/0xc4
 [<ffffffff8133cdde>] x509_cert_parse+0x167/0x1a1
 [<ffffffff8133d609>] x509_key_preparse+0x21/0x1a1
 [<ffffffff8133c3d7>] asymmetric_key_preparse+0x34/0x61
 [<ffffffff812fc9f3>] key_create_or_update+0x145/0x399
 [<ffffffff812fe227>] SyS_add_key+0x154/0x19e
 [<ffffffff81001c2b>] do_syscall_64+0x80/0x191
 [<ffffffff816825e4>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 56 41 55 41 54 53 48 81 ec a8 00 00 00 44 8b 71 04 8b 42 04 4c 8b 67 18 45 85 f6 89 45 80 0f 84 b4 06 00 00 85 c0 75 2f 41 ff ce <49> c7 04 24 01 00 00 00 b0 01 75 0b 48 8b 41 18 48 83 38 01 0f
RIP  [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
 RSP <ffff880401297ad8>
CR2: 0000000000000000
---[ end trace d82015255d4a5d8d ]---

Basically, this is a backport of a libgcrypt patch:

	http://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=patch;h=6e1adb05d290aeeb1c230c763970695f4a538526

Fixes: cdec9cb5 ("crypto: GnuPG based MPI lib - source files (part 1)")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
cc: linux-ima-devel@lists.sourceforge.net
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

18fb7a8f

apparmor: fix change_hat not finding hat after policy replacement · 6d7bc8a8

John Johansen authored Aug 31, 2016

commit 3d40658c upstream.

After a policy replacement, the task cred may be out of date and need
to be updated. However change_hat is using the stale profiles from
the out of date cred resulting in either: a stale profile being applied
or, incorrect failure when searching for a hat profile as it has been
migrated to the new parent profile.

Fixes: 01e2b670 (failure to find hat)
Fixes: 898127c3 (stale policy being applied)
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1000287Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

6d7bc8a8

cfg80211: limit scan results cache size · e525af57

Johannes Berg authored Nov 15, 2016

commit 9853a55e upstream.

It's possible to make scanning consume almost arbitrary amounts
of memory, e.g. by sending beacon frames with random BSSIDs at
high rates while somebody is scanning.

Limit the number of BSS table entries we're willing to cache to
1000, limiting maximum memory usage to maybe 4-5MB, but lower
in practice - that would be the case for having both full-sized
beacon and probe response frames for each entry; this seems not
possible in practice, so a limit of 1000 entries will likely be
closer to 0.5 MB.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

e525af57

tile: avoid using clocksource_cyc2ns with absolute cycle count · 2323e1a9

Chris Metcalf authored Nov 16, 2016

commit e658a6f1 upstream.

For large values of "mult" and long uptimes, the intermediate
result of "cycles * mult" can overflow 64 bits.  For example,
the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
we have mult = 853, and after 208.5 days, we overflow 64 bits.

Since clocksource_cyc2ns() is intended to be used for relative
cycle counts, not absolute cycle counts, performance is more
importance than accepting a wider range of cycle values.  So,
just use mult_frac() directly in tile's sched_clock().

Commit 4cecf6d4 ("sched, x86: Avoid unnecessary overflow
in sched_clock") by Salman Qazi results in essentially the same
generated code for x86 as this change does for tile.  In fact,
a follow-on change by Salman introduced mult_frac() and switched
to using it, so the C code was largely identical at that point too.

Peter Zijlstra then added mul_u64_u32_shr() and switched x86
to use it.  This is, in principle, better; by optimizing the
64x64->64 multiplies to be 32x32->64 multiplies we can potentially
save some time.  However, the compiler piplines the 64x64->64
multiplies pretty well, and the conditional branch in the generic
mul_u64_u32_shr() causes some bubbles in execution, with the
result that it's pretty much a wash.  If tilegx provided its own
implementation of mul_u64_u32_shr() without the conditional branch,
we could potentially save 3 cycles, but that seems like small gain
for a fair amount of additional build scaffolding; no other platform
currently provides a mul_u64_u32_shr() override, and tile doesn't
currently have an <asm/div64.h> header to put the override in.

Additionally, gcc currently has an optimization bug that prevents
it from recognizing the opportunity to use a 32x32->64 multiply,
and so the result would be no better than the existing mult_frac()
until such time as the compiler is fixed.

For now, just using mult_frac() seems like the right answer.
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

2323e1a9

scsi: mpt3sas: Fix secure erase premature termination · 940aeba5

Andrey Grodzovsky authored Nov 10, 2016

commit 18f6084a upstream.

This is a work around for a bug with LSI Fusion MPT SAS2 when perfoming
secure erase. Due to the very long time the operation takes, commands
issued during the erase will time out and will trigger execution of the
abort hook. Even though the abort hook is called for the specific
command which timed out, this leads to entire device halt
(scsi_state terminated) and premature termination of the secure erase.

Set device state to busy while ATA passthrough commands are in progress.

[mkp: hand applied to 4.9/scsi-fixes, tweaked patch description]
Signed-off-by: Andrey Grodzovsky <andrey2805@gmail.com>
Acked-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com>
Cc: <linux-scsi@vger.kernel.org>
Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

940aeba5

Fix USB CB/CBI storage devices with CONFIG_VMAP_STACK=y · af5bc71b

Petr Vandrovec authored Nov 10, 2016

commit 2ce9d227 upstream.

Some code (all error handling) submits CDBs that are allocated
on the stack.  This breaks with CB/CBI code that tries to create
URB directly from SCSI command buffer - which happens to be in
vmalloced memory with vmalloced kernel stacks.

Let's make copy of the command in usb_stor_CB_transport.
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

af5bc71b

USB: serial: ftdi_sio: add support for TI CC3200 LaunchPad · 9a64b7b1

Doug Brown authored Nov 04, 2016

commit 9bfef729 upstream.

This patch adds support for the TI CC3200 LaunchPad board, which uses a
custom USB vendor ID and product ID. Channel A is used for JTAG, and
channel B is used for a UART.
Signed-off-by: Doug Brown <doug@schmorgal.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

9a64b7b1

USB: serial: cp210x: add ID for the Zone DPMX · b2f43aa4

Paul Jakma authored Nov 16, 2016

commit 2ab13292 upstream.

The BRIM Brothers Zone DPMX is a bicycle powermeter. This ID is for the USB
serial interface in its charging dock for the control pods, via which some
settings for the pods can be modified.
Signed-off-by: Paul Jakma <paul@jakma.org>
Cc: Barry Redmond <barry@brimbrothers.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

b2f43aa4

usb: chipidea: move the lock initialization to core file · 11a35448

Peter Chen authored Nov 15, 2016

commit a5d906bb upstream.

This can fix below dump when the lock is accessed at host
mode due to it is not initialized.

[   46.119638] INFO: trying to register non-static key.
[   46.124643] the code is fine but needs lockdep annotation.
[   46.130144] turning off the locking correctness validator.
[   46.135659] CPU: 0 PID: 690 Comm: cat Not tainted 4.9.0-rc3-00079-g4b75f1d #1210
[   46.143075] Hardware name: Freescale i.MX6 SoloX (Device Tree)
[   46.148923] Backtrace:
[   46.151448] [<c010c460>] (dump_backtrace) from [<c010c658>] (show_stack+0x18/0x1c)
[   46.159038]  r7:edf52000
[   46.161412]  r6:60000193
[   46.163967]  r5:00000000
[   46.165035]  r4:c0e25c2c

[   46.169109] [<c010c640>] (show_stack) from [<c03f58a4>] (dump_stack+0xb4/0xe8)
[   46.176362] [<c03f57f0>] (dump_stack) from [<c016d690>] (register_lock_class+0x4fc/0x56c)
[   46.184554]  r10:c0e25d24
[   46.187014]  r9:edf53e70
[   46.189569]  r8:c1642444
[   46.190637]  r7:ee9da024
[   46.193191]  r6:00000000
[   46.194258]  r5:00000000
[   46.196812]  r4:00000000
[   46.199185]  r3:00000001

[   46.203259] [<c016d194>] (register_lock_class) from [<c0171294>] (__lock_acquire+0x80/0x10f0)
[   46.211797]  r10:c0e25d24
[   46.214257]  r9:edf53e70
[   46.216813]  r8:ee9da024
[   46.217880]  r7:c1642444
[   46.220435]  r6:edcd1800
[   46.221502]  r5:60000193
[   46.224057]  r4:00000000

[   46.227953] [<c0171214>] (__lock_acquire) from [<c01726c0>] (lock_acquire+0x74/0x94)
[   46.235710]  r10:00000001
[   46.238169]  r9:edf53e70
[   46.240723]  r8:edf53f80
[   46.241790]  r7:00000001
[   46.244344]  r6:00000001
[   46.245412]  r5:60000193
[   46.247966]  r4:00000000

[   46.251866] [<c017264c>] (lock_acquire) from [<c096c8fc>] (_raw_spin_lock_irqsave+0x40/0x54)
[   46.260319]  r7:ee1c6a00
[   46.262691]  r6:c062a570
[   46.265247]  r5:20000113
[   46.266314]  r4:ee9da014

[   46.270393] [<c096c8bc>] (_raw_spin_lock_irqsave) from [<c062a570>] (ci_port_test_show+0x2c/0x70)
[   46.279280]  r6:eebd2000
[   46.281652]  r5:ee9da010
[   46.284207]  r4:ee9da014

[   46.286810] [<c062a544>] (ci_port_test_show) from [<c0248d04>] (seq_read+0x1ac/0x4f8)
[   46.294655]  r9:edf53e70
[   46.297028]  r8:edf53f80
[   46.299583]  r7:ee1c6a00
[   46.300650]  r6:00000001
[   46.303205]  r5:00000000
[   46.304273]  r4:eebd2000
[   46.306850] [<c0248b58>] (seq_read) from [<c039e864>] (full_proxy_read+0x54/0x6c)
[   46.314348]  r10:00000000
[   46.316808]  r9:c0a6ad30
[   46.319363]  r8:edf53f80
[   46.320430]  r7:00020000
[   46.322986]  r6:b6de3000
[   46.324053]  r5:ee1c6a00
[   46.326607]  r4:c0248b58

[   46.330505] [<c039e810>] (full_proxy_read) from [<c021ec98>] (__vfs_read+0x34/0x118)
[   46.338262]  r9:edf52000
[   46.340635]  r8:c0107fc4
[   46.343190]  r7:00020000
[   46.344257]  r6:edf53f80
[   46.346812]  r5:c039e810
[   46.347879]  r4:ee1c6a00
[   46.350447] [<c021ec64>] (__vfs_read) from [<c021fbd0>] (vfs_read+0x8c/0x11c)
[   46.357597]  r9:edf52000
[   46.359969]  r8:c0107fc4
[   46.362524]  r7:edf53f80
[   46.363592]  r6:b6de3000
[   46.366147]  r5:ee1c6a00
[   46.367214]  r4:00020000
[   46.369782] [<c021fb44>] (vfs_read) from [<c0220a4c>] (SyS_read+0x4c/0xa8)
[   46.376672]  r8:c0107fc4
[   46.379045]  r7:00020000
[   46.381600]  r6:b6de3000
[   46.382667]  r5:ee1c6a00
[   46.385222]  r4:ee1c6a00

[   46.387817] [<c0220a00>] (SyS_read) from [<c0107e20>] (ret_fast_syscall+0x0/0x1c)
[   46.395314]  r7:00000003
[   46.397687]  r6:b6de3000
[   46.400243]  r5:00020000
[   46.401310]  r4:00020000

Fixes: 26c696c6 ("USB: Chipidea: rename struct ci13xxx variables from udc to ci")
Signed-off-by: Peter Chen <peter.chen@nxp.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

11a35448

KVM: x86: drop error recovery in em_jmp_far and em_ret_far · 53093fc1

Radim Krčmář authored Nov 23, 2016

commit 2117d539 upstream.

em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.

Found by syzkaller:

  WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] panic+0x1b7/0x3a3 kernel/panic.c:179
   [...] __warn+0x1c4/0x1e0 kernel/panic.c:542
   [...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
   [...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
   [...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
   [...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
   [...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
   [...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
   [...] complete_emulated_io arch/x86/kvm/x86.c:6870
   [...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
   [...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
   [...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: d1442d85 ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

53093fc1

01 Dec, 2016 1 commit

Revert "drivers/net: Disable UFO through virtio" · 36d50c23

Vlad Yasevich authored Feb 03, 2015

commit e3e3c423 upstream.

This reverts commit 3d0ad094.

Now that GSO functionality can correctly track if the fragment
id has been selected and select a fragment id if necessary,
we can re-enable UFO on tap/macvap and virtio devices.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

36d50c23

28 Nov, 2016 26 commits

tty: audit: Fix audit source · ef99a35d

Peter Hurley authored Nov 08, 2015

commit 6b2a3d62 upstream.

The data to audit/record is in the 'from' buffer (ie., the input
read buffer).

Fixes: 72586c60 ("n_tty: Fix auditing support for cannonical mode")
Cc: Miloslav Trmač <mitr@redhat.com>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Acked-by: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

ef99a35d

kernel/panic.c: turn off locks debug before releasing console lock · fd81c458

Vitaly Kuznetsov authored Nov 20, 2015

commit 7625b3a0 upstream.

Commit 08d78658 ("panic: release stale console lock to always get the
logbuf printed out") introduced an unwanted bad unlock balance report when
panic() is called directly and not from OOPS (e.g.  from out_of_memory()).
The difference is that in case of OOPS we disable locks debug in
oops_enter() and on direct panic call nobody does that.

Fixes: 08d78658 ("panic: release stale console lock to always get the logbuf printed out")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

fd81c458

mtd: blkdevs: fix potential deadlock + lockdep warnings · 727cf403

Brian Norris authored Oct 26, 2015

commit f3c63795 upstream.

Commit 073db4a5 ("mtd: fix: avoid race condition when accessing
mtd->usecount") fixed a race condition but due to poor ordering of the
mutex acquisition, introduced a potential deadlock.

The deadlock can occur, for example, when rmmod'ing the m25p80 module, which
will delete one or more MTDs, along with any corresponding mtdblock
devices. This could potentially race with an acquisition of the block
device as follows.

 -> blktrans_open()
    ->  mutex_lock(&dev->lock);
    ->  mutex_lock(&mtd_table_mutex);

 -> del_mtd_device()
    ->  mutex_lock(&mtd_table_mutex);
    ->  blktrans_notify_remove() -> del_mtd_blktrans_dev()
       ->  mutex_lock(&dev->lock);

This is a classic (potential) ABBA deadlock, which can be fixed by
making the A->B ordering consistent everywhere. There was no real
purpose to the ordering in the original patch, AFAIR, so this shouldn't
be a problem. This ordering was actually already present in
del_mtd_blktrans_dev(), for one, where the function tried to ensure that
its caller already held mtd_table_mutex before it acquired &dev->lock:

        if (mutex_trylock(&mtd_table_mutex)) {
                mutex_unlock(&mtd_table_mutex);
                BUG();
        }

So, reverse the ordering of acquisition of &dev->lock and &mtd_table_mutex so
we always acquire mtd_table_mutex first.

Snippets of the lockdep output follow:

  # modprobe -r m25p80
  [   53.419251]
  [   53.420838] ======================================================
  [   53.427300] [ INFO: possible circular locking dependency detected ]
  [   53.433865] 4.3.0-rc6 #96 Not tainted
  [   53.437686] -------------------------------------------------------
  [   53.444220] modprobe/372 is trying to acquire lock:
  [   53.449320]  (&new->lock){+.+...}, at: [<c043fe4c>] del_mtd_blktrans_dev+0x80/0xdc
  [   53.457271]
  [   53.457271] but task is already holding lock:
  [   53.463372]  (mtd_table_mutex){+.+.+.}, at: [<c0439994>] del_mtd_device+0x18/0x100
  [   53.471321]
  [   53.471321] which lock already depends on the new lock.
  [   53.471321]
  [   53.479856]
  [   53.479856] the existing dependency chain (in reverse order) is:
  [   53.487660]
  -> #1 (mtd_table_mutex){+.+.+.}:
  [   53.492331]        [<c043fc5c>] blktrans_open+0x34/0x1a4
  [   53.497879]        [<c01afce0>] __blkdev_get+0xc4/0x3b0
  [   53.503364]        [<c01b0bb8>] blkdev_get+0x108/0x320
  [   53.508743]        [<c01713c0>] do_dentry_open+0x218/0x314
  [   53.514496]        [<c0180454>] path_openat+0x4c0/0xf9c
  [   53.519959]        [<c0182044>] do_filp_open+0x5c/0xc0
  [   53.525336]        [<c0172758>] do_sys_open+0xfc/0x1cc
  [   53.530716]        [<c000f740>] ret_fast_syscall+0x0/0x1c
  [   53.536375]
  -> #0 (&new->lock){+.+...}:
  [   53.540587]        [<c063f124>] mutex_lock_nested+0x38/0x3cc
  [   53.546504]        [<c043fe4c>] del_mtd_blktrans_dev+0x80/0xdc
  [   53.552606]        [<c043f164>] blktrans_notify_remove+0x7c/0x84
  [   53.558891]        [<c04399f0>] del_mtd_device+0x74/0x100
  [   53.564544]        [<c043c670>] del_mtd_partitions+0x80/0xc8
  [   53.570451]        [<c0439aa0>] mtd_device_unregister+0x24/0x48
  [   53.576637]        [<c046ce6c>] spi_drv_remove+0x1c/0x34
  [   53.582207]        [<c03de0f0>] __device_release_driver+0x88/0x114
  [   53.588663]        [<c03de19c>] device_release_driver+0x20/0x2c
  [   53.594843]        [<c03dd9e8>] bus_remove_device+0xd8/0x108
  [   53.600748]        [<c03dacc0>] device_del+0x10c/0x210
  [   53.606127]        [<c03dadd0>] device_unregister+0xc/0x20
  [   53.611849]        [<c046d878>] __unregister+0x10/0x20
  [   53.617211]        [<c03da868>] device_for_each_child+0x50/0x7c
  [   53.623387]        [<c046eae8>] spi_unregister_master+0x58/0x8c
  [   53.629578]        [<c03e12f0>] release_nodes+0x15c/0x1c8
  [   53.635223]        [<c03de0f8>] __device_release_driver+0x90/0x114
  [   53.641689]        [<c03de900>] driver_detach+0xb4/0xb8
  [   53.647147]        [<c03ddc78>] bus_remove_driver+0x4c/0xa0
  [   53.652970]        [<c00cab50>] SyS_delete_module+0x11c/0x1e4
  [   53.658976]        [<c000f740>] ret_fast_syscall+0x0/0x1c
  [   53.664621]
  [   53.664621] other info that might help us debug this:
  [   53.664621]
  [   53.672979]  Possible unsafe locking scenario:
  [   53.672979]
  [   53.679169]        CPU0                    CPU1
  [   53.683900]        ----                    ----
  [   53.688633]   lock(mtd_table_mutex);
  [   53.692383]                                lock(&new->lock);
  [   53.698306]                                lock(mtd_table_mutex);
  [   53.704658]   lock(&new->lock);
  [   53.707946]
  [   53.707946]  *** DEADLOCK ***

Fixes: 073db4a5 ("mtd: fix: avoid race condition when accessing mtd->usecount")
Reported-by: Felipe Balbi <balbi@ti.com>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

727cf403

i2c: at91: fix write transfers by clearing pending interrupt first · 087ff3d1

Cyrille Pitchen authored Oct 21, 2015

commit 6f6ddbb0 upstream.

In some cases a NACK interrupt may be pending in the Status Register (SR)
as a result of a previous transfer. However at91_do_twi_transfer() did not
read the SR to clear pending interruptions before starting a new transfer.
Hence a NACK interrupt rose as soon as it was enabled again at the I2C
controller level, resulting in a wrong sequence of operations and strange
patterns of behaviour on the I2C bus, such as a clock stretch followed by
a restart of the transfer.

This first issue occurred with both DMA and PIO write transfers.

Also when a NACK error was detected during a PIO write transfer, the
interrupt handler used to wrongly start a new transfer by writing into the
Transmit Holding Register (THR). Then the I2C slave was likely to reply
with a second NACK.

This second issue is fixed in atmel_twi_interrupt() by handling the TXRDY
status bit only if both the TXCOMP and NACK status bits are cleared.

Tested with a at24 eeprom on sama5d36ek board running a linux-4.1-at91
kernel image. Adapted to linux-next.
Reported-by: Peter Rosin <peda@lysator.liu.se>
Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
Signed-off-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Tested-by: Peter Rosin <peda@lysator.liu.se>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Fixes: 93563a6a ("i2c: at91: fix a race condition when using the DMA controller")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

087ff3d1

PCI: Use function 0 VPD for identical functions, regular VPD for others · 7d1e51af

Alex Williamson authored Sep 15, 2015

commit da2d03ea upstream.

932c435c ("PCI: Add dev_flags bit to access VPD through function 0")
added PCI_DEV_FLAGS_VPD_REF_F0.  Previously, we set the flag on every
non-zero function of quirked devices.  If a function turned out to be
different from function 0, i.e., it had a different class, vendor ID, or
device ID, the flag remained set but we didn't make VPD accessible at all.

Flip this around so we only set PCI_DEV_FLAGS_VPD_REF_F0 for functions that
are identical to function 0, and allow regular VPD access for any other
functions.

[bhelgaas: changelog, stable tag]
Fixes: 932c435c ("PCI: Add dev_flags bit to access VPD through function 0")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
Acked-by: Myron Stowe <myron.stowe@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

7d1e51af

PCI: Fix devfn for VPD access through function 0 · eabe51b6

Alex Williamson authored Sep 15, 2015

commit 9d924075 upstream.

Commit 932c435c ("PCI: Add dev_flags bit to access VPD through function
0") passes PCI_SLOT(devfn) for the devfn parameter of pci_get_slot().
Generally this works because we're fairly well guaranteed that a PCIe
device is at slot address 0, but for the general case, including
conventional PCI, it's incorrect.  We need to get the slot and then convert
it back into a devfn.

Fixes: 932c435c ("PCI: Add dev_flags bit to access VPD through function 0")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
Acked-by: Myron Stowe <myron.stowe@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

eabe51b6

x86/idle: Restore trace_cpu_idle to mwait_idle() calls · 15034b96

Jisheng Zhang authored Aug 20, 2015

commit e43d0189 upstream.

Commit b253149b ("sched/idle/x86: Restore mwait_idle() to fix boot
hangs, to improve power savings and to improve performance") restores
mwait_idle(), but the trace_cpu_idle related calls are missing. This
causes powertop on my old desktop powered by Intel Core2 E6550 to
report zero wakeups and zero events.

Add them back to restore the proper behaviour.

Fixes: b253149b ("sched/idle/x86: Restore mwait_idle() to ...")
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Cc: <len.brown@intel.com>
Link: http://lkml.kernel.org/r/1440046479-4262-1-git-send-email-jszhang@marvell.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

15034b96

Linux 3.12.68 · f19260ac
Jiri Slaby authored Nov 25, 2016

f19260ac

ALSA: usb-audio: Fix runtime PM unbalance · dbb41290

Takashi Iwai authored Aug 19, 2015

commit 9003ebb1 upstream.

The fix for deadlock in PM in commit [1ee23fe0: ALSA: usb-audio:
Fix deadlocks at resuming] introduced a new check of in_pm flag.
However, the brainless patch author evaluated it in a wrong way
(logical AND instead of logical OR), thus usb_autopm_get_interface()
is wrongly called at probing, leading to unbalance of runtime PM
refcount.

This patch fixes it by correcting the logic.
Reported-by: Hans Yang <hansy@nvidia.com>
Fixes: 1ee23fe0 ('ALSA: usb-audio: Fix deadlocks at resuming')
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

dbb41290

xen-pciback: Add name prefix to global 'permissive' variable · 6417d6f2

Ben Hutchings authored Apr 13, 2015

commit 8014bcc8 upstream.

The variable for the 'permissive' module parameter used to be static
but was recently changed to be extern.  This puts it in the kernel
global namespace if the driver is built-in, so its name should begin
with a prefix identifying the driver.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: af6fc858 ("xen-pciback: limit guest control of command register")
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

6417d6f2

PCI: Handle read-only BARs on AMD CS553x devices · f183980f

Myron Stowe authored Feb 03, 2015

commit 06cf35f9 upstream.

Some AMD CS553x devices have read-only BARs because of a firmware or
hardware defect.  There's a workaround in quirk_cs5536_vsa(), but it no
longer works after 36e81648 ("PCI: Restore detection of read-only
BARs").  Prior to 36e81648, we filled in res->start; afterwards we
leave it zeroed out.  The quirk only updated the size, so the driver tried
to use a region starting at zero, which didn't work.

Expand quirk_cs5536_vsa() to read the base addresses from the BARs and
hard-code the sizes.

On Nix's system BAR 2's read-only value is 0x6200.  Prior to 36e81648,
we interpret that as a 512-byte BAR based on the lowest-order bit set.  Per
datasheet sec 5.6.1, that BAR (MFGPT) requires only 64 bytes; use that to
avoid clearing any address bits if a platform uses only 64-byte alignment.

[js] pcibios_bus_to_resource takes pdev, not bus in 3.12

[bhelgaas: changelog, reduce BAR 2 size to 64]
Fixes: 36e81648 ("PCI: Restore detection of read-only BARs")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=85991#c4
Link: http://support.amd.com/TechDocs/31506_cs5535_databook.pdf
Link: http://support.amd.com/TechDocs/33238G_cs5536_db.pdfReported-and-tested-by: Nix <nix@esperi.org.uk>
Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

f183980f

perf: Tighten (and fix) the grouping condition · 5e08a111

Peter Zijlstra authored Jan 23, 2015

commit c3c87e77 upstream.

The fix from 9fc81d87 ("perf: Fix events installation during
moving group") was incomplete in that it failed to recognise that
creating a group with events for different CPUs is semantically
broken -- they cannot be co-scheduled.

Furthermore, it leads to real breakage where, when we create an event
for CPU Y and then migrate it to form a group on CPU X, the code gets
confused where the counter is programmed -- triggered in practice
as well by me via the perf fuzzer.

Fix this by tightening the rules for creating groups. Only allow
grouping of counters that can be co-scheduled in the same context.
This means for the same task and/or the same cpu.

Fixes: 9fc81d87 ("perf: Fix events installation during moving group")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

5e08a111

usb: musb: musb_cppi41: recognize HS devices in hostmode · 1af405b1

Sebastian Andrzej Siewior authored Nov 13, 2014

commit 1eec34e9 upstream.

There is a poll loop for max 25us for HS devices. Now guess what, I
tested it in gadget mode and forgot about the little detail. Nobody seem
to have it noticed…
This patch adds the missing logic for hostmode so it is recognized in
host and device mode properly.

Fixes: 50aea6fc ("usb: musb: cppi41: fire hrtimer according to
programmed channel length")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

1af405b1

drivers/net: Disable UFO through virtio · a8767147

Ben Hutchings authored Oct 30, 2014

commit 3d0ad094 upstream.

IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header.  UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.

Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug.  Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.

Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.

We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf4 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

a8767147

KVM: check for !is_zero_pfn() in kvm_is_mmio_pfn() · 79024d0f

Ard Biesheuvel authored Sep 12, 2014

commit 85c8555f upstream.

Read-only memory ranges may be backed by the zero page, so avoid
misidentifying it a a MMIO pfn.

This fixes another issue I identified when testing QEMU+KVM_UEFI, where
a read to an uninitialized emulated NOR flash brought in the zero page,
but mapped as a read-write device region, because kvm_is_mmio_pfn()
misidentifies it as a MMIO pfn due to its PG_reserved bit being set.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Fixes: b8865767 ("ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

79024d0f

mm: export symbol dependencies of is_zero_pfn() · 571a8e0b

Ard Biesheuvel authored Sep 12, 2014

commit 0b70068e upstream.

In order to make the static inline function is_zero_pfn() callable by
modules, export its symbol dependencies 'zero_pfn' and (for s390 and
mips) 'zero_page_mask'.

We need this for KVM, as CONFIG_KVM is a tristate for all supported
architectures except ARM and arm64, and testing a pfn whether it refers
to the zero page is required to correctly distinguish the zero page
from other special RAM ranges that may also have the PG_reserved bit
set, but need to be treated as MMIO memory.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

571a8e0b

mm: filemap: update find_get_pages_tag() to deal with shadow entries · 820c7391

Johannes Weiner authored May 06, 2014

commit 139b6a6f upstream.

Dave Jones reports the following crash when find_get_pages_tag() runs
into an exceptional entry:

  kernel BUG at mm/filemap.c:1347!
  RIP: find_get_pages_tag+0x1cb/0x220
  Call Trace:
    find_get_pages_tag+0x36/0x220
    pagevec_lookup_tag+0x21/0x30
    filemap_fdatawait_range+0xbe/0x1e0
    filemap_fdatawait+0x27/0x30
    sync_inodes_sb+0x204/0x2a0
    sync_inodes_one_sb+0x19/0x20
    iterate_supers+0xb2/0x110
    sys_sync+0x44/0xb0
    ia32_do_call+0x13/0x13

  1343                         /*
  1344                          * This function is never used on a shmem/tmpfs
  1345                          * mapping, so a swap entry won't be found here.
  1346                          */
  1347                         BUG();

After commit 0cd6144a ("mm + fs: prepare for non-page entries in
page cache radix trees") this comment and BUG() are out of date because
exceptional entries can now appear in all mappings - as shadows of
recently evicted pages.

However, as Hugh Dickins notes,

  "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
   any other PAGECACHE_TAG_*) to appear on an exceptional entry.

   I expect it comes down to an occasional race in RCU lookup of the
   radix_tree: lacking absolute synchronization, we might sometimes
   catch an exceptional entry, with the tag which really belongs with
   the unexceptional entry which was there an instant before."

And indeed, not only is the tree walk lockless, the tags are also read
in chunks, one radix tree node at a time.  There is plenty of time for
page reclaim to swoop in and replace a page that was already looked up
as tagged with a shadow entry.

Remove the BUG() and update the comment.  While reviewing all other
lookup sites for whether they properly deal with shadow entries of
evicted pages, update all the comments and fix memcg file charge moving
to not miss shmem/tmpfs swapcache pages.

Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

820c7391

sparc64: Handle extremely large kernel TLB range flushes more gracefully. · 58c190c8

David S. Miller authored Oct 27, 2016

[ Upstream commit a74ad5e6 ]

When the vmalloc area gets fragmented, and because the firmware
mapping area sits between where modules live and the vmalloc area, we
can sometimes receive requests for enormous kernel TLB range flushes.

When this happens the cpu just spins flushing billions of pages and
this triggers the NMI watchdog and other problems.

We took care of this on the TSB side by doing a linear scan of the
table once we pass a certain threshold.

Do something similar for the TLB flush, however we are limited by
the TLB flush facilities provided by the different chip variants.

First of all we use an (mostly arbitrary) cut-off of 256K which is
about 32 pages.  This can be tuned in the future.

The huge range code path for each chip works as follows:

1) On spitfire we flush all non-locked TLB entries using diagnostic
   acceses.

2) On cheetah we use the "flush all" TLB flush.

3) On sun4v/hypervisor we do a TLB context flush on context 0, which
   unlike previous chips does not remove "permanent" or locked
   entries.

We could probably do something better on spitfire, such as limiting
the flush to kernel TLB entries or even doing range comparisons.
However that probably isn't worth it since those chips are old and
the TLB only had 64 entries.
Reported-by: James Clarke <jrtc27@jrtc27.com>
Tested-by: James Clarke <jrtc27@jrtc27.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

58c190c8

sparc64: Fix illegal relative branches in hypervisor patched TLB cross-call code. · 5d43827d

David S. Miller authored Oct 26, 2016

[ Upstream commit a236441b ]

Just like the non-cross-call TLB flush handlers, the cross-call ones need
to avoid doing PC-relative branches outside of their code blocks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

5d43827d

sparc64: Fix instruction count in comment for __hypervisor_flush_tlb_pending. · 3febdf4c

David S. Miller authored Oct 26, 2016

[ Upstream commit 830cda3f ]

Noticed by James Clarke.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

3febdf4c

sparc64: Fix illegal relative branches in hypervisor patched TLB code. · 6246428d

David S. Miller authored Oct 25, 2016

[ Upstream commit b429ae4d ]

When we copy code over to patch another piece of code, we can only use
PC-relative branches that target code within that piece of code.

Such PC-relative branches cannot be made to external symbols because
the patch moves the location of the code and thus modifies the
relative address of external symbols.

Use an absolute jmpl to fix this problem.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

6246428d

sparc64: Handle extremely large kernel TSB range flushes sanely. · fbc1defa

David S. Miller authored Oct 25, 2016

[ Upstream commit 849c4987 ]

If the number of pages we are flushing is more than twice the number
of entries in the TSB, just scan the TSB table for matches rather
than probing each and every page in the range.

Based upon a patch and report by James Clarke.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

fbc1defa

sparc: Handle negative offsets in arch_jump_label_transform · 6d6262c5

James Clarke authored Oct 24, 2016

[ Upstream commit 9d9fa230 ]

Additionally, if the offset will overflow the immediate for a ba,pt
instruction, fall back on a standard ba to get an extra 3 bits.
Signed-off-by: James Clarke <jrtc27@jrtc27.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

6d6262c5

sparc64 mm: Fix base TSB sizing when hugetlb pages are used · 2f9cb540

Mike Kravetz authored Jul 15, 2016

[ Upstream commit af1b1a9b ]

do_sparc64_fault() calculates both the base and huge page RSS sizes and
uses this information in calls to tsb_grow().  The calculation for base
page TSB size is not correct if the task uses hugetlb pages.  hugetlb
pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
does not include hugetlb pages.  However, the number of pages based on
huge_pte_count (which does include hugetlb pages) is subtracted from
this value.  This will result in an artificially small and often negative
RSS calculation.  The base TSB size is then often set to max_tsb_size
as the passed RSS is unsigned, so a negative value looks really big.

THP pages are also accounted for in huge_pte_count, and THP pages are
accounted for in RSS so the calculation in do_sparc64_fault() is correct
if a task only uses THP pages.

A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
and THP pages can be used.  Instead of a single counter, use two:  one
for hugetlb and one for THP.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

2f9cb540

sparc: Don't leak context bits into thread->fault_address · 2d5cba50

David S. Miller authored Jul 27, 2016

[ Upstream commit 4f6deb8c ]

On pre-Niagara systems, we fetch the fault address on data TLB
exceptions from the TLB_TAG_ACCESS register.  But this register also
contains the context ID assosciated with the fault in the low 13 bits
of the register value.

This propagates into current_thread_info()->fault_address and can
cause trouble later on.

So clear the low 13-bits out of the TLB_TAG_ACCESS value in the cases
where it matters.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

2d5cba50

tcp: take care of truncations done by sk_filter() · 9edbcfdc

Eric Dumazet authored Nov 10, 2016

[ Upstream commit ac6e7800 ]

With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>

9edbcfdc