1. 19 Jan, 2017 40 commits
    • Herbert Xu's avatar
      Revert "tty: serial: 8250: add CON_CONSDEV to flags" · 1f363639
      Herbert Xu authored
      commit 6741f551 upstream.
      
      This commit needs to be reverted because it prevents people from
      using the serial console as a secondary console with input being
      directed to tty0.
      
      IOW, if you boot with console=ttyS0 console=tty0 then all kernels
      prior to this commit will produce output on both ttyS0 and tty0
      but input will only be taken from tty0.  With this patch the serial
      console will always be the primary console instead of tty0,
      potentially preventing people from getting into their machines in
      emergency situations.
      
      Fixes: d03516df ("tty: serial: 8250: add CON_CONSDEV to flags")
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f363639
    • Takashi Sakamoto's avatar
      ASoC: hdmi-codec: use unsigned type to structure members with bit-field · f9cf776b
      Takashi Sakamoto authored
      commit 9e4d59ad upstream.
      
      This is a fix for Linux 4.10-rc1.
      
      In C language specification, a bit-field is interpreted as a signed or
      unsigned integer type consisting of the specified number of bits.
      
      In GCC manual, the range of a signed bit field of N bits is from
      -(2^N) / 2 to ((2^N) / 2) - 1
      https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Bit-Fields
      
      Therefore, when defined as 1 bit-field with signed type, variables can
      represents -1 and 0.
      
      The snd-soc-hdmi-codec module includes a structure which has signed type
      members with bit-fields. Codes of this module assign 0 and 1 to the
      members. This seems to result in implementation-dependent behaviours.
      
      As of v4.10-rc1 merge window, outside of sound subsystem, this structure
      is referred by below GPU modules.
       - tda998x
       - sti-drm
       - mediatek-drm-hdmi
       - msm
      
      As long as I review their codes relevant to the structure, the structure
      members are used just for condition statements and printk formats.
      My proposal of change is a bit intrusive to the printk formats but this
      may be acceptable.
      
      Totally, it's reasonable to use unsigned type for the structure members.
      This bug is detected by Sparse, static code analyzer with below warnings.
      
      ./include/sound/hdmi-codec.h:39:26: error: dubious one-bit signed bitfield
      ./include/sound/hdmi-codec.h:40:28: error: dubious one-bit signed bitfield
      ./include/sound/hdmi-codec.h:41:29: error: dubious one-bit signed bitfield
      ./include/sound/hdmi-codec.h:42:31: error: dubious one-bit signed bitfield
      
      Fixes: 09184118 ("ASoC: hdmi-codec: Add hdmi-codec for external HDMI-encoders")
      Signed-off-by: default avatarTakashi Sakamoto <o-takashi@sakamocchi.jp>
      Acked-by: default avatarArnaud Pouliquen <arnaud.pouliquen@st.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9cf776b
    • David Sterba's avatar
      btrfs: fix crash when tracepoint arguments are freed by wq callbacks · 28dad9aa
      David Sterba authored
      commit ac0c7cf8 upstream.
      
      Enabling btrfs tracepoints leads to instant crash, as reported. The wq
      callbacks could free the memory and the tracepoints started to
      dereference the members to get to fs_info.
      
      The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
      removed the tracepoints but we could preserve them by passing only the
      required data in a safe way.
      
      Fixes: bc074524 ("btrfs: prefix fsid to all trace events")
      Reported-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28dad9aa
    • Mathias Nyman's avatar
      xhci: fix deadlock at host remove by running watchdog correctly · 4d0f302b
      Mathias Nyman authored
      commit d6169d04 upstream.
      
      If a URB is killed while the host is removed we can end up in a situation
      where the hub thread takes the roothub device lock, and waits for
      the URB to be given back by xhci-hcd, blocking the host remove code.
      
      xhci-hcd tries to stop the endpoint and give back the urb, but can't
      as the host is removed from PCI bus at the same time, preventing the normal
      way of giving back urb.
      
      Instead we need to rely on the stop command timeout function to give back
      the urb. This xhci_stop_endpoint_command_watchdog() timeout function
      used a XHCI_STATE_DYING flag to indicate if the timeout function is already
      running, but later this flag has been taking into use in other places to
      mark that xhci is dying.
      
      Remove checks for XHCI_STATE_DYING in xhci_urb_dequeue. We are still
      checking that reading from pci state does not return 0xffffffff or that
      host is not halted before trying to stop the endpoint.
      
      This whole area of stopping endpoints, giving back URBs, and the wathdog
      timeout need rework, this fix focuses on solving a specific deadlock
      issue that we can then send to stable before any major rework.
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d0f302b
    • Al Viro's avatar
      fix a fencepost error in pipe_advance() · d06367ac
      Al Viro authored
      commit b9dc6f65 upstream.
      
      The logics in pipe_advance() used to release all buffers past the new
      position failed in cases when the number of buffers to release was equal
      to pipe->buffers.  If that happened, none of them had been released,
      leaving pipe full.  Worse, it was trivial to trigger and we end up with
      pipe full of uninitialized pages.  IOW, it's an infoleak.
      Reported-by: default avatar"Alan J. Wylie" <alan@wylie.me.uk>
      Tested-by: default avatar"Alan J. Wylie" <alan@wylie.me.uk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d06367ac
    • Vlad Tsyrklevich's avatar
      i2c: fix kernel memory disclosure in dev interface · ab895739
      Vlad Tsyrklevich authored
      commit 30f939fe upstream.
      
      i2c_smbus_xfer() does not always fill an entire block, allowing
      kernel stack memory disclosure through the temp variable. Clear
      it before it's read to.
      Signed-off-by: default avatarVlad Tsyrklevich <vlad@tsyrklevich.net>
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab895739
    • John Garry's avatar
      i2c: print correct device invalid address · 93c94ec2
      John Garry authored
      commit 6f724fb3 upstream.
      
      In of_i2c_register_device(), when the check for
      device address validity fails we print the info.addr,
      which has not been assigned properly.
      
      Fix this by printing the actual invalid address.
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Reviewed-by: default avatarVladimir Zapolskiy <vz@mleia.com>
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Fixes: b4e2f6ac ("i2c: apply DT flags when probing")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93c94ec2
    • Guenter Roeck's avatar
      Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data · 61a8c337
      Guenter Roeck authored
      commit 1c3415a0 upstream.
      
      The following crash may be seen if bad data is received from the
      touchscreen.
      
      [ 2189.425150] elants_i2c i2c-ELAN0001:00: unknown packet ff ff ff ff
      [ 2189.430738] divide error: 0000 [#1] PREEMPT SMP
      [ 2189.434679] gsmi: Log Shutdown Reason 0x03
      [ 2189.434689] Modules linked in: ip6t_REJECT nf_reject_ipv6 rfcomm evdi
      uinput uvcvideo cmac videobuf2_vmalloc videobuf2_memops snd_hda_codec_hdmi
      i2c_dev videobuf2_core snd_soc_sst_cht_bsw_rt5645 snd_hda_intel
      snd_intel_sst_acpi btusb btrtl btbcm btintel bluetooth snd_soc_sst_acpi
      snd_hda_codec snd_intel_sst_core snd_hwdep snd_soc_sst_mfld_platform
      snd_hda_core snd_soc_rt5645 memconsole_x86_legacy memconsole zram snd_soc_rl6231
      fuse ip6table_filter iwlmvm iwlwifi iwl7000_mac80211 cfg80211 iio_trig_sysfs
      joydev cros_ec_sensors cros_ec_sensors_core industrialio_triggered_buffer
      kfifo_buf industrialio snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq
      snd_seq_device ppp_async ppp_generic slhc tun
      [ 2189.434866] CPU: 0 PID: 106 Comm: irq/184-ELAN000 Tainted: G        W
      3.18.0-13101-g57e8190 #1
      [ 2189.434883] Hardware name: GOOGLE Ultima, BIOS Google_Ultima.7287.131.43 07/20/2016
      [ 2189.434898] task: ffff88017a0b6d80 ti: ffff88017a2bc000 task.ti: ffff88017a2bc000
      [ 2189.434913] RIP: 0010:[<ffffffffbecc48d5>]  [<ffffffffbecc48d5>] elants_i2c_irq+0x190/0x200
      [ 2189.434937] RSP: 0018:ffff88017a2bfd98  EFLAGS: 00010293
      [ 2189.434948] RAX: 0000000000000000 RBX: ffff88017a967828 RCX: ffff88017a9678e8
      [ 2189.434962] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000000
      [ 2189.434975] RBP: ffff88017a2bfdd8 R08: 00000000000003e8 R09: 0000000000000000
      [ 2189.434989] R10: 0000000000000000 R11: 000000000044a2bd R12: ffff88017a991800
      [ 2189.435001] R13: ffffffffbe8a2a53 R14: ffff88017a0b6d80 R15: ffff88017a0b6d80
      [ 2189.435011] FS:  0000000000000000(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000
      [ 2189.435022] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 2189.435030] CR2: 00007f678d94b000 CR3: 000000003f41a000 CR4: 00000000001007f0
      [ 2189.435039] Stack:
      [ 2189.435044]  ffff88017a2bfda8 ffff88017a9678e8 646464647a2bfdd8 0000000006e09574
      [ 2189.435060]  0000000000000000 ffff88017a088b80 ffff88017a921000 ffffffffbe8a2a53
      [ 2189.435074]  ffff88017a2bfe08 ffffffffbe8a2a73 ffff88017a0b6d80 0000000006e09574
      [ 2189.435089] Call Trace:
      [ 2189.435101]  [<ffffffffbe8a2a53>] ? irq_thread_dtor+0xa9/0xa9
      [ 2189.435112]  [<ffffffffbe8a2a73>] irq_thread_fn+0x20/0x40
      [ 2189.435123]  [<ffffffffbe8a2be1>] irq_thread+0x14e/0x222
      [ 2189.435135]  [<ffffffffbee8cbeb>] ? __schedule+0x3b3/0x57a
      [ 2189.435145]  [<ffffffffbe8a29aa>] ? wake_threads_waitq+0x2d/0x2d
      [ 2189.435156]  [<ffffffffbe8a2a93>] ? irq_thread_fn+0x40/0x40
      [ 2189.435168]  [<ffffffffbe87c385>] kthread+0x10e/0x116
      [ 2189.435178]  [<ffffffffbe87c277>] ? __kthread_parkme+0x67/0x67
      [ 2189.435189]  [<ffffffffbee900ac>] ret_from_fork+0x7c/0xb0
      [ 2189.435199]  [<ffffffffbe87c277>] ? __kthread_parkme+0x67/0x67
      [ 2189.435208] Code: ff ff eb 73 0f b6 bb c1 00 00 00 83 ff 03 7e 13 49 8d 7c
      24 20 ba 04 00 00 00 48 c7 c6 8a cd 21 bf eb 4d 0f b6 83 c2 00 00 00 99 <f7> ff
      83 f8 37 75 15 48 6b f7 37 4c 8d a3 c4 00 00 00 4c 8d ac
      [ 2189.435312] RIP  [<ffffffffbecc48d5>] elants_i2c_irq+0x190/0x200
      [ 2189.435323]  RSP <ffff88017a2bfd98>
      [ 2189.435350] ---[ end trace f4945345a75d96dd ]---
      [ 2189.443841] Kernel panic - not syncing: Fatal exception
      [ 2189.444307] Kernel Offset: 0x3d800000 from 0xffffffff81000000
      	(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [ 2189.444519] gsmi: Log Shutdown Reason 0x02
      
      The problem was seen with a 3.18 based kernel, but there is no reason
      to believe that the upstream code is safe.
      
      Fixes: 66aee900 ("Input: add support for Elan eKTH I2C touchscreens")
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61a8c337
    • Johan Hovold's avatar
      USB: serial: ch341: fix open and resume after B0 · 0556a65e
      Johan Hovold authored
      commit a20047f3 upstream.
      
      The private baud_rate variable is used to configure the port at open and
      reset-resume and must never be set to (and left at) zero or reset-resume
      and all further open attempts will fail.
      
      Fixes: aa91def4 ("USB: ch341: set tty baud speed according to tty struct")
      Fixes: 664d5df9 ("USB: usb-serial ch341: support for DTR/RTS/CTS")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0556a65e
    • Johan Hovold's avatar
      USB: serial: ch341: fix control-message error handling · 3ed1f6da
      Johan Hovold authored
      commit 2d5a9c72 upstream.
      
      A short control transfer would currently fail to be detected, something
      which could lead to stale buffer data being used as valid input.
      
      Check for short transfers, and make sure to log any transfer errors.
      
      Note that this also avoids leaking heap data to user space (TIOCMGET)
      and the remote device (break control).
      
      Fixes: 6ce76104 ("USB: Driver for CH341 USB-serial adaptor")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ed1f6da
    • Johan Hovold's avatar
      USB: serial: ch341: fix open error handling · 139556a9
      Johan Hovold authored
      commit f2950b78 upstream.
      
      Make sure to stop the interrupt URB before returning on errors during
      open.
      
      Fixes: 664d5df9 ("USB: usb-serial ch341: support for DTR/RTS/CTS")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      139556a9
    • Johan Hovold's avatar
      USB: serial: ch341: fix resume after reset · 1685daad
      Johan Hovold authored
      commit ce5e2928 upstream.
      
      Fix reset-resume handling which failed to resubmit the read and
      interrupt URBs, thereby leaving a port that was open before suspend in a
      broken state until closed and reopened.
      
      Fixes: 1ded7ea4 ("USB: ch341 serial: fix port number changed after resume")
      Fixes: 2bfd1c96 ("USB: serial: ch341: remove reset_resume callback")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1685daad
    • Johan Hovold's avatar
      USB: serial: ch341: fix initial modem-control state · 4aeab97a
      Johan Hovold authored
      commit 4e2da446 upstream.
      
      DTR and RTS will be asserted by the tty-layer when the port is opened
      and deasserted on close (if HUPCL is set). Make sure the initial state
      is not-asserted before the port is first opened as well.
      
      Fixes: 664d5df9 ("USB: usb-serial ch341: support for DTR/RTS/CTS")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4aeab97a
    • Johan Hovold's avatar
      USB: serial: kl5kusb105: fix line-state error handling · 58ede4be
      Johan Hovold authored
      commit 146cc8a1 upstream.
      
      The current implementation failed to detect short transfers when
      attempting to read the line state, and also, to make things worse,
      logged the content of the uninitialised heap transfer buffer.
      
      Fixes: abf492e7 ("USB: kl5kusb105: fix DMA buffers on stack")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58ede4be
    • Bin Liu's avatar
      usb: musb: fix runtime PM in debugfs · dfd48efc
      Bin Liu authored
      commit 7b6c1b4c upstream.
      
      MUSB driver now has runtime PM support, but the debugfs driver misses
      the PM _get/_put() calls, which could cause MUSB register access
      failure.
      Acked-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarBin Liu <b-liu@ti.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dfd48efc
    • Andy Lutomirski's avatar
      wusbcore: Fix one more crypto-on-the-stack bug · 88d3670a
      Andy Lutomirski authored
      commit 620f1a63 upstream.
      
      The driver put a constant buffer of all zeros on the stack and
      pointed a scatterlist entry at it.  This doesn't work with virtual
      stacks.  Use ZERO_PAGE instead.
      Reported-by: default avatarEric Biggers <ebiggers3@gmail.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88d3670a
    • Borislav Petkov's avatar
      x86/CPU/AMD: Fix Bulldozer topology · 99ff99b8
      Borislav Petkov authored
      commit a33d3317 upstream.
      
      The following commit:
      
        8196dab4 ("x86/cpu: Get rid of compute_unit_id")
      
      ... broke the initial strategy for Bulldozer-based cores' topology,
      where we consider each thread of a compute unit a standalone core
      and not a HT or SMT thread.
      
      Revert to the firmware-supplied core_id numbering and do not make
      them thread siblings as we don't consider them for such even if they
      technically are, more or less.
      Reported-and-tested-by: default avatarBrice Goglin <Brice.Goglin@inria.fr>
      Tested-by: default avatarYazen Ghannam <yazen.ghannam@amd.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 8196dab4 ("x86/cpu: Get rid of compute_unit_id")
      Link: http://lkml.kernel.org/r/20170105092638.5247-1-bp@alien8.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99ff99b8
    • Thomas Gleixner's avatar
      x86/bugs: Separate AMD E400 erratum and C1E bug · bd7e7694
      Thomas Gleixner authored
      commit 3344ed30 upstream.
      
      The workaround for the AMD Erratum E400 (Local APIC timer stops in C1E
      state) is a two step process:
      
       - Selection of the E400 aware idle routine
      
       - Detection whether the platform is affected
      
      The idle routine selection happens for possibly affected CPUs depending on
      family/model/stepping information. These range of CPUs is not necessarily
      affected as the decision whether to enable the C1E feature is made by the
      firmware. Unfortunately there is no way to query this at early boot.
      
      The current implementation polls a MSR in the E400 aware idle routine to
      detect whether the CPU is affected. This is inefficient on non affected
      CPUs because every idle entry has to do the MSR read.
      
      There is a better way to detect this before going idle for the first time
      which requires to seperate the bug flags:
      
        X86_BUG_AMD_E400 	- Selects the E400 aware idle routine and
        			  enables the detection
      
        X86_BUG_AMD_APIC_C1E  - Set when the platform is affected by E400
      
      Replace the current X86_BUG_AMD_APIC_C1E usage by the new X86_BUG_AMD_E400
      bug bit to select the idle routine which currently does an unconditional
      detection poll. X86_BUG_AMD_APIC_C1E is going to be used in later patches
      to remove the MSR polling and simplify the handling of this misfeature.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20161209182912.2726-3-bp@alien8.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd7e7694
    • Yazen Ghannam's avatar
      x86/cpu/AMD: Clean up cpu_llc_id assignment per topology feature · e2d9ad2c
      Yazen Ghannam authored
      commit b6a50cdd upstream.
      
      These changes do not affect current hw - just a cleanup:
      
      Currently, we assume that a system has a single Last Level Cache (LLC)
      per node, and that the cpu_llc_id is thus equal to the node_id. This no
      longer applies since Fam17h can have multiple last level caches within a
      node.
      
      So group the cpu_llc_id assignment by topology feature and family in
      order to make the computation of cpu_llc_id on the different families
      more clear.
      
      Here is how the LLC ID is being computed on the different families:
      
      The NODEID_MSR feature only applies to Fam10h in which case the LLC is
      at the node level.
      
      The TOPOEXT feature is used on families 15h, 16h and 17h. So far we only
      see multiple last level caches if L3 caches are available. Otherwise,
      the cpu_llc_id will default to be the phys_proc_id.
      
      We have L3 caches only on families 15h and 17h:
      
       - on Fam15h, the LLC is at the node level.
      
       - on Fam17h, the LLC is at the core complex level and can be found by
         right shifting the APIC ID. Also, keep the family checks explicit so that
         new families will fall back to the default, which will be node_id for
         TOPOEXT systems.
      
      Single node systems in families 10h and 15h will have a Node ID of 0
      which will be the same as the phys_proc_id, so we don't need to check
      for multiple nodes before using the node_id.
      Tested-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarYazen Ghannam <Yazen.Ghannam@amd.com>
      [ Rewrote the commit message. ]
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20161108153054.bs3sajbyevq6a6uu@pd.tnicSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2d9ad2c
    • Artur Molchanov's avatar
      bridge: netfilter: Fix dropping packets that moving through bridge interface · 259495a0
      Artur Molchanov authored
      commit 14221cc4 upstream.
      
      Problem:
      br_nf_pre_routing_finish() calls itself instead of
      br_nf_pre_routing_finish_bridge(). Due to this bug reverse path filter drops
      packets that go through bridge interface.
      
      User impact:
      Local docker containers with bridge network can not communicate with each
      other.
      
      Fixes: c5136b15 ("netfilter: bridge: add and use br_nf_hook_thresh")
      Signed-off-by: default avatarArtur Molchanov <artur.molchanov@synesis.ru>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      259495a0
    • Jan Kara's avatar
      xfs: Timely free truncated dirty pages · 6ba35da6
      Jan Kara authored
      commit 0a417b8d upstream.
      
      Commit 99579cce "xfs: skip dirty pages in ->releasepage()" started
      to skip dirty pages in xfs_vm_releasepage() which also has the effect
      that if a dirty page is truncated, it does not get freed by
      block_invalidatepage() and is lingering in LRU list waiting for reclaim.
      So a simple loop like:
      
      while true; do
      	dd if=/dev/zero of=file bs=1M count=100
      	rm file
      done
      
      will keep using more and more memory until we hit low watermarks and
      start pagecache reclaim which will eventually reclaim also the truncate
      pages. Keeping these truncated (and thus never usable) pages in memory
      is just a waste of memory, is unnecessarily stressing page cache
      reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
      prematurely.
      
      So instead of just skipping dirty pages in xfs_vm_releasepage(), return
      to old behavior of skipping them only if they have delalloc or unwritten
      buffers and fix the spurious warnings by warning only if the page is
      clean.
      
      CC: Brian Foster <bfoster@redhat.com>
      CC: Vlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarPetr Tůma <petr.tuma@d3s.mff.cuni.cz>
      Fixes: 99579cceSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ba35da6
    • Geert Uytterhoeven's avatar
      gpio: Move freeing of GPIO hogs before numbing of the device · 86673e93
      Geert Uytterhoeven authored
      commit 5018ada6 upstream.
      
      When removing a gpiochip that uses GPIO hogging (e.g. by unloading the
      chip's DT overlay), a warning is printed:
      
          gpio gpiochip8: REMOVING GPIOCHIP WITH GPIOS STILL REQUESTED
      
      This happens because gpiochip_free_hogs() is called after the gdev->chip
      pointer is reset to NULL. Hence __gpiod_free() cannot determine the
      chip in use, and cannot clear flags nor call the optional chip-specific
      .free() callback.
      
      Move the call to gpiochip_free_hogs() up to fix this.
      
      Fixes: ff2b1359 ("gpio: make the gpiochip a real device")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86673e93
    • Johannes Berg's avatar
      nl80211: fix sched scan netlink socket owner destruction · 0a28f539
      Johannes Berg authored
      commit 753aacfd upstream.
      
      A single netlink socket might own multiple interfaces *and* a
      scheduled scan request (which might belong to another interface),
      so when it goes away both may need to be destroyed.
      
      Remove the schedule_scan_stop indirection to fix this - it's only
      needed for interface destruction because of the way this works
      right now, with a single work taking care of all interfaces.
      
      Fixes: 93a1e86c ("nl80211: Stop scheduled scan if netlink client disappears")
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a28f539
    • Nicolai Stange's avatar
      x86/efi: Don't allocate memmap through memblock after mm_init() · 14d6c966
      Nicolai Stange authored
      commit 20b1e22d upstream.
      
      With the following commit:
      
        4bc9f92e ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
      
      ...  efi_bgrt_init() calls into the memblock allocator through
      efi_mem_reserve() => efi_arch_mem_reserve() *after* mm_init() has been called.
      
      Indeed, KASAN reports a bad read access later on in efi_free_boot_services():
      
        BUG: KASAN: use-after-free in efi_free_boot_services+0xae/0x24c
                  at addr ffff88022de12740
        Read of size 4 by task swapper/0/0
        page:ffffea0008b78480 count:0 mapcount:-127
        mapping:          (null) index:0x1 flags: 0x5fff8000000000()
        [...]
        Call Trace:
         dump_stack+0x68/0x9f
         kasan_report_error+0x4c8/0x500
         kasan_report+0x58/0x60
         __asan_load4+0x61/0x80
         efi_free_boot_services+0xae/0x24c
         start_kernel+0x527/0x562
         x86_64_start_reservations+0x24/0x26
         x86_64_start_kernel+0x157/0x17a
         start_cpu+0x5/0x14
      
      The instruction at the given address is the first read from the memmap's
      memory, i.e. the read of md->type in efi_free_boot_services().
      
      Note that the writes earlier in efi_arch_mem_reserve() don't splat because
      they're done through early_memremap()ed addresses.
      
      So, after memblock is gone, allocations should be done through the "normal"
      page allocator. Introduce a helper, efi_memmap_alloc() for this. Use
      it from efi_arch_mem_reserve(), efi_free_boot_services() and, for the sake
      of consistency, from efi_fake_memmap() as well.
      
      Note that for the latter, the memmap allocations cease to be page aligned.
      This isn't needed though.
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Reviewed-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Fixes: 4bc9f92e ("x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data")
      Link: http://lkml.kernel.org/r/20170105125130.2815-1-nicstange@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14d6c966
    • Peter Jones's avatar
      efi/x86: Prune invalid memory map entries and fix boot regression · 99b17ac0
      Peter Jones authored
      commit 0100a3e6 upstream.
      
      Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
      (2.28), include memory map entries with phys_addr=0x0 and num_pages=0.
      
      These machines fail to boot after the following commit,
      
        commit 8e80632f ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")
      
      Fix this by removing such bogus entries from the memory map.
      
      Furthermore, currently the log output for this case (with efi=debug)
      looks like:
      
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |  |  |  |  |  ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)
      
      This is clearly wrong, and also not as informative as it could be.  This
      patch changes it so that if we find obviously invalid memory map
      entries, we print an error and skip those entries.  It also detects the
      display of the address range calculation overflow, so the new output is:
      
       [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[0x0000000000000000-0x0000000000000000] (invalid)
      
      It also detects memory map sizes that would overflow the physical
      address, for example phys_addr=0xfffffffffffff000 and
      num_pages=0x0200000000000001, and prints:
      
       [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
       [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)
      
      It then removes these entries from the memory map.
      Signed-off-by: default avatarPeter Jones <pjones@redhat.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      [ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      [Matt: Include bugzilla info in commit log]
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99b17ac0
    • Ard Biesheuvel's avatar
      efi/libstub/arm*: Pass latest memory map to the kernel · 74ce3fd6
      Ard Biesheuvel authored
      commit abfb7b68 upstream.
      
      As reported by James Morse, the current libstub code involving the
      annotated memory map only works somewhat correctly by accident, due
      to the fact that a pool allocation happens to be reused immediately,
      retaining its former contents on most implementations of the
      UEFI boot services.
      
      Instead of juggling memory maps, which makes the code more complex than
      it needs to be, simply put placeholder values into the FDT for the memory
      map parameters, and only write the actual values after ExitBootServices()
      has been called.
      Reported-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Jeffrey Hugo <jhugo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-efi@vger.kernel.org
      Fixes: ed9cc156 ("efi/libstub: Use efi_exit_boot_services() in FDT")
      Link: http://lkml.kernel.org/r/1482587963-20183-2-git-send-email-ard.biesheuvel@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74ce3fd6
    • Steve Rutherford's avatar
      KVM: x86: Introduce segmented_write_std · 736e77c0
      Steve Rutherford authored
      commit 129a72a0 upstream.
      
      Introduces segemented_write_std.
      
      Switches from emulated reads/writes to standard read/writes in fxsave,
      fxrstor, sgdt, and sidt.  This fixes CVE-2017-2584, a longstanding
      kernel memory leak.
      
      Since commit 283c95d0 ("KVM: x86: emulate FXSAVE and FXRSTOR",
      2016-11-09), which is luckily not yet in any final release, this would
      also be an exploitable kernel memory *write*!
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: 96051572
      Fixes: 283c95d0Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSteve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      736e77c0
    • Radim Krčmář's avatar
      KVM: x86: emulate FXSAVE and FXRSTOR · 83fedbb7
      Radim Krčmář authored
      commit 283c95d0 upstream.
      
      Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
      Old Intels don't have unrestricted_guest, so we have to emulate them.
      
      The patch takes advantage of the hardware implementation.
      
      AMD and Intel differ in saving and restoring other fields in first 32
      bytes.  A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
      in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
      and executed fxsave:
      
        Intel (Nehalem):
          7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
          ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
        Intel (Haswell -- deprecated FPU CS and FPU DS):
          7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
          ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
        AMD (Opteron 2300-series):
          7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
          ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00
      
      fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
      much to improve the situation.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83fedbb7
    • Radim Krčmář's avatar
      KVM: x86: add asm_safe wrapper · aae8f346
      Radim Krčmář authored
      commit aabba3c6 upstream.
      
      Move the existing exception handling for inline assembly into a macro
      and switch its return values to X86EMUL type.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aae8f346
    • Radim Krčmář's avatar
      KVM: x86: add Align16 instruction flag · bc5e1316
      Radim Krčmář authored
      commit d3fe959f upstream.
      
      Needed for FXSAVE and FXRSTOR.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc5e1316
    • Wanpeng Li's avatar
      KVM: x86: fix NULL deref in vcpu_scan_ioapic · 90f70fcd
      Wanpeng Li authored
      commit 546d87e5 upstream.
      
      Reported by syzkaller:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
          IP: _raw_spin_lock+0xc/0x30
          PGD 3e28eb067
          PUD 3f0ac6067
          PMD 0
          Oops: 0002 [#1] SMP
          CPU: 0 PID: 2431 Comm: test Tainted: G           OE   4.10.0-rc1+ #3
          Call Trace:
           ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
           kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
           ? pick_next_task_fair+0xe1/0x4e0
           ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
           kvm_vcpu_ioctl+0x33a/0x600 [kvm]
           ? hrtimer_try_to_cancel+0x29/0x130
           ? do_nanosleep+0x97/0xf0
           do_vfs_ioctl+0xa1/0x5d0
           ? __hrtimer_init+0x90/0x90
           ? do_nanosleep+0x5b/0xf0
           SyS_ioctl+0x79/0x90
           do_syscall_64+0x6e/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0
      
      The syzkaller folks reported a NULL pointer dereference due to
      ENABLE_CAP succeeding even without an irqchip.  The Hyper-V
      synthetic interrupt controller is activated, resulting in a
      wrong request to rescan the ioapic and a NULL pointer dereference.
      
          #include <sys/ioctl.h>
          #include <sys/mman.h>
          #include <sys/types.h>
          #include <linux/kvm.h>
          #include <pthread.h>
          #include <stddef.h>
          #include <stdint.h>
          #include <stdlib.h>
          #include <string.h>
          #include <unistd.h>
      
          #ifndef KVM_CAP_HYPERV_SYNIC
          #define KVM_CAP_HYPERV_SYNIC 123
          #endif
      
          void* thr(void* arg)
          {
      	struct kvm_enable_cap cap;
      	cap.flags = 0;
      	cap.cap = KVM_CAP_HYPERV_SYNIC;
      	ioctl((long)arg, KVM_ENABLE_CAP, &cap);
      	return 0;
          }
      
          int main()
          {
      	void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
      			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      	int kvmfd = open("/dev/kvm", 0);
      	int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
      	struct kvm_userspace_memory_region memreg;
      	memreg.slot = 0;
      	memreg.flags = 0;
      	memreg.guest_phys_addr = 0;
      	memreg.memory_size = 0x1000;
      	memreg.userspace_addr = (unsigned long)host_mem;
      	host_mem[0] = 0xf4;
      	ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
      	int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
      	struct kvm_sregs sregs;
      	ioctl(cpufd, KVM_GET_SREGS, &sregs);
      	sregs.cr0 = 0;
      	sregs.cr4 = 0;
      	sregs.efer = 0;
      	sregs.cs.selector = 0;
      	sregs.cs.base = 0;
      	ioctl(cpufd, KVM_SET_SREGS, &sregs);
      	struct kvm_regs regs = { .rflags = 2 };
      	ioctl(cpufd, KVM_SET_REGS, &regs);
      	ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
      	pthread_t th;
      	pthread_create(&th, 0, thr, (void*)(long)cpufd);
      	usleep(rand() % 10000);
      	ioctl(cpufd, KVM_RUN, 0);
      	pthread_join(th, 0);
      	return 0;
          }
      
      This patch fixes it by failing ENABLE_CAP if without an irqchip.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: 5c919412 (kvm/x86: Hyper-V synthetic interrupt controller)
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90f70fcd
    • David Matlack's avatar
      KVM: x86: flush pending lapic jump label updates on module unload · 5ed21cc0
      David Matlack authored
      commit cef84c30 upstream.
      
      KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
      These are implemented with delayed_work structs which can still be
      pending when the KVM module is unloaded. We've seen this cause kernel
      panics when the kvm_intel module is quickly reloaded.
      
      Use the new static_key_deferred_flush() API to flush pending updates on
      module unload.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ed21cc0
    • David Matlack's avatar
      jump_labels: API for flushing deferred jump label updates · 483ecebb
      David Matlack authored
      commit b6416e61 upstream.
      
      Modules that use static_key_deferred need a way to synchronize with
      any delayed work that is still pending when the module is unloaded.
      Introduce static_key_deferred_flush() which flushes any pending
      jump label updates.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      483ecebb
    • Wanpeng Li's avatar
      KVM: eventfd: fix NULL deref irqbypass consumer · 7caf473f
      Wanpeng Li authored
      commit 4f3dbdf4 upstream.
      
      Reported syzkaller:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
          IP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
          PGD 0
      
          Oops: 0002 [#1] SMP
          CPU: 1 PID: 125 Comm: kworker/1:1 Not tainted 4.9.0+ #1
          Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
          task: ffff9bbe0dfbb900 task.stack: ffffb61802014000
          RIP: 0010:irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass]
          Call Trace:
           irqfd_shutdown+0x66/0xa0 [kvm]
           process_one_work+0x16b/0x480
           worker_thread+0x4b/0x500
           kthread+0x101/0x140
           ? process_one_work+0x480/0x480
           ? kthread_create_on_node+0x60/0x60
           ret_from_fork+0x25/0x30
          RIP: irq_bypass_unregister_consumer+0x9d/0xb70 [irqbypass] RSP: ffffb61802017e20
          CR2: 0000000000000008
      
      The syzkaller folks reported a NULL pointer dereference that due to
      unregister an consumer which fails registration before. The syzkaller
      creates two VMs w/ an equal eventfd occasionally. So the second VM
      fails to register an irqbypass consumer. It will make irqfd as inactive
      and queue an workqueue work to shutdown irqfd and unregister the irqbypass
      consumer when eventfd is closed. However, the second consumer has been
      initialized though it fails registration. So the token(same as the first
      VM's) is taken to unregister the consumer through the workqueue, the
      consumer of the first VM is found and unregistered, then NULL deref incurred
      in the path of deleting consumer from the consumers list.
      
      This patch fixes it by making irq_bypass_register/unregister_consumer()
      looks for the consumer entry based on consumer pointer itself instead of
      token matching.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Suggested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7caf473f
    • Paolo Bonzini's avatar
      KVM: x86: fix emulation of "MOV SS, null selector" · 7718ffcf
      Paolo Bonzini authored
      commit 33ab9110 upstream.
      
      This is CVE-2017-2583.  On Intel this causes a failed vmentry because
      SS's type is neither 3 nor 7 (even though the manual says this check is
      only done for usable SS, and the dmesg splat says that SS is unusable!).
      On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.
      
      The fix fabricates a data segment descriptor when SS is set to a null
      selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
      Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
      this in turn ensures CPL < 3 because RPL must be equal to CPL.
      
      Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
      the bug and deciphering the manuals.
      Reported-by: default avatarXiaohan Zhang <zhangxiaohan1@huawei.com>
      Fixes: 79d5b4c3Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7718ffcf
    • Mike Kravetz's avatar
      mm/hugetlb.c: fix reservation race when freeing surplus pages · 1e26cec6
      Mike Kravetz authored
      commit e5bbc8a6 upstream.
      
      return_unused_surplus_pages() decrements the global reservation count,
      and frees any unused surplus pages that were backing the reservation.
      
      Commit 7848a4bf ("mm/hugetlb.c: add cond_resched_lock() in
      return_unused_surplus_pages()") added a call to cond_resched_lock in the
      loop freeing the pages.
      
      As a result, the hugetlb_lock could be dropped, and someone else could
      use the pages that will be freed in subsequent iterations of the loop.
      This could result in inconsistent global hugetlb page state, application
      api failures (such as mmap) failures or application crashes.
      
      When dropping the lock in return_unused_surplus_pages, make sure that
      the global reservation count (resv_huge_pages) remains sufficiently
      large to prevent someone else from claiming pages about to be freed.
      
      Analyzed by Paul Cassella.
      
      Fixes: 7848a4bf ("mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()")
      Link: http://lkml.kernel.org/r/1483991767-6879-1-git-send-email-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarPaul Cassella <cassella@cray.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e26cec6
    • John Sperbeck's avatar
      mm/slab.c: fix SLAB freelist randomization duplicate entries · 8315c22e
      John Sperbeck authored
      commit c4e490cf upstream.
      
      This patch fixes a bug in the freelist randomization code.  When a high
      random number is used, the freelist will contain duplicate entries.  It
      will result in different allocations sharing the same chunk.
      
      It will result in odd behaviours and crashes.  It should be uncommon but
      it depends on the machines.  We saw it happening more often on some
      machines (every few hours of running tests).
      
      Fixes: c7ce4f60 ("mm: SLAB freelist randomization")
      Link: http://lkml.kernel.org/r/20170103181908.143178-1-thgarnie@google.comSigned-off-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8315c22e
    • Minchan Kim's avatar
      mm: support anonymous stable page · 6ca29ee3
      Minchan Kim authored
      commit f0571429 upstream.
      
      During developemnt for zram-swap asynchronous writeback, I found strange
      corruption of compressed page, resulting in:
      
        Modules linked in: zram(E)
        CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        task: ffff88007620b840 task.stack: ffff880078090000
        RIP: set_freeobj.part.43+0x1c/0x1f
        RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
        RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
        RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
        RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
        R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
        FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
        Call Trace:
          obj_malloc+0x22b/0x260
          zs_malloc+0x1e4/0x580
          zram_bvec_rw+0x4cd/0x830 [zram]
          page_requests_rw+0x9c/0x130 [zram]
          zram_thread+0xe6/0x173 [zram]
          kthread+0xca/0xe0
          ret_from_fork+0x25/0x30
      
      With investigation, it reveals currently stable page doesn't support
      anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
      writeback completion so it can overwrite page zram is compressing.
      
      Unfortunately, zram has used per-cpu stream feature from v4.7.
      It aims for increasing cache hit ratio of scratch buffer for
      compressing. Downside of that approach is that zram should ask
      memory space for compressed page in per-cpu context which requires
      stricted gfp flag which could be failed. If so, it retries to
      allocate memory space out of per-cpu context so it could get memory
      this time and compress the data again, copies it to the memory space.
      
      In this scenario, zram assumes the data should never be changed
      but it is not true unless stable page supports. So, If the data is
      changed under us, zram can make buffer overrun because second
      compression size could be bigger than one we got in previous trial
      and blindly, copy bigger size object to smaller buffer which is
      buffer overrun. The overrun breaks zsmalloc free object chaining
      so system goes crash like above.
      
      I think below is same problem.
      https://bugzilla.suse.com/show_bug.cgi?id=997574
      
      Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
      writeback in there so the approach in this patch is simply return false if
      we found it needs stable page.  Although it increases memory footprint
      temporarily, it happens rarely and it should be reclaimed easily althoug
      it happened.  Also, It would be better than waiting of IO completion,
      which is critial path for application latency.
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
      Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.orgSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ca29ee3
    • Michal Hocko's avatar
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · 07fc9575
      Michal Hocko authored
      commit b4536f0c upstream.
      
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarNils Holland <nholland@tisys.org>
      Tested-by: default avatarNils Holland <nholland@tisys.org>
      Reported-by: default avatarKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07fc9575
    • Eric Ren's avatar
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · 6c9bd81c
      Eric Ren authored
      commit e7ee2c08 upstream.
      
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c9bd81c