1. 10 Aug, 2016 36 commits
    • Sinclair Yeh's avatar
      Input: vmmouse - remove port reservation · bc05227c
      Sinclair Yeh authored
      commit 60842ef8 upstream.
      
      The VMWare EFI BIOS will expose port 0x5658 as an ACPI resource.  This
      causes the port to be reserved by the APCI module as the system comes up,
      making it unavailable to be reserved again by other drivers, thus
      preserving this VMWare port for special use in a VMWare guest.
      
      This port is designed to be shared among multiple VMWare services, such as
      the VMMOUSE.  Because of this, VMMOUSE should not try to reserve this port
      on its own.
      
      The VMWare non-EFI BIOS does not do this to preserve compatibility with
      existing/legacy VMs.  It is known that there is small chance a VM may be
      configured such that these ports get reserved by other non-VMWare devices,
      and if this ever happens, the result is undefined.
      Signed-off-by: default avatarSinclair Yeh <syeh@vmware.com>
      Reviewed-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc05227c
    • Kangjie Lu's avatar
      ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt · 1fae440a
      Kangjie Lu authored
      commit e4ec8cc8 upstream.
      
      The stack object “r1” has a total size of 32 bytes. Its field
      “event” and “val” both contain 4 bytes padding. These 8 bytes
      padding bytes are sent to user without being initialized.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fae440a
    • Kangjie Lu's avatar
      ALSA: timer: Fix leak in events via snd_timer_user_ccallback · 5b6fc00b
      Kangjie Lu authored
      commit 9a47e9cf upstream.
      
      The stack object “r1” has a total size of 32 bytes. Its field
      “event” and “val” both contain 4 bytes padding. These 8 bytes
      padding bytes are sent to user without being initialized.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b6fc00b
    • Kangjie Lu's avatar
      ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS · 82a638a2
      Kangjie Lu authored
      commit cec8f96e upstream.
      
      The stack object “tread” has a total size of 32 bytes. Its field
      “event” and “val” both contain 4 bytes padding. These 8 bytes
      padding bytes are sent to user without being initialized.
      Signed-off-by: default avatarKangjie Lu <kjlu@gatech.edu>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82a638a2
    • Bob Liu's avatar
      xen-blkfront: don't call talk_to_blkback when already connected to blkback · 2e9aedc4
      Bob Liu authored
      commit efd15352 upstream.
      
      Sometimes blkfront may twice receive blkback_changed() notification
      (XenbusStateConnected) after migration, which will cause
      talk_to_blkback() to be called twice too and confuse xen-blkback.
      
      The flow is as follow:
         blkfront                                        blkback
      blkfront_resume()
       > talk_to_blkback()
        > Set blkfront to XenbusStateInitialised
                                                      front changed()
                                                       > Connect()
                                                        > Set blkback to XenbusStateConnected
      
      blkback_changed()
       > Skip talk_to_blkback()
         because frontstate == XenbusStateInitialised
       > blkfront_connect()
        > Set blkfront to XenbusStateConnected
      
      -----
      And here we get another XenbusStateConnected notification leading
      to:
      -----
      blkback_changed()
       > because now frontstate != XenbusStateInitialised
         talk_to_blkback() is also called again
        > blkfront state changed from
        XenbusStateConnected to XenbusStateInitialised
          (Which is not correct!)
      
      						front_changed():
                                                       > Do nothing because blkback
                                                         already in XenbusStateConnected
      
      Now blkback is in XenbusStateConnected but blkfront is still
      in XenbusStateInitialised - leading to no disks.
      
      Poking of the XenbusStateConnected state is allowed (to deal with
      block disk change) and has to be dealt with. The most likely
      cause of this bug are custom udev scripts hooking up the disks
      and then validating the size.
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e9aedc4
    • Bob Liu's avatar
      xen-blkfront: fix resume issues after a migration · 1d8d6ed6
      Bob Liu authored
      commit 2a6f71ad upstream.
      
      After a migrate to another host (which may not have multiqueue
      support), the number of rings (block hardware queues)
      may be changed and the ring info structure will also be reallocated.
      
      This patch fixes two related bugs:
       * call blk_mq_update_nr_hw_queues() to make blk-core know the number
         of hardware queues have been changed.
       * Don't store rinfo pointer to hctx->driver_data, because rinfo may be
         reallocated so use hctx->queue_num to get the rinfo structure instead.
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d8d6ed6
    • Jan Beulich's avatar
      xenbus: don't bail early from xenbus_dev_request_and_reply() · 77bdc603
      Jan Beulich authored
      commit 7469be95 upstream.
      
      xenbus_dev_request_and_reply() needs to track whether a transaction is
      open.  For XS_TRANSACTION_START messages it calls transaction_start()
      and for XS_TRANSACTION_END messages it calls transaction_end().
      
      If sending an XS_TRANSACTION_START message fails or responds with an
      an error, the transaction is not open and transaction_end() must be
      called.
      
      If sending an XS_TRANSACTION_END message fails, the transaction is
      still open, but if an error response is returned the transaction is
      closed.
      
      Commit 027bd7e8 ("xen/xenbus: Avoid synchronous wait on XenBus
      stalling shutdown/restart") introduced a regression where failed
      XS_TRANSACTION_START messages were leaving the transaction open.  This
      can cause problems with suspend (and migration) as all transactions
      must be closed before suspending.
      
      It appears that the problematic change was added accidentally, so just
      remove it.
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77bdc603
    • Jan Beulich's avatar
      xenbus: don't BUG() on user mode induced condition · e8f5fe05
      Jan Beulich authored
      commit 0beef634 upstream.
      
      Inability to locate a user mode specified transaction ID should not
      lead to a kernel crash. For other than XS_TRANSACTION_START also
      don't issue anything to xenbus if the specified ID doesn't match that
      of any active transaction.
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8f5fe05
    • Bob Liu's avatar
      xen-blkfront: save uncompleted reqs in blkfront_resume() · 61675789
      Bob Liu authored
      commit 7b427a59 upstream.
      
      Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
      migration, but that's too late after multi-queue was introduced.
      
      After a migrate to another host (which may not have multiqueue support), the
      number of rings (block hardware queues) may be changed and the ring and shadow
      structure will also be reallocated.
      
      The blkfront_recover() then can't 'save and resubmit' the real
      uncompleted reqs because shadow structure have been reallocated.
      
      This patch fixes this issue by moving the 'save' logic out of
      blkfront_recover() to earlier place in blkfront_resume().
      
      The 'resubmit' is not changed and still in blkfront_recover().
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61675789
    • Andrey Grodzovsky's avatar
      xen/pciback: Fix conf_space read/write overlap check. · 45e4b704
      Andrey Grodzovsky authored
      commit 02ef871e upstream.
      
      Current overlap check is evaluating to false a case where a filter
      field is fully contained (proper subset) of a r/w request.  This
      change applies classical overlap check instead to include all the
      scenarios.
      
      More specifically, for (Hilscher GmbH CIFX 50E-DP(M/S)) device driver
      the logic is such that the entire confspace is read and written in 4
      byte chunks. In this case as an example, CACHE_LINE_SIZE,
      LATENCY_TIMER and PCI_BIST are arriving together in one call to
      xen_pcibk_config_write() with offset == 0xc and size == 4.  With the
      exsisting overlap check the LATENCY_TIMER field (offset == 0xd, length
      == 1) is fully contained in the write request and hence is excluded
      from write, which is incorrect.
      Signed-off-by: default avatarAndrey Grodzovsky <andrey2805@gmail.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45e4b704
    • Vineet Gupta's avatar
      ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame) · 4d1fa7a0
      Vineet Gupta authored
      commit f52e126c upstream.
      
      With recent binutils update to support dwarf CFI pseudo-ops in gas, we
      now get .eh_frame vs. .debug_frame. Although the call frame info is
      exactly the same in both, the CIE differs, which the current kernel
      unwinder can't cope with.
      
      This broke both the kernel unwinder as well as loadable modules (latter
      because of a new unhandled relo R_ARC_32_PCREL from .rela.eh_frame in
      the module loader)
      
      The ideal solution would be to switch unwinder to .eh_frame.
      For now however we can make do by just ensureing .debug_frame is
      generated by removing -fasynchronous-unwind-tables
      
       .eh_frame    generated with -gdwarf-2 -fasynchronous-unwind-tables
       .debug_frame generated with -gdwarf-2
      
      Fixes STAR 9001058196
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d1fa7a0
    • Alexey Brodkin's avatar
      arc: unwind: warn only once if DW2_UNWIND is disabled · bcd07ba6
      Alexey Brodkin authored
      commit 9bd54517 upstream.
      
      If CONFIG_ARC_DW2_UNWIND is disabled every time arc_unwind_core()
      gets called following message gets printed in debug console:
      ----------------->8---------------
      CONFIG_ARC_DW2_UNWIND needs to be enabled
      ----------------->8---------------
      
      That message makes sense if user indeed wants to see a backtrace or
      get nice function call-graphs in perf but what if user disabled
      unwinder for the purpose? Why pollute his debug console?
      
      So instead we'll warn user about possibly missing feature once and
      let him decide if that was what he or she really wanted.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bcd07ba6
    • Josh Poimboeuf's avatar
      sched/debug: Fix deadlock when enabling sched events · f6e9bad4
      Josh Poimboeuf authored
      commit eda8dca5 upstream.
      
      I see a hang when enabling sched events:
      
        echo 1 > /sys/kernel/debug/tracing/events/sched/enable
      
      The printk buffer shows:
      
        BUG: spinlock recursion on CPU#1, swapper/1/0
         lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
        ...
        Call Trace:
         <IRQ>  [<ffffffff8143d663>] dump_stack+0x85/0xc2
         [<ffffffff81115948>] spin_dump+0x78/0xc0
         [<ffffffff81115aea>] do_raw_spin_lock+0x11a/0x150
         [<ffffffff81891471>] _raw_spin_lock+0x61/0x80
         [<ffffffff810e5466>] ? try_to_wake_up+0x256/0x4e0
         [<ffffffff810e5466>] try_to_wake_up+0x256/0x4e0
         [<ffffffff81891a0a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
         [<ffffffff810e5705>] wake_up_process+0x15/0x20
         [<ffffffff810cebb4>] insert_work+0x84/0xc0
         [<ffffffff810ced7f>] __queue_work+0x18f/0x660
         [<ffffffff810cf9a6>] queue_work_on+0x46/0x90
         [<ffffffffa00cd95b>] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
         [<ffffffffa00cdac0>] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
         [<ffffffff814babcd>] soft_cursor+0x1ad/0x230
         [<ffffffff814ba379>] bit_cursor+0x649/0x680
         [<ffffffff814b9d30>] ? update_attr.isra.2+0x90/0x90
         [<ffffffff814b5e6a>] fbcon_cursor+0x14a/0x1c0
         [<ffffffff81555ef8>] hide_cursor+0x28/0x90
         [<ffffffff81558b6f>] vt_console_print+0x3bf/0x3f0
         [<ffffffff81122c63>] call_console_drivers.constprop.24+0x183/0x200
         [<ffffffff811241f4>] console_unlock+0x3d4/0x610
         [<ffffffff811247f5>] vprintk_emit+0x3c5/0x610
         [<ffffffff81124bc9>] vprintk_default+0x29/0x40
         [<ffffffff811e965b>] printk+0x57/0x73
         [<ffffffff810f7a9e>] enqueue_entity+0xc2e/0xc70
         [<ffffffff810f7b39>] enqueue_task_fair+0x59/0xab0
         [<ffffffff8106dcd9>] ? kvm_sched_clock_read+0x9/0x20
         [<ffffffff8103fb39>] ? sched_clock+0x9/0x10
         [<ffffffff810e3fcc>] activate_task+0x5c/0xa0
         [<ffffffff810e4514>] ttwu_do_activate+0x54/0xb0
         [<ffffffff810e5cea>] sched_ttwu_pending+0x7a/0xb0
         [<ffffffff810e5e51>] scheduler_ipi+0x61/0x170
         [<ffffffff81059e7f>] smp_trace_reschedule_interrupt+0x4f/0x2a0
         [<ffffffff81893ba6>] trace_reschedule_interrupt+0x96/0xa0
         <EOI>  [<ffffffff8106e0d6>] ? native_safe_halt+0x6/0x10
         [<ffffffff8110fb1d>] ? trace_hardirqs_on+0xd/0x10
         [<ffffffff81040ac0>] default_idle+0x20/0x1a0
         [<ffffffff8104147f>] arch_cpu_idle+0xf/0x20
         [<ffffffff81102f8f>] default_idle_call+0x2f/0x50
         [<ffffffff8110332e>] cpu_startup_entry+0x37e/0x450
         [<ffffffff8105af70>] start_secondary+0x160/0x1a0
      
      Note the hang only occurs when echoing the above from a physical serial
      console, not from an ssh session.
      
      The bug is caused by a deadlock where the task is trying to grab the rq
      lock twice because printk()'s aren't safe in sched code.
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@trebleSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6e9bad4
    • Andrey Ryabinin's avatar
      kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w · 543a87ab
      Andrey Ryabinin authored
      commit 57675cb9 upstream.
      
      Lengthy output of sysrq-w may take a lot of time on slow serial console.
      
      Currently we reset NMI-watchdog on the current CPU to avoid spurious
      lockup messages. Sometimes this doesn't work since softlockup watchdog
      might trigger on another CPU which is waiting for an IPI to proceed.
      We reset softlockup watchdogs on all CPUs, but we do this only after
      listing all tasks, and this may be too late on a busy system.
      
      So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      543a87ab
    • Jiri Slaby's avatar
      pps: do not crash when failed to register · 266f1d62
      Jiri Slaby authored
      commit 368301f2 upstream.
      
      With this command sequence:
      
        modprobe plip
        modprobe pps_parport
        rmmod pps_parport
      
      the partport_pps modules causes this crash:
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: parport_detach+0x1d/0x60 [pps_parport]
        Oops: 0000 [#1] SMP
        ...
        Call Trace:
          parport_unregister_driver+0x65/0xc0 [parport]
          SyS_delete_module+0x187/0x210
      
      The sequence that builds up to this is:
      
       1) plip is loaded and takes the parport device for exclusive use:
      
          plip0: Parallel port at 0x378, using IRQ 7.
      
       2) pps_parport then fails to grab the device:
      
          pps_parport: parallel port PPS client
          parport0: cannot grant exclusive access for device pps_parport
          pps_parport: couldn't register with parport0
      
       3) rmmod of pps_parport is then killed because it tries to access
          pardev->name, but pardev (taken from port->cad) is NULL.
      
      So add a check for NULL in the test there too.
      
      Link: http://lkml.kernel.org/r/20160714115245.12651-1-jslaby@suse.czSigned-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Acked-by: default avatarRodolfo Giometti <giometti@enneenne.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      266f1d62
    • Andrey Ryabinin's avatar
      radix-tree: fix radix_tree_iter_retry() for tagged iterators. · d9e8be11
      Andrey Ryabinin authored
      commit 3cb9185c upstream.
      
      radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
      Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
      leading to crash:
      
        RIP: radix_tree_next_slot include/linux/radix-tree.h:473
          find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
        ....
        Call Trace:
          pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
          mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
          ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
          do_writepages+0x97/0x100 mm/page-writeback.c:2364
          __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
          filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
          ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
          vfs_fsync_range+0x10a/0x250 fs/sync.c:195
          vfs_fsync fs/sync.c:209
          do_fsync+0x42/0x70 fs/sync.c:219
          SYSC_fdatasync fs/sync.c:232
          SyS_fdatasync+0x19/0x20 fs/sync.c:230
          entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207
      
      We must reset iterator's tags to bail out from radix_tree_next_slot()
      and go to the slow-path in radix_tree_next_chunk().
      
      Fixes: 46437f9a ("radix-tree: fix race in gang lookup")
      Link: http://lkml.kernel.org/r/1468495196-10604-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9e8be11
    • Johannes Weiner's avatar
      mm: memcontrol: fix cgroup creation failure after many small jobs · db70cd18
      Johannes Weiner authored
      commit 73f576c0 upstream.
      
      The memory controller has quite a bit of state that usually outlives the
      cgroup and pins its CSS until said state disappears.  At the same time
      it imposes a 16-bit limit on the CSS ID space to economically store IDs
      in the wild.  Consequently, when we use cgroups to contain frequent but
      small and short-lived jobs that leave behind some page cache, we quickly
      run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
      fails with -ENOSPC while there are only a few, or even no user-visible
      cgroups in existence.
      
      Although pinning CSSs past cgroup removal is common, there are only two
      instances that actually need an ID after a cgroup is deleted: cache
      shadow entries and swapout records.
      
      Cache shadow entries reference the ID weakly and can deal with the CSS
      having disappeared when it's looked up later.  They pose no hurdle.
      
      Swap-out records do need to pin the css to hierarchically attribute
      swapins after the cgroup has been deleted; though the only pages that
      remain swapped out after offlining are tmpfs/shmem pages.  And those
      references are under the user's control, so they are manageable.
      
      This patch introduces a private 16-bit memcg ID and switches swap and
      cache shadow entries over to using that.  This ID can then be recycled
      after offlining when the CSS remains pinned only by objects that don't
      specifically need it.
      
      This script demonstrates the problem by faulting one cache page in a new
      cgroup and deleting it again:
      
        set -e
        mkdir -p pages
        for x in `seq 128000`; do
          [ $((x % 1000)) -eq 0 ] && echo $x
          mkdir /cgroup/foo
          echo $$ >/cgroup/foo/cgroup.procs
          echo trex >pages/$x
          echo $$ >/cgroup/cgroup.procs
          rmdir /cgroup/foo
        done
      
      When run on an unpatched kernel, we eventually run out of possible IDs
      even though there are no visible cgroups:
      
        [root@ham ~]# ./cssidstress.sh
        [...]
        65000
        mkdir: cannot create directory '/cgroup/foo': No space left on device
      
      After this patch, the IDs get released upon cgroup destruction and the
      cache and css objects get released once memory reclaim kicks in.
      
      [hannes@cmpxchg.org: init the IDR]
        Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
      Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups")
      Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarJohn Garcia <john.garcia@mesosphere.io>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db70cd18
    • Hugh Dickins's avatar
      mm: thp: refix false positive BUG in page_move_anon_rmap() · 11a11016
      Hugh Dickins authored
      commit 5a49973d upstream.
      
      The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
      worth: the syzkaller fuzzer hit it again.  It's still wrong for some THP
      cases, because linear_page_index() was never intended to apply to
      addresses before the start of a vma.
      
      That's easily fixed with a signed long cast inside linear_page_index();
      and Dmitry has tested such a patch, to verify the false positive.  But
      why extend linear_page_index() just for this case? when the avoidance in
      page_move_anon_rmap() has already grown ugly, and there's no reason for
      the check at all (nothing else there is using address or index).
      
      Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
      remove CONFIG_DEBUG_VM PageTransHuge adjustment.
      
      And one more thing: should the compound_head(page) be done inside or
      outside page_move_anon_rmap()? It's usually pushed down to the lowest
      level nowadays (and mm/memory.c shows no other explicit use of it), so I
      think it's better done in page_move_anon_rmap() than by caller.
      
      Fixes: 0798d3c0 ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvilsSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11a11016
    • Dmitry Vyukov's avatar
      vmlinux.lds: account for destructor sections · aa434e11
      Dmitry Vyukov authored
      commit e41f501d upstream.
      
      If CONFIG_KASAN is enabled and gcc is configured with
      --disable-initfini-array and/or gold linker is used, gcc emits
      .ctors/.dtors and .text.startup/.text.exit sections instead of
      .init_array/.fini_array.  .dtors section is not explicitly accounted in
      the linker script and messes vvar/percpu layout.
      
      We want:
        ffffffff822bfd80 D _edata
        ffffffff822c0000 D __vvar_beginning_hack
        ffffffff822c0000 A __vvar_page
        ffffffff822c0080 0000000000000098 D vsyscall_gtod_data
        ffffffff822c1000 A __init_begin
        ffffffff822c1000 D init_per_cpu__irq_stack_union
        ffffffff822c1000 A __per_cpu_load
        ffffffff822d3000 D init_per_cpu__gdt_page
      
      We got:
        ffffffff8279a600 D _edata
        ffffffff8279b000 A __vvar_page
        ffffffff8279c000 A __init_begin
        ffffffff8279c000 D init_per_cpu__irq_stack_union
        ffffffff8279c000 A __per_cpu_load
        ffffffff8279e000 D __vvar_beginning_hack
        ffffffff8279e080 0000000000000098 D vsyscall_gtod_data
        ffffffff827ae000 D init_per_cpu__gdt_page
      
      This happens because __vvar_page and .vvar get different addresses in
      arch/x86/kernel/vmlinux.lds.S:
      
      	. = ALIGN(PAGE_SIZE);
      	__vvar_page = .;
      
      	.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
      		/* work around gold bug 13023 */
      		__vvar_beginning_hack = .;
      
      Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
      Merge .text.startup into init text.
      
      Link: http://lkml.kernel.org/r/1467386363-120030-1-git-send-email-dvyukov@google.comSigned-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aa434e11
    • Mel Gorman's avatar
      mm, meminit: ensure node is online before checking whether pages are uninitialised · 7ec7eb63
      Mel Gorman authored
      commit ef70b6f4 upstream.
      
      early_page_uninitialised looks up an arbitrary PFN.  While a machine
      without node 0 will boot with "mm, page_alloc: Always return a valid
      node from early_pfn_to_nid", it works because it assumes that nodes are
      always in PFN order.  This is not guaranteed so this patch adds
      robustness by always checking if the node being checked is online.
      
      Link: http://lkml.kernel.org/r/1468008031-3848-4-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ec7eb63
    • Mel Gorman's avatar
      mm, meminit: always return a valid node from early_pfn_to_nid · f6ea62ca
      Mel Gorman authored
      commit e4568d38 upstream.
      
      early_pfn_to_nid can return node 0 if a PFN is invalid on machines that
      has no node 0.  A machine with only node 1 was observed to crash with
      the following message:
      
         BUG: unable to handle kernel paging request at 000000000002a3c8
         PGD 0
         Modules linked in:
         Hardware name: Supermicro H8DSP-8/H8DSP-8, BIOS 080011  06/30/2006
         task: ffffffff81c0d500 ti: ffffffff81c00000 task.ti: ffffffff81c00000
         RIP: reserve_bootmem_region+0x6a/0xef
         CR2: 000000000002a3c8 CR3: 0000000001c06000 CR4: 00000000000006b0
         Call Trace:
            free_all_bootmem+0x4b/0x12a
            mem_init+0x70/0xa3
            start_kernel+0x25b/0x49b
      
      The problem is that early_page_uninitialised uses the early_pfn_to_nid
      helper which returns node 0 for invalid PFNs.  No caller of
      early_pfn_to_nid cares except early_page_uninitialised.  This patch has
      early_pfn_to_nid always return a valid node.
      
      Link: http://lkml.kernel.org/r/1468008031-3848-3-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6ea62ca
    • Mauro Carvalho Chehab's avatar
      uapi: export lirc.h header · 3428b992
      Mauro Carvalho Chehab authored
      commit 12cb22bb upstream.
      
      This header contains the userspace API for lirc.
      
      This is a fixup for commit b7be7557 ("[media] bz#75751: Move
      internal header file lirc.h to uapi/").  It moved the header to the
      right place, but it forgot to add it at Kbuild.  So, despite being at
      uapi, it is not copied to the right place.
      
      Fixes: b7be7557 ("[media] bz#75751: Move internal header file lirc.h to uapi/")
      Link: http://lkml.kernel.org/r/320c765d32bfc82c582e336d52ffe1026c73c644.1468439021.git.mchehab@s-opensource.comSigned-off-by: default avatarMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alec Leamas <leamas.alec@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3428b992
    • David Rientjes's avatar
      mm, compaction: prevent VM_BUG_ON when terminating freeing scanner · 1a30fc84
      David Rientjes authored
      commit a46cbf3b upstream.
      
      It's possible to isolate some freepages in a pageblock and then fail
      split_free_page() due to the low watermark check.  In this case, we hit
      VM_BUG_ON() because the freeing scanner terminated early without a
      contended lock or enough freepages.
      
      This should never have been a VM_BUG_ON() since it's not a fatal
      condition.  It should have been a VM_WARN_ON() at best, or even handled
      gracefully.
      
      Regardless, we need to terminate anytime the full pageblock scan was not
      done.  The logic belongs in isolate_freepages_block(), so handle its
      state gracefully by terminating the pageblock loop and making a note to
      restart at the same pageblock next time since it was not possible to
      complete the scan this time.
      
      [rientjes@google.com: don't rescan pages in a pageblock]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1607111244150.83138@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606291436300.145590@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarMinchan Kim <minchan@kernel.org>
      Tested-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a30fc84
    • Torsten Hilbrich's avatar
      fs/nilfs2: fix potential underflow in call to crc32_le · bb01babb
      Torsten Hilbrich authored
      commit 63d2f95d upstream.
      
      The value `bytes' comes from the filesystem which is about to be
      mounted.  We cannot trust that the value is always in the range we
      expect it to be.
      
      Check its value before using it to calculate the length for the crc32_le
      call.  It value must be larger (or equal) sumoff + 4.
      
      This fixes a kernel bug when accidentially mounting an image file which
      had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
      The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
      s_bytes value of 1.  This caused an underflow when substracting sumoff +
      4 (20) in the call to crc32_le.
      
        BUG: unable to handle kernel paging request at ffff88021e600000
        IP:  crc32_le+0x36/0x100
        ...
        Call Trace:
          nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
          nilfs_load_super_block+0x142/0x300 [nilfs2]
          init_nilfs+0x60/0x390 [nilfs2]
          nilfs_mount+0x302/0x520 [nilfs2]
          mount_fs+0x38/0x160
          vfs_kern_mount+0x67/0x110
          do_mount+0x269/0xe00
          SyS_mount+0x9f/0x100
          entry_SYSCALL_64_fastpath+0x16/0x71
      
      Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jpSigned-off-by: default avatarTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Tested-by: default avatarTorsten Hilbrich <torsten.hilbrich@secunet.com>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb01babb
    • David Rientjes's avatar
      mm, compaction: abort free scanner if split fails · 8f40a441
      David Rientjes authored
      commit a4f04f2c upstream.
      
      If the memory compaction free scanner cannot successfully split a free
      page (only possible due to per-zone low watermark), terminate the free
      scanner rather than continuing to scan memory needlessly.  If the
      watermark is insufficient for a free page of order <= cc->order, then
      terminate the scanner since all future splits will also likely fail.
      
      This prevents the compaction freeing scanner from scanning all memory on
      very large zones (very noticeable for zones > 128GB, for instance) when
      all splits will likely fail while holding zone->lock.
      
      compaction_alloc() iterating a 128GB zone has been benchmarked to take
      over 400ms on some systems whereas any free page isolated and ready to
      be split ends up failing in split_free_page() because of the low
      watermark check and thus the iteration continues.
      
      The next time compaction occurs, the freeing scanner will likely start
      at the end of the zone again since no success was made previously and we
      get the same lengthy iteration until the zone is brought above the low
      watermark.  All thp page faults can take >400ms in such a state without
      this fix.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f40a441
    • Lukasz Odzioba's avatar
      mm/swap.c: flush lru pvecs on compound page arrival · af809c0e
      Lukasz Odzioba authored
      commit 8f182270 upstream.
      
      Currently we can have compound pages held on per cpu pagevecs, which
      leads to a lot of memory unavailable for reclaim when needed.  In the
      systems with hundreads of processors it can be GBs of memory.
      
      On of the way of reproducing the problem is to not call munmap
      explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
      that some pages (with THP enabled also huge pages) may end up on
      lru_add_pvec, example below.
      
        void main() {
        #pragma omp parallel
        {
      	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
      	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
      		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
      	if (p != MAP_FAILED)
      		memset(p, 0, size);
      	//munmap(p, size); // uncomment to make the problem go away
        }
        }
      
      When we run it with THP enabled it will leave significant amount of
      memory on lru_add_pvec.  This memory will be not reclaimed if we hit
      OOM, so when we run above program in a loop:
      
      	for i in `seq 100`; do ./a.out; done
      
      many processes (95% in my case) will be killed by OOM.
      
      The primary point of the LRU add cache is to save the zone lru_lock
      contention with a hope that more pages will belong to the same zone and
      so their addition can be batched.  The huge page is already a form of
      batched addition (it will add 512 worth of memory in one go) so skipping
      the batching seems like a safer option when compared to a potential
      excess in the caching which can be quite large and much harder to fix
      because lru_add_drain_all is way to expensive and it is not really clear
      what would be a good moment to call it.
      
      Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
      madvise(p, size, MADV_FREE); after memset.
      
      This patch flushes lru pvecs on compound page arrival making the problem
      less severe - after applying it kill rate of above example drops to 0%,
      due to reducing maximum amount of memory held on pvec from 28MB (with
      THP) to 56kB per CPU.
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: default avatarLukasz Odzioba <lukasz.odzioba@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Ming Li <mingli199x@qq.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af809c0e
    • Tejun Heo's avatar
      memcg: css_alloc should return an ERR_PTR value on error · b31a27b8
      Tejun Heo authored
      commit ea3a9645 upstream.
      
      mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
      expected it to return an ERR_PTR value leading to the following NULL
      deref after a css allocation failure.  Fix it by return
      ERR_PTR(-ENOMEM) instead.  I'll also update cgroup core so that it
      can handle NULL returns.
      
        mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        ...
        Call Trace:
          dump_stack+0x68/0xa1
          warn_alloc_failed+0xd6/0x130
          __alloc_pages_nodemask+0x4c6/0xf20
          alloc_pages_current+0x66/0xe0
          alloc_kmem_pages+0x14/0x80
          kmalloc_order_trace+0x2a/0x1a0
          __kmalloc+0x291/0x310
          memcg_update_all_caches+0x6c/0x130
          mem_cgroup_css_alloc+0x590/0x610
          cgroup_apply_control_enable+0x18b/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        ...
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP:  init_and_link_css+0x37/0x220
        PGD 34b1e067 PUD 3a109067 PMD 0
        Oops: 0002 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
        task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
        RIP: 0010:[<ffffffff810f2ca7>]  [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220
        RSP: 0018:ffff8800666d7d90  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
        RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
        R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
        R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
        FS:  00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          cgroup_apply_control_enable+0x1ac/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
        RIP   init_and_link_css+0x37/0x220
         RSP <ffff8800666d7d90>
        CR2: 00000000000000d0
        ---[ end trace a2d8836ae1e852d1 ]---
      
      Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.orgSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b31a27b8
    • Tejun Heo's avatar
      memcg: mem_cgroup_migrate() may be called with irq disabled · 5acd89d3
      Tejun Heo authored
      commit d93c4130 upstream.
      
      mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
      with irq disabled from migrate_page_copy().  This ends up enabling irq
      while holding a irq context lock triggering the following lockdep
      warning.  Fix it by using irq_save/restore instead.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.7.0-rc1+ #52 Tainted: G        W
        ---------------------------------
        inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8
        {IN-SOFTIRQ-W} state was registered at:
           __lock_acquire+0x5b6/0x1930
           lock_acquire+0xee/0x270
           _raw_spin_lock_irqsave+0x66/0xb0
           aio_complete+0x98/0x328
           dio_complete+0xe4/0x1e0
           blk_update_request+0xd4/0x450
           scsi_end_request+0x48/0x1c8
           scsi_io_completion+0x272/0x698
           blk_done_softirq+0xca/0xe8
           __do_softirq+0xc8/0x518
           irq_exit+0xee/0x110
           do_IRQ+0x6a/0x88
           io_int_handler+0x11a/0x25c
           __mutex_unlock_slowpath+0x144/0x1d8
           __mutex_unlock_slowpath+0x140/0x1d8
           kernfs_iop_permission+0x64/0x80
           __inode_permission+0x9e/0xf0
           link_path_walk+0x6e/0x510
           path_lookupat+0xc4/0x1a8
           filename_lookup+0x9c/0x160
           user_path_at_empty+0x5c/0x70
           SyS_readlinkat+0x68/0x140
           system_call+0xd6/0x270
        irq event stamp: 971410
        hardirqs last  enabled at (971409):  migrate_page_move_mapping+0x3ea/0x588
        hardirqs last disabled at (971410):  _raw_spin_lock_irqsave+0x3c/0xb0
        softirqs last  enabled at (970526):  __do_softirq+0x460/0x518
        softirqs last disabled at (970519):  irq_exit+0xee/0x110
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
      	 CPU0
      	 ----
          lock(&(&ctx->completion_lock)->rlock);
          <Interrupt>
            lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
        3 locks held by kcompactd0/151:
         #0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at:  aio_migratepage+0x42/0x1e8
         #1:  (&ctx->ring_lock){+.+.+.}, at:  aio_migratepage+0x5a/0x1e8
         #2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at:  aio_migratepage+0x156/0x1e8
      
        stack backtrace:
        CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G        W       4.7.0-rc1+ #52
        Call Trace:
          show_trace+0xea/0xf0
          show_stack+0x72/0xf0
          dump_stack+0x9a/0xd8
          print_usage_bug.part.27+0x2d4/0x2e8
          mark_lock+0x17e/0x758
          mark_held_locks+0xa2/0xd0
          trace_hardirqs_on_caller+0x140/0x1c0
          mem_cgroup_migrate+0x266/0x370
          aio_migratepage+0x16a/0x1e8
          move_to_new_page+0xb0/0x260
          migrate_pages+0x8f4/0x9f0
          compact_zone+0x4dc/0xdc8
          kcompactd_do_work+0x1aa/0x358
          kcompactd+0xba/0x2c8
          kthread+0x10a/0x110
          kernel_thread_starter+0x6/0xc
          kernel_thread_starter+0x0/0xc
        INFO: lockdep is turned off.
      
      Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
      Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
      Fixes: 74485cf2 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5acd89d3
    • Mel Gorman's avatar
      mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask · 521fe1d2
      Mel Gorman authored
      commit e838a45f upstream.
      
      Commit d0164adc ("mm, page_alloc: distinguish between being unable
      to sleep, unwilling to sleep and avoiding waking kswapd") modified
      __GFP_WAIT to explicitly identify the difference between atomic callers
      and those that were unwilling to sleep.  Later the definition was
      removed entirely.
      
      The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
      and reclaim behaviour but __GFP_ATOMIC was never added.  Without it,
      atomic users of the slab allocator strip the __GFP_ATOMIC flag and
      cannot access the page allocator atomic reserves.  This patch addresses
      the problem.
      
      The user-visible impact depends on the workload but potentially atomic
      allocations unnecessarily fail without this path.
      
      Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarMarcin Wojtas <mw@semihalf.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      521fe1d2
    • Ludovic Desroches's avatar
      dmaengine: at_xdmac: double FIFO flush needed to compute residue · f883b924
      Ludovic Desroches authored
      commit 9295c41d upstream.
      
      Due to the way CUBC register is updated, a double flush is needed to
      compute an accurate residue. First flush aim is to get data from the DMA
      FIFO and second one ensures that we won't report data which are not in
      memory.
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@atmel.com>
      Fixes: e1f7c9ee ("dmaengine: at_xdmac: creation of the atmel
      eXtended DMA Controller driver")
      Reviewed-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f883b924
    • Ludovic Desroches's avatar
      dmaengine: at_xdmac: fix residue corruption · d1a41160
      Ludovic Desroches authored
      commit 53398f48 upstream.
      
      An unexpected value of CUBC can lead to a corrupted residue. A more
      complex sequence is needed to detect an inaccurate value for NCA or CUBC.
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@atmel.com>
      Fixes: e1f7c9ee ("dmaengine: at_xdmac: creation of the atmel
      eXtended DMA Controller driver")
      Reviewed-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1a41160
    • Ludovic Desroches's avatar
      dmaengine: at_xdmac: align descriptors on 64 bits · de462956
      Ludovic Desroches authored
      commit 4a9723e8 upstream.
      
      Having descriptors aligned on 64 bits allows update CNDA and CUBC in an
      atomic way.
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@atmel.com>
      Fixes: e1f7c9ee ("dmaengine: at_xdmac: creation of the atmel
      eXtended DMA Controller driver")
      Reviewed-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de462956
    • Lukas Wunner's avatar
      x86/quirks: Add early quirk to reset Apple AirPort card · 980d99cd
      Lukas Wunner authored
      commit abb2bafd upstream.
      
      The EFI firmware on Macs contains a full-fledged network stack for
      downloading OS X images from osrecovery.apple.com. Unfortunately
      on Macs introduced 2011 and 2012, EFI brings up the Broadcom 4331
      wireless card on every boot and leaves it enabled even after
      ExitBootServices has been called. The card continues to assert its IRQ
      line, causing spurious interrupts if the IRQ is shared. It also corrupts
      memory by DMAing received packets, allowing for remote code execution
      over the air. This only stops when a driver is loaded for the wireless
      card, which may be never if the driver is not installed or blacklisted.
      
      The issue seems to be constrained to the Broadcom 4331. Chris Milsted
      has verified that the newer Broadcom 4360 built into the MacBookPro11,3
      (2013/2014) does not exhibit this behaviour. The chances that Apple will
      ever supply a firmware fix for the older machines appear to be zero.
      
      The solution is to reset the card on boot by writing to a reset bit in
      its mmio space. This must be done as an early quirk and not as a plain
      vanilla PCI quirk to successfully combat memory corruption by DMAed
      packets: Matthew Garrett found out in 2012 that the packets are written
      to EfiBootServicesData memory (http://mjg59.dreamwidth.org/11235.html).
      This type of memory is made available to the page allocator by
      efi_free_boot_services(). Plain vanilla PCI quirks run much later, in
      subsys initcall level. In-between a time window would be open for memory
      corruption. Random crashes occurring in this time window and attributed
      to DMAed packets have indeed been observed in the wild by Chris
      Bainbridge.
      
      When Matthew Garrett analyzed the memory corruption issue in 2012, he
      sought to fix it with a grub quirk which transitions the card to D3hot:
      http://git.savannah.gnu.org/cgit/grub.git/commit/?id=9d34bb85da56
      
      This approach does not help users with other bootloaders and while it
      may prevent DMAed packets, it does not cure the spurious interrupts
      emanating from the card. Unfortunately the card's mmio space is
      inaccessible in D3hot, so to reset it, we have to undo the effect of
      Matthew's grub patch and transition the card back to D0.
      
      Note that the quirk takes a few shortcuts to reduce the amount of code:
      The size of BAR 0 and the location of the PM capability is identical
      on all affected machines and therefore hardcoded. Only the address of
      BAR 0 differs between models. Also, it is assumed that the BCMA core
      currently mapped is the 802.11 core. The EFI driver seems to always take
      care of this.
      
      Michael Büsch, Bjorn Helgaas and Matt Fleming contributed feedback
      towards finding the best solution to this problem.
      
      The following should be a comprehensive list of affected models:
          iMac13,1        2012  21.5"       [Root Port 00:1c.3 = 8086:1e16]
          iMac13,2        2012  27"         [Root Port 00:1c.3 = 8086:1e16]
          Macmini5,1      2011  i5 2.3 GHz  [Root Port 00:1c.1 = 8086:1c12]
          Macmini5,2      2011  i5 2.5 GHz  [Root Port 00:1c.1 = 8086:1c12]
          Macmini5,3      2011  i7 2.0 GHz  [Root Port 00:1c.1 = 8086:1c12]
          Macmini6,1      2012  i5 2.5 GHz  [Root Port 00:1c.1 = 8086:1e12]
          Macmini6,2      2012  i7 2.3 GHz  [Root Port 00:1c.1 = 8086:1e12]
          MacBookPro8,1   2011  13"         [Root Port 00:1c.1 = 8086:1c12]
          MacBookPro8,2   2011  15"         [Root Port 00:1c.1 = 8086:1c12]
          MacBookPro8,3   2011  17"         [Root Port 00:1c.1 = 8086:1c12]
          MacBookPro9,1   2012  15"         [Root Port 00:1c.1 = 8086:1e12]
          MacBookPro9,2   2012  13"         [Root Port 00:1c.1 = 8086:1e12]
          MacBookPro10,1  2012  15"         [Root Port 00:1c.1 = 8086:1e12]
          MacBookPro10,2  2012  13"         [Root Port 00:1c.1 = 8086:1e12]
      
      For posterity, spurious interrupts caused by the Broadcom 4331 wireless
      card resulted in splats like this (stacktrace omitted):
      
          irq 17: nobody cared (try booting with the "irqpoll" option)
          handlers:
          [<ffffffff81374370>] pcie_isr
          [<ffffffffc0704550>] sdhci_irq [sdhci] threaded [<ffffffffc07013c0>] sdhci_thread_irq [sdhci]
          [<ffffffffc0a0b960>] azx_interrupt [snd_hda_codec]
          Disabling IRQ #17
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=79301
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111781
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=728916
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=895951#c16
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1009819
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1098621
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1149632#c5
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1279130
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1332732
      Tested-by: Konstantin Simanov <k.simanov@stlk.ru>        # [MacBookPro8,1]
      Tested-by: Lukas Wunner <lukas@wunner.de>                # [MacBookPro9,1]
      Tested-by: Bryan Paradis <bryan.paradis@gmail.com>       # [MacBookPro9,2]
      Tested-by: Andrew Worsley <amworsley@gmail.com>          # [MacBookPro10,1]
      Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> # [MacBookPro10,2]
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Acked-by: default avatarRafał Miłecki <zajec5@gmail.com>
      Acked-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chris Milsted <cmilsted@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Michael Buesch <m@bues.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: b43-dev@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: linux-wireless@vger.kernel.org
      Link: http://lkml.kernel.org/r/48d0972ac82a53d460e5fce77a07b2560db95203.1465690253.git.lukas@wunner.de
      [ Did minor readability edits. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      980d99cd
    • Lukas Wunner's avatar
      x86/quirks: Reintroduce scanning of secondary buses · 7013ef90
      Lukas Wunner authored
      commit 850c3210 upstream.
      
      We used to scan secondary buses until the following commit that
      was applied in 2009:
      
        8659c406 ("x86: only scan the root bus in early PCI quirks")
      
      which commit constrained early quirks to the root bus only. Its
      motivation was to prevent application of the nvidia_bugs quirk
      on secondary buses.
      
      We're about to add a quirk to reset the Broadcom 4331 wireless card on
      2011/2012 Macs, which is located on a secondary bus behind a PCIe root
      port. To facilitate that, reintroduce scanning of secondary buses.
      
      The commit message of 8659c406 notes that scanning only the root bus
      "saves quite some unnecessary scanning work". The algorithm used prior
      to 8659c406 was particularly time consuming because it scanned
      buses 0 to 31 brute force. To avoid lengthening boot time, employ a
      recursive strategy which only scans buses that are actually reachable
      from the root bus.
      
      Yinghai Lu pointed out that the secondary bus number read from a
      bridge's config space may be invalid, in particular a value of 0 would
      cause an infinite loop. The PCI core goes beyond that and recurses to a
      child bus only if its bus number is greater than the parent bus number
      (see pci_scan_bridge()). Since the root bus is numbered 0, this implies
      that secondary buses may not be 0. Do the same on early scanning.
      
      If this algorithm is found to significantly impact boot time or cause
      infinite loops on broken hardware, it would be possible to limit its
      recursion depth: The Broadcom 4331 quirk applies at depth 1, all others
      at depth 0, so the bus need not be scanned deeper than that for now. An
      alternative approach would be to revert to scanning only the root bus,
      and apply the Broadcom 4331 quirk to the root ports 8086:1c12, 8086:1e12
      and 8086:1e16. Apple always positioned the card behind either of these
      three ports. The quirk would then check presence of the card in slot 0
      below the root port and do its deed.
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: linux-pci@vger.kernel.org
      Link: http://lkml.kernel.org/r/f0daa70dac1a9b2483abdb31887173eb6ab77bdf.1465690253.git.lukas@wunner.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7013ef90
    • Lukas Wunner's avatar
      x86/quirks: Apply nvidia_bugs quirk only on root bus · 9c170009
      Lukas Wunner authored
      commit 447d29d1 upstream.
      
      Since the following commit:
      
        8659c406 ("x86: only scan the root bus in early PCI quirks")
      
      ... early quirks are only applied to devices on the root bus.
      
      The motivation was to prevent application of the nvidia_bugs quirk on
      secondary buses.
      
      We're about to reintroduce scanning of secondary buses for a quirk to
      reset the Broadcom 4331 wireless card on 2011/2012 Macs. To prevent
      regressions, open code the requirement to apply nvidia_bugs only on the
      root bus.
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/4d5477c1d76b2f0387a780f2142bbcdd9fee869b.1465690253.git.lukas@wunner.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c170009
    • Michał Pecio's avatar
      USB: OHCI: Don't mark EDs as ED_OPER if scheduling fails · e043a37e
      Michał Pecio authored
      commit c66f59ee upstream.
      
      Since ed_schedule begins with marking the ED as "operational",
      the ED may be left in such state even if scheduling actually
      fails.
      
      This allows future submission attempts to smuggle this ED to the
      hardware behind the scheduler's back and without linking it to
      the ohci->eds_in_use list.
      
      The former causes bandwidth saturation and data loss on isoc
      endpoints, the latter crashes the kernel when attempt is made
      to unlink such ED from this list.
      
      Fix ed_schedule to update ED state only on successful return.
      Signed-off-by: default avatarMichal Pecio <michal.pecio@gmail.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e043a37e
  2. 27 Jul, 2016 4 commits