1. 26 Feb, 2016 17 commits
  2. 25 Feb, 2016 23 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.3 · 2134d97a
      Greg Kroah-Hartman authored
      2134d97a
    • Luis R. Rodriguez's avatar
      modules: fix modparam async_probe request · e2f712dc
      Luis R. Rodriguez authored
      commit 4355efbd upstream.
      
      Commit f2411da7 ("driver-core: add driver module
      asynchronous probe support") added async probe support,
      in two forms:
      
        * in-kernel driver specification annotation
        * generic async_probe module parameter (modprobe foo async_probe)
      
      To support the generic kernel parameter parse_args() was
      extended via commit ecc86170 ("module: add extra
      argument for parse_params() callback") however commit
      failed to f2411da7 failed to add the required argument.
      
      This causes a crash then whenever async_probe generic
      module parameter is used. This was overlooked when the
      form in which in-kernel async probe support was reworked
      a bit... Fix this as originally intended.
      
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [minimized]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2f712dc
    • Rusty Russell's avatar
      module: wrapper for symbol name. · a24d9a2f
      Rusty Russell authored
      commit 2e7bac53 upstream.
      
      This trivial wrapper adds clarity and makes the following patch
      smaller.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a24d9a2f
    • Thomas Gleixner's avatar
      itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper · 82e730ba
      Thomas Gleixner authored
      commit 51cbb524 upstream.
      
      As Helge reported for timerfd we have the same issue in itimers. We return
      remaining time larger than the programmed relative time to user space in case
      of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time
      added in hrtimer_start_range_ns().
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: dhowells@redhat.com
      Link: http://lkml.kernel.org/r/20160114164159.528222587@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82e730ba
    • Thomas Gleixner's avatar
      posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper · 1c94da3e
      Thomas Gleixner authored
      commit 572c3917 upstream.
      
      As Helge reported for timerfd we have the same issue in posix timers. We
      return remaining time larger than the programmed relative time to user space
      in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra
      time added in hrtimer_start_range_ns().
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: dhowells@redhat.com
      Link: http://lkml.kernel.org/r/20160114164159.450510905@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c94da3e
    • Thomas Gleixner's avatar
      timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper · 565f2229
      Thomas Gleixner authored
      commit b62526ed upstream.
      
      Helge reported that a relative timer can return a remaining time larger than
      the programmed relative time on parisc and other architectures which have
      CONFIG_TIME_LOW_RES set. This happens because we add a jiffie to the resulting
      expiry time to prevent short timeouts.
      
      Use the new function hrtimer_expires_remaining_adjusted() to calculate the
      remaining time. It takes that extra added time into account for relative
      timers.
      Reported-and-tested-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: dhowells@redhat.com
      Link: http://lkml.kernel.org/r/20160114164159.354500742@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      565f2229
    • Mateusz Guzik's avatar
      prctl: take mmap sem for writing to protect against others · e5e99792
      Mateusz Guzik authored
      commit ddf1d398 upstream.
      
      An unprivileged user can trigger an oops on a kernel with
      CONFIG_CHECKPOINT_RESTORE.
      
      proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
      start/end values. These get sanity checked as follows:
              BUG_ON(arg_start > arg_end);
              BUG_ON(env_start > env_end);
      
      These can be changed by prctl_set_mm. Turns out also takes the semaphore for
      reading, effectively rendering it useless. This results in:
      
        kernel BUG at fs/proc/base.c:240!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: virtio_net
        CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
        RIP: proc_pid_cmdline_read+0x520/0x530
        RSP: 0018:ffff8800784d3db8  EFLAGS: 00010206
        RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
        RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
        RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
        R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
        FS:  00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
        Call Trace:
          __vfs_read+0x37/0x100
          vfs_read+0x82/0x130
          SyS_read+0x58/0xd0
          entry_SYSCALL_64_fastpath+0x12/0x76
        Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
        RIP   proc_pid_cmdline_read+0x520/0x530
        ---[ end trace 97882617ae9c6818 ]---
      
      Turns out there are instances where the code just reads aformentioned
      values without locking whatsoever - namely environ_read and get_cmdline.
      
      Interestingly these functions look quite resilient against bogus values,
      but I don't believe this should be relied upon.
      
      The first patch gets rid of the oops bug by grabbing mmap_sem for
      writing.
      
      The second patch is optional and puts locking around aformentioned
      consumers for safety.  Consumers of other fields don't seem to benefit
      from similar treatment and are left untouched.
      
      This patch (of 2):
      
      The code was taking the semaphore for reading, which does not protect
      against readers nor concurrent modifications.
      
      The problem could cause a sanity checks to fail in procfs's cmdline
      reader, resulting in an OOPS.
      
      Note that some functions perform an unlocked read of various mm fields,
      but they seem to be fine despite possible modificaton.
      Signed-off-by: default avatarMateusz Guzik <mguzik@redhat.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anshuman Khandual <anshuman.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5e99792
    • Dave Chinner's avatar
      xfs: log mount failures don't wait for buffers to be released · f86701c4
      Dave Chinner authored
      commit 85bec546 upstream.
      
      Recently I've been seeing xfs/051 fail on 1k block size filesystems.
      Trying to trace the events during the test lead to the problem going
      away, indicating that it was a race condition that lead to this
      ASSERT failure:
      
      XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
      .....
      [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
      [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
      [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
      [<ffffffff811d8800>] mount_bdev+0x180/0x1b0
      [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
      [<ffffffff811d90a8>] mount_fs+0x38/0x170
      [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
      [<ffffffff811f7018>] do_mount+0x218/0xd60
      [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
      
      When I finally caught it with tracing enabled, I saw that AG 2 had
      an elevated reference count and a buffer was responsible for it. I
      tracked down the specific buffer, and found that it was missing the
      final reference count release that would put it back on the LRU and
      hence be found by xfs_wait_buftarg() calls in the log mount failure
      handling.
      
      The last four traces for the buffer before the assert were (trimmed
      for relevance)
      
      kworker/0:1-5259   xfs_buf_iodone:        hold 2  lock 0 flags ASYNC
      kworker/0:1-5259   xfs_buf_ioerror:       hold 2  lock 0 error -5
      mount-7163	   xfs_buf_lock_done:     hold 2  lock 0 flags ASYNC
      mount-7163	   xfs_buf_unlock:        hold 2  lock 1 flags ASYNC
      
      This is an async write that is completing, so there's nobody waiting
      for it directly.  Hence we call xfs_buf_relse() once all the
      processing is complete. That does:
      
      static inline void xfs_buf_relse(xfs_buf_t *bp)
      {
      	xfs_buf_unlock(bp);
      	xfs_buf_rele(bp);
      }
      
      Now, it's clear that mount is waiting on the buffer lock, and that
      it has been released by xfs_buf_relse() and gained by mount. This is
      expected, because at this point the mount process is in
      xfs_buf_delwri_submit() waiting for all the IO it submitted to
      complete.
      
      The mount process, however, is waiting on the lock for the buffer
      because it is in xfs_buf_delwri_submit(). This waits for IO
      completion, but it doesn't wait for the buffer reference owned by
      the IO to go away. The mount process collects all the completions,
      fails the log recovery, and the higher level code then calls
      xfs_wait_buftarg() to free all the remaining buffers in the
      filesystem.
      
      The issue is that on unlocking the buffer, the scheduler has decided
      that the mount process has higher priority than the the kworker
      thread that is running the IO completion, and so immediately
      switched contexts to the mount process from the semaphore unlock
      code, hence preventing the kworker thread from finishing the IO
      completion and releasing the IO reference to the buffer.
      
      Hence by the time that xfs_wait_buftarg() is run, the buffer still
      has an active reference and so isn't on the LRU list that the
      function walks to free the remaining buffers. Hence we miss that
      buffer and continue onwards to tear down the mount structures,
      at which time we get find a stray reference count on the perag
      structure. On a non-debug kernel, this will be ignored and the
      structure torn down and freed. Hence when the kworker thread is then
      rescheduled and the buffer released and freed, it will access a
      freed perag structure.
      
      The problem here is that when the log mount fails, we still need to
      quiesce the log to ensure that the IO workqueues have returned to
      idle before we run xfs_wait_buftarg(). By synchronising the
      workqueues, we ensure that all IO completions are fully processed,
      not just to the point where buffers have been unlocked. This ensures
      we don't end up in the situation above.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f86701c4
    • Dave Chinner's avatar
      Revert "xfs: clear PF_NOFREEZE for xfsaild kthread" · 16f14a28
      Dave Chinner authored
      commit 3e85286e upstream.
      
      This reverts commit 24ba16bb as it
      prevents machines from suspending. This regression occurs when the
      xfsaild is idle on entry to suspend, and so there s no activity to
      wake it from it's idle sleep and hence see that it is supposed to
      freeze. Hence the freezer times out waiting for it and suspend is
      cancelled.
      
      There is no obvious fix for this short of freezing the filesystem
      properly, so revert this change for now.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Acked-by: default avatarJiri Kosina <jkosina@suse.cz>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16f14a28
    • Dave Chinner's avatar
      xfs: inode recovery readahead can race with inode buffer creation · 7530e6fd
      Dave Chinner authored
      commit b79f4a1c upstream.
      
      When we do inode readahead in log recovery, we do can do the
      readahead before we've replayed the icreate transaction that stamps
      the buffer with inode cores. The inode readahead verifier catches
      this and marks the buffer as !done to indicate that it doesn't yet
      contain valid inodes.
      
      In adding buffer error notification  (i.e. setting b_error = -EIO at
      the same time as as we clear the done flag) to such a readahead
      verifier failure, we can then get subsequent inode recovery failing
      with this error:
      
      XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32
      
      This occurs when readahead completion races with icreate item replay
      such as:
      
      	inode readahead
      		find buffer
      		lock buffer
      		submit RA io
      	....
      	icreate recovery
      	    xfs_trans_get_buffer
      		find buffer
      		lock buffer
      		<blocks on RA completion>
      	.....
      	<ra completion>
      		fails verifier
      		clear XBF_DONE
      		set bp->b_error = -EIO
      		release and unlock buffer
      	<icreate gains lock>
      	icreate initialises buffer
      	marks buffer as done
      	adds buffer to delayed write queue
      	releases buffer
      
      At this point, we have an initialised inode buffer that is up to
      date but has an -EIO state registered against it. When we finally
      get to recovering an inode in that buffer:
      
      	inode item recovery
      	    xfs_trans_read_buffer
      		find buffer
      		lock buffer
      		sees XBF_DONE is set, returns buffer
      	    sees bp->b_error is set
      		fail log recovery!
      
      Essentially, we need xfs_trans_get_buf_map() to clear the error status of
      the buffer when doing a lookup. This function returns uninitialised
      buffers, so the buffer returned can not be in an error state and
      none of the code that uses this function expects b_error to be set
      on return. Indeed, there is an ASSERT(!bp->b_error); in the
      transaction case in xfs_trans_get_buf_map() that would have caught
      this if log recovery used transactions....
      
      This patch firstly changes the inode readahead failure to set -EIO
      on the buffer, and secondly changes xfs_buf_get_map() to never
      return a buffer with an error state set so this first change doesn't
      cause unexpected log recovery failures.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7530e6fd
    • Darrick J. Wong's avatar
      libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct · 888959f2
      Darrick J. Wong authored
      commit 96f859d5 upstream.
      
      Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
      inside it, gcc will quietly round the structure size up to the nearest
      64 bits -- in this case, 40 bytes.  This results in the XFS_AGFL_SIZE
      macro returning incorrect results for v5 filesystems on 64-bit
      machines (118 items instead of 119).  As a result, a 32-bit xfs_repair
      will see garbage in AGFL item 119 and complain.
      
      Therefore, tell gcc not to pad the structure so that the AGFL size
      calculation is correct.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      888959f2
    • Miklos Szeredi's avatar
      ovl: setattr: check permissions before copy-up · 8373f659
      Miklos Szeredi authored
      commit cf9a6784 upstream.
      
      Without this copy-up of a file can be forced, even without actually being
      allowed to do anything on the file.
      
      [Arnd Bergmann] include <linux/pagemap.h> for PAGE_CACHE_SIZE (used by
      MAX_LFS_FILESIZE definition).
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8373f659
    • Miklos Szeredi's avatar
      ovl: root: copy attr · 7193e802
      Miklos Szeredi authored
      commit ed06e069 upstream.
      
      We copy i_uid and i_gid of underlying inode into overlayfs inode.  Except
      for the root inode.
      
      Fix this omission.
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7193e802
    • Konstantin Khlebnikov's avatar
      ovl: check dentry positiveness in ovl_cleanup_whiteouts() · 367e439d
      Konstantin Khlebnikov authored
      commit 84889d49 upstream.
      
      This patch fixes kernel crash at removing directory which contains
      whiteouts from lower layers.
      
      Cache of directory content passed as "list" contains entries from all
      layers, including whiteouts from lower layers. So, lookup in upper dir
      (moved into work at this stage) will return negative entry. Plus this
      cache is filled long before and we can race with external removal.
      
      Example:
       mkdir -p lower0/dir lower1/dir upper work overlay
       touch lower0/dir/a lower0/dir/b
       mknod lower1/dir/a c 0 0
       mount -t overlay none overlay -o lowerdir=lower1:lower0,upperdir=upper,workdir=work
       rm -fr overlay/dir
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      367e439d
    • Vito Caputo's avatar
      ovl: use a minimal buffer in ovl_copy_xattr · fa932190
      Vito Caputo authored
      commit e4ad29fa upstream.
      
      Rather than always allocating the high-order XATTR_SIZE_MAX buffer
      which is costly and prone to failure, only allocate what is needed and
      realloc if necessary.
      
      Fixes https://github.com/coreos/bugs/issues/489Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa932190
    • Miklos Szeredi's avatar
      ovl: allow zero size xattr · 85a7ed32
      Miklos Szeredi authored
      commit 97daf8b9 upstream.
      
      When ovl_copy_xattr() encountered a zero size xattr no more xattrs were
      copied and the function returned success.  This is clearly not the desired
      behavior.
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85a7ed32
    • Thomas Gleixner's avatar
      futex: Drop refcount if requeue_pi() acquired the rtmutex · acaf8425
      Thomas Gleixner authored
      commit fb75a428 upstream.
      
      If the proxy lock in the requeue loop acquires the rtmutex for a
      waiter then it acquired also refcount on the pi_state related to the
      futex, but the waiter side does not drop the reference count.
      
      Add the missing free_pi_state() call.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Bhuvanesh_Surachari@mentor.com
      Cc: Andy Lowe <Andy_Lowe@mentor.com>
      Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      acaf8425
    • Toshi Kani's avatar
      devm_memremap_release(): fix memremap'd addr handling · 30066dcd
      Toshi Kani authored
      commit 9273a8bb upstream.
      
      The pmem driver calls devm_memremap() to map a persistent memory range.
      When the pmem driver is unloaded, this memremap'd range is not released
      so the kernel will leak a vma.
      
      Fix devm_memremap_release() to handle a given memremap'd address
      properly.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30066dcd
    • Kirill A. Shutemov's avatar
      ipc/shm: handle removed segments gracefully in shm_mmap() · 15db15e2
      Kirill A. Shutemov authored
      commit 1ac0b6de upstream.
      
      remap_file_pages(2) emulation can reach file which represents removed
      IPC ID as long as a memory segment is mapped.  It breaks expectations of
      IPC subsystem.
      
      Test case (rewritten to be more human readable, originally autogenerated
      by syzkaller[1]):
      
      	#define _GNU_SOURCE
      	#include <stdlib.h>
      	#include <sys/ipc.h>
      	#include <sys/mman.h>
      	#include <sys/shm.h>
      
      	#define PAGE_SIZE 4096
      
      	int main()
      	{
      		int id;
      		void *p;
      
      		id = shmget(IPC_PRIVATE, 3 * PAGE_SIZE, 0);
      		p = shmat(id, NULL, 0);
      		shmctl(id, IPC_RMID, NULL);
      		remap_file_pages(p, 3 * PAGE_SIZE, 0, 7, 0);
      
      	        return 0;
      	}
      
      The patch changes shm_mmap() and code around shm_lock() to propagate
      locking error back to caller of shm_mmap().
      
      [1] http://github.com/google/syzkallerSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15db15e2
    • Dan Carpenter's avatar
      intel_scu_ipcutil: underflow in scu_reg_access() · fe90acff
      Dan Carpenter authored
      commit b1d353ad upstream.
      
      "count" is controlled by the user and it can be negative.  Let's prevent
      that by making it unsigned.  You have to have CAP_SYS_RAWIO to call this
      function so the bug is not as serious as it could be.
      
      Fixes: 5369c02d ('intel_scu_ipc: Utility driver for intel scu ipc')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fe90acff
    • Vineet Gupta's avatar
      mm,thp: khugepaged: call pte flush at the time of collapse · edfde263
      Vineet Gupta authored
      commit 6a6ac72f upstream.
      
      This showed up on ARC when running LMBench bw_mem tests as Overlapping
      TLB Machine Check Exception triggered due to STLB entry (2M pages)
      overlapping some NTLB entry (regular 8K page).
      
      bw_mem 2m touches a large chunk of vaddr creating NTLB entries.  In the
      interim khugepaged kicks in, collapsing the contiguous ptes into a
      single pmd.  pmdp_collapse_flush()->flush_pmd_tlb_range() is called to
      flush out NTLB entries for the ptes.  This for ARC (by design) can only
      shootdown STLB entries (for pmd).  The stray NTLB entries cause the
      overlap with the subsequent STLB entry for collapsed page.  So make
      pmdp_collapse_flush() call pte flush interface not pmd flush.
      
      Note that originally all thp flush call sites in generic code called
      flush_tlb_range() leaving it to architecture to implement the flush for
      pte and/or pmd.  Commit 12ebc158 changed this by calling a new
      opt-in API flush_pmd_tlb_range() which made the semantics more explicit
      but failed to distinguish the pte vs pmd flush in generic code, which is
      what this patch fixes.
      
      Note that ARC can fixed w/o touching the generic pmdp_collapse_flush()
      by defining a ARC version, but that defeats the purpose of generic
      version, plus sementically this is the right thing to do.
      
      Fixes STAR 9000961194: LMBench on AXS103 triggering duplicate TLB
      exceptions with super pages
      
      Fixes: 12ebc158 ("mm,thp: introduce flush_pmd_tlb_range")
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edfde263
    • Eric Dumazet's avatar
      dump_stack: avoid potential deadlocks · e31e4672
      Eric Dumazet authored
      commit d7ce3692 upstream.
      
      Some servers experienced fatal deadlocks because of a combination of
      bugs, leading to multiple cpus calling dump_stack().
      
      The checksumming bug was fixed in commit 34ae6a1a ("ipv6: update
      skb->csum when CE mark is propagated").
      
      The second problem is a faulty locking in dump_stack()
      
      CPU1 runs in process context and calls dump_stack(), grabs dump_lock.
      
         CPU2 receives a TCP packet under softirq, grabs socket spinlock, and
         call dump_stack() from netdev_rx_csum_fault().
      
         dump_stack() spins on atomic_cmpxchg(&dump_lock, -1, 2), since
         dump_lock is owned by CPU1
      
      While dumping its stack, CPU1 is interrupted by a softirq, and happens
      to process a packet for the TCP socket locked by CPU2.
      
      CPU1 spins forever in spin_lock() : deadlock
      
      Stack trace on CPU1 looked like :
      
          NMI backtrace for cpu 1
          RIP: _raw_spin_lock+0x25/0x30
          ...
          Call Trace:
            <IRQ>
            tcp_v6_rcv+0x243/0x620
            ip6_input_finish+0x11f/0x330
            ip6_input+0x38/0x40
            ip6_rcv_finish+0x3c/0x90
            ipv6_rcv+0x2a9/0x500
            process_backlog+0x461/0xaa0
            net_rx_action+0x147/0x430
            __do_softirq+0x167/0x2d0
            call_softirq+0x1c/0x30
            do_softirq+0x3f/0x80
            irq_exit+0x6e/0xc0
            smp_call_function_single_interrupt+0x35/0x40
            call_function_single_interrupt+0x6a/0x70
            <EOI>
            printk+0x4d/0x4f
            printk_address+0x31/0x33
            print_trace_address+0x33/0x3c
            print_context_stack+0x7f/0x119
            dump_trace+0x26b/0x28e
            show_trace_log_lvl+0x4f/0x5c
            show_stack_log_lvl+0x104/0x113
            show_stack+0x42/0x44
            dump_stack+0x46/0x58
            netdev_rx_csum_fault+0x38/0x3c
            __skb_checksum_complete_head+0x6e/0x80
            __skb_checksum_complete+0x11/0x20
            tcp_rcv_established+0x2bd5/0x2fd0
            tcp_v6_do_rcv+0x13c/0x620
            sk_backlog_rcv+0x15/0x30
            release_sock+0xd2/0x150
            tcp_recvmsg+0x1c1/0xfc0
            inet_recvmsg+0x7d/0x90
            sock_recvmsg+0xaf/0xe0
            ___sys_recvmsg+0x111/0x3b0
            SyS_recvmsg+0x5c/0xb0
            system_call_fastpath+0x16/0x1b
      
      Fixes: b58d9774 ("dump_stack: serialize the output from dump_stack()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e31e4672
    • Konstantin Khlebnikov's avatar
      radix-tree: fix oops after radix_tree_iter_retry · 55e0d986
      Konstantin Khlebnikov authored
      commit 73204282 upstream.
      
      Helper radix_tree_iter_retry() resets next_index to the current index.
      In following radix_tree_next_slot current chunk size becomes zero.  This
      isn't checked and it tries to dereference null pointer in slot.
      
      Tagged iterator is fine because retry happens only at slot 0 where tag
      bitmask in iter->tags is filled with single bit.
      
      Fixes: 46437f9a ("radix-tree: fix race in gang lookup")
      Signed-off-by: default avatarKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ohad Ben-Cohen <ohad@wizery.com>
      Cc: Jeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55e0d986