1. 08 Apr, 2017 7 commits
    • Brian Foster's avatar
      xfs: fix eofblocks race with file extending async dio writes · 4127a5d9
      Brian Foster authored
      commit e4229d6b upstream.
      
      It's possible for post-eof blocks to end up being used for direct I/O
      writes. dio write performs an upfront unwritten extent allocation, sends
      the dio and then updates the inode size (if necessary) on write
      completion. If a file release occurs while a file extending dio write is
      in flight, it is possible to mistake the post-eof blocks for speculative
      preallocation and incorrectly truncate them from the inode. This means
      that the resulting dio write completion can discover a hole and allocate
      new blocks rather than perform unwritten extent conversion.
      
      This requires a strange mix of I/O and is thus not likely to reproduce
      in real world workloads. It is intermittently reproduced by generic/299.
      The error manifests as an assert failure due to transaction overrun
      because the aforementioned write completion transaction has only
      reserved enough blocks for btree operations:
      
        XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
         file: fs/xfs//xfs_trans.c, line: 309
      
      The root cause is that xfs_free_eofblocks() uses i_size to truncate
      post-eof blocks from the inode, but async, file extending direct writes
      do not update i_size until write completion, long after inode locks are
      dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
      to the incorrect size.
      
      Update xfs_free_eofblocks() to serialize against dio similar to how
      extending writes are serialized against i_size updates before post-eof
      block zeroing. Specifically, wait on dio while under the iolock. This
      ensures that dio write completions have updated i_size before post-eof
      blocks are processed.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4127a5d9
    • Brian Foster's avatar
      xfs: sync eofblocks scans under iolock are livelock prone · 4d725d74
      Brian Foster authored
      commit c3155097 upstream.
      
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      4d725d74
    • Brian Foster's avatar
      xfs: pull up iolock from xfs_free_eofblocks() · 798b1dc5
      Brian Foster authored
      commit a36b9261 upstream.
      
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      798b1dc5
    • Christoph Hellwig's avatar
      xfs: use per-AG reservations for the finobt · 08a2a268
      Christoph Hellwig authored
      commit 76d771b4 upstream.
      
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08a2a268
    • Christoph Hellwig's avatar
      xfs: only update mount/resv fields on success in __xfs_ag_resv_init · 9be1c33d
      Christoph Hellwig authored
      commit 4dfa2b84 upstream.
      
      Try to reserve the blocks first and only then update the fields in
      or hanging off the mount structure.  This way we can call __xfs_ag_resv_init
      again after a previous failure.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9be1c33d
    • Ross Lagerwall's avatar
      xen/setup: Don't relocate p2m over existing one · 8b08aec6
      Ross Lagerwall authored
      commit 7ecec850 upstream.
      
      When relocating the p2m, take special care not to relocate it so
      that is overlaps with the current location of the p2m/initrd. This is
      needed since the full extent of the current location is not marked as a
      reserved region in the e820.
      
      This was seen to happen to a dom0 with a large initial p2m and a small
      reserved region in the middle of the initial p2m.
      Signed-off-by: default avatarRoss Lagerwall <ross.lagerwall@citrix.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b08aec6
    • Ilya Dryomov's avatar
      libceph: force GFP_NOIO for socket allocations · 86015377
      Ilya Dryomov authored
      commit 633ee407 upstream.
      
      sock_alloc_inode() allocates socket+inode and socket_wq with
      GFP_KERNEL, which is not allowed on the writeback path:
      
          Workqueue: ceph-msgr con_work [libceph]
          ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
          0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
          ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
          Call Trace:
          [<ffffffff816dd629>] schedule+0x29/0x70
          [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
          [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
          [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
          [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
          [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
          [<ffffffff81086335>] flush_work+0x165/0x250
          [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
          [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
          [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
          [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
          [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
          [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
          [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
          [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
          [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
          [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
          [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
          [<ffffffff811c0c18>] super_cache_scan+0x178/0x180
          [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340
          [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450
          [<ffffffff8115af70>] shrink_slab+0x100/0x140
          [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490
          [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0
          [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be
          [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40
          [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110
          [<ffffffff811a0ac5>] new_slab+0x2c5/0x390
          [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0
          [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0
          [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0
          [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0
          [<ffffffff811d8566>] alloc_inode+0x26/0xa0
          [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70
          [<ffffffff815b933e>] sock_alloc+0x1e/0x80
          [<ffffffff815ba855>] __sock_create+0x95/0x220
          [<ffffffff815baa04>] sock_create_kern+0x24/0x30
          [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
          [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
          [<ffffffff81084c19>] process_one_work+0x159/0x4f0
          [<ffffffff8108561b>] worker_thread+0x11b/0x530
          [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
          [<ffffffff8108b6f9>] kthread+0xc9/0xe0
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
          [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
          [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
      
      Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.
      
      Link: http://tracker.ceph.com/issues/19309Reported-by: default avatarSergey Jerusalimov <wintchester@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86015377
  2. 31 Mar, 2017 17 commits
  3. 30 Mar, 2017 16 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.9.19 · c8e13160
      Greg Kroah-Hartman authored
      c8e13160
    • Jiri Slaby's avatar
      crypto: algif_hash - avoid zero-sized array · bc959a40
      Jiri Slaby authored
      commit 62071194 upstream.
      
      With this reproducer:
        struct sockaddr_alg alg = {
                .salg_family = 0x26,
                .salg_type = "hash",
                .salg_feat = 0xf,
                .salg_mask = 0x5,
                .salg_name = "digest_null",
        };
        int sock, sock2;
      
        sock = socket(AF_ALG, SOCK_SEQPACKET, 0);
        bind(sock, (struct sockaddr *)&alg, sizeof(alg));
        sock2 = accept(sock, NULL, NULL);
        setsockopt(sock, SOL_ALG, ALG_SET_KEY, "\x9b\xca", 2);
        accept(sock2, NULL, NULL);
      
      ==== 8< ======== 8< ======== 8< ======== 8< ====
      
      one can immediatelly see an UBSAN warning:
      UBSAN: Undefined behaviour in crypto/algif_hash.c:187:7
      variable length array bound value 0 <= 0
      CPU: 0 PID: 15949 Comm: syz-executor Tainted: G            E      4.4.30-0-default #1
      ...
      Call Trace:
      ...
       [<ffffffff81d598fd>] ? __ubsan_handle_vla_bound_not_positive+0x13d/0x188
       [<ffffffff81d597c0>] ? __ubsan_handle_out_of_bounds+0x1bc/0x1bc
       [<ffffffffa0e2204d>] ? hash_accept+0x5bd/0x7d0 [algif_hash]
       [<ffffffffa0e2293f>] ? hash_accept_nokey+0x3f/0x51 [algif_hash]
       [<ffffffffa0e206b0>] ? hash_accept_parent_nokey+0x4a0/0x4a0 [algif_hash]
       [<ffffffff8235c42b>] ? SyS_accept+0x2b/0x40
      
      It is a correct warning, as hash state is propagated to accept as zero,
      but creating a zero-length variable array is not allowed in C.
      
      Fix this as proposed by Herbert -- do "?: 1" on that site. No sizeof or
      similar happens in the code there, so we just allocate one byte even
      though we do not use the array.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net> (maintainer:CRYPTO API)
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc959a40
    • Takashi Iwai's avatar
      fbcon: Fix vc attr at deinit · 3fd37725
      Takashi Iwai authored
      commit 8aac7f34 upstream.
      
      fbcon can deal with vc_hi_font_mask (the upper 256 chars) and adjust
      the vc attrs dynamically when vc_hi_font_mask is changed at
      fbcon_init().  When the vc_hi_font_mask is set, it remaps the attrs in
      the existing console buffer with one bit shift up (for 9 bits), while
      it remaps with one bit shift down (for 8 bits) when the value is
      cleared.  It works fine as long as the font gets updated after fbcon
      was initialized.
      
      However, we hit a bizarre problem when the console is switched to
      another fb driver (typically from vesafb or efifb to drmfb).  At
      switching to the new fb driver, we temporarily rebind the console to
      the dummy console, then rebind to the new driver.  During the
      switching, we leave the modified attrs as is.  Thus, the new fbcon
      takes over the old buffer as if it were to contain 8 bits chars
      (although the attrs are still shifted for 9 bits), and effectively
      this results in the yellow color texts instead of the original white
      color, as found in the bugzilla entry below.
      
      An easy fix for this is to re-adjust the attrs before leaving the
      fbcon at con_deinit callback.  Since the code to adjust the attrs is
      already present in the current fbcon code, in this patch, we simply
      factor out the relevant code, and call it from fbcon_deinit().
      
      Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1000619Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fd37725
    • Daniel Vetter's avatar
      drm: reference count event->completion · c75fe789
      Daniel Vetter authored
      commit 24835e44 upstream.
      
      When writing the generic nonblocking commit code I assumed that
      through clever lifetime management I can assure that the completion
      (stored in drm_crtc_commit) only gets freed after it is completed. And
      that worked.
      
      I also wanted to make nonblocking helpers resilient against driver
      bugs, by having timeouts everywhere. And that worked too.
      
      Unfortunately taking boths things together results in oopses :( Well,
      at least sometimes: What seems to happen is that the drm event hangs
      around forever stuck in limbo land. The nonblocking helpers eventually
      time out, move on and release it. Now the bug I tested all this
      against is drivers that just entirely fail to deliver the vblank
      events like they should, and in those cases the event is simply
      leaked. But what seems to happen, at least sometimes, on i915 is that
      the event is set up correctly, but somohow the vblank fails to fire in
      time. Which means the event isn't leaked, it's still there waiting for
      eventually a vblank to fire. That tends to happen when re-enabling the
      pipe, and then the trap springs and the kernel oopses.
      
      The correct fix here is simply to refcount the crtc commit to make
      sure that the event sticks around even for drivers which only
      sometimes fail to deliver vblanks for some arbitrary reasons. Since
      crtc commits are already refcounted that's easy to do.
      
      References: https://bugs.freedesktop.org/show_bug.cgi?id=96781
      Cc: Jim Rees <rees@umich.edu>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Reviewed-by: default avatarMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20161221102331.31033-1-daniel.vetter@ffwll.ch
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c75fe789
    • Johannes Berg's avatar
      nl80211: fix dumpit error path RTNL deadlocks · 56769e7a
      Johannes Berg authored
      commit ea90e0dc upstream.
      
      Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
      to be perfectly accurate - there are various error paths that miss unlock
      of the RTNL.
      
      To fix those, change the locking a bit to not be conditional in all those
      nl80211_prepare_*_dump() functions, but make those require the RTNL to
      start with, and fix the buggy error paths. This also let me use sparse
      (by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
      validate the changes.
      Reported-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      56769e7a
    • Marek Szyprowski's avatar
      drm/bridge: analogix dp: Fix runtime PM state on driver bind · 7b3c8b2a
      Marek Szyprowski authored
      commit f0a8b49c upstream.
      
      Analogix_dp_bind() can be called from component framework, which doesn't
      guarantee proper runtime PM state of the device during bind operation,
      so ensure that device is runtime active before doing any register access.
      This ensures that the power domain, to which DP module belongs, is turned
      on. While at it, also fix the unbalanced call to phy_power_on() in
      analogix_dp_bind() function.
      
      This patch solves the following kernel oops on Samsung Exynos5250 Snow
      board:
      
      Unhandled fault: imprecise external abort (0x406) at 0x00000000
      pgd = c0004000
      [00000000] *pgd=00000000
      Internal error: : 406 [#1] PREEMPT SMP ARM
      Modules linked in:
      CPU: 0 PID: 75 Comm: kworker/0:2 Not tainted 4.9.0 #1046
      Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
      Workqueue: events deferred_probe_work_func
      task: ee272300 task.stack: ee312000
      PC is at analogix_dp_enable_sw_function+0x18/0x2c
      LR is at analogix_dp_init_dp+0x2c/0x50
      ...
      [<c03fcb38>] (analogix_dp_enable_sw_function) from [<c03fa9c4>] (analogix_dp_init_dp+0x2c/0x50)
      [<c03fa9c4>] (analogix_dp_init_dp) from [<c03fab6c>] (analogix_dp_bind+0x184/0x42c)
      [<c03fab6c>] (analogix_dp_bind) from [<c03fdb84>] (component_bind_all+0xf0/0x218)
      [<c03fdb84>] (component_bind_all) from [<c03ed64c>] (exynos_drm_load+0x134/0x200)
      [<c03ed64c>] (exynos_drm_load) from [<c03d5058>] (drm_dev_register+0xa0/0xd0)
      [<c03d5058>] (drm_dev_register) from [<c03d66b8>] (drm_platform_init+0x58/0xb0)
      [<c03d66b8>] (drm_platform_init) from [<c03fe0c4>] (try_to_bring_up_master+0x14c/0x188)
      [<c03fe0c4>] (try_to_bring_up_master) from [<c03fe188>] (component_add+0x88/0x138)
      [<c03fe188>] (component_add) from [<c0403a38>] (platform_drv_probe+0x50/0xb0)
      [<c0403a38>] (platform_drv_probe) from [<c0402470>] (driver_probe_device+0x1f0/0x2a8)
      [<c0402470>] (driver_probe_device) from [<c0400a54>] (bus_for_each_drv+0x44/0x8c)
      [<c0400a54>] (bus_for_each_drv) from [<c04021f8>] (__device_attach+0x9c/0x100)
      [<c04021f8>] (__device_attach) from [<c04018e8>] (bus_probe_device+0x84/0x8c)
      [<c04018e8>] (bus_probe_device) from [<c0401d1c>] (deferred_probe_work_func+0x60/0x8c)
      [<c0401d1c>] (deferred_probe_work_func) from [<c012fc14>] (process_one_work+0x120/0x318)
      [<c012fc14>] (process_one_work) from [<c012fe34>] (process_scheduled_works+0x28/0x38)
      [<c012fe34>] (process_scheduled_works) from [<c0130048>] (worker_thread+0x204/0x4ac)
      [<c0130048>] (worker_thread) from [<c01352c4>] (kthread+0xd8/0xf4)
      [<c01352c4>] (kthread) from [<c0107978>] (ret_from_fork+0x14/0x3c)
      Code: e59035f0 e5935018 f57ff04f e3c55001 (f57ff04e)
      ---[ end trace 3d1d0d87796de344 ]---
      Reviewed-by: default avatarSean Paul <seanpaul@chromium.org>
      Signed-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarArchit Taneja <architt@codeaurora.org>
      Link: http://patchwork.freedesktop.org/patch/msgid/1483091866-1088-1-git-send-email-m.szyprowski@samsung.com
      Cc: Javier Martinez Canillas <javier@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b3c8b2a
    • Dave Jiang's avatar
      device-dax: fix pmd/pte fault fallback handling · eae72468
      Dave Jiang authored
      commit 0134ed4f upstream.
      
      Jeff Moyer reports:
      
          With a device dax alignment of 4KB or 2MB, I get sigbus when running
          the attached fio job file for the current kernel (4.11.0-rc1+).  If
          I specify an alignment of 1GB, it works.
      
          I turned on debug output, and saw that it was failing in the huge
          fault code.
      
           dax dax1.0: dax_open
           dax dax1.0: dax_mmap
           dax dax1.0: dax_dev_huge_fault: fio: write (0x7f08f0a00000 -
           dax dax1.0: __dax_dev_pud_fault: phys_to_pgoff(0xffffffffcf60
           dax dax1.0: dax_release
      
          fio config for reproduce:
          [global]
          ioengine=dev-dax
          direct=0
          filename=/dev/dax0.0
          bs=2m
      
          [write]
          rw=write
      
          [read]
          stonewall
          rw=read
      
      The driver fails to fallback when taking a fault that is larger than
      the device alignment, or handling a larger fault when a smaller
      mapping is already established. While we could support larger
      mappings for a device with a smaller alignment, that change is
      too large for the immediate fix. The simplest change is to force
      fallback until the fault size matches the alignment.
      
      Fixes: dee41079 ("/dev/dax, core: file operations and dax-mmap")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eae72468
    • Ilya Dryomov's avatar
      libceph: don't set weight to IN when OSD is destroyed · 81ec3dc1
      Ilya Dryomov authored
      commit b581a585 upstream.
      
      Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
      osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
      This changes the result of applying an incremental for clients, not
      just OSDs.  Because CRUSH computations are obviously affected,
      pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
      object placement, resulting in misdirected requests.
      
      Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.
      
      Fixes: 930c5328 ("libceph: apply new_state before new_up_client on incrementals")
      Link: http://tracker.ceph.com/issues/19122Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81ec3dc1
    • K. Y. Srinivasan's avatar
      Drivers: hv: vmbus: Don't leak memory when a channel is rescinded · df1fe6c9
      K. Y. Srinivasan authored
      commit 5e030d5c upstream.
      
      When we close a channel that has been rescinded, we will leak memory since
      vmbus_teardown_gpadl() returns an error. Fix this so that we can properly
      cleanup the memory allocated to the ring buffers.
      
      Fixes: ccb61f8a ("Drivers: hv: vmbus: Fix a rescind handling bug")
      Signed-off-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df1fe6c9
    • K. Y. Srinivasan's avatar
      Drivers: hv: vmbus: Don't leak channel ids · b1f6b0a5
      K. Y. Srinivasan authored
      commit 9a547602 upstream.
      
      If we cannot allocate memory for the channel, free the relid
      associated with the channel.
      Signed-off-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1f6b0a5
    • Alexander Shishkin's avatar
      intel_th: Don't leak module refcount on failure to activate · 3076066b
      Alexander Shishkin authored
      commit e609ccef upstream.
      
      Output 'activation' may fail for the reasons of the output driver,
      for example, if msc's buffer is not allocated. We forget, however,
      to drop the module reference in this case. So each attempt at
      activation in this case leaks a reference, preventing the module
      from ever unloading.
      
      This patch adds the missing module_put() in the activation error
      path.
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3076066b
    • Eric Biggers's avatar
      jbd2: don't leak memory if setting up journal fails · b176a6ee
      Eric Biggers authored
      commit cd9cb405 upstream.
      
      In journal_init_common(), if we failed to allocate the j_wbuf array, or
      if we failed to create the buffer_head for the journal superblock, we
      leaked the memory allocated for the revocation tables.  Fix this.
      
      Fixes: f0c9fd54Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b176a6ee
    • Dmitry Torokhov's avatar
      auxdisplay: img-ascii-lcd: add missing sentinel entry in img_ascii_lcd_matches · 90f39ad2
      Dmitry Torokhov authored
      commit abda288b upstream.
      
      The OF device table must be terminated, otherwise we'll be walking past it
      and into areas unknown.
      
      Fixes: 0cad855f ("auxdisplay: img-ascii-lcd: driver for simple ASCII...")
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Tested-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90f39ad2
    • Alex Deucher's avatar
      drm/amdgpu: reinstate oland workaround for sclk · 9740abe0
      Alex Deucher authored
      commit e11ddff6 upstream.
      
      Higher sclks seem to be unstable on some boards.
      
      bug: https://bugs.freedesktop.org/show_bug.cgi?id=100222Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9740abe0
    • Ming Lei's avatar
      blk-mq: don't complete un-started request in timeout handler · 21d17f1b
      Ming Lei authored
      commit 95a49603 upstream.
      
      When iterating busy requests in timeout handler,
      if the STARTED flag of one request isn't set, that means
      the request is being processed in block layer or driver, and
      isn't submitted to hardware yet.
      
      In current implementation of blk_mq_check_expired(),
      if the request queue becomes dying, un-started requests are
      handled as being completed/freed immediately. This way is
      wrong, and can cause rq corruption or double allocation[1][2],
      when doing I/O and removing&resetting NVMe device at the sametime.
      
      This patch fixes several issues reported by Yi Zhang.
      
      [1]. oops log 1
      [  581.789754] ------------[ cut here ]------------
      [  581.789758] kernel BUG at block/blk-mq.c:374!
      [  581.789760] invalid opcode: 0000 [#1] SMP
      [  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
      irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
      intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
      intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
      pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
      sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
      crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
      dm_region_hash dm_log dm_mod
      [  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
      [  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  581.789804] Workqueue: kblockd blk_mq_timeout_work
      [  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
      [  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
      [  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
      [  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
      [  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
      [  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
      [  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
      [  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
      [  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
      [  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
      [  581.789820] Call Trace:
      [  581.789825]  blk_mq_check_expired+0x76/0x80
      [  581.789828]  bt_iter+0x45/0x50
      [  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
      [  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789840]  ? __switch_to+0x140/0x450
      [  581.789841]  blk_mq_timeout_work+0x88/0x170
      [  581.789845]  process_one_work+0x165/0x410
      [  581.789847]  worker_thread+0x137/0x4c0
      [  581.789851]  kthread+0x101/0x140
      [  581.789853]  ? rescuer_thread+0x3b0/0x3b0
      [  581.789855]  ? kthread_park+0x90/0x90
      [  581.789860]  ret_from_fork+0x2c/0x40
      [  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
      8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
      0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
      [  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
      [  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
      
      [2]. oops log2
      [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857373] PGD 0
      [ 6984.857374]
      [ 6984.857376] Oops: 0000 [#1] SMP
      [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
      irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
      iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
      mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
      acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
      sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
      libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
      dm_region_hash dm_log dm_mod
      [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
      4.10.0-2.el7.bz1420297.x86_64 #1
      [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
      [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
      [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
      [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
      [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
      [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
      [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
      [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
      [ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
      [ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
      [ 6984.857443] Call Trace:
      [ 6984.857451]  ? mempool_free+0x2b/0x80
      [ 6984.857455]  ? bio_free+0x4e/0x60
      [ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
      [ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
      [ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
      [ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
      [ 6984.857473]  process_one_work+0x165/0x410
      [ 6984.857475]  worker_thread+0x137/0x4c0
      [ 6984.857478]  kthread+0x101/0x140
      [ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
      [ 6984.857481]  ? kthread_park+0x90/0x90
      [ 6984.857489]  ret_from_fork+0x2c/0x40
      [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
      89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
      8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
      [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
      [ 6984.857512] CR2: 0000000000000010
      [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Tested-by: default avatarYi Zhang <yizhan@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21d17f1b
    • Tejun Heo's avatar
      cgroup, net_cls: iterate the fds of only the tasks which are being migrated · 62f6341c
      Tejun Heo authored
      commit a05d4fd9 upstream.
      
      The net_cls controller controls the classid field of each socket which
      is associated with the cgroup.  Because the classid is per-socket
      attribute, when a task migrates to another cgroup or the configured
      classid of the cgroup changes, the controller needs to walk all
      sockets and update the classid value, which was implemented by
      3b13758f ("cgroups: Allow dynamically changing net_classid").
      
      While the approach is not scalable, migrating tasks which have a lot
      of fds attached to them is rare and the cost is born by the ones
      initiating the operations.  However, for simplicity, both the
      migration and classid config change paths call update_classid() which
      scans all fds of all tasks in the target css.  This is an overkill for
      the migration path which only needs to cover a much smaller subset of
      tasks which are actually getting migrated in.
      
      On cgroup v1, this can lead to unexpected scalability issues when one
      tries to migrate a task or process into a net_cls cgroup which already
      contains a lot of fds.  Even if the migration traget doesn't have many
      to get scanned, update_classid() ends up scanning all fds in the
      target cgroup which can be extremely numerous.
      
      Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
      even worse.  Before bfc2cf6f ("cgroup: call subsys->*attach() only
      for subsystems which are actually affected by migration"), cgroup core
      would call the ->css_attach callback even for controllers which don't
      see actual migration to a different css.
      
      As net_cls is always disabled but still mounted on cgroup v2, whenever
      a process is migrated on the cgroup v2 hierarchy, net_cls sees
      identity migration from root to root and cgroup core used to call
      ->css_attach callback for those.  The net_cls ->css_attach ends up
      calling update_classid() on the root net_cls css to which all
      processes on the system belong to as the controller isn't used.  This
      makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
      which is horrible and easily leads to noticeable stalls triggering RCU
      stall warnings and so on.
      
      The worst symptom is already fixed in upstream by bfc2cf6f
      ("cgroup: call subsys->*attach() only for subsystems which are
      actually affected by migration"); however, backporting that commit is
      too invasive and we want to avoid other cases too.
      
      This patch updates net_cls's cgrp_attach() to iterate fds of only the
      processes which are actually getting migrated.  This removes the
      surprising migration cost which is dependent on the total number of
      fds in the target cgroup.  As this leaves write_classid() the only
      user of update_classid(), open-code the helper into write_classid().
      Reported-by: default avatarDavid Goode <dgoode@fb.com>
      Fixes: 3b13758f ("cgroups: Allow dynamically changing net_classid")
      Cc: Nina Schiff <ninasc@fb.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62f6341c