1. 23 Mar, 2017 10 commits
    • Jan Kara's avatar
      block: Fix oops scsi_disk_get() · d01b2dcb
      Jan Kara authored
      When device open races with device shutdown, we can get the following
      oops in scsi_disk_get():
      
      [11863.044351] general protection fault: 0000 [#1] SMP
      [11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
      [11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W      4.10.0-rc2-xen+ #35
      [11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
      [11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
      [11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
      [11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
      [11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
      [11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
      [11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
      [11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
      [11863.059217] FS:  00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
      [11863.059217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
      [11863.059217] Call Trace:
      [11863.059217]  ? disk_get_part+0x22/0x1f0
      [11863.059217]  sd_open+0x39/0x130
      [11863.059217]  __blkdev_get+0x69/0x430
      [11863.059217]  ? bd_acquire+0x7f/0xc0
      [11863.059217]  ? bd_acquire+0x96/0xc0
      [11863.059217]  ? blkdev_get+0x350/0x350
      [11863.059217]  blkdev_get+0x126/0x350
      [11863.059217]  ? _raw_spin_unlock+0x2b/0x40
      [11863.059217]  ? bd_acquire+0x7f/0xc0
      [11863.059217]  ? blkdev_get+0x350/0x350
      [11863.059217]  blkdev_open+0x65/0x80
      ...
      
      As you can see RAX value is already poisoned showing that gendisk we got
      is already freed. The problem is that get_gendisk() looks up device
      number in ext_devt_idr and then does get_disk() which does kobject_get()
      on the disks kobject. However the disk gets removed from ext_devt_idr
      only in disk_release() (through blk_free_devt()) at which moment it has
      already 0 refcount and is already on its way to be freed. Indeed we've
      got a warning from kobject_get() about 0 refcount shortly before the
      oops.
      
      We fix the problem by using kobject_get_unless_zero() in get_disk() so
      that get_disk() cannot get reference on a disk that is already being
      freed.
      Tested-by: default avatarLekshmi Pillai <lekshmicpillai@in.ibm.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d01b2dcb
    • Jan Kara's avatar
      kobject: Export kobject_get_unless_zero() · c70c176f
      Jan Kara authored
      Make the function available for outside use and fortify it against NULL
      kobject.
      
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c70c176f
    • Jan Kara's avatar
      block: Fix oops in locked_inode_to_wb_and_lock_list() · f759741d
      Jan Kara authored
      When block device is closed, we call inode_detach_wb() in __blkdev_put()
      which sets inode->i_wb to NULL. That is contrary to expectations that
      inode->i_wb stays valid once set during the whole inode's lifetime and
      leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
      inode_to_wb() returned NULL.
      
      The reason why we called inode_detach_wb() is not valid anymore though.
      BDI is guaranteed to stay along until we call bdi_put() from
      bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
      moment.
      
      Also add a warning to catch if someone uses inode_detach_wb() in a
      dangerous way.
      Reported-by: default avatarThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f759741d
    • Jan Kara's avatar
      bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() · b1c51afc
      Jan Kara authored
      Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
      from bdi_unregister() which is not necessarily called from bdi_destroy()
      and thus the name is somewhat misleading.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b1c51afc
    • Jan Kara's avatar
      bdi: Do not wait for cgwbs release in bdi_unregister() · 4514451e
      Jan Kara authored
      Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
      (called from bdi_unregister()). That is however unnecessary now when
      cgwb->bdi is a proper refcounted reference (thus bdi cannot get
      released before all cgwbs are released) and when cgwb_bdi_destroy()
      shuts down writeback directly.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4514451e
    • Jan Kara's avatar
      bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() · 5318ce7d
      Jan Kara authored
      Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
      which also means that writeback has been shutdown on them. Since this
      wait is going away, directly shutdown writeback on cgwbs from
      cgwb_bdi_destroy() to avoid live writeback structures after
      bdi_unregister() has finished. To make that safe with concurrent
      shutdown from cgwb_release_workfn(), we also have to make sure
      wb_shutdown() returns only after the bdi_writeback structure is really
      shutdown.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5318ce7d
    • Jan Kara's avatar
      bdi: Unify bdi->wb_list handling for root wb_writeback · e8cb72b3
      Jan Kara authored
      Currently root wb_writeback structure is added to bdi->wb_list in
      bdi_init() and never removed. That is different from all other
      wb_writeback structures which get added to the list when created and
      removed from it before wb_shutdown().
      
      So move list addition of root bdi_writeback to bdi_register() and list
      removal of all wb_writeback structures to wb_shutdown(). That way a
      wb_writeback structure is on bdi->wb_list if and only if it can handle
      writeback and it will make it easier for us to handle shutdown of all
      wb_writeback structures in bdi_unregister().
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e8cb72b3
    • Jan Kara's avatar
      bdi: Make wb->bdi a proper reference · 810df54a
      Jan Kara authored
      Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
      structures except for the one embedded inside struct backing_dev_info.
      That will allow us to simplify bdi unregistration.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      810df54a
    • Jan Kara's avatar
      bdi: Mark congested->bdi as internal · b7d680d7
      Jan Kara authored
      congested->bdi pointer is used only to be able to remove congested
      structure from bdi->cgwb_congested_tree on structure release. Moreover
      the pointer can become NULL when we unregister the bdi. Rename the field
      to __bdi and add a comment to make it more explicit this is internal
      stuff of memcg writeback code and people should not use the field as
      such use will be likely race prone.
      
      We do not bother with converting congested->bdi to a proper refcounted
      reference. It will be slightly ugly to special-case bdi->wb.congested to
      avoid effectively a cyclic reference of bdi to itself and the reference
      gets cleared from bdi_unregister() making it impossible to reference
      a freed bdi.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b7d680d7
    • Jan Kara's avatar
      block: Fix bdi assignment to bdev inode when racing with disk delete · 03e26279
      Jan Kara authored
      When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
      restart the process of opening the block device. However we forget to
      switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
      inode will be pointing to a stale bdi. Fix the problem by setting
      bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      03e26279
  2. 21 Mar, 2017 6 commits
    • Jens Axboe's avatar
      block: fix stacked driver stats init and free · a83b576c
      Jens Axboe authored
      If a driver allocates a queue for stacked usage, then it does
      not currently get stats allocated. This causes the later init
      of, eg, writeback throttling to blow up. Move the init to the
      queue allocation instead.
      
      Additionally, allow a NULL callback unregistration. This avoids
      having the caller check for that, fixing another oops on
      removal of a block device that doesn't have poll stats allocated.
      
      Fixes: 34dbad5d ("blk-stat: convert to callback-based statistics reporting")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a83b576c
    • Omar Sandoval's avatar
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval authored
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      34dbad5d
    • Omar Sandoval's avatar
      blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c · 4875253f
      Omar Sandoval authored
      This is an implementation detail that no-one outside of blk-stat.c uses.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      4875253f
    • Omar Sandoval's avatar
      blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb
      Omar Sandoval authored
      The stats buckets will become generic soon, so make the existing users
      use the common READ and WRITE definitions instead of one internal to
      blk-stat.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fa2e39cb
    • Omar Sandoval's avatar
      block: remove extra calls to wbt_exit() · 0315b159
      Omar Sandoval authored
      We always call wbt_exit() from blk_release_queue(), so these are
      unnecessary.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0315b159
    • Omar Sandoval's avatar
      blk-stat: fix blk_stat_sum() if all samples are batched · 7d8d0014
      Omar Sandoval authored
      We need to flush the batch _before_ we check the number of samples,
      otherwise we'll miss all of the batched samples.
      
      Fixes: cf43e6be ("block: add scalable completion tracking of requests")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7d8d0014
  3. 20 Mar, 2017 5 commits
    • Linus Torvalds's avatar
      Linux 4.11-rc3 · 97da3854
      Linus Torvalds authored
      97da3854
    • Linus Torvalds's avatar
      mm/swap: don't BUG_ON() due to uninitialized swap slot cache · 452b94b8
      Linus Torvalds authored
      This BUG_ON() triggered for me once at shutdown, and I don't see a
      reason for the check.  The code correctly checks whether the swap slot
      cache is usable or not, so an uninitialized swap slot cache is not
      actually problematic afaik.
      
      I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
      I'm not sure why that seemingly pointless check was there.  I suspect
      the real fix is to just remove it entirely, but for now we'll warn about
      it but not bring the machine down.
      
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      452b94b8
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · a07a6e41
      Linus Torvalds authored
      Pull more powerpc fixes from Michael Ellerman:
       "A couple of minor powerpc fixes for 4.11:
      
         - wire up statx() syscall
      
         - don't print a warning on memory hotplug when HPT resizing isn't
           available
      
        Thanks to: David Gibson, Chandan Rajendra"
      
      * tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/pseries: Don't give a warning when HPT resizing isn't available
        powerpc: Wire up statx() syscall
      a07a6e41
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 4571bc5a
      Linus Torvalds authored
      Pull parisc fixes from Helge Deller:
      
       - Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
         modules with CONFIG_MODVERSIONS.
      
       - Dave Anglin optimized the cache flushing for vmap ranges.
      
       - Arvind Yadav provided a fix for a potential NULL pointer dereference
         in the parisc perf code (and some code cleanups).
      
       - I wired up the new statx system call, fixed some compiler warnings
         with the access_ok() macro and fixed shutdown code to really halt a
         system at shutdown instead of crashing & rebooting.
      
      * 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Fix system shutdown halt
        parisc: perf: Fix potential NULL pointer dereference
        parisc: Avoid compiler warnings with access_ok()
        parisc: Wire up statx system call
        parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
        parisc: support R_PARISC_SECREL32 relocation in modules
      4571bc5a
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 8aa34172
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "The bulk of the changes are in qla2xxx target driver code to address
        various issues found during Cavium/QLogic's internal testing (stable
        CC's included), along with a few other stability and smaller
        miscellaneous improvements.
      
        There are also a couple of different patch sets from Mike Christie,
        which have been a result of his work to use target-core ALUA logic
        together with tcm-user backend driver.
      
        Finally, a patch to address some long standing issues with
        pass-through SCSI export of TYPE_TAPE + TYPE_MEDIUM_CHANGER devices,
        which will make folks using physical (or virtual) magnetic tape happy"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
        qla2xxx: Update driver version to 9.00.00.00-k
        qla2xxx: Fix delayed response to command for loop mode/direct connect.
        qla2xxx: Change scsi host lookup method.
        qla2xxx: Add DebugFS node to display Port Database
        qla2xxx: Use IOCB interface to submit non-critical MBX.
        qla2xxx: Add async new target notification
        qla2xxx: Export DIF stats via debugfs
        qla2xxx: Improve T10-DIF/PI handling in driver.
        qla2xxx: Allow relogin to proceed if remote login did not finish
        qla2xxx: Fix sess_lock & hardware_lock lock order problem.
        qla2xxx: Fix inadequate lock protection for ABTS.
        qla2xxx: Fix request queue corruption.
        qla2xxx: Fix memory leak for abts processing
        qla2xxx: Allow vref count to timeout on vport delete.
        tcmu: Convert cmd_time_out into backend device attribute
        tcmu: make cmd timeout configurable
        tcmu: add helper to check if dev was configured
        target: fix race during implicit transition work flushes
        target: allow userspace to set state to transitioning
        target: fix ALUA transition timeout handling
        ...
      8aa34172
  4. 19 Mar, 2017 15 commits
  5. 18 Mar, 2017 4 commits
    • Nicholas Bellinger's avatar
      tcmu: Convert cmd_time_out into backend device attribute · 7d7a7435
      Nicholas Bellinger authored
      Instead of putting cmd_time_out under ../target/core/user_0/foo/control,
      which has historically been used by parameters needed for initial
      backend device configuration, go ahead and move cmd_time_out into
      a backend device attribute.
      
      In order to do this, tcmu_module_init() has been updated to create
      a local struct configfs_attribute **tcmu_attrs, that is based upon
      the existing passthrough_attrib_attrs along with the new cmd_time_out
      attribute.  Once **tcm_attrs has been setup, go ahead and point
      it at tcmu_ops->tb_dev_attrib_attrs so it's picked up by target-core.
      
      Also following MNC's previous change, ->cmd_time_out is stored in
      milliseconds but exposed via configfs in seconds.  Also, note this
      patch restricts the modification of ->cmd_time_out to before +
      after the TCMU device has been configured, but not while it has
      active fabric exports.
      
      Cc: Mike Christie <mchristi@redhat.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      7d7a7435
    • Mike Christie's avatar
      tcmu: make cmd timeout configurable · af980e46
      Mike Christie authored
      A single daemon could implement multiple types of devices
      using multuple types of real devices that may not support
      restarting from crashes and/or handling tcmu timeouts. This
      makes the cmd timeout configurable, so handlers that do not
      support it can turn if off for now.
      Signed-off-by: default avatarMike Christie <mchristi@redhat.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      af980e46
    • Mike Christie's avatar
      tcmu: add helper to check if dev was configured · 972c7f16
      Mike Christie authored
      This adds a helper to check if the dev was configured. It
      will be used in the next patch to prevent updates to some
      config settings after the device has been setup.
      Signed-off-by: default avatarMike Christie <mchristi@redhat.com>
      Signed-off-by: default avatarNicholas Bellinger <nab@linux-iscsi.org>
      972c7f16
    • Linus Torvalds's avatar
      Merge tag 'openrisc-for-linus' of git://github.com/openrisc/linux · 93afaa45
      Linus Torvalds authored
      Pull OpenRISC fixes from Stafford Horne:
       "OpenRISC fixes for build issues that were exposed by kbuild robots
        after 4.11 merge. All from allmodconfig builds. This includes:
      
         - bug in the handling of 8-byte get_user() calls
      
         - module build failure due to multile missing symbol exports"
      
      * tag 'openrisc-for-linus' of git://github.com/openrisc/linux:
        openrisc: Export symbols needed by modules
        openrisc: fix issue handling 8 byte get_user calls
        openrisc: xchg: fix `computed is not used` warning
      93afaa45