1. 30 Jun, 2020 3 commits
  2. 29 Jun, 2020 16 commits
  3. 28 Jun, 2020 1 commit
  4. 26 Jun, 2020 1 commit
    • Jan Kara's avatar
      blktrace: Provide event for request merging · f3bdc62f
      Jan Kara authored
      Currently blk-mq does not report any event when two requests get merged
      in the elevator. This then results in difficult to understand sequence
      of events like:
      
      ...
        8,0   34     1579     0.608765271  2718  I  WS 215023504 + 40 [dbench]
        8,0   34     1584     0.609184613  2719  A  WS 215023544 + 56 <- (8,4) 2160568
        8,0   34     1585     0.609184850  2719  Q  WS 215023544 + 56 [dbench]
        8,0   34     1586     0.609188524  2719  G  WS 215023544 + 56 [dbench]
        8,0    3      602     0.609684162   773  D  WS 215023504 + 96 [kworker/3:1H]
        8,0   34     1591     0.609843593     0  C  WS 215023504 + 96 [0]
      
      and you can only guess (after quite some headscratching since the above
      excerpt is intermixed with a lot of other IO) that request 215023544+56
      got merged to request 215023504+40. Provide proper event for request
      merging like we used to do in the legacy block layer.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f3bdc62f
  5. 24 Jun, 2020 19 commits
    • Christoph Hellwig's avatar
      block: move struct block_device to blk_types.h · 621c1f42
      Christoph Hellwig authored
      Move the struct block_device definition together with most of the
      block layer definitions, as it has nothing to do with the rest of fs.h.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      621c1f42
    • Christoph Hellwig's avatar
      block: reduce ifdef CONFIG_BLOCK madness in headers · 1a4dcfa8
      Christoph Hellwig authored
      Large part of bio.h, blkdev.h and genhd.h are under ifdef CONFIG_BLOCK
      for no good reason.  Only stub out function that are called from
      code that is not dependent on CONFIG_BLOCK and leave the harmless
      other declarations around.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1a4dcfa8
    • Christoph Hellwig's avatar
      fs: move the buffer_heads_over_limit stub to buffer_head.h · d2de7ea4
      Christoph Hellwig authored
      Move the !CONFIG_BLOCK stub to the same place as the non-stub
      declaration.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d2de7ea4
    • Christoph Hellwig's avatar
      block: move block-related definitions out of fs.h · 3f1266f1
      Christoph Hellwig authored
      Move most of the block related definition out of fs.h into more suitable
      headers.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3f1266f1
    • Christoph Hellwig's avatar
      block: simplify sb_is_blkdev_sb · dd0dca22
      Christoph Hellwig authored
      Just use IS_ENABLED instead of providing a stub for !CONFIG_BLOCK.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dd0dca22
    • Christoph Hellwig's avatar
      fs: remove the mount_bdev and kill_block_super stubs · 75362a17
      Christoph Hellwig authored
      No one calls these functions without CONFIG_BLOCK, so don't bother
      stubbing them out.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      75362a17
    • Christoph Hellwig's avatar
      fs: remove the HAVE_UNLOCKED_IOCTL and HAVE_COMPAT_IOCTL defines · 4e24566a
      Christoph Hellwig authored
      These are not defined anywhere, and contrary to the comments we really
      do not care about out of tree code at all.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e24566a
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      764b23bd
    • Christoph Hellwig's avatar
      tty/sysrq: emergency_thaw_all does not depend on CONFIG_BLOCK · b818f09e
      Christoph Hellwig authored
      We can also thaw non-block file systems.  Remove the CONFIG_BLOCK in
      sysrq.c after making the prototype available unconditionally.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b818f09e
    • Christoph Hellwig's avatar
      nvme-rdma: fix a missing completion with remove invalidation · 7a804c34
      Christoph Hellwig authored
      Revert and incorret transformation that caused requests using remote
      invalidation to never complete.
      
      Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper")
      Reported-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7a804c34
    • Gustavo A. R. Silva's avatar
      blk-iocost: Use struct_size() in kzalloc_node() · f61d6e25
      Gustavo A. R. Silva authored
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes.
      
      This code was detected with the help of Coccinelle and, audited and
      fixed manually.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f61d6e25
    • Gustavo A. R. Silva's avatar
      block: bio: Use struct_size() in kmalloc() · 1f4fe21c
      Gustavo A. R. Silva authored
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes.
      
      This code was detected with the help of Coccinelle and, audited and
      fixed manually.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1f4fe21c
    • Luis Chamberlain's avatar
      block: create the request_queue debugfs_dir on registration · 85e0cbbb
      Luis Chamberlain authored
      We were only creating the request_queue debugfs_dir only
      for make_request block drivers (multiqueue), but never for
      request-based block drivers. We did this as we were only
      creating non-blktrace additional debugfs files on that directory
      for make_request drivers. However, since blktrace *always* creates
      that directory anyway, we special-case the use of that directory
      on blktrace. Other than this being an eye-sore, this exposes
      request-based block drivers to the same debugfs fragile
      race that used to exist with make_request block drivers
      where if we start adding files onto that directory we can later
      run a race with a double removal of dentries on the directory
      if we don't deal with this carefully on blktrace.
      
      Instead, just simplify things by always creating the request_queue
      debugfs_dir on request_queue registration. Rename the mutex also to
      reflect the fact that this is used outside of the blktrace context.
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      85e0cbbb
    • Luis Chamberlain's avatar
      blktrace: ensure our debugfs dir exists · b431ef83
      Luis Chamberlain authored
      We make an assumption that a debugfs directory exists, but since
      this can fail ensure it exists before allowing blktrace setup to
      complete. Otherwise we end up stuffing blktrace files on the debugfs
      root directory. In the worst case scenario this *in theory* can create
      an eventual panic *iff* in the future a similarly named file is created
      prior on the debugfs root directory. This theoretical crash can happen
      due to a recursive removal followed by a specific dentry removal.
      
      This doesn't fix any known crash, however I have seen the files
      go into the main debugfs root directory in cases where the debugfs
      directory was not created due to other internal bugs with blktrace
      now fixed.
      
      blktrace is also completely useless without this directory, so
      this ensures to userspace we only setup blktrace if the kernel
      can stuff files where they are supposed to go into.
      
      debugfs directory creations typically aren't checked for, and we have
      maintainers doing sweep removals of these checks, but since we need this
      check to ensure proper userspace blktrace functionality we make sure
      to annotate the justification for the check.
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b431ef83
    • Luis Chamberlain's avatar
      blktrace: fix debugfs use after free · bad8e64f
      Luis Chamberlain authored
      On commit 6ac93117 ("blktrace: use existing disk debugfs directory")
      merged on v4.12 Omar fixed the original blktrace code for request-based
      drivers (multiqueue). This however left in place a possible crash, if you
      happen to abuse blktrace while racing to remove / add a device.
      
      We used to use asynchronous removal of the request_queue, and with that
      the issue was easier to reproduce. Now that we have reverted to
      synchronous removal of the request_queue, the issue is still possible to
      reproduce, its however just a bit more difficult.
      
      We essentially run two instances of break-blktrace which add/remove
      a loop device, and setup a blktrace and just never tear the blktrace
      down. We do this twice in parallel. This is easily reproduced with the
      script run_0004.sh from break-blktrace [0].
      
      We can end up with two types of panics each reflecting where we
      race, one a failed blktrace setup:
      
      [  252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
      [  252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
      [  252.436592] #PF: supervisor write access in kernel mode
      [  252.439822] #PF: error_code(0x0002) - not-present page
      [  252.442967] PGD 0 P4D 0
      [  252.444656] Oops: 0002 [#1] SMP NOPTI
      [  252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
      [  252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
      [  252.456343] RIP: 0010:down_write+0x15/0x40
      [  252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
                     cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
                     00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
                     45 08 5d
      [  252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
      [  252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
      [  252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
      [  252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
      [  252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
      [  252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
      [  252.473346] FS:  00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
      [  252.475225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
      [  252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  252.479866] Call Trace:
      [  252.480322]  simple_recursive_removal+0x4e/0x2e0
      [  252.481078]  ? debugfs_remove+0x60/0x60
      [  252.481725]  ? relay_destroy_buf+0x77/0xb0
      [  252.482662]  debugfs_remove+0x40/0x60
      [  252.483518]  blk_remove_buf_file_callback+0x5/0x10
      [  252.484328]  relay_close_buf+0x2e/0x60
      [  252.484930]  relay_open+0x1ce/0x2c0
      [  252.485520]  do_blk_trace_setup+0x14f/0x2b0
      [  252.486187]  __blk_trace_setup+0x54/0xb0
      [  252.486803]  blk_trace_ioctl+0x90/0x140
      [  252.487423]  ? do_sys_openat2+0x1ab/0x2d0
      [  252.488053]  blkdev_ioctl+0x4d/0x260
      [  252.488636]  block_ioctl+0x39/0x40
      [  252.489139]  ksys_ioctl+0x87/0xc0
      [  252.489675]  __x64_sys_ioctl+0x16/0x20
      [  252.490380]  do_syscall_64+0x52/0x180
      [  252.491032]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      And the other on the device removal:
      
      [  128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
      [  128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
      [  128.619537] #PF: supervisor write access in kernel mode
      [  128.622700] #PF: error_code(0x0002) - not-present page
      [  128.625842] PGD 0 P4D 0
      [  128.627585] Oops: 0002 [#1] SMP NOPTI
      [  128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
      [  128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
      [  128.640471] RIP: 0010:down_write+0x15/0x40
      [  128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
                     cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
                     00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
                     45 08 5d
      [  128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
      [  128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
      [  128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
      [  128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
      [  128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
      [  128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
      [  128.660500] FS:  00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
      [  128.662204] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
      [  128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  128.667282] Call Trace:
      [  128.667801]  simple_recursive_removal+0x4e/0x2e0
      [  128.668663]  ? debugfs_remove+0x60/0x60
      [  128.669368]  debugfs_remove+0x40/0x60
      [  128.669985]  blk_trace_free+0xd/0x50
      [  128.670593]  __blk_trace_remove+0x27/0x40
      [  128.671274]  blk_trace_shutdown+0x30/0x40
      [  128.671935]  blk_release_queue+0x95/0xf0
      [  128.672589]  kobject_put+0xa5/0x1b0
      [  128.673188]  disk_release+0xa2/0xc0
      [  128.673786]  device_release+0x28/0x80
      [  128.674376]  kobject_put+0xa5/0x1b0
      [  128.674915]  loop_remove+0x39/0x50 [loop]
      [  128.675511]  loop_control_ioctl+0x113/0x130 [loop]
      [  128.676199]  ksys_ioctl+0x87/0xc0
      [  128.676708]  __x64_sys_ioctl+0x16/0x20
      [  128.677274]  do_syscall_64+0x52/0x180
      [  128.677823]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The common theme here is:
      
      debugfs: Directory 'loop0' with parent 'block' already present
      
      This crash happens because of how blktrace uses the debugfs directory
      where it places its files. Upon init we always create the same directory
      which would be needed by blktrace but we only do this for make_request
      drivers (multiqueue) block drivers. When you race a removal of these
      devices with a blktrace setup you end up in a situation where the
      make_request recursive debugfs removal will sweep away the blktrace
      files and then later blktrace will also try to remove individual
      dentries which are already NULL. The inverse is also possible and hence
      the two types of use after frees.
      
      We don't create the block debugfs directory on init for these types of
      block devices:
      
        * request-based block driver block devices
        * every possible partition
        * scsi-generic
      
      And so, this race should in theory only be possible with make_request
      drivers.
      
      We can fix the UAF by simply re-using the debugfs directory for
      make_request drivers (multiqueue) and only creating the ephemeral
      directory for the other type of block devices. The new clarifications
      on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
      *prior* to processing a blktrace ensures the debugfs directories are
      only created if no possible directory name clashes are possible.
      
      This goes tested with:
      
        o nvme partitions
        o ISCSI with tgt, and blktracing against scsi-generic with:
          o block
          o tape
          o cdrom
          o media changer
        o blktests
      
      This patch is part of the work which disputes the severity of
      CVE-2019-19770 which shows this issue is not a core debugfs issue, but
      a misuse of debugfs within blktace.
      
      Fixes: 6ac93117 ("blktrace: use existing disk debugfs directory")
      Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
      Cc: yu kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bad8e64f
    • Luis Chamberlain's avatar
      loop: be paranoid on exit and prevent new additions / removals · 200f9337
      Luis Chamberlain authored
      Be pedantic on removal as well and hold the mutex.
      This should prevent uses of addition while we exit.
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      200f9337
    • Luis Chamberlain's avatar
      blktrace: annotate required lock on do_blk_trace_setup() · a67549c8
      Luis Chamberlain authored
      Ensure it is clear which lock is required on do_blk_trace_setup().
      Suggested-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a67549c8
    • Luis Chamberlain's avatar
      block: revert back to synchronous request_queue removal · e8c7d14a
      Luis Chamberlain authored
      Commit dc9edc44 ("block: Fix a blk_exit_rl() regression") merged on
      v4.12 moved the work behind blk_release_queue() into a workqueue after a
      splat floated around which indicated some work on blk_release_queue()
      could sleep in blk_exit_rl(). This splat would be possible when a driver
      called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
      as its final call) from an atomic context.
      
      blk_put_queue() decrements the refcount for the request_queue kobject, and
      upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
      now removed through commit db6d9952 ("block: remove request_list code")
      on v5.0, we reserve the right to be able to sleep within
      blk_release_queue() context.
      
      The last reference for the request_queue must not be called from atomic
      context. *When* the last reference to the request_queue reaches 0 varies,
      and so let's take the opportunity to document when that is expected to
      happen and also document the context of the related calls as best as
      possible so we can avoid future issues, and with the hopes that the
      synchronous request_queue removal sticks.
      
      We revert back to synchronous request_queue removal because asynchronous
      removal creates a regression with expected userspace interaction with
      several drivers. An example is when removing the loopback driver, one
      uses ioctls from userspace to do so, but upon return and if successful,
      one expects the device to be removed. Likewise if one races to add another
      device the new one may not be added as it is still being removed. This was
      expected behavior before and it now fails as the device is still present
      and busy still. Moving to asynchronous request_queue removal could have
      broken many scripts which relied on the removal to have been completed if
      there was no error. Document this expectation as well so that this
      doesn't regress userspace again.
      
      Using asynchronous request_queue removal however has helped us find
      other bugs. In the future we can test what could break with this
      arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
      
      While at it, update the docs with the context expectations for the
      request_queue / gendisk refcount decrement, and make these
      expectations explicit by using might_sleep().
      
      Fixes: dc9edc44 ("block: Fix a blk_exit_rl() regression")
      Suggested-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: yu kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8c7d14a