1. 09 Dec, 2016 2 commits
    • Christoph Hellwig's avatar
      block: improve handling of the magic discard payload · f9d03f96
      Christoph Hellwig authored
      Instead of allocating a single unused biovec for discard requests, send
      them down without any payload.  Instead we allow the driver to add a
      "special" payload using a biovec embedded into struct request (unioned
      over other fields never used while in the driver), and overloading
      the number of segments for this case.
      
      This has a couple of advantages:
      
       - we don't have to allocate the bio_vec
       - the amount of special casing for discard requests in the block
         layer is significantly reduced
       - using this same scheme for other request types is trivial,
         which will be important for implementing the new WRITE_ZEROES
         op on devices where it actually requires a payload (e.g. SCSI)
       - we can get rid of playing games with the request length, as
         we'll never touch it and completions will work just fine
       - it will allow us to support ranged discard operations in the
         future by merging non-contiguous discard bios into a single
         request
       - last but not least it removes a lot of code
      
      This patch is the common base for my WIP series for ranges discards and to
      remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
      so it would be good to get it in quickly.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f9d03f96
    • Christoph Hellwig's avatar
      blk-wbt: don't throttle discard or write zeroes · be07e14f
      Christoph Hellwig authored
      Both of these are metadata only commands that are not issued by the
      writeback code and not directly relevant to the writeback bandwith.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      be07e14f
  2. 08 Dec, 2016 2 commits
  3. 06 Dec, 2016 21 commits
  4. 05 Dec, 2016 2 commits
    • Jens Axboe's avatar
      6e85eaf3
    • Nicolai Stange's avatar
      block: fix unintended fallthrough in generic_make_request_checks() · 58886785
      Nicolai Stange authored
      Since commit e73c23ff ("block: add async variant of
      blkdev_issue_zeroout") messages like the following show up:
      
        EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at
                        logical offset 0 with max blocks 1 with error 95
        EXT4-fs (dm-1): This should not happen!! Data will be lost
      
      Due to the following fallthrough introduced with
      commit 2d253440 ("block: Define zoned block device operations"),
      generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only
      if the block device supports "write same" *and* is a zoned one:
      
        switch (bio_op(bio)) {
        [...]
        case REQ_OP_WRITE_SAME:
              if (!bdev_write_same(bio->bi_bdev))
                      goto not_supported;
        case REQ_OP_ZONE_REPORT:
        case REQ_OP_ZONE_RESET:
                      if (!bdev_is_zoned(bio->bi_bdev))
                              goto not_supported;
                      break;
        [...]
        }
      
      Thus, although the bio setup as done by __blkdev_issue_write_same() from
      commit e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      would succeed, its actual submission would not, resulting in the
      EOPNOTSUPP == 95.
      
      Fix this by removing the fallthrough which, due to the lack of an explicit
      comment, seems to be unintended anyway.
      
      Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      Fixes: 2d253440 ("block: Define zoned block device operations")
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      58886785
  5. 03 Dec, 2016 3 commits
  6. 01 Dec, 2016 9 commits
    • Ritesh Harjani's avatar
      block: factor out req_set_nomerge · e0c72300
      Ritesh Harjani authored
      Factor out common code for setting REQ_NOMERGE flag which is being used
      out at certain places and make it a helper instead, req_set_nomerge().
      Signed-off-by: default avatarRitesh Harjani <riteshh@codeaurora.org>
      
      Get rid of the inline.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e0c72300
    • Rabin Vincent's avatar
      block: protect iterate_bdevs() against concurrent close · af309226
      Rabin Vincent authored
      If a block device is closed while iterate_bdevs() is handling it, the
      following NULL pointer dereference occurs because bdev->b_disk is NULL
      in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
      turn called by the mapping_cap_writeback_dirty() call in
      __filemap_fdatawrite_range()):
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
       IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
       PGD 9e62067 PUD 9ee8067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
       task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
       RIP: 0010:[<ffffffff81314790>]  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
       RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
       RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
       RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
       R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
       FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
       Stack:
        ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
        0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
        ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
       Call Trace:
        [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90
        [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30
        [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20
        [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130
        [<ffffffff811b2763>] sys_sync+0x63/0x90
        [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76
       Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
       RIP  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
        RSP <ffff880009f5fe68>
       CR2: 0000000000000508
       ---[ end trace 2487336ceb3de62d ]---
      
      The crash is easily reproducible by running the following command, if an
      msleep(100) is inserted before the call to func() in iterate_devs():
      
       while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done
      
      Fix it by holding the bd_mutex across the func() call and only calling
      func() if the bdev is opened.
      
      Cc: stable@vger.kernel.org
      Fixes: 5c0d6b60 ("vfs: Create function for iterating over block devices")
      Reported-and-tested-by: default avatarWei Fang <fangwei1@huawei.com>
      Signed-off-by: default avatarRabin Vincent <rabinv@axis.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      af309226
    • Pan Bian's avatar
      block: mtip32xx: set error code on failure · 5b0e34e1
      Pan Bian authored
      Fix bug https://bugzilla.kernel.org/show_bug.cgi?id=188531. In function
      mtip_block_initialize(), variable rv takes the return value, and its
      value should be negative on errors. rv is initialized as 0 and is not
      reset when the call to ida_pre_get() fails. So 0 may be returned.
      The return value 0 indicates that there is no error, which may be
      inconsistent with the execution status. This patch fixes the bug by
      explicitly assigning -ENOMEM to rv on the branch that ida_pre_get()
      fails.
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5b0e34e1
    • Chaitanya Kulkarni's avatar
      nvmet: add support for the Write Zeroes command · d2629209
      Chaitanya Kulkarni authored
      Add support for handling write zeroes command on target.
      Call into __blkdev_issue_zeroout, which the block layer expands into the
      best suitable variant of zeroing the LBAs. Allow write zeroes operation
      to deallocate the LBAs when calling __blkdev_issue_zeroout.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d2629209
    • Chaitanya Kulkarni's avatar
      nvme: add support for the Write Zeroes command · 6d31e3ba
      Chaitanya Kulkarni authored
      Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block
      device, if the device supports optional command bit set for write
      zeroes. Add support to setup write zeroes command. Set maximum possible
      write zeroes sectors in one write zeroes command according to
      nvme write zeroes command definition.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6d31e3ba
    • Chaitanya Kulkarni's avatar
      nvme.h: add Write Zeroes definitions · 3b7c33b2
      Chaitanya Kulkarni authored
      Add the command structure, optional command set support (ONCS) bit and
      a new error code for the Write Zeroes command.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3b7c33b2
    • Chaitanya Kulkarni's avatar
      block: add support for REQ_OP_WRITE_ZEROES · a6f0788e
      Chaitanya Kulkarni authored
      This adds a new block layer operation to zero out a range of
      LBAs. This allows to implement zeroing for devices that don't use
      either discard with a predictable zero pattern or WRITE SAME of zeroes.
      The prominent example of that is NVMe with the Write Zeroes command,
      but in the future, this should also help with improving the way
      zeroing discards work. For this operation, suitable entry is exported in
      sysfs which indicate the number of maximum bytes allowed in one
      write zeroes operation by the device.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a6f0788e
    • Chaitanya Kulkarni's avatar
      block: add async variant of blkdev_issue_zeroout · e73c23ff
      Chaitanya Kulkarni authored
      Similar to __blkdev_issue_discard this variant allows submitting
      the final bio asynchronously and chaining multiple ranges
      into a single completion.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e73c23ff
    • Damien Le Moal's avatar
      block: Check partition alignment on zoned block devices · b02d8aae
      Damien Le Moal authored
      Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
      a zoned block device. However, the first and last zones reported for a
      partition make sense only if the partition start sector and size are aligned
      on the device zone size. The same applies for zone reset. Resetting the first
      or the last zone of a partition straddling zones may impact neighboring
      partitions. Finally, if a partition start sector is not at the beginning of a
      sequential zone, it will be impossible to write to the first sectors of the
      partition on a host-managed device.
      Avoid all these problems and incoherencies by ignoring partitions that are not
      zone aligned.
      
      Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
      correct disk zoning type (host-aware, host-managed or none) but
      bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
      size is unknown). So test this as a way to ensure that a zoned block device is
      being handled as such. As a result, for a host-aware devices, unaligned zone
      partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
      disk will be treated as a regular block device (as it should). If zoned block
      device support is enabled, only aligned partitions will be accepted.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b02d8aae
  7. 29 Nov, 2016 1 commit